[GIT PULL] f2fs: request for tree inclusion

2012-12-10 Thread Jaegeuk Kim
Hi Linus,

This is the first pull request for tree inclusion of Flash-Friendly File
System (F2FS) towards the 3.8 merge window.

http://lwn.net/Articles/518718/
http://lwn.net/Articles/518988/
http://en.wikipedia.org/wiki/F2FS

The f2fs has been in the linux-next tree for a while, and several issues
have been cleared as described in the signed tag below.
And also, I've done testing f2fs successfully based on Linux 3.7 with
the following test scenarios.

- Reliability test:
  Run fsstress on an SSD partition.

- Robustness test:
  Conduct sudden-power-off and examine the fs consistency repeatedly,
  while running a reliability test.

So, please pull the f2fs filesystem.
If I'm missing any issues or made mistakes, please let me know.

Thanks,
Jaegeuk Kim

The following changes since commit
29594404d7fe73cd80eaa4ee8c43dcc53970c60e:

  Linux 3.7 (2012-12-10 19:30:57 -0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs.git
tags/for-3.8-merge

for you to fetch changes up to e6aa9f36b2bfd6b30072c07b34f2a24becf1:

  f2fs: fix tracking parent inode number (2012-12-11 13:43:45 +0900)


Introduce a new file system, Flash-Friendly File System (F2FS), to Linux
3.8.

Highlights:
- Add initial f2fs source codes
- Fix an endian conversion bug
- Fix build failures on random configs
- Fix the power-off-recovery routine
- Minor cleanup, coding style, and typos patches

Greg Kroah-Hartman (1):
  f2fs: move proc files to debugfs

Huajun Li (1):
  f2fs: fix a typo in f2fs documentation

Jaegeuk Kim (22):
  f2fs: add document
  f2fs: add on-disk layout
  f2fs: add superblock and major in-memory structure
  f2fs: add super block operations
  f2fs: add checkpoint operations
  f2fs: add node operations
  f2fs: add segment operations
  f2fs: add file operations
  f2fs: add address space operations for data
  f2fs: add core inode operations
  f2fs: add inode operations for special inodes
  f2fs: add core directory operations
  f2fs: add xattr and acl functionalities
  f2fs: add garbage collection functions
  f2fs: add recovery routines for roll-forward
  f2fs: update Kconfig and Makefile
  f2fs: update the f2fs document
  f2fs: fix endian conversion bugs reported by sparse
  f2fs: adjust kernel coding style
  f2fs: resolve build failures
  f2fs: cleanup the f2fs_bio_alloc routine
  f2fs: fix tracking parent inode number

Namjae Jeon (10):
  f2fs: fix the compiler warning for uninitialized use of variable
  f2fs: show error in case of invalid mount arguments
  f2fs: remove unneeded memset from init_once
  f2fs: check read only condition before beginning write out
  f2fs: remove unneeded initialization
  f2fs: move error condition for mkdir at proper place
  f2fs: rewrite f2fs_bio_alloc to make it simpler
  f2fs: make use of GFP_F2FS_ZERO for setting gfp_mask
  f2fs: remove redundant call to f2fs_put_page in delete entry
  f2fs: introduce accessor to retrieve number of dentry slots

Sachin Kamat (1):
  f2fs: remove unneeded version.h header file from f2fs.h

Wei Yongjun (1):
  f2fs: remove unused variable

 Documentation/filesystems/00-INDEX |2 +
 Documentation/filesystems/f2fs.txt |  421 +
 fs/Kconfig |1 +
 fs/Makefile|1 +
 fs/f2fs/Kconfig|   53 ++
 fs/f2fs/Makefile   |7 +
 fs/f2fs/acl.c  |  414 +
 fs/f2fs/acl.h  |   57 ++
 fs/f2fs/checkpoint.c   |  794 
 fs/f2fs/data.c |  702 ++
 fs/f2fs/debug.c|  361 
 fs/f2fs/dir.c  |  672 ++
 fs/f2fs/f2fs.h | 1083 ++
 fs/f2fs/file.c |  636 +
 fs/f2fs/gc.c   |  742 +++
 fs/f2fs/gc.h   |  117 +++
 fs/f2fs/hash.c |   97 ++
 fs/f2fs/inode.c|  268 ++
 fs/f2fs/namei.c|  503 ++
 fs/f2fs/node.c | 1764
+++
 fs/f2fs/node.h |  353 +++
 fs/f2fs/recovery.c |  375 
 fs/f2fs/segment.c  | 1791

 fs/f2fs/segment.h  |  618 +
 fs/f2fs/super.c|  657 +
 fs/f2fs/xattr.c|  440 +
 fs/f2fs/xattr.h|  145 +++
 include/linux/f2fs_fs.h|  413 +
 include/uapi/linux/magic.h |1 +
 29 files changed, 13488 insertions(+)
 create mode 100644 Documentation/fi

Re: linux-next: manual merge of the akpm tree with Linus' tree

2012-12-10 Thread Glauber Costa
On 12/11/2012 09:22 AM, Stephen Rothwell wrote:
> Hi Andrew,
> 
> Today's linux-next merge of the akpm tree got a conflict in
> include/linux/gfp.h between commit caf491916b1c ("Revert "revert "Revert
> "mm: remove __GFP_NO_KSWAPD""" and associated damage") from Linus' tree
> and commit "mm: add a __GFP_KMEMCG flag" from the akpm tree.
> 
> I fixed it up (see below) and can carry the fix as necessary (no action
> is required).
> 
Fix is fine, thanks.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] mfd: wm5102: Mark only extant DSP registers volatile

2012-12-10 Thread Mark Brown
Since regmap sometimes uses volatile as a proxy for readable simply
having a blanket condition can mark too many registers as readable.

Signed-off-by: Mark Brown 
---
 drivers/mfd/wm5102-tables.c |   11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/drivers/mfd/wm5102-tables.c b/drivers/mfd/wm5102-tables.c
index 4a01192..0317d11 100644
--- a/drivers/mfd/wm5102-tables.c
+++ b/drivers/mfd/wm5102-tables.c
@@ -1837,9 +1837,6 @@ static bool wm5102_readable_register(struct device *dev, 
unsigned int reg)
 
 static bool wm5102_volatile_register(struct device *dev, unsigned int reg)
 {
-   if (reg > 0x)
-   return true;
-
switch (reg) {
case ARIZONA_SOFTWARE_RESET:
case ARIZONA_DEVICE_REVISION:
@@ -1884,7 +1881,13 @@ static bool wm5102_volatile_register(struct device *dev, 
unsigned int reg)
case ARIZONA_MIC_DETECT_3:
return true;
default:
-   return false;
+   if ((reg >= 0x10 && reg < 0x106000) ||
+   (reg >= 0x18 && reg < 0x180800) ||
+   (reg >= 0x19 && reg < 0x194800) ||
+   (reg >= 0x1a8000 && reg < 0x1a9800))
+   return true;
+   else
+   return false;
}
 }
 
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: feel Re: [PATCH v2 1/2] zsmalloc: add function to query object size

2012-12-10 Thread Minchan Kim
On Mon, Dec 10, 2012 at 11:19:49PM -0800, Nitin Gupta wrote:
> On 12/10/2012 07:59 PM, Minchan Kim wrote:
> > On Fri, Dec 07, 2012 at 04:45:53PM -0800, Nitin Gupta wrote:
> >> On Sun, Dec 2, 2012 at 11:52 PM, Minchan Kim  wrote:
> >>> On Sun, Dec 02, 2012 at 11:20:42PM -0800, Nitin Gupta wrote:
> 
> 
>  On Nov 30, 2012, at 5:54 AM, Minchan Kim  
>  wrote:
> 
> > On Thu, Nov 29, 2012 at 10:54:48PM -0800, Nitin Gupta wrote:
> >> Changelog v2 vs v1:
> >> - None
> >>
> >> Adds zs_get_object_size(handle) which provides the size of
> >> the given object. This is useful since the user (zram etc.)
> >> now do not have to maintain object sizes separately, saving
> >> on some metadata size (4b per page).
> >>
> >> The object handle encodes  pair which currently points
> >> to the start of the object. Now, the handle implicitly stores the size
> >> information by pointing to the object's end instead. Since zsmalloc is
> >> a slab based allocator, the start of the object can be easily 
> >> determined
> >> and the difference between the end offset encoded in the handle and the
> >> start gives us the object size.
> >>
> >> Signed-off-by: Nitin Gupta 
> > Acked-by: Minchan Kim 
> >
> > I already had a few comment in your previous versoin.
> > I'm OK although you ignore them because I can make follow up patch about
> > my nitpick but could you answer below my question?
> >
> >> ---
> >> drivers/staging/zsmalloc/zsmalloc-main.c |  177 
> >> +-
> >> drivers/staging/zsmalloc/zsmalloc.h  |1 +
> >> 2 files changed, 127 insertions(+), 51 deletions(-)
> >>
> >> diff --git a/drivers/staging/zsmalloc/zsmalloc-main.c 
> >> b/drivers/staging/zsmalloc/zsmalloc-main.c
> >> index 09a9d35..65c9d3b 100644
> >> --- a/drivers/staging/zsmalloc/zsmalloc-main.c
> >> +++ b/drivers/staging/zsmalloc/zsmalloc-main.c
> >> @@ -112,20 +112,20 @@
> >> #define MAX_PHYSMEM_BITS 36
> >> #else /* !CONFIG_HIGHMEM64G */
> >> /*
> >> - * If this definition of MAX_PHYSMEM_BITS is used, OBJ_INDEX_BITS 
> >> will just
> >> + * If this definition of MAX_PHYSMEM_BITS is used, OFFSET_BITS will 
> >> just
> >>  * be PAGE_SHIFT
> >>  */
> >> #define MAX_PHYSMEM_BITS BITS_PER_LONG
> >> #endif
> >> #endif
> >> #define _PFN_BITS(MAX_PHYSMEM_BITS - PAGE_SHIFT)
> >> -#define OBJ_INDEX_BITS(BITS_PER_LONG - _PFN_BITS)
> >> -#define OBJ_INDEX_MASK((_AC(1, UL) << OBJ_INDEX_BITS) - 1)
> >> +#define OFFSET_BITS(BITS_PER_LONG - _PFN_BITS)
> >> +#define OFFSET_MASK((_AC(1, UL) << OFFSET_BITS) - 1)
> >>
> >> #define MAX(a, b) ((a) >= (b) ? (a) : (b))
> >> /* ZS_MIN_ALLOC_SIZE must be multiple of ZS_ALIGN */
> >> #define ZS_MIN_ALLOC_SIZE \
> >> -MAX(32, (ZS_MAX_PAGES_PER_ZSPAGE << PAGE_SHIFT >> OBJ_INDEX_BITS))
> >> +MAX(32, (ZS_MAX_PAGES_PER_ZSPAGE << PAGE_SHIFT >> OFFSET_BITS))
> >> #define ZS_MAX_ALLOC_SIZEPAGE_SIZE
> >>
> >> /*
> >> @@ -256,6 +256,11 @@ static int is_last_page(struct page *page)
> >>return PagePrivate2(page);
> >> }
> >>
> >> +static unsigned long get_page_index(struct page *page)
> >> +{
> >> +return is_first_page(page) ? 0 : page->index;
> >> +}
> >> +
> >> static void get_zspage_mapping(struct page *page, unsigned int 
> >> *class_idx,
> >>enum fullness_group *fullness)
> >> {
> >> @@ -433,39 +438,86 @@ static struct page *get_next_page(struct page 
> >> *page)
> >>return next;
> >> }
> >>
> >> -/* Encode  as a single handle value */
> >> -static void *obj_location_to_handle(struct page *page, unsigned long 
> >> obj_idx)
> >> +static struct page *get_prev_page(struct page *page)
> >> {
> >> -unsigned long handle;
> >> +struct page *prev, *first_page;
> >>
> >> -if (!page) {
> >> -BUG_ON(obj_idx);
> >> -return NULL;
> >> -}
> >> +first_page = get_first_page(page);
> >> +if (page == first_page)
> >> +prev = NULL;
> >> +else if (page == (struct page *)first_page->private)
> >> +prev = first_page;
> >> +else
> >> +prev = list_entry(page->lru.prev, struct page, lru);
> >>
> >> -handle = page_to_pfn(page) << OBJ_INDEX_BITS;
> >> -handle |= (obj_idx & OBJ_INDEX_MASK);
> >> +return prev;
> >>
> >> -return (void *)handle;
> >> }
> >>
> >> -/* Decode  pair from the given object handle */
> >> -static void obj_handle_to_location(unsigned long handle, struct page 
> >> **page,
> >> -unsigned long *obj_idx)
> >> +static void *encode_ptr(struct page *page, unsigned long offset)
> >> {
> >> -*pag

[GIT PULL] core/locking change for v3.8

2012-12-10 Thread Ingo Molnar
Linus,

Please pull the latest core-locking-for-linus git tree from:

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 
core-locking-for-linus

   HEAD: 99fb4a122e96203dfd6c67d99d908aafd20f4753 lockdep: Use KSYM_NAME_LEN'ed 
buffer for __get_key_name()

Just a oneliner cleanup.

 Thanks,

Ingo

-->
Cyrill Gorcunov (1):
  lockdep: Use KSYM_NAME_LEN'ed buffer for __get_key_name()


 kernel/lockdep_proc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/lockdep_proc.c b/kernel/lockdep_proc.c
index 91c32a0..b2c71c5 100644
--- a/kernel/lockdep_proc.c
+++ b/kernel/lockdep_proc.c
@@ -39,7 +39,7 @@ static void l_stop(struct seq_file *m, void *v)
 
 static void print_name(struct seq_file *m, struct lock_class *class)
 {
-   char str[128];
+   char str[KSYM_NAME_LEN];
const char *name = class->name;
 
if (!name) {

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v3] Support volatile range for anon vma

2012-12-10 Thread Minchan Kim
On Tue, Dec 11, 2012 at 08:17:42AM +0100, Mike Hommey wrote:
> On Tue, Dec 11, 2012 at 11:41:04AM +0900, Minchan Kim wrote:
> > - What's the madvise(addr, length, MADV_VOLATILE)?
> > 
> >   It's a hint that user deliver to kernel so kernel can *discard*
> >   pages in a range anytime.
> > 
> > - What happens if user access page(ie, virtual address) discarded
> >   by kernel?
> > 
> >   The user can see zero-fill-on-demand pages as if madvise(DONTNEED).
> 
> What happened to getting SIGBUS?

I thought it could force for user to handle signal.
If user can receive signal, what can he do?
Maybe he can call madivse(NOVOLATILE) in my old version but I removed it
in this version so user don't need handle signal handling.

The problem of madvise(NOVOLATILE) is that time delay between allocator
allocats a free chunk to user and the user really access the memory.
Normally, when allocator return free chunk to customer, allocator should
call madvise(NOVOLATILE) but user could access the memory long time after.
So during that time difference, that pages could be swap out. It means to
mitigate the patch's goal.

Yes. It's not good for tmpfs volatile pages. If you have an interesting
about tmpfs-volatile, please look at this.

https://lkml.org/lkml/2012/12/10/695

> 
> Mike
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org";> em...@kvack.org 

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 4/4] leds: leds-pwm: Add device tree bindings

2012-12-10 Thread Thierry Reding
On Mon, Dec 10, 2012 at 11:00:37AM +0100, Peter Ujfalusi wrote:
[...]
> +LED sub-node properties:
> +- pwms : PWM property, please refer to: 
> +  Documentation/devicetree/bindings/pwm/pwm.txt

Instead of only referring to the generic PWM binding document, this
should probably explain what the PWM device is used for.

> +err:
> + if (priv->num_leds > 0) {
> + for (count = priv->num_leds - 1; count >= 0; count--) {
> + led_classdev_unregister(&priv->leds[count].cdev);
> + pwm_put(priv->leds[count].pwm);
> + }
> + }

Can this not be written more simply as follows?

while (priv->num_leds--) {
...
}

>  static int led_pwm_remove(struct platform_device *pdev)
>  {
> + struct led_pwm_platform_data *pdata = pdev->dev.platform_data;
>   struct led_pwm_priv *priv = platform_get_drvdata(pdev);
>   int i;
>  
> - for (i = 0; i < priv->num_leds; i++)
> + for (i = 0; i < priv->num_leds; i++) {
>   led_classdev_unregister(&priv->leds[i].cdev);
> + if (!pdata)
> + pwm_put(priv->leds[i].pwm);
> + }

Perhaps while at it we can add devm_of_pwm_get() along with exporting
of_pwm_get() so that you don't have to special-case this?

> +static const struct of_device_id of_pwm_leds_match[] = {
> + { .compatible = "pwm-leds", },
> + {},
> +};

Doesn't this cause a compiler warning for !OF builds?

Thierry


pgpu79iP0dP5J.pgp
Description: PGP signature


Re: [PATCH 1/1] media: saa7146: don't use mutex_lock_interruptible() in device_release().

2012-12-10 Thread Hans Verkuil
On Tue December 11 2012 04:05:28 Cyril Roelandt wrote:
> Use uninterruptible mutex_lock in the release() file op to make sure all
> resources are properly freed when a process is being terminated. Returning
> -ERESTARTSYS has no effect for a terminating process and this may cause driver
> resources not to be released.

Acked-by: Hans Verkuil 

Thanks!

Hans

> This was found using the following semantic patch 
> (http://coccinelle.lip6.fr/):
> 
> 
> @r@
> identifier fops;
> identifier release_func;
> @@
> static const struct v4l2_file_operations fops = {
> .release = release_func
> };
> 
> @depends on r@
> identifier r.release_func;
> expression E;
> @@
> static int release_func(...)
> {
> ...
> - if (mutex_lock_interruptible(E)) return -ERESTARTSYS;
> + mutex_lock(E);
> ...
> }
> 
> 
> Signed-off-by: Cyril Roelandt 
> ---
>  drivers/media/common/saa7146/saa7146_fops.c |3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/drivers/media/common/saa7146/saa7146_fops.c 
> b/drivers/media/common/saa7146/saa7146_fops.c
> index b3890bd..0afe98d 100644
> --- a/drivers/media/common/saa7146/saa7146_fops.c
> +++ b/drivers/media/common/saa7146/saa7146_fops.c
> @@ -265,8 +265,7 @@ static int fops_release(struct file *file)
>  
>   DEB_EE("file:%p\n", file);
>  
> - if (mutex_lock_interruptible(vdev->lock))
> - return -ERESTARTSYS;
> + mutex_lock(vdev->lock);
>  
>   if (vdev->vfl_type == VFL_TYPE_VBI) {
>   if (dev->ext_vv_data->capabilities & V4L2_CAP_VBI_CAPTURE)
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


feel Re: [PATCH v2 1/2] zsmalloc: add function to query object size

2012-12-10 Thread Nitin Gupta
On 12/10/2012 07:59 PM, Minchan Kim wrote:
> On Fri, Dec 07, 2012 at 04:45:53PM -0800, Nitin Gupta wrote:
>> On Sun, Dec 2, 2012 at 11:52 PM, Minchan Kim  wrote:
>>> On Sun, Dec 02, 2012 at 11:20:42PM -0800, Nitin Gupta wrote:


 On Nov 30, 2012, at 5:54 AM, Minchan Kim  
 wrote:

> On Thu, Nov 29, 2012 at 10:54:48PM -0800, Nitin Gupta wrote:
>> Changelog v2 vs v1:
>> - None
>>
>> Adds zs_get_object_size(handle) which provides the size of
>> the given object. This is useful since the user (zram etc.)
>> now do not have to maintain object sizes separately, saving
>> on some metadata size (4b per page).
>>
>> The object handle encodes  pair which currently points
>> to the start of the object. Now, the handle implicitly stores the size
>> information by pointing to the object's end instead. Since zsmalloc is
>> a slab based allocator, the start of the object can be easily determined
>> and the difference between the end offset encoded in the handle and the
>> start gives us the object size.
>>
>> Signed-off-by: Nitin Gupta 
> Acked-by: Minchan Kim 
>
> I already had a few comment in your previous versoin.
> I'm OK although you ignore them because I can make follow up patch about
> my nitpick but could you answer below my question?
>
>> ---
>> drivers/staging/zsmalloc/zsmalloc-main.c |  177 
>> +-
>> drivers/staging/zsmalloc/zsmalloc.h  |1 +
>> 2 files changed, 127 insertions(+), 51 deletions(-)
>>
>> diff --git a/drivers/staging/zsmalloc/zsmalloc-main.c 
>> b/drivers/staging/zsmalloc/zsmalloc-main.c
>> index 09a9d35..65c9d3b 100644
>> --- a/drivers/staging/zsmalloc/zsmalloc-main.c
>> +++ b/drivers/staging/zsmalloc/zsmalloc-main.c
>> @@ -112,20 +112,20 @@
>> #define MAX_PHYSMEM_BITS 36
>> #else /* !CONFIG_HIGHMEM64G */
>> /*
>> - * If this definition of MAX_PHYSMEM_BITS is used, OBJ_INDEX_BITS will 
>> just
>> + * If this definition of MAX_PHYSMEM_BITS is used, OFFSET_BITS will just
>>  * be PAGE_SHIFT
>>  */
>> #define MAX_PHYSMEM_BITS BITS_PER_LONG
>> #endif
>> #endif
>> #define _PFN_BITS(MAX_PHYSMEM_BITS - PAGE_SHIFT)
>> -#define OBJ_INDEX_BITS(BITS_PER_LONG - _PFN_BITS)
>> -#define OBJ_INDEX_MASK((_AC(1, UL) << OBJ_INDEX_BITS) - 1)
>> +#define OFFSET_BITS(BITS_PER_LONG - _PFN_BITS)
>> +#define OFFSET_MASK((_AC(1, UL) << OFFSET_BITS) - 1)
>>
>> #define MAX(a, b) ((a) >= (b) ? (a) : (b))
>> /* ZS_MIN_ALLOC_SIZE must be multiple of ZS_ALIGN */
>> #define ZS_MIN_ALLOC_SIZE \
>> -MAX(32, (ZS_MAX_PAGES_PER_ZSPAGE << PAGE_SHIFT >> OBJ_INDEX_BITS))
>> +MAX(32, (ZS_MAX_PAGES_PER_ZSPAGE << PAGE_SHIFT >> OFFSET_BITS))
>> #define ZS_MAX_ALLOC_SIZEPAGE_SIZE
>>
>> /*
>> @@ -256,6 +256,11 @@ static int is_last_page(struct page *page)
>>return PagePrivate2(page);
>> }
>>
>> +static unsigned long get_page_index(struct page *page)
>> +{
>> +return is_first_page(page) ? 0 : page->index;
>> +}
>> +
>> static void get_zspage_mapping(struct page *page, unsigned int 
>> *class_idx,
>>enum fullness_group *fullness)
>> {
>> @@ -433,39 +438,86 @@ static struct page *get_next_page(struct page 
>> *page)
>>return next;
>> }
>>
>> -/* Encode  as a single handle value */
>> -static void *obj_location_to_handle(struct page *page, unsigned long 
>> obj_idx)
>> +static struct page *get_prev_page(struct page *page)
>> {
>> -unsigned long handle;
>> +struct page *prev, *first_page;
>>
>> -if (!page) {
>> -BUG_ON(obj_idx);
>> -return NULL;
>> -}
>> +first_page = get_first_page(page);
>> +if (page == first_page)
>> +prev = NULL;
>> +else if (page == (struct page *)first_page->private)
>> +prev = first_page;
>> +else
>> +prev = list_entry(page->lru.prev, struct page, lru);
>>
>> -handle = page_to_pfn(page) << OBJ_INDEX_BITS;
>> -handle |= (obj_idx & OBJ_INDEX_MASK);
>> +return prev;
>>
>> -return (void *)handle;
>> }
>>
>> -/* Decode  pair from the given object handle */
>> -static void obj_handle_to_location(unsigned long handle, struct page 
>> **page,
>> -unsigned long *obj_idx)
>> +static void *encode_ptr(struct page *page, unsigned long offset)
>> {
>> -*page = pfn_to_page(handle >> OBJ_INDEX_BITS);
>> -*obj_idx = handle & OBJ_INDEX_MASK;
>> +unsigned long ptr;
>> +ptr = page_to_pfn(page) << OFFSET_BITS;
>> +ptr |= offset & OFFSET_MASK;
>> +return (void *)ptr;
>> +}
>> +
>> +static void decode_ptr(unsigned long ptr, struct 

Re: [PATCH] kvm/vmx: fix the return value of handle_vmcall()

2012-12-10 Thread Gleb Natapov
On Mon, Dec 10, 2012 at 03:28:13PM -0600, Jesse Larrew wrote:
> 
> The return value of kvm_emulate_hypercall() is intended to inform callers
> whether or not we need to exit to userspace. However, handle_vmcall()
> currently ignores the return value.
> 
No, it it not. KVM does not handle vmcalls in userspace.

> This patch simply propogates the return value from kvm_emulate_hypercall()
> to callers so that it can be acted upon appropriately.
> 
> Signed-off-by: Jesse Larrew 
> ---
>  arch/x86/kvm/vmx.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index f858159..8b37f5f 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -4682,8 +4682,7 @@ static int handle_halt(struct kvm_vcpu *vcpu)
>  static int handle_vmcall(struct kvm_vcpu *vcpu)
>  {
>   skip_emulated_instruction(vcpu);
> - kvm_emulate_hypercall(vcpu);
> - return 1;
> + return kvm_emulate_hypercall(vcpu);
>  }
>  
>  static int handle_invd(struct kvm_vcpu *vcpu)
> -- 
> 1.7.11.7
> 
> Jesse Larrew
> Software Engineer, KVM Team
> IBM Linux Technology Center
> Phone: (512) 973-2052 (T/L: 363-2052)
> jlar...@linux.vnet.ibm.com
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v3] Support volatile range for anon vma

2012-12-10 Thread Mike Hommey
On Tue, Dec 11, 2012 at 11:41:04AM +0900, Minchan Kim wrote:
> - What's the madvise(addr, length, MADV_VOLATILE)?
> 
>   It's a hint that user deliver to kernel so kernel can *discard*
>   pages in a range anytime.
> 
> - What happens if user access page(ie, virtual address) discarded
>   by kernel?
> 
>   The user can see zero-fill-on-demand pages as if madvise(DONTNEED).

What happened to getting SIGBUS?

Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] regulator: lp3971: Convert to get_voltage_sel

2012-12-10 Thread Marek Szyprowski

Hello,

On 12/10/2012 12:46 PM, Axel Lin wrote:

regulator_list_voltage_table() returns -EINVAL if selector >= n_voltages.
Thus we don't need to check if reg is greater than BUCK_TARGET_VOL_MAX_IDX in
lp3971_dcdc_get_voltage_sel.

BUCK_TARGET_VOL_MIN_IDX and BUCK_TARGET_VOL_MAX_IDX are not used, remove them.

Signed-off-by: Axel Lin 


Acked-by: Marek Szyprowski 


---
  drivers/regulator/lp3971.c |   22 ++
  1 file changed, 6 insertions(+), 16 deletions(-)

diff --git a/drivers/regulator/lp3971.c b/drivers/regulator/lp3971.c
index 5f68ff1..9cb2c0f 100644
--- a/drivers/regulator/lp3971.c
+++ b/drivers/regulator/lp3971.c
@@ -73,8 +73,6 @@ static const unsigned int buck_voltage_map[] = {
  };

  #define BUCK_TARGET_VOL_MASK 0x3f
-#define BUCK_TARGET_VOL_MIN_IDX 0x01
-#define BUCK_TARGET_VOL_MAX_IDX 0x19

  #define LP3971_BUCK_RAMP_REG(x)   (buck_base_addr[x]+2)

@@ -140,7 +138,7 @@ static int lp3971_ldo_disable(struct regulator_dev *dev)
return lp3971_set_bits(lp3971, LP3971_LDO_ENABLE_REG, mask, 0);
  }

-static int lp3971_ldo_get_voltage(struct regulator_dev *dev)
+static int lp3971_ldo_get_voltage_sel(struct regulator_dev *dev)
  {
struct lp3971 *lp3971 = rdev_get_drvdata(dev);
int ldo = rdev_get_id(dev) - LP3971_LDO1;
@@ -149,7 +147,7 @@ static int lp3971_ldo_get_voltage(struct regulator_dev *dev)
reg = lp3971_reg_read(lp3971, LP3971_LDO_VOL_CONTR_REG(ldo));
val = (reg >> LDO_VOL_CONTR_SHIFT(ldo)) & LDO_VOL_CONTR_MASK;

-   return dev->desc->volt_table[val];
+   return val;
  }

  static int lp3971_ldo_set_voltage_sel(struct regulator_dev *dev,
@@ -168,7 +166,7 @@ static struct regulator_ops lp3971_ldo_ops = {
.is_enabled = lp3971_ldo_is_enabled,
.enable = lp3971_ldo_enable,
.disable = lp3971_ldo_disable,
-   .get_voltage = lp3971_ldo_get_voltage,
+   .get_voltage_sel = lp3971_ldo_get_voltage_sel,
.set_voltage_sel = lp3971_ldo_set_voltage_sel,
  };

@@ -201,24 +199,16 @@ static int lp3971_dcdc_disable(struct regulator_dev *dev)
return lp3971_set_bits(lp3971, LP3971_BUCK_VOL_ENABLE_REG, mask, 0);
  }

-static int lp3971_dcdc_get_voltage(struct regulator_dev *dev)
+static int lp3971_dcdc_get_voltage_sel(struct regulator_dev *dev)
  {
struct lp3971 *lp3971 = rdev_get_drvdata(dev);
int buck = rdev_get_id(dev) - LP3971_DCDC1;
u16 reg;
-   int val;

reg = lp3971_reg_read(lp3971, LP3971_BUCK_TARGET_VOL1_REG(buck));
reg &= BUCK_TARGET_VOL_MASK;

-   if (reg <= BUCK_TARGET_VOL_MAX_IDX)
-   val = buck_voltage_map[reg];
-   else {
-   val = 0;
-   dev_warn(&dev->dev, "chip reported incorrect voltage value.\n");
-   }
-
-   return val;
+   return reg;
  }

  static int lp3971_dcdc_set_voltage_sel(struct regulator_dev *dev,
@@ -249,7 +239,7 @@ static struct regulator_ops lp3971_dcdc_ops = {
.is_enabled = lp3971_dcdc_is_enabled,
.enable = lp3971_dcdc_enable,
.disable = lp3971_dcdc_disable,
-   .get_voltage = lp3971_dcdc_get_voltage,
+   .get_voltage_sel = lp3971_dcdc_get_voltage_sel,
.set_voltage_sel = lp3971_dcdc_set_voltage_sel,
  };





Best regards
--
Marek Szyprowski
Samsung Poland R&D Center

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 2/4] leds: leds-pwm: Preparing the driver for device tree support

2012-12-10 Thread Thierry Reding
On Mon, Dec 10, 2012 at 11:00:35AM +0100, Peter Ujfalusi wrote:
> In order to be able to add device tree support for leds-pwm driver we need
> to rearrange the data structures used by the drivers.
> 
> Signed-off-by: Peter Ujfalusi 
> ---
>  drivers/leds/leds-pwm.c | 39 +++
>  1 file changed, 23 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/leds/leds-pwm.c b/drivers/leds/leds-pwm.c
> index 351257c..02f0c0c 100644
> --- a/drivers/leds/leds-pwm.c
> +++ b/drivers/leds/leds-pwm.c
> @@ -30,6 +30,11 @@ struct led_pwm_data {
>   unsigned intperiod;
>  };
>  
> +struct led_pwm_priv {
> + int num_leds;
> + struct led_pwm_data leds[];
> +};

I think you want leds[0] here. Otherwise your structure is too large by
sizeof(struct led_pwm_data *).

> +
>  static void led_pwm_set(struct led_classdev *led_cdev,
>   enum led_brightness brightness)
>  {
> @@ -47,25 +52,29 @@ static void led_pwm_set(struct led_classdev *led_cdev,
>   }
>  }
>  
> +static inline int sizeof_pwm_leds_priv(int num_leds)

Perhaps this should return size_t?

> +{
> + return sizeof(struct led_pwm_priv) +
> +   (sizeof(struct led_pwm_data) * num_leds);
> +}
> +
>  static int led_pwm_probe(struct platform_device *pdev)
>  {
>   struct led_pwm_platform_data *pdata = pdev->dev.platform_data;
> - struct led_pwm *cur_led;
> - struct led_pwm_data *leds_data, *led_dat;
> + struct led_pwm_priv *priv;
>   int i, ret = 0;
>  
>   if (!pdata)
>   return -EBUSY;
>  
> - leds_data = devm_kzalloc(&pdev->dev,
> - sizeof(struct led_pwm_data) * pdata->num_leds,
> - GFP_KERNEL);
> - if (!leds_data)
> + priv = devm_kzalloc(&pdev->dev, sizeof_pwm_leds_priv(pdata->num_leds),
> + GFP_KERNEL);

I'm not sure if sizeof_pwm_leds_priv() requires to be a separate
function. You could make it shorter by doing something like:

size_t extra = sizeof(*led_dat) * pdata->num_leds;

priv = devm_kzalloc(&pdev->dev, sizeof(*priv) + extra, GFP_KERNEL);

But that's really just a matter of taste, so no further objections if
you want to keep the inline function.

Thierry


pgpkjEcBiIAI5.pgp
Description: PGP signature


Re: [RFC][PATCH RT 3/4] sched/rt: Use IPI to trigger RT task push migration instead of pulling

2012-12-10 Thread Mike Galbraith
On Mon, 2012-12-10 at 20:53 -0500, Steven Rostedt wrote: 
> On Mon, 2012-12-10 at 17:15 -0800, Frank Rowand wrote:
> 
> > I should have also mentioned some previous experience using IPIs to
> > avoid runq lock contention on wake up.  Someone encountered IPI
> > storms when using the TTWU_QUEUE feature, thus it defaults to off
> > for CONFIG_PREEMPT_RT_FULL:
> > 
> >   #ifndef CONFIG_PREEMPT_RT_FULL
> >   /*
> >* Queue remote wakeups on the target CPU and process them
> >* using the scheduler IPI. Reduces rq->lock contention/bounces.
> >*/
> >   SCHED_FEAT(TTWU_QUEUE, true)
> >   #else
> >   SCHED_FEAT(TTWU_QUEUE, false)
> > 
> 
> Interesting, but I'm wondering if this also does it for every wakeup? If
> you have 1000 tasks waking up on another CPU, this could potentially
> send out 1000 IPIs. The number of IPIs here looks to be # of tasks
> waking up, and perhaps more than that, as there could be multiple
> instances that try to wake up the same task.

Yeah.  In mainline, wakeup via IPI is disabled within a socket, because
it's too much of a performance hit for high frequency switchers.  (It
seems we're limited by the max rate at which we can IPI)

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:x86/microcode] x86/microcode_intel_early.c: Early update ucode on Intel's CPU

2012-12-10 Thread Yinghai Lu
On Mon, Dec 10, 2012 at 10:34 PM, H. Peter Anvin  wrote:
>
> That doesn't work if the microcode is replaced at runtime.  However, vmalloc
> doesn't work either since 32 bits needs any one blob to be physically
> contiguous.  I have suggested Fenghua replace it with a linked list of
> kmalloc areas, one for each blob.

you mean:
keep the all of version, and update code need to go over the list to
find latest before apply update?

BTW, do we really need to update microcode so early?

Yinghai
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] clk: changes for 3.8

2012-12-10 Thread Mike Turquette
The following changes since commit 8f0d8163b50e01f398b14bcd4dc039ac5ab18d64:

  Linux 3.7-rc3 (2012-10-28 12:24:48 -0700)

are available in the git repository at:

  git://git.linaro.org/people/mturquette/linux.git tags/clk-for-linus

for you to fetch changes up to 8f87189653d60656e262060665f52c855508a301:

  MAINTAINERS: bad email address for Mike Turquette (2012-12-10 22:35:32 -0800)


The common clock framework changes for 3.8 are comprised of lots of
fixes for existing platforms as well as new ports for some ARM
platforms.  In addition there are new clk drivers for audio devices and
MFDs.


Axel Lin (1):
  clk: spear: Add stub functions for spear3[0|1|2]0_clk_init()

Deepak Sikri (2):
  CLK: SPEAr: Update clock rate table
  CLK: SPEAr: Correct index scanning done for clock synths

Fabio Estevam (1):
  clk: mxs: Use a better name for the USB PHY clock

Linus Walleij (4):
  clk: add GPLv2 headers to the Versatile clock files
  clk: make ICST driver handle the VCO registers
  clk: move IM-PD1 clocks to drivers/clk
  clk: ux500: fix bit error

Martin Fuzzey (1):
  clk: clock multiplexers may register out of order

Mike Turquette (2):
  clk: introduce optional disable_unused callback
  MAINTAINERS: bad email address for Mike Turquette

Pawel Moll (2):
  clk: Versatile Express clock generators ("osc") driver
  clk: Common clocks implementation for Versatile Express

Peter Ujfalusi (1):
  CLK: clk-twl6040: Initial clock driver for OMAP4+ McPDM fclk clock

Rajeev Kumar (1):
  CLK: SPEAr: Fix dev_id & con_id for multiple clocks

Shiraz Hashim (2):
  CLK: SPEAr13xx: Fix mux clock names
  CLK: SPEAr13xx: fix parent names of multiple clocks

Stephen Boyd (6):
  clk: Document .is_enabled op
  clk: Fix documentation typos
  clk: Don't return negative numbers for unsigned values with !clk
  clk: wm831x: Fix clk_register() error code checking
  clk: Add devm_clk_{register,unregister}()
  clk: wm831x: Use devm_clk_register() to simplify code

Tony Prisk (1):
  CLK: vt8500: Fix SDMMC clk special cases

Ulf Hansson (19):
  mfd: dbx500: Export prmcu_request_ape_opp_100_voltage
  clk: ux500: Support prcmu ape opp voltage clock
  clk: ux500: Update sdmmc clock to 100MHz for u8500
  ARM: ux500: Remove cpufreq platform device
  mfd: db8500: Provide cpufreq table as platform data
  cpufreq: db8500: Register as a platform driver
  cpufreq: db8500: Fetch cpufreq table from platform data
  mfd: db8500: Connect ARMSS clk to ARM OPP
  clk: ux500: Support for prcmu_scalable_rate clock
  clk: ux500: Add armss clk and fixup smp_twd clk for u8500
  cpufreq: db8500: Use armss clk to update frequency
  clk: ux500: Register i2c clock lookups for u8500
  clk: ux500: Register ssp clock lookups for u8500
  clk: ux500: Register msp clock lookups for u8500
  clk: ux500: Update rtc clock lookup for u8500
  clk: ux500: Register slimbus clock lookups for u8500
  clk: ux500: Register rng clock lookups for u8500
  clk: ux500: Register nomadik keypad clock lookups for u8500
  clk: ux500: Initial support for abx500 clock driver

Vipul Kumar Samar (3):
  CLK: SPEAr: Set CLK_SET_RATE_PARENT for few clocks
  CLK: SPEAr: Add missing clocks
  CLK: SPEAr: Remove unused dummy apb_pclk

Viresh Kumar (1):
  clk: SPEAr: Vco-pll: Fix compilation warning

Wei Yongjun (4):
  clk: fix return value check in of_fixed_clk_setup()
  clk: fix return value check in sirfsoc_of_clk_init()
  clk: fix return value check in bcm2835_init_clocks()
  CLK: clk-twl6040: fix return value check in twl6040_clk_probe()

 .../devicetree/bindings/clock/imx23-clock.txt  |2 +-
 .../devicetree/bindings/clock/imx28-clock.txt  |4 +-
 MAINTAINERS|1 -
 arch/arm/include/asm/hardware/sp810.h  |2 +
 arch/arm/mach-integrator/impd1.c   |   69 +-
 arch/arm/mach-ux500/cpu-db8500.c   |6 -
 drivers/clk/Kconfig|   16 +-
 drivers/clk/Makefile   |1 +
 drivers/clk/clk-bcm2835.c  |8 +-
 drivers/clk/clk-fixed-rate.c   |2 +-
 drivers/clk/clk-prima2.c   |   84 +++
 drivers/clk/clk-twl6040.c  |  126 +++
 drivers/clk/clk-vt8500.c   |   18 ++
 drivers/clk/clk-wm831x.c   |   34 +--
 drivers/clk/clk.c  |  154 ++---
 drivers/clk/mxs/clk-imx23.c|6 +-
 drivers/clk/mxs/clk-imx28.c|   10 +-
 drivers/clk/spear/clk-aux-synth.c  |3 +-
 drivers/clk/spear

Re: [PATCH v3 1/4] leds: leds-pwm: Convert to use devm_get_pwm

2012-12-10 Thread Thierry Reding
On Mon, Dec 10, 2012 at 11:00:34AM +0100, Peter Ujfalusi wrote:
> Update the driver to use the new API for requesting pwm so we can take
> advantage of the pwm_lookup table to find the correct pwm to be used for the
> LED functionality.
> If the devm_get_pwm fails we fall back to legacy mode to try to get the pwm.
> 
> Signed-off-by: Peter Ujfalusi 
> ---
>  drivers/leds/leds-pwm.c  | 19 ++-
>  include/linux/leds_pwm.h |  2 +-
>  2 files changed, 7 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/leds/leds-pwm.c b/drivers/leds/leds-pwm.c
> index 2157524..351257c 100644
> --- a/drivers/leds/leds-pwm.c
> +++ b/drivers/leds/leds-pwm.c
> @@ -67,12 +67,11 @@ static int led_pwm_probe(struct platform_device *pdev)
>   cur_led = &pdata->leds[i];
>   led_dat = &leds_data[i];
>  
> - led_dat->pwm = pwm_request(cur_led->pwm_id,
> - cur_led->name);
> + led_dat->pwm = devm_pwm_get(&pdev->dev, cur_led->name);
>   if (IS_ERR(led_dat->pwm)) {
>   ret = PTR_ERR(led_dat->pwm);
> - dev_err(&pdev->dev, "unable to request PWM %d\n",
> - cur_led->pwm_id);
> + dev_err(&pdev->dev, "unable to request PWM for %s\n",
> + cur_led->name);
>   goto err;
>   }

The commit message says that legacy mode is used as fallback if
devm_get_pwm() (that should really be devm_pwm_get() btw) fails but I
don't see where pwm_request() is called.

Thierry


pgpDhuDzzX4rT.pgp
Description: PGP signature


Re: [PATCH v3 3/4] pwm: core: Export of_pwm_request() so client drivers can also use it

2012-12-10 Thread Thierry Reding
On Mon, Dec 10, 2012 at 11:00:36AM +0100, Peter Ujfalusi wrote:
> Allow client driver to use of_pwm_request() to get the pwm they need. This
> is needed for drivers which handle more than one pwm separately, like
> leds-pwm driver which have:

Hi Peter,

I really was hoping that we didn't have to export this function, but I
can't think of any other way to solve the problem at hand either. I'd
prefer to rename the function to of_pwm_get() at the same time to keep
consistent with other subsystems that provide similar functionality.
Also, please use all-caps for PWM in prose. And while at it, you can
drop the "core:" and "so client drivers can also use it" from the
subject line.

> pwmleds {
>   compatible = "pwm-leds";
>   kpad {
>   label = "omap4::keypad";
>   pwms = <&twl_pwm 0 7812500>;
>   max-brightness = <127>;
>   };
> 
>   charging {
>   label = "omap4:green:chrg";
>   pwms = <&twl_pwmled 0 7812500>;
>   max-brightness = <255>;
>   };
> };
> 
> in the dts files.
> 
> Signed-off-by: Peter Ujfalusi 
> ---
>  drivers/pwm/core.c  | 2 +-
>  include/linux/pwm.h | 7 +++
>  2 files changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/pwm/core.c b/drivers/pwm/core.c
> index 903138b..3a7ebcc 100644
> --- a/drivers/pwm/core.c
> +++ b/drivers/pwm/core.c
> @@ -486,7 +486,7 @@ static struct pwm_chip *of_node_to_pwmchip(struct 
> device_node *np)
>   * becomes mandatory for devices that look up the PWM device via the con_id
>   * parameter.
>   */
> -static struct pwm_device *of_pwm_request(struct device_node *np,
> +struct pwm_device *of_pwm_request(struct device_node *np,
>const char *con_id)
>  {
>   struct pwm_device *pwm = NULL;

This is missing an EXPORT_SYMBOL_GPL.

> diff --git a/include/linux/pwm.h b/include/linux/pwm.h
> index 6d661f3..d70ffe3 100644
> --- a/include/linux/pwm.h
> +++ b/include/linux/pwm.h
> @@ -175,6 +175,7 @@ struct pwm_device *of_pwm_xlate_with_flags(struct 
> pwm_chip *pc,
>   const struct of_phandle_args *args);
>  
>  struct pwm_device *pwm_get(struct device *dev, const char *consumer);
> +struct pwm_device *of_pwm_request(struct device_node *np, const char 
> *con_id);

While at it, maybe rename the con_id parameter as well to match
pwm_get().

Thierry


pgp23R2yAVsxG.pgp
Description: PGP signature


performance drop after using blkcg

2012-12-10 Thread Zhao Shuai
Hi,

I plan to use blkcg(proportional BW) in my system. But I encounter
great performance drop after enabling blkcg.

The testing tool is fio(version 2.0.7) and both the BW and IOPS fields
are recorded. Two instances of fio program are carried out simultaneously,
each opearting on a separate disk file (say /data/testfile1, /data/testfile2).

System environment:
kernel: 3.7.0-rc5
CFQ's slice_idle is disabled(slice_idle=0) while group_idle is
enabled(group_idle=8).

FIO configuration(e.g. "read") for the first fio program(say FIO1):

[global]
description=Emulation of Intel IOmeter File Server Access Pattern

[iometer]
bssplit=4k/30:8k/40:16k/30
rw=read
direct=1
time_based
runtime=180s
ioengine=sync
filename=/data/testfile1
numjobs=32
group_reporting


result before using blkcg: (the value of BW is KB/s)

   FIO1 BW/IOPSFIO2 BW/IOPS
---
read   26799/2911  25861/2810
write  138618/15071138578/15069
rw 72159/7838(r)   71851/7811(r)
   72171/7840(w)   71799/7805(w)
randread   4982/5435370/585
randwrite  5192/5666010/654
randrw 2369/258(r) 3027/330(r)
   2369/258(w) 3016/328(w)

result after using blkcg(create two blkio cgroups with
default blkio.weight(500) and put FIO1 and FIO2 into these
cgroups respectively)

   FIO1 BW/IOPSFIO2 BW/IOPS
---
read   36651/3985  36470/3943
write  75738/8229  75641/8221
rw 49169/5342(r)   49168/5346(r)
   49200/5348(w)   49140/5341(w)
randread   4876/5324905/534
randwrite  5535/6035497/599
randrw 2521/274(r) 2527/275(r)
   2510/273(w) 2532/274(w)

Comparing with those results, we found greate performance drop
(30%-40%) in some test cases(especially for the "write", "rw" case).
Is it normal to see write/rw bandwidth decrease by 40% after using
blkio-cgroup? If not, any way to improve or tune the performace?

Thanks.

--
Regards,
Zhao Shuai
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] pinctrl changes for v3.8

2012-12-10 Thread Linus Walleij
Hi Linus,

these are the first and major pinctrl changes for the v3.8 merge cycle.
Some of this is used as merge base for other trees so I better be early
on the trigger.

The major changes are described in the signed tag. All has been in
linux-next for a while.

This is the first time I've had to pull in external branches and use
some parallel topics for pinctrl so if I've done it wrong somehow
just tell me.

Anyway, please pull it in!

Yours,
Linus Walleij

The following changes since commit 77b67063bb6bce6d475e910d3b886a606d0d91f7:

  Linux 3.7-rc5 (2012-11-11 13:44:33 +0100)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl.git
tags/pinctrl-for-v3.8

for you to fetch changes up to 7c8f86a451fe8c010eb93c62d4d69727ccdbe435:

  ARM: mmp: select PINCTRL for ARCH_MMP (2012-12-02 00:09:09 +0100)


This is the pinctrl big pull request for v3.8.

As can be seen from the diffstat the major changes
are:

- A big conversion of the AT91 pinctrl driver and
  the associated ACKed platform changes under
  arch/arm/max-at91 and its device trees. This
  has been coordinated with the AT91 maintainers
  to go in through the pinctrl tree.

- A larger chunk of changes to the SPEAr drivers
  and the addition of the "plgpio" driver for the
  SPEAr as well.

- The removal of the remnants of the Nomadik driver
  from the arch/arm tree and fusion of that into
  the Nomadik driver and platform data header files.

- Some local movement in the Marvell MVEBU drivers,
  these now have their own subdirectory.

- The addition of a chunk of code to gpiolib under
  drivers/gpio to register gpio-to-pin range mappings
  from the GPIO side of things. This has been
  requested by Grant Likely and is now implemented,
  it is particularly useful for device tree work.

Then we have incremental updates all over the place,
many of these are cleanups and fixes from Axel Lin
who has done a great job of removing minor mistakes
and compilation annoyances.


Axel Lin (25):
  pinctrl: nomadik: Add terminating entry for platform_device_id table
  pinctrl: at91: Staticize non-exported symbols
  pinctrl: exynos: Add terminating entry for of_device_id table
  pinctrl: u300: Staticize non-exported symbols
  pinctrl: sirf: Staticize non-exported symbol
  pinctrl: Staticize pinconf_ops
  pinctrl: lantiq: Remove ltq_pmx_disable() function
  pinctrl: lantiq: Staticize non-exported symbols
  pinctrl: pinmux: Release all taken pins in pinmux_enable_setting
error paths
  pinctrl: spear: Staticize non-exported symbols
  pinctrl: mxs: Make PINCTRL_MXS select PINMUX && PINCONF
  pinctrl: tegra: Make PINCTRL_TEGRA select PINMUX && PINCONF
  pinctrl: pxa3xx: Use devm_request_and_ioremap
  pinctrl: pxa3xx: Remove phy_base and phy_size from struct
pxa3xx_pinmux_info
  pinctrl: tegra: Staticize non-exported symbols
  pinctrl: imx: Fix the logic checking if not able to find pin reg map
  pinctrl: spear: Fix the logic of setting reg in
pmx_init_gpio_pingroup_addr
  pinctrl: coh901: Return proper error if irq_domain_add_linear() fails
  pinctrl: spear: Make get_gpio_pingroup return NULL when no
gpio_pingroup found
  pinctrl: plgpio: Call clk_disable_unprepare only if
clk_prepare_enable is called
  pinctrl: nomadik: Prevent NULL dereference if of_match_device returns NULL
  pinctrl: nomadik: Staticize non-exported symbols
  gpiolib: Fix use after free in gpiochip_add_pin_range
  pinctrl: Drop selecting PINCONF for MMP2, PXA168 and PXA910
  ARM: mmp: select PINCTRL for ARCH_MMP

Barry Song (1):
  pinctrl: sirf: enable the driver support new SiRFmarco SoC

Haojian Zhuang (3):
  pinctrl: single: dump pinmux register value
  pinctrl: generic: add input schmitt disable parameter
  pinctrl: single: support gpio request and free

Jean-Christophe PLAGNIOL-VILLARD (29):
  arm: at91: use macro to declare soc boot data
  ARM: at91: gpio: implement request
  at91: regroup gpio and pinctrl under the same ranges
  arm: at91: at91sam9x5: fix gpio number per bank
  ARM: at91: add dummies pinctrl for non dt platform
  ARM: at91: add pinctrl support
  arm: at91: dt: at91sam9 add pinctrl support
  arm: at91: dt: at91sam9 add serial pinctrl support
  tty: atmel_serial: add pinctrl support
  arm: at91: dt: sam9m10g45ek: use rts/cts pinctrl group for uart1
  arm: at91: dt: sam9263ek: use rts/cts pinctrl group for uart0
  arm: at91: dt: sam9g20ek: use rts/cts/dtr/dsr/dcd/ri pinctrl
group for uart0
  arm: at91: dt: at91sam9 add nand pinctrl support
  MTD: atmel_nand: add pinctrl consumer support
  pinctrl: at91: fix typo on PULL_UP
  gpio/at91: auto request and configure the pio as input when the
interrupt is used via DT
  pinctrl/:

Re: [PATCH V2] MCE: fix an error of mce_bad_pages statistics

2012-12-10 Thread Xishi Qiu
>>> Hi Simon,

>>>
>>> If we use "/sys/devices/system/memory/soft_offline_page" to offline a
>>> free page, the value of mce_bad_pages will be added. Then the page is marked
>>> HWPoison, but it is still managed by page buddy alocator.
>>>
>>> So if we offline it again, the value of mce_bad_pages will be added again.
>>> Assume the page is not allocated during this short time.
>>>
>>> soft_offline_page()
>>> get_any_page()
>>> "else if (is_free_buddy_page(p))" branch return 0
>>> "goto done";
>>> "atomic_long_add(1, &mce_bad_pages);"
>>>
>>> I think it would be better to move "if(PageHWPoison(page))" at the 
>>> beginning of
>>> soft_offline_page(). However I don't know what do these words mean,
>>> "Synchronized using the page lock with memory_failure()"
> 
> Hi Xishi,
> 
> Unpoison will clear PG_hwpoison flag after hold page lock, memory_failure() 
> and 
> soft_offline_page() take the lock to avoid unpoison clear the flag behind 
> them.
> 
> Regards,
> Wanpeng Li 
> 

Hi Wanpeng,

As you mean, it is the necessary to get the page lock first when we check the
HWPoison flag every time, this is in order to avoid conflict, right?

So why not use a globe lock here? For example lock_memory_hotplug() is used in
online_pages() and offline_pages()?

Thanks,
Xishi Qiu

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] xen/swiotlb: Exchange to contiguous memory for map_sg hook

2012-12-10 Thread Xu, Dongxiao
> -Original Message-
> From: Konrad Rzeszutek Wilk [mailto:konrad.w...@oracle.com]
> Sent: Friday, December 07, 2012 10:09 PM
> To: Xu, Dongxiao
> Cc: xen-de...@lists.xen.org; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH] xen/swiotlb: Exchange to contiguous memory for map_sg
> hook
> 
> On Thu, Dec 06, 2012 at 09:08:42PM +0800, Dongxiao Xu wrote:
> > While mapping sg buffers, checking to cross page DMA buffer is also
> > needed. If the guest DMA buffer crosses page boundary, Xen should
> > exchange contiguous memory for it.
> 
> So this is when we cross those 2MB contingous swatch of buffers.
> Wouldn't we get the same problem with the 'map_page' call? If the driver tried
> to map say a 4MB DMA region?

Yes, it also needs such check, as I just replied to Jan's mail.

> 
> What if this check was done in the routines that provide the software static
> buffers and there try to provide a nice DMA contingous swatch of pages?

Yes, this approach also came to our mind, which needs to modify the driver 
itself.
If so, it requires driver not using such static buffers (e.g., from kmalloc) to 
do DMA even if the buffer is continuous in native.
Is this acceptable by kernel/driver upstream?

Thanks,
Dongxiao

> 
> >
> > Besides, it is needed to backup the original page contents and copy it
> > back after memory exchange is done.
> >
> > This fixes issues if device DMA into software static buffers, and in
> > case the static buffer cross page boundary which pages are not
> > contiguous in real hardware.
> >
> > Signed-off-by: Dongxiao Xu 
> > Signed-off-by: Xiantao Zhang 
> > ---
> >  drivers/xen/swiotlb-xen.c |   47
> -
> >  1 files changed, 46 insertions(+), 1 deletions(-)
> >
> > diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
> > index 58db6df..e8f0cfb 100644
> > --- a/drivers/xen/swiotlb-xen.c
> > +++ b/drivers/xen/swiotlb-xen.c
> > @@ -461,6 +461,22 @@ xen_swiotlb_sync_single_for_device(struct device
> > *hwdev, dma_addr_t dev_addr,  }
> > EXPORT_SYMBOL_GPL(xen_swiotlb_sync_single_for_device);
> >
> > +static bool
> > +check_continguous_region(unsigned long vstart, unsigned long order) {
> > +   unsigned long prev_ma = xen_virt_to_bus((void *)vstart);
> > +   unsigned long next_ma;
> > +   int i;
> > +
> > +   for (i = 1; i < (1 << order); i++) {
> > +   next_ma = xen_virt_to_bus((void *)(vstart + i * PAGE_SIZE));
> > +   if (next_ma != prev_ma + PAGE_SIZE)
> > +   return false;
> > +   prev_ma = next_ma;
> > +   }
> > +   return true;
> > +}
> > +
> >  /*
> >   * Map a set of buffers described by scatterlist in streaming mode for
> DMA.
> >   * This is the scatter-gather version of the above
> > xen_swiotlb_map_page @@ -489,7 +505,36 @@
> > xen_swiotlb_map_sg_attrs(struct device *hwdev, struct scatterlist
> > *sgl,
> >
> > for_each_sg(sgl, sg, nelems, i) {
> > phys_addr_t paddr = sg_phys(sg);
> > -   dma_addr_t dev_addr = xen_phys_to_bus(paddr);
> > +   unsigned long vstart, order;
> > +   dma_addr_t dev_addr;
> > +
> > +   /*
> > +* While mapping sg buffers, checking to cross page DMA buffer
> > +* is also needed. If the guest DMA buffer crosses page
> > +* boundary, Xen should exchange contiguous memory for it.
> > +* Besides, it is needed to backup the original page contents
> > +* and copy it back after memory exchange is done.
> > +*/
> > +   if (range_straddles_page_boundary(paddr, sg->length)) {
> > +   vstart = (unsigned long)__va(paddr & PAGE_MASK);
> > +   order = get_order(sg->length + (paddr & ~PAGE_MASK));
> > +   if (!check_continguous_region(vstart, order)) {
> > +   unsigned long buf;
> > +   buf = __get_free_pages(GFP_KERNEL, order);
> > +   memcpy((void *)buf, (void *)vstart,
> > +   PAGE_SIZE * (1 << order));
> > +   if (xen_create_contiguous_region(vstart, order,
> > +   fls64(paddr))) {
> > +   free_pages(buf, order);
> > +   return 0;
> > +   }
> > +   memcpy((void *)vstart, (void *)buf,
> > +   PAGE_SIZE * (1 << order));
> > +   free_pages(buf, order);
> > +   }
> > +   }
> > +
> > +   dev_addr = xen_phys_to_bus(paddr);
> >
> > if (swiotlb_force ||
> > !dma_capable(hwdev, dev_addr, sg->length) ||
> > --
> > 1.7.1
> >
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pleas

Re: [tip:x86/microcode] x86/microcode_intel_early.c: Early update ucode on Intel's CPU

2012-12-10 Thread H. Peter Anvin

On 12/10/2012 07:55 PM, Yinghai Lu wrote:


And my suggestion is: after scan and find the ucode, save it to BRK,
so don't need to adjust
pointer again, and don't need to copy the blob and update the pointer again.



That doesn't work if the microcode is replaced at runtime.  However, 
vmalloc doesn't work either since 32 bits needs any one blob to be 
physically contiguous.  I have suggested Fenghua replace it with a 
linked list of kmalloc areas, one for each blob.


-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] staging: rtl8712: avoid a useless call to memset().

2012-12-10 Thread Dan Carpenter
On Tue, Dec 11, 2012 at 01:20:48AM +0100, Cyril Roelandt wrote:
> In r8711_wx_get_wap(), make sure we do not call memcpy() on a memory area that
> has just been zeroed by a call to memset().
> 

Acked-by: Dan Carpenter 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 01/18] sched: select_task_rq_fair clean up

2012-12-10 Thread Preeti U Murthy
On 12/11/2012 10:58 AM, Alex Shi wrote:
> On 12/11/2012 12:23 PM, Preeti U Murthy wrote:
>> Hi Alex,
>>
>> On 12/10/2012 01:52 PM, Alex Shi wrote:
>>> It is impossible to miss a task allowed cpu in a eligible group.
>>
>> The one thing I am concerned with here is if there is a possibility of
>> the task changing its tsk_cpus_allowed() while this code is running.
>>
>> i.e find_idlest_group() finds an idle group,then the tsk_cpus_allowed()
>> for the task changes,perhaps by the user himself,which might not include
>> the cpus in the idle group.After this find_idlest_cpu() is called.I mean
>> a race condition in short.Then we might not have an eligible cpu in that
>> group right?
> 
> your worry make sense, but the code handle the situation, in
> select_task_rq(), it will check the cpu allowed again. if the answer is
> no, it will fallback to old cpu.
>>
>>> And since find_idlest_group only return a different group which
>>> excludes old cpu, it's also imporissible to find a new cpu same as old
>>> cpu.

I doubt this will work correctly.Consider the following situation:sched
domain begins with sd that encloses both socket1 and socket2

cpu0 cpu1  | cpu2 cpu3
---|-
 socket1   |  socket2

old cpu = cpu1

Iteration1:
1.find_idlest_group() returns socket2 to be idlest.
2.task changes tsk_allowed_cpus to 0,1
3.find_idlest_cpu() returns cpu2

* without your patch
   1.the condition after find_idlest_cpu() returns -1,and sd->child is
chosen which happens to be socket1
   2.in the next iteration, find_idlest_group() and find_idlest_cpu()
will probably choose cpu0 which happens to be idler than cpu1,which is
in tsk_allowed_cpu.

* with your patch
   1.the condition after find_idlest_cpu() does not exist,therefore
a sched domain to which cpu2 belongs to is chosen.this is socket2.(under
the for_each_domain() loop).
   2.in the next iteration, find_idlest_group() return NULL,because
there is no cpu which intersects with tsk_allowed_cpus.
   3.in select task rq,the fallback cpu is chosen even when an idle cpu
existed.

So my concern is though select_task_rq() checks the
tsk_allowed_cpus(),you might end up choosing a different path of
sched_domains compared to without this patch as shown above.

In short without the "if(new_cpu==-1)" condition we might get misled
doing unnecessary iterations over the wrong sched domains in
select_task_rq_fair().(Think about situations when not all the cpus of
socket2 are disallowed by the task,then there will more iterations in
the wrong path of sched_domains before exit,compared to what is shown
above.)

Regards
Preeti U Murthy


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] smpboot: calling smpboot_register_percpu_thread is unsafe during one CPU being down

2012-12-10 Thread Srivatsa S. Bhat
On 12/11/2012 07:47 PM, Chuansheng Liu wrote:
> 
> When one CPU is going down, and smpboot_register_percpu_thread is called,
> there is the race issue below:
> T1(CPUA):  T2(CPUB):
> _cpu_down()
> smpboot_register_percpu_thread()
>   smpboot_park_threads()   ...
>   __stop_machine()   
> __smpboot_create_thread(CPU_Dying)
>  [Currently, the being down 
> CPU is online yet]
> take_cpu_down()  
> smpboot_unpark_thread(CPU_Dying)
>   __cpu_disable()  
> native_cpu_disable()
>  Here the new kthread will 
> get running
>  based on the CPU_Dying
> set_cpu_online(cpu, false)
>
>cpu_notify(CPU_DYING)
> 
> After notified the CPU_DYING, the new created kthead for dying CPU will
> be migrated to another CPU in migration_call().
> 
> Here we need use get_online_cpus()/put_online_cpus() when calling
> function smpboot_register_percpu_thread().
> 
> Signed-off-by: liu chuansheng 

Reviewed-by: Srivatsa S. Bhat 

Regards,
Srivatsa S. Bhat

> ---
>  kernel/smpboot.c |2 ++
>  1 files changed, 2 insertions(+), 0 deletions(-)
> 
> diff --git a/kernel/smpboot.c b/kernel/smpboot.c
> index d6c5fc0..3fe708a 100644
> --- a/kernel/smpboot.c
> +++ b/kernel/smpboot.c
> @@ -266,6 +266,7 @@ int smpboot_register_percpu_thread(struct 
> smp_hotplug_thread *plug_thread)
>   unsigned int cpu;
>   int ret = 0;
> 
> + get_online_cpus();
>   mutex_lock(&smpboot_threads_lock);
>   for_each_online_cpu(cpu) {
>   ret = __smpboot_create_thread(plug_thread, cpu);
> @@ -278,6 +279,7 @@ int smpboot_register_percpu_thread(struct 
> smp_hotplug_thread *plug_thread)
>   list_add(&plug_thread->list, &hotplug_threads);
>  out:
>   mutex_unlock(&smpboot_threads_lock);
> + put_online_cpus();
>   return ret;
>  }
>  EXPORT_SYMBOL_GPL(smpboot_register_percpu_thread);
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [Xen-devel] [PATCH] xen/swiotlb: Exchange to contiguous memory for map_sg hook

2012-12-10 Thread Xu, Dongxiao
> -Original Message-
> From: Jan Beulich [mailto:jbeul...@suse.com]
> Sent: Thursday, December 06, 2012 9:38 PM
> To: Xu, Dongxiao
> Cc: xen-de...@lists.xen.org; konrad.w...@oracle.com;
> linux-kernel@vger.kernel.org
> Subject: Re: [Xen-devel] [PATCH] xen/swiotlb: Exchange to contiguous memory
> for map_sg hook
> 
> >>> On 06.12.12 at 14:08, Dongxiao Xu  wrote:
> > While mapping sg buffers, checking to cross page DMA buffer is also
> > needed. If the guest DMA buffer crosses page boundary, Xen should
> > exchange contiguous memory for it.
> >
> > Besides, it is needed to backup the original page contents and copy it
> > back after memory exchange is done.
> >
> > This fixes issues if device DMA into software static buffers, and in
> > case the static buffer cross page boundary which pages are not
> > contiguous in real hardware.
> >
> > Signed-off-by: Dongxiao Xu 
> > Signed-off-by: Xiantao Zhang 
> > ---
> >  drivers/xen/swiotlb-xen.c |   47
> > -
> >  1 files changed, 46 insertions(+), 1 deletions(-)
> >
> > diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
> > index 58db6df..e8f0cfb 100644
> > --- a/drivers/xen/swiotlb-xen.c
> > +++ b/drivers/xen/swiotlb-xen.c
> > @@ -461,6 +461,22 @@ xen_swiotlb_sync_single_for_device(struct device
> > *hwdev, dma_addr_t dev_addr,  }
> > EXPORT_SYMBOL_GPL(xen_swiotlb_sync_single_for_device);
> >
> > +static bool
> > +check_continguous_region(unsigned long vstart, unsigned long order)
> 
> check_continguous_region(unsigned long vstart, unsigned int order)
> 
> But - why do you need to do this check order based in the first place? 
> Checking
> the actual length of the buffer should suffice.

Thanks, the word "continguous" is mistyped in the function, it should be 
"contiguous".
  
check_contiguous_region() function is used to check whether pages are 
contiguous in hardware.
The length only indicates whether the buffer crosses page boundary. If buffer 
crosses pages and they are not contiguous in hardware, we do need to exchange 
memory in Xen.

> 
> > +{
> > +   unsigned long prev_ma = xen_virt_to_bus((void *)vstart);
> > +   unsigned long next_ma;
> 
> phys_addr_t or some such for both of them.

Thanks.
Should be dma_addr_t?

> 
> > +   int i;
> 
> unsigned long

Thanks.

> 
> > +
> > +   for (i = 1; i < (1 << order); i++) {
> 
> 1UL

Thanks.

> 
> > +   next_ma = xen_virt_to_bus((void *)(vstart + i * PAGE_SIZE));
> > +   if (next_ma != prev_ma + PAGE_SIZE)
> > +   return false;
> > +   prev_ma = next_ma;
> > +   }
> > +   return true;
> > +}
> > +
> >  /*
> >   * Map a set of buffers described by scatterlist in streaming mode for
> DMA.
> >   * This is the scatter-gather version of the above
> > xen_swiotlb_map_page @@ -489,7 +505,36 @@
> > xen_swiotlb_map_sg_attrs(struct device *hwdev, struct scatterlist
> > *sgl,
> >
> > for_each_sg(sgl, sg, nelems, i) {
> > phys_addr_t paddr = sg_phys(sg);
> > -   dma_addr_t dev_addr = xen_phys_to_bus(paddr);
> > +   unsigned long vstart, order;
> > +   dma_addr_t dev_addr;
> > +
> > +   /*
> > +* While mapping sg buffers, checking to cross page DMA buffer
> > +* is also needed. If the guest DMA buffer crosses page
> > +* boundary, Xen should exchange contiguous memory for it.
> > +* Besides, it is needed to backup the original page contents
> > +* and copy it back after memory exchange is done.
> > +*/
> > +   if (range_straddles_page_boundary(paddr, sg->length)) {
> > +   vstart = (unsigned long)__va(paddr & PAGE_MASK);
> > +   order = get_order(sg->length + (paddr & ~PAGE_MASK));
> > +   if (!check_continguous_region(vstart, order)) {
> > +   unsigned long buf;
> > +   buf = __get_free_pages(GFP_KERNEL, order);
> > +   memcpy((void *)buf, (void *)vstart,
> > +   PAGE_SIZE * (1 << order));
> > +   if (xen_create_contiguous_region(vstart, order,
> > +   fls64(paddr))) {
> > +   free_pages(buf, order);
> > +   return 0;
> > +   }
> > +   memcpy((void *)vstart, (void *)buf,
> > +   PAGE_SIZE * (1 << order));
> > +   free_pages(buf, order);
> > +   }
> > +   }
> > +
> > +   dev_addr = xen_phys_to_bus(paddr);
> >
> > if (swiotlb_force ||
> > !dma_capable(hwdev, dev_addr, sg->length) ||
> 
> How about swiotlb_map_page() (for the compound page case)?

Yes! This should also need similar handling.

One thing needs further consideration is that, the above approach introdu

[GIT PULL] please pull infiniband.git

2012-12-10 Thread Roland Dreier
Hi Linus,

Please pull from

git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git 
tags/rdma-for-linus



First batch of InfiniBand/RDMA changes for the 3.8 merge window:
 - A good chunk of Bart Van Assche's SRP fixes
 - UAPI disintegration from David Howells
 - mlx4 support for "64-byte CQE" hardware feature from Or Gerlitz
 - Other miscellaneous fixes


Alan Cox (2):
  IB/ipath: Remove unreachable code
  RDMA/amsol1100: Fix missing break

Bart Van Assche (14):
  IB/srp: Increase block layer timeout
  IB/srp: Eliminate state SRP_TARGET_CONNECTING
  IB/srp: Keep processing commands during host removal
  IB/srp: Simplify SCSI error handling
  IB/srp: Introduce srp_handle_qp_err()
  IB/srp: Process all error completions
  IB/srp: Suppress superfluous error messages
  IB/srp: Introduce the helper function srp_remove_target()
  IB/srp: Eliminate state SRP_TARGET_DEAD
  IB/srp: Document sysfs attributes
  srp_transport: Fix attribute registration
  srp_transport: Simplify attribute initialization code
  srp_transport: Document sysfs attributes
  IB/srp: Allow SRP disconnect through sysfs

David Howells (1):
  UAPI: (Scripted) Disintegrate include/rdma

Ishai Rabinovitz (1):
  IB/srp: destroy and recreate QP and CQs when reconnecting

Jack Morgenstein (2):
  IB/mlx4: Fix spinlock order to avoid lockdep warnings
  mlx4_core: Fix potential deadlock in mlx4_eq_int()

Julia Lawall (3):
  RDMA/nes: Use WARN()
  RDMA/cxgb4: use WARN
  RDMA/cxgb3: use WARN

Or Gerlitz (1):
  mlx4: 64-byte CQE/EQE support

Roland Dreier (4):
  Merge branches 'cxgb4', 'misc', 'mlx4', 'nes' and 'uapi' into for-next
  Merge branches 'cma' and 'mlx4' into for-next
  Merge branch 'srp' into for-next
  Merge branch 'nes' into for-next

Tatyana Nikolova (7):
  RDMA/nes: Fix incorrect address of IP header
  RDMA/nes: Fix for unlinking skbs from empty list
  RDMA/nes: Fix for sending fpdus in order to hardware
  RDMA/nes: Fix for incorrect multicast address in the perfect filter table
  RDMA/nes: Fix for BUG_ON due to adding already-pending timer
  RDMA/nes: Fix for terminate timer crash
  RDMA/nes: Fix for crash when registering zero length MR for CQ

Vu Pham (1):
  IB/srp: send disconnect request without waiting for CM timewait exit

shefty (1):
  RDMA/cm: Change return value from find_gid_port()

 Documentation/ABI/stable/sysfs-driver-ib_srp   |  156 
 Documentation/ABI/stable/sysfs-transport-srp   |   19 ++
 drivers/infiniband/core/cma.c  |9 +-
 drivers/infiniband/hw/amso1100/c2_ae.c |1 +
 drivers/infiniband/hw/cxgb3/iwch_cm.c  |6 +-
 drivers/infiniband/hw/cxgb4/cm.c   |6 +-
 drivers/infiniband/hw/ipath/ipath_init_chip.c  |   10 -
 drivers/infiniband/hw/mlx4/cm.c|4 +-
 drivers/infiniband/hw/mlx4/cq.c|   34 ++-
 drivers/infiniband/hw/mlx4/main.c  |   27 +-
 drivers/infiniband/hw/mlx4/mlx4_ib.h   |1 +
 drivers/infiniband/hw/mlx4/user.h  |   12 +-
 drivers/infiniband/hw/nes/nes.h|1 +
 drivers/infiniband/hw/nes/nes_cm.c |   32 +--
 drivers/infiniband/hw/nes/nes_hw.c |9 +-
 drivers/infiniband/hw/nes/nes_mgt.c|   42 ++--
 drivers/infiniband/hw/nes/nes_nic.c|   13 +-
 drivers/infiniband/hw/nes/nes_verbs.c  |9 +-
 drivers/infiniband/ulp/srp/ib_srp.c|  314 ++--
 drivers/infiniband/ulp/srp/ib_srp.h|   11 +-
 drivers/net/ethernet/mellanox/mlx4/cmd.c   |   11 +-
 drivers/net/ethernet/mellanox/mlx4/en_cq.c |2 +-
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c |1 +
 drivers/net/ethernet/mellanox/mlx4/en_rx.c |5 +-
 drivers/net/ethernet/mellanox/mlx4/en_tx.c |5 +-
 drivers/net/ethernet/mellanox/mlx4/eq.c|   36 ++-
 drivers/net/ethernet/mellanox/mlx4/fw.c|   30 ++-
 drivers/net/ethernet/mellanox/mlx4/fw.h|1 +
 drivers/net/ethernet/mellanox/mlx4/main.c  |   38 ++-
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |1 +
 drivers/scsi/scsi_transport_srp.c  |   51 ++--
 include/linux/mlx4/device.h|   21 ++
 include/rdma/Kbuild|6 -
 include/rdma/rdma_netlink.h|   36 +--
 include/scsi/scsi_transport_srp.h  |8 +
 include/uapi/rdma/Kbuild   |6 +
 include/{ => uapi}/rdma/ib_user_cm.h   |0
 include/{ => uapi}/rdma/ib_user_mad.h  |0
 include/{ => uapi}/rdma/ib_user_sa.h   |0
 include/{ => uapi}/rdma/ib_user_verbs.h|0
 include/uapi/rdma/rdma_netlink.h   |   37 +++
 inc

[GIT PULL] hwmon updates for 3.8-rc1

2012-12-10 Thread Guenter Roeck
Hi Linus,

Please pull hwmon updates for Linux 3.8-rc1 from signed tag:

git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging.git 
hwmon-for-linus

Thanks,
Guenter
--

The following changes since commit 9489e9dcae718d5fde988e4a684a0f55b5f94d17:

  Linux 3.7-rc7 (2012-11-25 17:59:19 -0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging.git 
tags/hwmon-for-linus

for you to fetch changes up to 44f751cee1b4baef9e3b49c6bd954f8b12b097a6:

  hwmon: (da9055) Fix chan_mux[DA9055_ADC_ADCIN3] setting (2012-12-05 10:55:55 
-0800)


New driver: DA9055
Added/improved support for new chips in existing drivers: Z650/670, N550/570,
ADS7830, AMD 16h family


Ashish Jangam (1):
  hwmon: DA9055 HWMON driver

Axel Lin (2):
  hwmon: da9052: Use da9052_reg_update for rmw operations
  hwmon: (da9055) Fix chan_mux[DA9055_ADC_ADCIN3] setting

Boris Ostrovsky (1):
  x86,AMD: Power driver support for AMD's family 16h processors

Guenter Roeck (4):
  hwmon: (coretemp) Drop dependency on PCI for TjMax detection on Atom CPUs
  hwmon: (coretemp) Use model table instead of if/else to identify CPU 
models
  hwmon: (coretemp) Drop N4xx, N5xx, D4xx, D5xx CPUs from tjmax table
  hwmon: (coretemp) List TjMax for Z650/670 and N550/570

Guillaume Roguez (1):
  hwmon: (ads7828) add support for ADS7830

Vivien Didelot (1):
  hwmon: (ads7828) driver cleanup

Wei Yongjun (1):
  hwmon: (ina2xx) use module_i2c_driver to simplify the code

 Documentation/hwmon/ads7828   |   46 +++--
 Documentation/hwmon/coretemp  |2 +
 Documentation/hwmon/da9055|   47 +
 drivers/hwmon/Kconfig |   19 +-
 drivers/hwmon/Makefile|1 +
 drivers/hwmon/ads7828.c   |  247 ++--
 drivers/hwmon/coretemp.c  |   60 +++---
 drivers/hwmon/da9052-hwmon.c  |   27 +--
 drivers/hwmon/da9055-hwmon.c  |  336 +
 drivers/hwmon/fam15h_power.c  |4 +
 drivers/hwmon/ina2xx.c|   13 +-
 include/linux/platform_data/ads7828.h |   29 +++
 12 files changed, 607 insertions(+), 224 deletions(-)
 create mode 100644 Documentation/hwmon/da9055
 create mode 100644 drivers/hwmon/da9055-hwmon.c
 create mode 100644 include/linux/platform_data/ads7828.h


signature.asc
Description: Digital signature


Re: [PATCH V2] MCE: fix an error of mce_bad_pages statistics

2012-12-10 Thread Xishi Qiu
On 2012/12/11 11:48, Simon Jeons wrote:

> On Tue, 2012-12-11 at 04:19 +0100, Andi Kleen wrote:
>> On Mon, Dec 10, 2012 at 09:13:11PM -0600, Simon Jeons wrote:
>>> On Tue, 2012-12-11 at 04:01 +0100, Andi Kleen wrote:
> Oh, it will be putback to lru list during migration. So does your "some
> time" mean before call check_new_page?

 Yes until the next check_new_page() whenever that is. If the migration
 works it will be earlier, otherwise later.
>>>
>>> But I can't figure out any page reclaim path check if the page is set
>>> PG_hwpoison, can poisoned pages be rclaimed?
>>
>> The only way to reclaim a page is to free and reallocate it.
> 
> Then why there doesn't have check in reclaim path to avoid relcaim
> poisoned page?
> 
>   -Simon

Hi Simon,

If the page is free, it will be set PG_hwpoison, and soft_offline_page() is 
done.
When the page is alocated later, check_new_page() will find the poisoned page 
and
isolate the whole buddy block(just drop the block).

If the page is not free, soft_offline_page() try to free it first, if this is
failed, it will migrate the page, but the page is still in LRU list after 
migration,
migrate_pages()
unmap_and_move()
if (rc != -EAGAIN) {
...
putback_lru_page(page);
}
We can use lru_add_drain_all() to drain lru pagevec, at last 
free_hot_cold_page()
will be called, and free_pages_prepare() check the poisoned pages.
free_pages_prepare()
free_pages_check()
bad_page()

Is this right, Andi?

Thanks
Xishi Qiu

>>
>> -Andi
> 
> 
> 
> 



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 02/18] sched: fix find_idlest_group mess logical

2012-12-10 Thread Preeti U Murthy
Hi Alex,
On 12/11/2012 10:59 AM, Alex Shi wrote:
> On 12/11/2012 01:08 PM, Preeti U Murthy wrote:
>> Hi Alex,
>>
>> On 12/10/2012 01:52 PM, Alex Shi wrote:
>>> There is 4 situations in the function:
>>> 1, no task allowed group;
>>> so min_load = ULONG_MAX, this_load = 0, idlest = NULL
>>> 2, only local group task allowed;
>>> so min_load = ULONG_MAX, this_load assigned, idlest = NULL
>>> 3, only non-local task group allowed;
>>> so min_load assigned, this_load = 0, idlest != NULL
>>> 4, local group + another group are task allowed.
>>> so min_load assigned, this_load assigned, idlest != NULL
>>>
>>> Current logical will return NULL in first 3 kinds of scenarios.
>>> And still return NULL, if idlest group is heavier then the
>>> local group in the 4th situation.
>>>
>>> Actually, I thought groups in situation 2,3 are also eligible to host
>>> the task. And in 4th situation, agree to bias toward local group.
>>> So, has this patch.
>>>
>>> Signed-off-by: Alex Shi 
>>> ---
>>>  kernel/sched/fair.c |   12 +---
>>>  1 files changed, 9 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index df99456..b40bc2b 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -2953,6 +2953,7 @@ find_idlest_group(struct sched_domain *sd, struct 
>>> task_struct *p,
>>>   int this_cpu, int load_idx)
>>>  {
>>> struct sched_group *idlest = NULL, *group = sd->groups;
>>> +   struct sched_group *this_group = NULL;
>>> unsigned long min_load = ULONG_MAX, this_load = 0;
>>> int imbalance = 100 + (sd->imbalance_pct-100)/2;
>>>  
>>> @@ -2987,14 +2988,19 @@ find_idlest_group(struct sched_domain *sd, struct 
>>> task_struct *p,
>>>  
>>> if (local_group) {
>>> this_load = avg_load;
>>> -   } else if (avg_load < min_load) {
>>> +   this_group = group;
>>> +   }
>>> +   if (avg_load < min_load) {
>>> min_load = avg_load;
>>> idlest = group;
>>> }
>>> } while (group = group->next, group != sd->groups);
>>>  
>>> -   if (!idlest || 100*this_load < imbalance*min_load)
>>> -   return NULL;
>>> +   if (this_group && idlest != this_group)
>>> +   /* Bias toward our group again */
>>> +   if (100*this_load < imbalance*min_load)
>>> +   idlest = this_group;
>>
>> If the idlest group is heavier than this_group(or to put it better if
>> the difference in the loads of the local group and idlest group is less
>> than a threshold,it means there is no point moving the load from the
>> local group) you return NULL,that immediately means this_group is chosen
>> as the candidate group for the task to run,one does not have to
>> explicitly return that.
> 
> In situation 4, this_group is not NULL.

True.The return value of find_idlest_group() indicates that there is no
other idle group other than the local group(the group to which cpu
belongs to). it does not indicate that there is no host group for the
task.If this is the case,select_task_rq_fair() falls back to the
group(sd->child) to which the cpu chosen in the previous iteration
belongs to,This is nothing but this_group in the current iteration.

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 1/4] kprobes/powerpc: Do not disable External interrupts during single step

2012-12-10 Thread Suzuki K. Poulose

On 12/03/2012 08:37 PM, Suzuki K. Poulose wrote:

From: Suzuki K. Poulose 

External/Decrement exceptions have lower priority than the Debug Exception.
So, we don't have to disable the External interrupts before a single step.
However, on BookE, Critical Input Exception(CE) has higher priority than a
Debug Exception. Hence we mask them.

Signed-off-by:  Suzuki K. Poulose 
Cc: Sebastian Andrzej Siewior 
Cc: Ananth N Mavinakaynahalli 
Cc: Kumar Gala 
Cc: linuxppc-...@ozlabs.org
---
  arch/powerpc/kernel/kprobes.c |   10 +-
  1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c
index e88c643..4901b34 100644
--- a/arch/powerpc/kernel/kprobes.c
+++ b/arch/powerpc/kernel/kprobes.c
@@ -104,13 +104,13 @@ void __kprobes arch_remove_kprobe(struct kprobe *p)

  static void __kprobes prepare_singlestep(struct kprobe *p, struct pt_regs 
*regs)
  {
-   /* We turn off async exceptions to ensure that the single step will
-* be for the instruction we have the kprobe on, if we dont its
-* possible we'd get the single step reported for an exception handler
-* like Decrementer or External Interrupt */
-   regs->msr &= ~MSR_EE;
regs->msr |= MSR_SINGLESTEP;
  #ifdef CONFIG_PPC_ADV_DEBUG_REGS
+   /*
+* We turn off Critical Input Exception(CE) to ensure that the single
+* step will be for the instruction we have the probe on; if we don't,
+* it is possible we'd get the single step reported for CE.
+*/
regs->msr &= ~MSR_CE;
mtspr(SPRN_DBCR0, mfspr(SPRN_DBCR0) | DBCR0_IC | DBCR0_IDM);
  #ifdef CONFIG_PPC_47x



Ben, Kumar,

Could you please review this patch ?


Thanks
Suzuki

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/2] net: remove obsolete simple_strto

2012-12-10 Thread Abhijit Pawar
This patch removes the redundant occurences of simple_strto

Signed-off-by: Abhijit Pawar 
---
 net/core/netpoll.c|1 -
 net/mac80211/debugfs_sta.c|1 -
 net/netfilter/nf_conntrack_core.c |1 -
 3 files changed, 0 insertions(+), 3 deletions(-)

diff --git a/net/core/netpoll.c b/net/core/netpoll.c
index 12c129f..3151acf 100644
--- a/net/core/netpoll.c
+++ b/net/core/netpoll.c
@@ -706,7 +706,6 @@ int netpoll_parse_options(struct netpoll *np, char *opt)
*delim = 0;
if (*cur == ' ' || *cur == '\t')
np_info(np, "warning: whitespace is not allowed\n");
-   np->remote_port = simple_strtol(cur, NULL, 10);
if (kstrtou16(cur, 10, &np->remote_port))
goto parse_failed;
cur = delim;
diff --git a/net/mac80211/debugfs_sta.c b/net/mac80211/debugfs_sta.c
index 0dedb4b..6fb1168 100644
--- a/net/mac80211/debugfs_sta.c
+++ b/net/mac80211/debugfs_sta.c
@@ -220,7 +220,6 @@ static ssize_t sta_agg_status_write(struct file *file, 
const char __user *userbu
} else
return -EINVAL;
 
-   tid = simple_strtoul(buf, NULL, 0);
ret = kstrtoul(buf, 0, &tid);
if (ret)
return ret;
diff --git a/net/netfilter/nf_conntrack_core.c 
b/net/netfilter/nf_conntrack_core.c
index 37d9e62..08cdc71 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -1422,7 +1422,6 @@ int nf_conntrack_set_hashsize(const char *val, struct 
kernel_param *kp)
if (!nf_conntrack_htable_size)
return param_set_uint(val, kp);
 
-   hashsize = simple_strtoul(val, NULL, 0);
rc = kstrtouint(val, 0, &hashsize);
if (rc)
return rc;
-- 
1.7.7.6

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RESEND RESEND] net: remove obsolete simple_strto

2012-12-10 Thread Abhijit Pawar
On 12/11/2012 10:19 AM, David Miller wrote:
> From: Abhijit Pawar 
> Date: Tue, 11 Dec 2012 09:04:20 +0530
> 
>> This patch replace the obsolete simple_strto with kstrto
>>
>> Signed-off-by: Abhijit Pawar 
> 
> You can't submit replacement patches for ones which I have already
> applied.
> 
> Patches I apply are permanently applied, and therefore you must submit
> changes relative the ones I've applied already.
> 
I am sorry to create this confusion. I have created and sent the new
patch which you can apply over the old one to fix the issues.

-- 
-
Abhijit
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 02/18] sched: fix find_idlest_group mess logical

2012-12-10 Thread Alex Shi
On 12/11/2012 01:08 PM, Preeti U Murthy wrote:
> Hi Alex,
> 
> On 12/10/2012 01:52 PM, Alex Shi wrote:
>> There is 4 situations in the function:
>> 1, no task allowed group;
>>  so min_load = ULONG_MAX, this_load = 0, idlest = NULL
>> 2, only local group task allowed;
>>  so min_load = ULONG_MAX, this_load assigned, idlest = NULL
>> 3, only non-local task group allowed;
>>  so min_load assigned, this_load = 0, idlest != NULL
>> 4, local group + another group are task allowed.
>>  so min_load assigned, this_load assigned, idlest != NULL
>>
>> Current logical will return NULL in first 3 kinds of scenarios.
>> And still return NULL, if idlest group is heavier then the
>> local group in the 4th situation.
>>
>> Actually, I thought groups in situation 2,3 are also eligible to host
>> the task. And in 4th situation, agree to bias toward local group.
>> So, has this patch.
>>
>> Signed-off-by: Alex Shi 
>> ---
>>  kernel/sched/fair.c |   12 +---
>>  1 files changed, 9 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index df99456..b40bc2b 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -2953,6 +2953,7 @@ find_idlest_group(struct sched_domain *sd, struct 
>> task_struct *p,
>>int this_cpu, int load_idx)
>>  {
>>  struct sched_group *idlest = NULL, *group = sd->groups;
>> +struct sched_group *this_group = NULL;
>>  unsigned long min_load = ULONG_MAX, this_load = 0;
>>  int imbalance = 100 + (sd->imbalance_pct-100)/2;
>>  
>> @@ -2987,14 +2988,19 @@ find_idlest_group(struct sched_domain *sd, struct 
>> task_struct *p,
>>  
>>  if (local_group) {
>>  this_load = avg_load;
>> -} else if (avg_load < min_load) {
>> +this_group = group;
>> +}
>> +if (avg_load < min_load) {
>>  min_load = avg_load;
>>  idlest = group;
>>  }
>>  } while (group = group->next, group != sd->groups);
>>  
>> -if (!idlest || 100*this_load < imbalance*min_load)
>> -return NULL;
>> +if (this_group && idlest != this_group)
>> +/* Bias toward our group again */
>> +if (100*this_load < imbalance*min_load)
>> +idlest = this_group;
> 
> If the idlest group is heavier than this_group(or to put it better if
> the difference in the loads of the local group and idlest group is less
> than a threshold,it means there is no point moving the load from the
> local group) you return NULL,that immediately means this_group is chosen
> as the candidate group for the task to run,one does not have to
> explicitly return that.

In situation 4, this_group is not NULL.
> 
> Let me explain:
> find_idlest_group()-if it returns NULL to mark your case4,it means there
> is no idler group than the group to which this_cpu belongs to, at that
> level of sched domain.Which is fair enough.
> 
> So now the question is under such a circumstance which is the idlest
> group so far.It is the group containing this_cpu,i.e.this_group.After
> this sd->child is chosen which is nothing but this_group(sd hierarchy
> moves towards the cpu it belongs to). Again here the idlest group search
> begins.
> 
>> +
>>  return idlest;
>>  }
>>  
>>
> Regards
> Preeti U Murthy
> 


-- 
Thanks
Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 01/18] sched: select_task_rq_fair clean up

2012-12-10 Thread Alex Shi
On 12/11/2012 12:23 PM, Preeti U Murthy wrote:
> Hi Alex,
> 
> On 12/10/2012 01:52 PM, Alex Shi wrote:
>> It is impossible to miss a task allowed cpu in a eligible group.
> 
> The one thing I am concerned with here is if there is a possibility of
> the task changing its tsk_cpus_allowed() while this code is running.
> 
> i.e find_idlest_group() finds an idle group,then the tsk_cpus_allowed()
> for the task changes,perhaps by the user himself,which might not include
> the cpus in the idle group.After this find_idlest_cpu() is called.I mean
> a race condition in short.Then we might not have an eligible cpu in that
> group right?

your worry make sense, but the code handle the situation, in
select_task_rq(), it will check the cpu allowed again. if the answer is
no, it will fallback to old cpu.
> 
>> And since find_idlest_group only return a different group which
>> excludes old cpu, it's also imporissible to find a new cpu same as old
>> cpu.
> 
> This I agree with.
> 
>> Signed-off-by: Alex Shi 
>> ---
>>  kernel/sched/fair.c |5 -
>>  1 files changed, 0 insertions(+), 5 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 59e072b..df99456 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -3150,11 +3150,6 @@ select_task_rq_fair(struct task_struct *p, int 
>> sd_flag, int wake_flags)
>>  }
>>  
>>  new_cpu = find_idlest_cpu(group, p, cpu);
>> -if (new_cpu == -1 || new_cpu == cpu) {
>> -/* Now try balancing at a lower domain level of cpu */
>> -sd = sd->child;
>> -continue;
>> -}
>>  
>>  /* Now try balancing at a lower domain level of new_cpu */
>>  cpu = new_cpu;
>>
> Regards
> Preeti U Murthy
> 


-- 
Thanks
Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


linux-next: manual merge of the akpm tree with Linus' tree

2012-12-10 Thread Stephen Rothwell
Hi Andrew,

Today's linux-next merge of the akpm tree got a conflict in
include/linux/gfp.h between commit caf491916b1c ("Revert "revert "Revert
"mm: remove __GFP_NO_KSWAPD""" and associated damage") from Linus' tree
and commit "mm: add a reminder comment for __GFP_BITS_SHIFT" from the
akpm tree.

I fixed it up (see below) and can carry the fix as necessary (no action
is required).

-- 
Cheers,
Stephen Rothwells...@canb.auug.org.au

diff --cc include/linux/gfp.h
index 976a8e3,c0fb4d8..000
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@@ -30,11 -30,11 +30,12 @@@ struct vm_area_struct
  #define ___GFP_HARDWALL   0x2u
  #define ___GFP_THISNODE   0x4u
  #define ___GFP_RECLAIMABLE0x8u
 -#define ___GFP_NOTRACK0x10u
 -#define ___GFP_OTHER_NODE 0x20u
 -#define ___GFP_WRITE  0x40u
 -#define ___GFP_KMEMCG 0x80u
 +#define ___GFP_NOTRACK0x20u
 +#define ___GFP_NO_KSWAPD  0x40u
 +#define ___GFP_OTHER_NODE 0x80u
 +#define ___GFP_WRITE  0x100u
 +#define ___GFP_KMEMCG 0x200u
+ /* If the above are modified, __GFP_BITS_SHIFT may need updating */
  
  /*
   * GFP bitmasks..


pgpBN26usQqku.pgp
Description: PGP signature


linux-next: manual merge of the akpm tree with Linus' tree

2012-12-10 Thread Stephen Rothwell
Hi Andrew,

Today's linux-next merge of the akpm tree got a conflict in
include/linux/gfp.h between commit caf491916b1c ("Revert "revert "Revert
"mm: remove __GFP_NO_KSWAPD""" and associated damage") from Linus' tree
and commit "mm: add a __GFP_KMEMCG flag" from the akpm tree.

I fixed it up (see below) and can carry the fix as necessary (no action
is required).

-- 
Cheers,
Stephen Rothwells...@canb.auug.org.au

diff --cc include/linux/gfp.h
index 31e8041,5520344..000
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@@ -30,10 -30,10 +30,11 @@@ struct vm_area_struct
  #define ___GFP_HARDWALL   0x2u
  #define ___GFP_THISNODE   0x4u
  #define ___GFP_RECLAIMABLE0x8u
 -#define ___GFP_NOTRACK0x10u
 -#define ___GFP_OTHER_NODE 0x20u
 -#define ___GFP_WRITE  0x40u
 -#define ___GFP_KMEMCG 0x80u
 +#define ___GFP_NOTRACK0x20u
 +#define ___GFP_NO_KSWAPD  0x40u
 +#define ___GFP_OTHER_NODE 0x80u
 +#define ___GFP_WRITE  0x100u
++#define ___GFP_KMEMCG 0x200u
  
  /*
   * GFP bitmasks..
@@@ -86,17 -86,16 +87,17 @@@
  #define __GFP_RECLAIMABLE ((__force gfp_t)___GFP_RECLAIMABLE) /* Page is 
reclaimable */
  #define __GFP_NOTRACK ((__force gfp_t)___GFP_NOTRACK)  /* Don't track with 
kmemcheck */
  
 +#define __GFP_NO_KSWAPD   ((__force gfp_t)___GFP_NO_KSWAPD)
  #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of 
other node */
  #define __GFP_WRITE   ((__force gfp_t)___GFP_WRITE)   /* Allocator intends to 
dirty page */
- 
+ #define __GFP_KMEMCG  ((__force gfp_t)___GFP_KMEMCG) /* Allocation comes from 
a memcg-accounted resource */
  /*
   * This may seem redundant, but it's a way of annotating false positives vs.
   * allocations that simply cannot be supported (e.g. page tables).
   */
  #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
  
- #define __GFP_BITS_SHIFT 25   /* Room for N __GFP_FOO bits */
 -#define __GFP_BITS_SHIFT 24   /* Room for N __GFP_FOO bits */
++#define __GFP_BITS_SHIFT 26   /* Room for N __GFP_FOO bits */
  #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
  
  /* This equals 0, but use constants in case they ever change */


pgpKIRYXpyL6C.pgp
Description: PGP signature


[PATCH] smpboot: calling smpboot_register_percpu_thread is unsafe during one CPU being down

2012-12-10 Thread Chuansheng Liu

When one CPU is going down, and smpboot_register_percpu_thread is called,
there is the race issue below:
T1(CPUA):  T2(CPUB):
_cpu_down()smpboot_register_percpu_thread()
  smpboot_park_threads()   ...
  __stop_machine()   
__smpboot_create_thread(CPU_Dying)
 [Currently, the being down CPU 
is online yet]
take_cpu_down()  
smpboot_unpark_thread(CPU_Dying)
  __cpu_disable()  
native_cpu_disable()
     Here the new kthread will get 
running
 based on the CPU_Dying
set_cpu_online(cpu, false)
   
   cpu_notify(CPU_DYING)

After notified the CPU_DYING, the new created kthead for dying CPU will
be migrated to another CPU in migration_call().

Here we need use get_online_cpus()/put_online_cpus() when calling
function smpboot_register_percpu_thread().

Signed-off-by: liu chuansheng 
---
 kernel/smpboot.c |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/kernel/smpboot.c b/kernel/smpboot.c
index d6c5fc0..3fe708a 100644
--- a/kernel/smpboot.c
+++ b/kernel/smpboot.c
@@ -266,6 +266,7 @@ int smpboot_register_percpu_thread(struct 
smp_hotplug_thread *plug_thread)
unsigned int cpu;
int ret = 0;
 
+   get_online_cpus();
mutex_lock(&smpboot_threads_lock);
for_each_online_cpu(cpu) {
ret = __smpboot_create_thread(plug_thread, cpu);
@@ -278,6 +279,7 @@ int smpboot_register_percpu_thread(struct 
smp_hotplug_thread *plug_thread)
list_add(&plug_thread->list, &hotplug_threads);
 out:
mutex_unlock(&smpboot_threads_lock);
+   put_online_cpus();
return ret;
 }
 EXPORT_SYMBOL_GPL(smpboot_register_percpu_thread);
-- 
1.7.0.4



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 02/18] sched: fix find_idlest_group mess logical

2012-12-10 Thread Preeti U Murthy
Hi Alex,

On 12/10/2012 01:52 PM, Alex Shi wrote:
> There is 4 situations in the function:
> 1, no task allowed group;
>   so min_load = ULONG_MAX, this_load = 0, idlest = NULL
> 2, only local group task allowed;
>   so min_load = ULONG_MAX, this_load assigned, idlest = NULL
> 3, only non-local task group allowed;
>   so min_load assigned, this_load = 0, idlest != NULL
> 4, local group + another group are task allowed.
>   so min_load assigned, this_load assigned, idlest != NULL
> 
> Current logical will return NULL in first 3 kinds of scenarios.
> And still return NULL, if idlest group is heavier then the
> local group in the 4th situation.
> 
> Actually, I thought groups in situation 2,3 are also eligible to host
> the task. And in 4th situation, agree to bias toward local group.
> So, has this patch.
> 
> Signed-off-by: Alex Shi 
> ---
>  kernel/sched/fair.c |   12 +---
>  1 files changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index df99456..b40bc2b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2953,6 +2953,7 @@ find_idlest_group(struct sched_domain *sd, struct 
> task_struct *p,
> int this_cpu, int load_idx)
>  {
>   struct sched_group *idlest = NULL, *group = sd->groups;
> + struct sched_group *this_group = NULL;
>   unsigned long min_load = ULONG_MAX, this_load = 0;
>   int imbalance = 100 + (sd->imbalance_pct-100)/2;
>  
> @@ -2987,14 +2988,19 @@ find_idlest_group(struct sched_domain *sd, struct 
> task_struct *p,
>  
>   if (local_group) {
>   this_load = avg_load;
> - } else if (avg_load < min_load) {
> + this_group = group;
> + }
> + if (avg_load < min_load) {
>   min_load = avg_load;
>   idlest = group;
>   }
>   } while (group = group->next, group != sd->groups);
>  
> - if (!idlest || 100*this_load < imbalance*min_load)
> - return NULL;
> + if (this_group && idlest != this_group)
> + /* Bias toward our group again */
> + if (100*this_load < imbalance*min_load)
> + idlest = this_group;

If the idlest group is heavier than this_group(or to put it better if
the difference in the loads of the local group and idlest group is less
than a threshold,it means there is no point moving the load from the
local group) you return NULL,that immediately means this_group is chosen
as the candidate group for the task to run,one does not have to
explicitly return that.

Let me explain:
find_idlest_group()-if it returns NULL to mark your case4,it means there
is no idler group than the group to which this_cpu belongs to, at that
level of sched domain.Which is fair enough.

So now the question is under such a circumstance which is the idlest
group so far.It is the group containing this_cpu,i.e.this_group.After
this sd->child is chosen which is nothing but this_group(sd hierarchy
moves towards the cpu it belongs to). Again here the idlest group search
begins.

> +
>   return idlest;
>  }
>  
> 
Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v3 4/5][RESEND] page_alloc: Make movablecore_map has higher priority

2012-12-10 Thread Tang Chen
If kernelcore or movablecore is specified at the same time
with movablecore_map, movablecore_map will have higher
priority to be satisfied.
This patch will make find_zone_movable_pfns_for_nodes()
calculate zone_movable_pfn[] with the limit from
zone_movable_limit[].

change log:
Move find_usable_zone_for_movable() to free_area_init_nodes()
so that sanitize_zone_movable_limit() in patch 3 could use
initialized movable_zone.

Reported-by: Wu Jianguo 

Signed-off-by: Tang Chen 
Reviewed-by: Wen Congyang 
Reviewed-by: Lai Jiangshan 
Tested-by: Lin Feng 
---
 mm/page_alloc.c |   28 +---
 1 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 52c368e..00fa67d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4839,9 +4839,17 @@ static void __init find_zone_movable_pfns_for_nodes(void)
required_kernelcore = max(required_kernelcore, corepages);
}
 
-   /* If kernelcore was not specified, there is no ZONE_MOVABLE */
-   if (!required_kernelcore)
+   /*
+* If neither kernelcore/movablecore nor movablecore_map is specified,
+* there is no ZONE_MOVABLE. But if movablecore_map is specified, the
+* start pfn of ZONE_MOVABLE has been stored in zone_movable_limit[].
+*/
+   if (!required_kernelcore) {
+   if (movablecore_map.nr_map)
+   memcpy(zone_movable_pfn, zone_movable_limit,
+   sizeof(zone_movable_pfn));
goto out;
+   }
 
/* usable_startpfn is the lowest possible pfn ZONE_MOVABLE can be at */
usable_startpfn = arch_zone_lowest_possible_pfn[movable_zone];
@@ -4871,10 +4879,24 @@ restart:
for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
unsigned long size_pages;
 
+   /*
+* Find more memory for kernelcore in
+* [zone_movable_pfn[nid], zone_movable_limit[nid]).
+*/
start_pfn = max(start_pfn, zone_movable_pfn[nid]);
if (start_pfn >= end_pfn)
continue;
 
+   if (zone_movable_limit[nid]) {
+   end_pfn = min(end_pfn, zone_movable_limit[nid]);
+   /* No range left for kernelcore in this node */
+   if (start_pfn >= end_pfn) {
+   zone_movable_pfn[nid] =
+   zone_movable_limit[nid];
+   break;
+   }
+   }
+
/* Account for what is only usable for kernelcore */
if (start_pfn < usable_startpfn) {
unsigned long kernel_pages;
@@ -4934,12 +4956,12 @@ restart:
if (usable_nodes && required_kernelcore > usable_nodes)
goto restart;
 
+out:
/* Align start of ZONE_MOVABLE on all nids to MAX_ORDER_NR_PAGES */
for (nid = 0; nid < MAX_NUMNODES; nid++)
zone_movable_pfn[nid] =
roundup(zone_movable_pfn[nid], MAX_ORDER_NR_PAGES);
 
-out:
/* restore the node_state */
node_states[N_HIGH_MEMORY] = saved_node_state;
 }
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v3 3/5][RESEND] page_alloc: Introduce zone_movable_limit[] to keep movable limit for nodes

2012-12-10 Thread Tang Chen
This patch introduces a new array zone_movable_limit[] to store the
ZONE_MOVABLE limit from movablecore_map boot option for all nodes.
The function sanitize_zone_movable_limit() will find out to which
node the ranges in movable_map.map[] belongs, and calculates the
low boundary of ZONE_MOVABLE for each node.

change log:
Do find_usable_zone_for_movable() to initialize movable_zone
so that sanitize_zone_movable_limit() could use it.

Reported-by: Wu Jianguo 


Signed-off-by: Tang Chen 
Signed-off-by: Liu Jiang 
Reviewed-by: Wen Congyang 
Reviewed-by: Lai Jiangshan 
Tested-by: Lin Feng 
---
 mm/page_alloc.c |   79 ++-
 1 files changed, 78 insertions(+), 1 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1c91d16..52c368e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -206,6 +206,7 @@ static unsigned long __meminitdata 
arch_zone_highest_possible_pfn[MAX_NR_ZONES];
 static unsigned long __initdata required_kernelcore;
 static unsigned long __initdata required_movablecore;
 static unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES];
+static unsigned long __meminitdata zone_movable_limit[MAX_NUMNODES];
 
 /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
 int movable_zone;
@@ -4340,6 +4341,77 @@ static unsigned long __meminit 
zone_absent_pages_in_node(int nid,
return __absent_pages_in_range(nid, zone_start_pfn, zone_end_pfn);
 }
 
+/**
+ * sanitize_zone_movable_limit - Sanitize the zone_movable_limit array.
+ *
+ * zone_movable_limit is initialized as 0. This function will try to get
+ * the first ZONE_MOVABLE pfn of each node from movablecore_map, and
+ * assigne them to zone_movable_limit.
+ * zone_movable_limit[nid] == 0 means no limit for the node.
+ *
+ * Note: Each range is represented as [start_pfn, end_pfn)
+ */
+static void __meminit sanitize_zone_movable_limit(void)
+{
+   int map_pos = 0, i, nid;
+   unsigned long start_pfn, end_pfn;
+
+   if (!movablecore_map.nr_map)
+   return;
+
+   /* Iterate all ranges from minimum to maximum */
+   for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) {
+   /*
+* If we have found lowest pfn of ZONE_MOVABLE of the node
+* specified by user, just go on to check next range.
+*/
+   if (zone_movable_limit[nid])
+   continue;
+
+#ifdef CONFIG_ZONE_DMA
+   /* Skip DMA memory. */
+   if (start_pfn < arch_zone_highest_possible_pfn[ZONE_DMA])
+   start_pfn = arch_zone_highest_possible_pfn[ZONE_DMA];
+#endif
+
+#ifdef CONFIG_ZONE_DMA32
+   /* Skip DMA32 memory. */
+   if (start_pfn < arch_zone_highest_possible_pfn[ZONE_DMA32])
+   start_pfn = arch_zone_highest_possible_pfn[ZONE_DMA32];
+#endif
+
+#ifdef CONFIG_HIGHMEM
+   /* Skip lowmem if ZONE_MOVABLE is highmem. */
+   if (zone_movable_is_highmem() &&
+   start_pfn < arch_zone_lowest_possible_pfn[ZONE_HIGHMEM])
+   start_pfn = arch_zone_lowest_possible_pfn[ZONE_HIGHMEM];
+#endif
+
+   if (start_pfn >= end_pfn)
+   continue;
+
+   while (map_pos < movablecore_map.nr_map) {
+   if (end_pfn <= movablecore_map.map[map_pos].start_pfn)
+   break;
+
+   if (start_pfn >= movablecore_map.map[map_pos].end_pfn) {
+   map_pos++;
+   continue;
+   }
+
+   /*
+* The start_pfn of ZONE_MOVABLE is either the minimum
+* pfn specified by movablecore_map, or 0, which means
+* the node has no ZONE_MOVABLE.
+*/
+   zone_movable_limit[nid] = max(start_pfn,
+   movablecore_map.map[map_pos].start_pfn);
+
+   break;
+   }
+   }
+}
+
 #else /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
 static inline unsigned long __meminit zone_spanned_pages_in_node(int nid,
unsigned long zone_type,
@@ -4358,6 +4430,10 @@ static inline unsigned long __meminit 
zone_absent_pages_in_node(int nid,
return zholes_size[zone_type];
 }
 
+static void __meminit sanitize_zone_movable_limit(void)
+{
+}
+
 #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
 
 static void __meminit calculate_node_totalpages(struct pglist_data *pgdat,
@@ -4768,7 +4844,6 @@ static void __init find_zone_movable_pfns_for_nodes(void)
goto out;
 
/* usable_startpfn is the lowest possible pfn ZONE_MOVABLE can be at */
-   find_usable_zone_for_movable();
usable_startpfn = arch_zone_lowest_possible_pfn[movable_zone];
 
 restart:
@@ -4923,6 +4998,8 @@ 

RE: [PATCH 3/4 v5] iommu/fsl: Add iommu domain attributes required by fsl PAMU driver.

2012-12-10 Thread Sethi Varun-B16395


> -Original Message-
> From: Wood Scott-B07421
> Sent: Tuesday, December 11, 2012 6:31 AM
> To: Sethi Varun-B16395
> Cc: Wood Scott-B07421; Joerg Roedel; linux-kernel@vger.kernel.org;
> io...@lists.linux-foundation.org; linuxppc-...@lists.ozlabs.org; Tabi
> Timur-B04825
> Subject: Re: [PATCH 3/4 v5] iommu/fsl: Add iommu domain attributes
> required by fsl PAMU driver.
> 
> On 12/10/2012 04:10:06 AM, Sethi Varun-B16395 wrote:
> >
> >
> > > -Original Message-
> > > From: Wood Scott-B07421
> > > Sent: Tuesday, December 04, 2012 11:53 PM
> > > To: Sethi Varun-B16395
> > > Cc: Wood Scott-B07421; Joerg Roedel; linux-kernel@vger.kernel.org;
> > > io...@lists.linux-foundation.org; linuxppc-...@lists.ozlabs.org;
> > Tabi
> > > Timur-B04825
> > > Subject: Re: [PATCH 3/4 v5] iommu/fsl: Add iommu domain attributes
> > > required by fsl PAMU driver.
> > >
> > > On 12/04/2012 05:53:33 AM, Sethi Varun-B16395 wrote:
> > > >
> > > >
> > > > > -Original Message-
> > > > > From: Wood Scott-B07421
> > > > > Sent: Monday, December 03, 2012 10:34 PM
> > > > > To: Sethi Varun-B16395
> > > > > Cc: Joerg Roedel; linux-kernel@vger.kernel.org;
> > iommu@lists.linux-
> > > > > foundation.org; Wood Scott-B07421;
> > linuxppc-...@lists.ozlabs.org;
> > > > Tabi
> > > > > Timur-B04825
> > > > > Subject: Re: [PATCH 3/4 v5] iommu/fsl: Add iommu domain
> > attributes
> > > > > required by fsl PAMU driver.
> > > > >
> > > > > On 12/03/2012 10:57:29 AM, Sethi Varun-B16395 wrote:
> > > > > >
> > > > > >
> > > > > > > -Original Message-
> > > > > > > From: iommu-boun...@lists.linux-foundation.org
> > [mailto:iommu-
> > > > > > > boun...@lists.linux-foundation.org] On Behalf Of Joerg
> > Roedel
> > > > > > > Sent: Sunday, December 02, 2012 7:33 PM
> > > > > > > To: Sethi Varun-B16395
> > > > > > > Cc: linux-kernel@vger.kernel.org;
> > > > io...@lists.linux-foundation.org;
> > > > > > Wood
> > > > > > > Scott-B07421; linuxppc-...@lists.ozlabs.org; Tabi
> > Timur-B04825
> > > > > > > Subject: Re: [PATCH 3/4 v5] iommu/fsl: Add iommu domain
> > > > attributes
> > > > > > > required by fsl PAMU driver.
> > > > > > >
> > > > > > > Hmm, we need to work out a good abstraction for this.
> > > > > > >
> > > > > > > On Tue, Nov 20, 2012 at 07:24:56PM +0530, Varun Sethi wrote:
> > > > > > > > Added the following domain attributes required by FSL PAMU
> > > > driver:
> > > > > > > > 1. Subwindows field added to the iommu domain geometry
> > > > attribute.
> > > > > > >
> > > > > > > Are the Subwindows mapped with full size or do you map only
> > > > parts
> > > > > > of the
> > > > > > > subwindows?
> > > > > > >
> > > > > > [Sethi Varun-B16395] It's possible to map a part of the
> > subwindow
> > > > i.e.
> > > > > > size of the mapping can be less than the sub window size.
> > > > > >
> > > > > > > > +* This attribute indicates number of DMA subwindows
> > > > > supported
> > > > > > by
> > > > > > > > +* the geometry. If there is a single window that maps
> > > > the
> > > > > > entire
> > > > > > > > +* geometry, attribute must be set to "1". A value of
> > > > "0"
> > > > > > implies
> > > > > > > > +* that this mechanism is not used at all(normal paging
> > > > is
> > > > > > used).
> > > > > > > > +* Value other than* "0" or "1" indicates the actual
> > > > number
> > > > > of
> > > > > > > > +* subwindows.
> > > > > > > > +*/
> > > > > > >
> > > > > > > This semantic is ugly, how about a feature detection
> > mechanism?
> > > > > > >
> > > > > > [Sethi Varun-B16395] A feature mechanism to query the type of
> > > > IOMMU?
> > > > >
> > > > > A feature mechanism to determine whether this subwindow
> > mechanism is
> > > > > available, and what the limits are.
> > > > >
> > > > So, we use the IOMMU capability interface to find out if IOMMU
> > > > supports sub windows or not, right? But still number of sub
> > windows
> > > > would be specified as a part of the geometry and the valid value
> > for
> > > > sub windows would  0,1 or actual number of sub windows.
> > >
> > > How does a user of the interface find out what values are possible
> > for
> > > the "actual number of subwindows"?  How does a user of the
> > interface find
> > > out whether there are any limitations on specifying a value of zero
> > (in
> > > the case of PAMU, that would be a maximum 1 MiB naturally-aligned
> > > aperture to support arbitrary 4KiB mappings)?
> > How about if we say that the default value for subwindows is zero and
> > this what you get when you read the geometry (iommu_get_attr) after
> > initializing the domain? In that case the user would know that
> > implication of setting subwindows to zero with respect to the aperture
> > size.
> 
> So it would default to the maximum aperture size possible with no
> subwindows?  That might be OK, though is there a way to reset the domain
> later on to get back to that informational state?
> 
[Sethi Varun-B16395] Yes, that can be done via

Re: [PATCH RESEND RESEND] net: remove obsolete simple_strto

2012-12-10 Thread David Miller
From: Abhijit Pawar 
Date: Tue, 11 Dec 2012 09:04:20 +0530

> This patch replace the obsolete simple_strto with kstrto
> 
> Signed-off-by: Abhijit Pawar 

You can't submit replacement patches for ones which I have already
applied.

Patches I apply are permanently applied, and therefore you must submit
changes relative the ones I've applied already.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RESEND] net: remove obsolete simple_strto

2012-12-10 Thread David Miller
From: Abhijit Pawar 
Date: Tue, 11 Dec 2012 09:03:41 +0530

> On 12/11/2012 12:40 AM, David Miller wrote:
>> From: Abhijit Pawar 
>> Date: Mon, 10 Dec 2012 14:42:28 +0530
>> 
>>> This patch replace the obsolete simple_strto with kstrto
>>>
>>> Signed-off-by: Abhijit Pawar 
>> 
>> Applied.
>> 
> Hi David,
> It seems that there are occurences of simple_strto* still present in the
> couple of files which are not yet removed correctly by this patch. I
> will send a modified patch shortly. Please revert this commit and use
> the newly sent patch to merge with the tree.

Again, you cannot send "modified" patches.

When I say I've applied your patch, that cannot be undone.

You must therefore send me fixup patches relative to the ones
I've applied already.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RESEND] net: remove obsolete simple_strto

2012-12-10 Thread David Miller
From: Abhijit Pawar 
Date: Tue, 11 Dec 2012 06:36:59 +0530

> It looks like there are two occurences of simple_strtoul which has not been
> removed cleanly from the patch.
> They are in netpoll.c and debugfs_sta.c
> I will send the modified corrected clean patch shortly.

You can't simply send me a replacement patch, since I already applied
the original one and that patch will not be reverted.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v2] Support volatile range for anon vma

2012-12-10 Thread Minchan Kim
On Fri, Dec 07, 2012 at 04:49:56PM -0800, John Stultz wrote:
> On 12/04/2012 08:18 PM, Minchan Kim wrote:
> >On Tue, Dec 04, 2012 at 11:13:40AM -0800, John Stultz wrote:
> >>I don't think the problem is when vmas being marked VM_VOLATILE are
> >>being merged, its that when we mark the vma as *non-volatile*, and
> >>remove the VM_VOLATILE flag we merge the non-volatile vmas with
> >>neighboring vmas. So preserving the purged flag during that merge is
> >>important. Again, the example I used to trigger this was an
> >>alternating pattern of volatile and non volatile vmas, then marking
> >>the entire range non-volatile (though sometimes in two overlapping
> >>passes).
> >Understood. Thanks.
> >Below patch solves your problems? It's simple than yours.
> 
> Yea, this is nicer then my fix.
> Although I still need the purged handling in the vma merge code for
> me to see the behavior I expect in my tests.
> 
> I've integrated your patch and repushed my queue here:
> http://git.linaro.org/gitweb?p=people/jstultz/android-dev.git;a=shortlog;h=refs/heads/dev/minchan-anonvol
> 
> git://git.linaro.org/people/jstultz/android-dev.git dev/minchan-anonvol
> 
> >Anyway, both yours and mine are not right fix.
> >As I mentioned, locking scheme is broken.
> >We need anon_vma_lock to handle purged and we should consider fork
> >case, too.
> Hrm. I'm sure you're right, as I've not yet fully grasped all the
> locking rules here.  Could you clarify how it is broken? And why is
> the anon_vma_lock needed to manage the purged state that is part of
> the vma itself?

If you don't hold anon->lock, merge/split/fork can race with try_to_unmap
so vma->[purged|vm_flags] would lose consistency.

> 
> thanks
> -john
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org";> em...@kvack.org 

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch v2 4/6] memcg: simplify mem_cgroup_iter

2012-12-10 Thread Ying Han
On Mon, Nov 26, 2012 at 10:47 AM, Michal Hocko  wrote:
> Current implementation of mem_cgroup_iter has to consider both css and
> memcg to find out whether no group has been found (css==NULL - aka the
> loop is completed) and that no memcg is associated with the found node
> (!memcg - aka css_tryget failed because the group is no longer alive).
> This leads to awkward tweaks like tests for css && !memcg to skip the
> current node.
>
> It will be much easier if we got rid off css variable altogether and
> only rely on memcg. In order to do that the iteration part has to skip
> dead nodes. This sounds natural to me and as a nice side effect we will
> get a simple invariant that memcg is always alive when non-NULL and all
> nodes have been visited otherwise.
>
> We could get rid of the surrounding while loop but keep it in for now to
> make review easier. It will go away in the following patch.
>
> Signed-off-by: Michal Hocko 
> ---
>  mm/memcontrol.c |   56 
> +++
>  1 file changed, 27 insertions(+), 29 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 6bcc97b..d1bc0e8 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1086,7 +1086,6 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup 
> *root,
> rcu_read_lock();
> while (!memcg) {
> struct mem_cgroup_reclaim_iter *uninitialized_var(iter);
> -   struct cgroup_subsys_state *css = NULL;
>
> if (reclaim) {
> int nid = zone_to_nid(reclaim->zone);
> @@ -1112,53 +,52 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup 
> *root,
>  * explicit visit.
>  */
> if (!last_visited) {
> -   css = &root->css;
> +   memcg = root;
> } else {
> struct cgroup *prev_cgroup, *next_cgroup;
>
> prev_cgroup = (last_visited == root) ? NULL
> : last_visited->css.cgroup;
> -   next_cgroup = cgroup_next_descendant_pre(prev_cgroup,
> -   root->css.cgroup);
> -   if (next_cgroup)
> -   css = cgroup_subsys_state(next_cgroup,
> -   mem_cgroup_subsys_id);
> -   }
> +skip_node:
> +   next_cgroup = cgroup_next_descendant_pre(
> +   prev_cgroup, root->css.cgroup);
>
> -   /*
> -* Even if we found a group we have to make sure it is alive.
> -* css && !memcg means that the groups should be skipped and
> -* we should continue the tree walk.
> -* last_visited css is safe to use because it is protected by
> -* css_get and the tree walk is rcu safe.
> -*/
> -   if (css == &root->css || (css && css_tryget(css)))
> -   memcg = mem_cgroup_from_css(css);
> +   /*
> +* Even if we found a group we have to make sure it is
> +* alive. css && !memcg means that the groups should 
> be
> +* skipped and we should continue the tree walk.
> +* last_visited css is safe to use because it is
> +* protected by css_get and the tree walk is rcu safe.
> +*/
> +   if (next_cgroup) {
> +   struct mem_cgroup *mem = mem_cgroup_from_cont(
> +   next_cgroup);
> +   if (css_tryget(&mem->css))
> +   memcg = mem;

I see a functional change after this, where we now hold a refcnt of
css if memcg is root. It is not the case before this change.

--Ying

> +   else {
> +   prev_cgroup = next_cgroup;
> +   goto skip_node;
> +   }
> +   }
> +   }
>
> if (reclaim) {
> -   struct mem_cgroup *curr = memcg;
> -
> if (last_visited)
> css_put(&last_visited->css);
>
> -   if (css && !memcg)
> -   curr = mem_cgroup_from_css(css);
> -
> /* make sure that the cached memcg is not removed */
> -   if (curr)
> -   css_get(&curr->css);
> -   iter->last_visited = curr;
> +   if (memcg)
> +   css_get(&memcg->css);
> +   iter->last_visited = memcg;
>
> -   if (!

Re: [RFC v2] Support volatile range for anon vma

2012-12-10 Thread Minchan Kim
On Fri, Dec 07, 2012 at 04:20:30PM -0800, John Stultz wrote:
> On 12/04/2012 11:01 PM, Minchan Kim wrote:
> >Hi John,
> >
> >On Tue, Dec 04, 2012 at 11:13:40AM -0800, John Stultz wrote:
> >>
> >>I don't think the problem is when vmas being marked VM_VOLATILE are
> >>being merged, its that when we mark the vma as *non-volatile*, and
> >>remove the VM_VOLATILE flag we merge the non-volatile vmas with
> >>neighboring vmas. So preserving the purged flag during that merge is
> >>important. Again, the example I used to trigger this was an
> >>alternating pattern of volatile and non volatile vmas, then marking
> >>the entire range non-volatile (though sometimes in two overlapping
> >>passes).
> >If I understand correctly, you mean following as.
> >
> >chunk1 = mmap(8M)
> >chunk2 = chunk1 + 2M;
> >chunk3 = chunk2 + 2M
> >chunk4 = chunk3 + 2M
> >
> >madvise(chunk1, 2M, VOLATILE);
> >madvise(chunk4, 2M, VOLATILE);
> >
> >/*
> >  * V : volatile vma
> >  * N : non volatile vma
> >  * So Now vma is VNVN.
> >  */
> >And chunk4 is purged.
> >
> >int ret = madvise(chunk1, 8M, NOVOLATILE);
> >ASSERT(ret == 1);
> >/* And you expect VNVN->N ?*/
> >
> >Right?
> 
> Yes. That's exactly right.
> 
> >If so, why should non-volatile function semantic allow it which cross over
> >non-volatile areas in a range? I would like to fail such case because
> >in case of MADV_REMOVE, it fails in the middle of operation if it encounter
> >VM_LOCKED.
> >
> >What do you think about it?
> Right, so I think this issue is maybe a problematic part of the VMA
> based approach.  While marking an area as nonvolatile twice might
> not make a lot of sense, I think userland applications would not
> appreciate the constraint that madvise(VOLATILE/NONVOLATILE) calls
> be made in perfect pairs of identical sizes.
> 
> For instance, if a browser has rendered a web page, but the page is
> so large that only a sliding window/view of that page is visible at
> one time, it may want to mark the regions not currently in the view
> as volatile.   So it would be nice (albeit naive) for that
> application that when the view location changed, it would just mark
> the new region as non-volatile, and any region not in the current
> view as volatile.  This would be easier then trying to calculate the
> diff of the old view region boundaries vs the new and modifying only
> the ranges that changed. Granted, doing so might be more efficient,
> but I'm not sure we can be sure every similar case would be more
> efficient.
> 
> So in my mind, double-clearing a flag should be allowed (as well as
> double-setting), as well as allowing for setting/clearing
> overlapping regions.

It might and as you said, it's not matched by normal madvise opearation.
So if user really want it, we might need another interface like new
system call like mlock.

Although we can implement it, what I has a concern is mmap_sem hold time.
For VMA approach, we need exclusive mmap_sem and it ends up preventing
concurrent page fault handling so it would mitigate anon volatile's goal
for user-space allocators. So I would like to avoid more works with
exclusive mmap_sem as far as possible.

Of course, you can argue that if we don't support such semantic,
user can call madvise(NOVOATILE) several time with several ranges
so it could be more bad. Right. But I suggest for plumbers to implement
range management smart and let's leave kernel implementation simple/fast.

I'm not solid. If user really want such semantic, I can support it with
new system call.

Frankly speaking, I would like to remove madvise(NOVOLATILE) call.
If you already saw my patch just I sent morning, you can guess what it is.
The problem of anon volatile with madvise(NOVOLATILE) is that time delay
between allocator allocats a free chunk and user really access the memory.
Normally, when allocator return free chunk to customer, allocator should
call madvise(NOVOLATILE) but user could access the memory long time after.
So during that time difference, that pages could be swap out.
So I decide to remove madvise(NOVOLATILE) and it's handled at first
page fault.

Yeb. The same rule couldn't applied to tmpfs volatile and it does needs
NOVOLATILE semantic. Hmm,, I am biasing to new system call.

int mvolatile(const void *addr, size_t len, int mode);
int munvolatile(const void *addr, size_t len;

If someone call mvolatile with AUTO mode, it would work as my anon volatile
while in MANUAL mode, user must call munvolatile before using.
It might meet your and mine goal. But adding new system call is last resort. :)

> 
> Aside from if the behavior should be allowed or not, the error mode
> of madvise is problematic as well, since failures can happen mid way
> through the operation, leaving the vmas in the range specified
> inconsistent. Since usually its only advisory, such inconsistent
> states aren't really problematic, and repeating the last action is
> probably fine.

True.

> 
> The problem with NOVOLATILE's  purged state, with vmas, is that if
> we hi

Re: [PATCH v4 1/7] DMA: PL330: use prefix in reg names to build under x86

2012-12-10 Thread Jassi Brar
On 10 December 2012 19:12, Davide Ciminaghi  wrote:
> From: Alessandro Rubini 
>
> This driver would not compile if ARM_AMBA is selected under x86,
> because "CS" and "DS" are already defined there.  But AMBA
> is used in the x86 world by a PCI-to-AMBA bridge, to be submitted.
>
> The patch just adds the "PL330_" prefix to all registers,
> so it can be built by randomconfig after ARM_AMBA appears within x86.
> No other technical changes have been performed.
> The patch was build-tested only.
>
> Signed-off-by: Alessandro Rubini 
> Acked-by: Giancarlo Asnaghi 
> [Davide Ciminaghi : only registers prefixed]
> Signed-off-by: Davide Ciminaghi 
> ---

 Acked-by: Jassi Brar 

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 01/18] sched: select_task_rq_fair clean up

2012-12-10 Thread Preeti U Murthy
Hi Alex,

On 12/10/2012 01:52 PM, Alex Shi wrote:
> It is impossible to miss a task allowed cpu in a eligible group.

The one thing I am concerned with here is if there is a possibility of
the task changing its tsk_cpus_allowed() while this code is running.

i.e find_idlest_group() finds an idle group,then the tsk_cpus_allowed()
for the task changes,perhaps by the user himself,which might not include
the cpus in the idle group.After this find_idlest_cpu() is called.I mean
a race condition in short.Then we might not have an eligible cpu in that
group right?

> And since find_idlest_group only return a different group which
> excludes old cpu, it's also imporissible to find a new cpu same as old
> cpu.

This I agree with.

> Signed-off-by: Alex Shi 
> ---
>  kernel/sched/fair.c |5 -
>  1 files changed, 0 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 59e072b..df99456 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3150,11 +3150,6 @@ select_task_rq_fair(struct task_struct *p, int 
> sd_flag, int wake_flags)
>   }
>  
>   new_cpu = find_idlest_cpu(group, p, cpu);
> - if (new_cpu == -1 || new_cpu == cpu) {
> - /* Now try balancing at a lower domain level of cpu */
> - sd = sd->child;
> - continue;
> - }
>  
>   /* Now try balancing at a lower domain level of new_cpu */
>   cpu = new_cpu;
> 
Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 3.7

2012-12-10 Thread Stephen Rothwell
Hi all,

On Mon, 10 Dec 2012 19:59:50 -0800 Linus Torvalds 
 wrote:
>
> Anyway, it's been a somewhat drawn out release despite the 3.7 merge
> window having otherwise appeared pretty straightforward, and none of
> the rc's were all that big either. But we're done, and this means that
> the merge window will close on Christmas eve.
> 
> Or rather, I'll probably close it a couple of days early. For obvious
> reasons. It's the main commercial holiday of the year, after all.
> 
> So aim for winter solstice, and no later. Deal? And even then, I might
> be deep into the glögg.

Hopefully people will submit earlier rather than later as currently there
are more commits in linux-next than every before ...  Also, please resist
the usual rebase before submitting frenzy :-(

See http://neuling.org/linux-next-size.html (double click to see the
complete history).

-- 
Cheers,
Stephen Rothwells...@canb.auug.org.au


pgp8lzzKhmqBW.pgp
Description: PGP signature


linux-next: manual merge of the gpio-lw tree with the driver-core tree

2012-12-10 Thread Stephen Rothwell
Hi Linus,

Today's linux-next merge of the gpio-lw tree got a conflict in
drivers/gpio/gpio-stmpe.c between commit 3836309d9346 ("gpio: remove use
of __devinit") from the driver-core tree and commit fc13d5a5b17c ("gpio:
Provide the STMPE GPIO driver with its own IRQ Domain") from the gpio-lw
tree.

I fixed it up (see below) and can carry the fix as necessary (no action
is required).

-- 
Cheers,
Stephen Rothwells...@canb.auug.org.au

diff --cc drivers/gpio/gpio-stmpe.c
index 6411600,3e1d398..000
--- a/drivers/gpio/gpio-stmpe.c
+++ b/drivers/gpio/gpio-stmpe.c
@@@ -288,21 -292,37 +292,37 @@@ int stmpe_gpio_irq_map(struct irq_domai
return 0;
  }
  
- static void stmpe_gpio_irq_remove(struct stmpe_gpio *stmpe_gpio)
+ void stmpe_gpio_irq_unmap(struct irq_domain *d, unsigned int virq)
  {
-   int base = stmpe_gpio->irq_base;
-   int irq;
- 
-   for (irq = base; irq < base + stmpe_gpio->chip.ngpio; irq++) {
  #ifdef CONFIG_ARM
-   set_irq_flags(irq, 0);
+   set_irq_flags(virq, 0);
  #endif
-   irq_set_chip_and_handler(irq, NULL, NULL);
-   irq_set_chip_data(irq, NULL);
+   irq_set_chip_and_handler(virq, NULL, NULL);
+   irq_set_chip_data(virq, NULL);
+ }
+ 
+ static const struct irq_domain_ops stmpe_gpio_irq_simple_ops = {
+   .unmap = stmpe_gpio_irq_unmap,
+   .map = stmpe_gpio_irq_map,
+   .xlate = irq_domain_xlate_twocell,
+ };
+ 
 -static int __devinit stmpe_gpio_irq_init(struct stmpe_gpio *stmpe_gpio)
++static int stmpe_gpio_irq_init(struct stmpe_gpio *stmpe_gpio)
+ {
+   int base = stmpe_gpio->irq_base;
+ 
+   stmpe_gpio->domain = irq_domain_add_simple(NULL,
+   stmpe_gpio->chip.ngpio, base,
+   &stmpe_gpio_irq_simple_ops, stmpe_gpio);
+   if (!stmpe_gpio->domain) {
+   dev_err(stmpe_gpio->dev, "failed to create irqdomain\n");
+   return -ENOSYS;
}
+ 
+   return 0;
  }
  
 -static int __devinit stmpe_gpio_probe(struct platform_device *pdev)
 +static int stmpe_gpio_probe(struct platform_device *pdev)
  {
struct stmpe *stmpe = dev_get_drvdata(pdev->dev.parent);
struct device_node *np = pdev->dev.of_node;


pgpRPGZJAuY1v.pgp
Description: PGP signature


[GIT PULL] MMC updates for 3.8-rc1

2012-12-10 Thread Chris Ball
Hi Linus,

Please pull from:

  git://git.kernel.org/pub/scm/linux/kernel/git/cjb/mmc.git 
tags/mmc-updates-for-3.8-rc1

to receive the MMC merge for 3.8.  There are currently no conflicts,
and these patches have been tested in linux-next.  Thanks.


The following changes since commit 91ab252ac5a5c3461dd6910797611e9172626aed:

  mmc: sh-mmcif: avoid oops on spurious interrupts (second try) (2012-12-06 
13:54:35 -0500)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/cjb/mmc.git 
tags/mmc-updates-for-3.8-rc1

for you to fetch changes up to 71e69211eac889898dec5a21270347591eb2d001:

  mmc: sdhci: implement the .card_event() method (2012-12-07 13:56:03 -0500)


MMC highlights for 3.8:
Core:
 - Expose access to the eMMC RPMB ("Replay Protected Memory Block") area
   by extending the existing mmc_block ioctl.
 - Add SDIO powered-suspend DT properties to the core MMC DT binding.
 - Add no-1-8-v DT flag for boards where the SD controller reports that it
   supports 1.8V but the board itself has no way to switch to 1.8V.
 - More work on switching to 1.8V UHS support using a vqmmc regulator.
 - Fix up a case where the slot-gpio helper may fail to reset the host
   controller properly if a card was removed during a transfer.
 - Fix several cases where a broken device could cause an infinite loop
   while we wait for a register to update.

Drivers:
 - at91-mci: Remove obsolete driver, atmel-mci handles these devices now.
 - sdhci-dove: Allow using GPIOs for card-detect notifications.
 - sdhci-esdhc: Fix for recovering from ADMA errors on broken silicon.
 - sdhci-s3c: Add pinctrl support.
 - wmt-sdmmc: New driver for WonderMedia SD/MMC controllers.


Abhilash Kesavan (4):
  mmc: dt: Fix typo in filename
  mmc: dt: Add optional pm properties to core binding
  mmc: sdhci-pltfm: Support optional pm properties
  mmc: dw_mmc: Add sdio power bindings

Al Cooper (1):
  mmc: Limit MMC speed to 52MHz if not HS200

Andy Shevchenko (2):
  mmc: dw_mmc: use __devexit_p macro for .remove()
  mmc: dw_mmc: use helper macro module_platform_driver()

Arnd Bergmann (1):
  mmc: dw_mmc: fix more const pointer warnings

Balaji T K (4):
  mmc: omap_hsmmc: remove warning message for debounce clock
  mmc: omap_hsmmc: Fix Oops in case of data errors
  mmc: omap_hsmmc: No reset of cmd state machine for DCRC
  mmc: omap_hsmmc: Update error code for response_busy cmd

Daniel Drake (2):
  mmc: sdhci: add quirk for lack of 1.8v support
  mmc: dt: add no-1-8-v device tree flag

Daniel Mack (2):
  mmc: omap_hsmmc: claim pinctrl at probe time
  mmc: omap_hsmmc: add DT property for max bus frequency

Fabio Estevam (1):
  mmc: mxs-mmc: Remove platform data

Felipe Balbi (1):
  mmc: omap_hsmmc: Introduce omap_hsmmc_prepare/complete

Guennadi Liakhovetski (6):
  mmc: sh_mobile_sdhi: fix clock frequency printing
  mmc: sh_mobile_sdhi: remove unneeded clock connection ID
  mmc: sh_mmcif: remove unneeded clock connection ID
  mmc: add a card-event host operation
  mmc: extend the slot-gpio card-detection to use host's .card_event() 
method
  mmc: sdhci: implement the .card_event() method

Haijun Zhang (1):
  mmc: eSDHC: Recover from ADMA errors

Hebbar, Gururaja (1):
  mmc: omap_hsmmc: Enable HSPE bit for high speed cards

Jaehoon Chung (2):
  mmc: dw_mmc: relocate where dw_mci_setup_bus() is called from
  mmc: dw_mmc: remove duplicated buswidth code

Javier Martin (1):
  mmc: mxcmmc: fix SD cards not being detected sometimes.

Jerry Huang (1):
  mmc: sdhci-of-esdhc: support commands with busy response expecting TC

Johan Rudholm (1):
  mmc: core: debugfs: Add signal_voltage to ios dump

Kevin Liu (5):
  mmc: sdhci: Balance vmmc regulator_enable(), and always enable vqmmc
  mmc: sdhci-pxav3: Add base clock quirk
  mmc: host: Make UHS timing values fully unique
  mmc: sdhci: Use regulator min/max voltage range according to spec
  mmc: sdhci-pxav3: add quirks2

Kyoungil Kim (1):
  mmc: sdio: Use multiple scatter/gather list

Lee Jones (1):
  mmc: Standardise capability type

Loic Pallardy (5):
  mmc: core: Expose access to RPMB partition
  mmc: card: Do not scan RPMB partitions
  mmc: core: Extend sysfs to ext_csd parameters for RPMB support
  mmc: core: Add mmc_set_blockcount feature
  mmc: card: Add RPMB support in IOCTL interface

Ludovic Desroches (1):
  mmc: at91-mci: remove obsolete driver

Madhvapathi Sriram (1):
  mmc: sdhci-pci: Enable SDHCI_CAN_DO_HISPD for Ricoh SDHCI controller

Marina Makienko (1):
  mmc: vub300: add missing usb_put_dev

Rafael J. Wysocki (1):
  mmc: sdio: Add empty bus-level suspend/resume callbacks

Russell King (3):
  mmc: sdhci-dove: use devm_clk_get()
  mmc: sdhci-dov

Re: [[PATCH v9 3/3] 1/1] virtio_console: Remove buffers from out_vq at port removal

2012-12-10 Thread Amit Shah
On (Tue) 11 Dec 2012 [09:39:41], Rusty Russell wrote:
> Amit Shah  writes:
> 
> > On (Fri) 16 Nov 2012 [11:22:09], Rusty Russell wrote:
> >> Amit Shah  writes:
> >> > From: Sjur Brændeland 
> >> >
> >> > Remove buffers from the out-queue when a port is removed. Rproc_serial
> >> > communicates with remote processors that may crash and leave buffers in
> >> > the out-queue. The virtio serial ports may have buffers in the out-queue
> >> > as well, e.g. for non-blocking ports and the host didn't consume them
> >> > yet.
> >> >
> >> > [Amit: Remove WARN_ON for generic ports case.]
> >> >
> >> > Signed-off-by: Sjur Brændeland 
> >> > Signed-off-by: Amit Shah 
> >> 
> >> I already have this in my pending queue; I've promoted it to my
> >> virtio-next branch now.
> >
> > Rusty, I still see this series in your pending queue, not in
> > virtio-next.  Did anything change in the meantime?
> 
> Hmm:
> 
> 40e625ac50f40d87ddba93280d0a503425aa68e9?

I'm sorry, I meant the remoteproc code, not this patch.

Amit
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Linux 3.7

2012-12-10 Thread Linus Torvalds
Whee. After an extra rc release, 3.7 is now out. After a few more
trials at fixing things, in the end we ended up reverting the kswapd
changes that caused problems. And with the extra rc, I had decided to
risk doing the buffer.c cleanups that would otherwise have just been
marked for stable during the next merge window, and had enough time to
fix a few problems that people found there too.

There's also a fix for a SCSI driver bug that was exposed by the
last-minute workqueue fixes in rc8.

Other than that, there's a few networking fixes, and some trivial
fixes for sparc and MIPS.

Anyway, it's been a somewhat drawn out release despite the 3.7 merge
window having otherwise appeared pretty straightforward, and none of
the rc's were all that big either. But we're done, and this means that
the merge window will close on Christmas eve.

Or rather, I'll probably close it a couple of days early. For obvious
reasons. It's the main commercial holiday of the year, after all.

So aim for winter solstice, and no later. Deal? And even then, I might
be deep into the glögg.

Linus

---

Chris Ball (1):
  Revert misapplied "mmc: sh-mmcif: avoid oops on spurious interrupts"

Dan Carpenter (1):
  vfs: clear to the end of the buffer on partial buffer reads

David Daney (1):
  MIPS: Avoid mcheck by flushing page range in huge_ptep_set_access_flags()

David Howells (2):
  MODSIGN: Don't use enum-type bitfields in module signature info block
  ASN.1: Fix an indefinite length skip error

David S. Miller (2):
  sparc64: exit_group should kill register windows just like plain exit.
  sparc: Fix piggyback with newer binutils.

Dmitry Adamushko (1):
  MIPS: Fix endless loop when processing signals for kernel tasks

Eric Dumazet (1):
  net: gro: fix possible panic in skb_gro_receive()

Florian Fainelli (1):
  Input: matrix-keymap - provide proper module license

Guennadi Liakhovetski (1):
  mmc: sh-mmcif: avoid oops on spurious interrupts (second try)

Heiko Stübner (1):
  mmc: sdhci-s3c: fix missing clock for gpio card-detect

James Hogan (2):
  linux/kernel.h: define SYMBOL_PREFIX
  modsign: add symbol prefix to certificate list

Johannes Berg (1):
  ipv4: ip_check_defrag must not modify skb before unsharing

Johannes Weiner (2):
  mm: vmscan: do not keep kswapd looping forever due to individual
uncompactable zones
  mm: vmscan: fix inappropriate zone congestion clearing

Linus Torvalds (5):
  vfs: avoid "attempt to access beyond end of device" warnings
  vfs: fix O_DIRECT read past end of block device
  Revert "mm: avoid waking kswapd for THP allocations when
compaction is deferred or contended"
  Revert "revert "Revert "mm: remove __GFP_NO_KSWAPD""" and
associated damage
  Linux 3.7

Mel Gorman (2):
  mm: compaction: validate pfn range passed to isolate_freepages_block
  tmpfs: fix shared mempolicy leak

Neal Cardwell (4):
  inet_diag: fix oops for IPv4 AF_INET6 TCP SYN-RECV state
  inet_diag: validate byte code to prevent oops in inet_diag_bc_run()
  inet_diag: avoid unsafe and nonsensical prefix matches in
inet_diag_bc_run()
  inet_diag: validate port comparison byte code to prevent unsafe reads

Ralf Baechle (3):
  MIPS: N32: Fix preadv(2) and pwritev(2) entry points.
  MIPS: N32: Fix signalfd4 syscall entry point
  MIPS: R3000/R3081: Fix CPU detection.

Richard Weinberger (2):
  UBI: remove PEB from free tree in get_peb_for_wl()
  UBI: dont call ubi_self_check_all_ff() in __wl_get_peb()

Tejun Heo (1):
  workqueue: convert BUG_ON()s in __queue_delayed_work() to WARN_ON_ONCE()s

Thomas Gleixner (1):
  watchdog: Fix CPU hotplug regression

Tim Gardner (1):
  lib/Makefile: Fix oid_registry build dependency

Xiaotian Feng (1):
  megaraid: fix BUG_ON() from incorrect use of delayed work

Yuchung Cheng (1):
  tcp: bug fix Fast Open client retransmission
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 1/2] zsmalloc: add function to query object size

2012-12-10 Thread Minchan Kim
On Fri, Dec 07, 2012 at 04:45:53PM -0800, Nitin Gupta wrote:
> On Sun, Dec 2, 2012 at 11:52 PM, Minchan Kim  wrote:
> > On Sun, Dec 02, 2012 at 11:20:42PM -0800, Nitin Gupta wrote:
> >>
> >>
> >> On Nov 30, 2012, at 5:54 AM, Minchan Kim  
> >> wrote:
> >>
> >> > On Thu, Nov 29, 2012 at 10:54:48PM -0800, Nitin Gupta wrote:
> >> >> Changelog v2 vs v1:
> >> >> - None
> >> >>
> >> >> Adds zs_get_object_size(handle) which provides the size of
> >> >> the given object. This is useful since the user (zram etc.)
> >> >> now do not have to maintain object sizes separately, saving
> >> >> on some metadata size (4b per page).
> >> >>
> >> >> The object handle encodes  pair which currently points
> >> >> to the start of the object. Now, the handle implicitly stores the size
> >> >> information by pointing to the object's end instead. Since zsmalloc is
> >> >> a slab based allocator, the start of the object can be easily determined
> >> >> and the difference between the end offset encoded in the handle and the
> >> >> start gives us the object size.
> >> >>
> >> >> Signed-off-by: Nitin Gupta 
> >> > Acked-by: Minchan Kim 
> >> >
> >> > I already had a few comment in your previous versoin.
> >> > I'm OK although you ignore them because I can make follow up patch about
> >> > my nitpick but could you answer below my question?
> >> >
> >> >> ---
> >> >> drivers/staging/zsmalloc/zsmalloc-main.c |  177 
> >> >> +-
> >> >> drivers/staging/zsmalloc/zsmalloc.h  |1 +
> >> >> 2 files changed, 127 insertions(+), 51 deletions(-)
> >> >>
> >> >> diff --git a/drivers/staging/zsmalloc/zsmalloc-main.c 
> >> >> b/drivers/staging/zsmalloc/zsmalloc-main.c
> >> >> index 09a9d35..65c9d3b 100644
> >> >> --- a/drivers/staging/zsmalloc/zsmalloc-main.c
> >> >> +++ b/drivers/staging/zsmalloc/zsmalloc-main.c
> >> >> @@ -112,20 +112,20 @@
> >> >> #define MAX_PHYSMEM_BITS 36
> >> >> #else /* !CONFIG_HIGHMEM64G */
> >> >> /*
> >> >> - * If this definition of MAX_PHYSMEM_BITS is used, OBJ_INDEX_BITS will 
> >> >> just
> >> >> + * If this definition of MAX_PHYSMEM_BITS is used, OFFSET_BITS will 
> >> >> just
> >> >>  * be PAGE_SHIFT
> >> >>  */
> >> >> #define MAX_PHYSMEM_BITS BITS_PER_LONG
> >> >> #endif
> >> >> #endif
> >> >> #define _PFN_BITS(MAX_PHYSMEM_BITS - PAGE_SHIFT)
> >> >> -#define OBJ_INDEX_BITS(BITS_PER_LONG - _PFN_BITS)
> >> >> -#define OBJ_INDEX_MASK((_AC(1, UL) << OBJ_INDEX_BITS) - 1)
> >> >> +#define OFFSET_BITS(BITS_PER_LONG - _PFN_BITS)
> >> >> +#define OFFSET_MASK((_AC(1, UL) << OFFSET_BITS) - 1)
> >> >>
> >> >> #define MAX(a, b) ((a) >= (b) ? (a) : (b))
> >> >> /* ZS_MIN_ALLOC_SIZE must be multiple of ZS_ALIGN */
> >> >> #define ZS_MIN_ALLOC_SIZE \
> >> >> -MAX(32, (ZS_MAX_PAGES_PER_ZSPAGE << PAGE_SHIFT >> OBJ_INDEX_BITS))
> >> >> +MAX(32, (ZS_MAX_PAGES_PER_ZSPAGE << PAGE_SHIFT >> OFFSET_BITS))
> >> >> #define ZS_MAX_ALLOC_SIZEPAGE_SIZE
> >> >>
> >> >> /*
> >> >> @@ -256,6 +256,11 @@ static int is_last_page(struct page *page)
> >> >>return PagePrivate2(page);
> >> >> }
> >> >>
> >> >> +static unsigned long get_page_index(struct page *page)
> >> >> +{
> >> >> +return is_first_page(page) ? 0 : page->index;
> >> >> +}
> >> >> +
> >> >> static void get_zspage_mapping(struct page *page, unsigned int 
> >> >> *class_idx,
> >> >>enum fullness_group *fullness)
> >> >> {
> >> >> @@ -433,39 +438,86 @@ static struct page *get_next_page(struct page 
> >> >> *page)
> >> >>return next;
> >> >> }
> >> >>
> >> >> -/* Encode  as a single handle value */
> >> >> -static void *obj_location_to_handle(struct page *page, unsigned long 
> >> >> obj_idx)
> >> >> +static struct page *get_prev_page(struct page *page)
> >> >> {
> >> >> -unsigned long handle;
> >> >> +struct page *prev, *first_page;
> >> >>
> >> >> -if (!page) {
> >> >> -BUG_ON(obj_idx);
> >> >> -return NULL;
> >> >> -}
> >> >> +first_page = get_first_page(page);
> >> >> +if (page == first_page)
> >> >> +prev = NULL;
> >> >> +else if (page == (struct page *)first_page->private)
> >> >> +prev = first_page;
> >> >> +else
> >> >> +prev = list_entry(page->lru.prev, struct page, lru);
> >> >>
> >> >> -handle = page_to_pfn(page) << OBJ_INDEX_BITS;
> >> >> -handle |= (obj_idx & OBJ_INDEX_MASK);
> >> >> +return prev;
> >> >>
> >> >> -return (void *)handle;
> >> >> }
> >> >>
> >> >> -/* Decode  pair from the given object handle */
> >> >> -static void obj_handle_to_location(unsigned long handle, struct page 
> >> >> **page,
> >> >> -unsigned long *obj_idx)
> >> >> +static void *encode_ptr(struct page *page, unsigned long offset)
> >> >> {
> >> >> -*page = pfn_to_page(handle >> OBJ_INDEX_BITS);
> >> >> -*obj_idx = handle & OBJ_INDEX_MASK;
> >> >> +unsigned long ptr;
> >> >> +ptr = page_to_pfn(page) << OFFSET_BITS;
> >> >> +ptr |= offset & OFFSET_MASK;
> >> >> +return (void *)p

Re: [tip:x86/microcode] x86/microcode_intel_early.c: Early update ucode on Intel's CPU

2012-12-10 Thread Yinghai Lu
On Mon, Dec 10, 2012 at 7:41 PM, H. Peter Anvin  wrote:
> On 12/10/2012 06:39 PM, Yinghai Lu wrote:
>>
>> No, you should not copy that several times.
>>
>> just pre-allocate some kbytes in BRK, and copy to there one time.
>>
>
> He doesn't copy it several times.  He just saves an offset into the
> initrd blob.

ucode is together with initrd blob, and code scan that blob, and save
the pointer about ucode,
then BSP use it, and APs use it
after that when initrd get freed,  that ucode is copied to somewhere...

and his patch missed initrd could be get relocated for 64bit and 32bit.
So AP would not find that saved ucode.

After i pointed that, he said he will update the pointer when relocate
the initrd for AP.

And my suggestion is: after scan and find the ucode, save it to BRK,
so don't need to adjust
pointer again, and don't need to copy the blob and update the pointer again.

Yinghai
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] lpfc: init: fix misspelling word in mailbox command waiting comments

2012-12-10 Thread Ren Mingxin

On 12/11/2012 11:53 AM, re...@cn.fujitsu.com wrote:

From: Ren Mingxin


Superfluous, sorry for disturbing everyone :-(

Ren
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] lpfc: init: fix misspelling word in mailbox command waiting comments

2012-12-10 Thread renmx
From: Ren Mingxin 

Correct misspelling of "outstanding" in mailbox command waiting comments.

Signed-off-by: Ren Mingxin 
Signed-off-by: Pan Dayu 
---
 drivers/scsi/lpfc/lpfc_init.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/scsi/lpfc/lpfc_init.c b/drivers/scsi/lpfc/lpfc_init.c
index 7dc4218..8533160 100644
--- a/drivers/scsi/lpfc/lpfc_init.c
+++ b/drivers/scsi/lpfc/lpfc_init.c
@@ -2566,7 +2566,7 @@ lpfc_block_mgmt_io(struct lpfc_hba *phba, int mbx_action)
}
spin_unlock_irqrestore(&phba->hbalock, iflag);
 
-   /* Wait for the outstnading mailbox command to complete */
+   /* Wait for the outstanding mailbox command to complete */
while (phba->sli.mbox_active) {
/* Check active mailbox complete status every 2ms */
msleep(2);
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/4] regulator: s5m8767: Fix to work even if no DVS gpio present

2012-12-10 Thread Mark Brown
On Mon, Dec 10, 2012 at 06:19:41PM +0530, Amit Daniel Kachhap wrote:
> Signed-off-by: Amit Daniel Kachhap 

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] lpfc: init: fix misspelling word in mailbox command waiting comments

2012-12-10 Thread Ren Mingxin
Correct misspelling of "outstanding" in mailbox command waiting comments.

Signed-off-by: Ren Mingxin 
Signed-off-by: Pan Dayu 
---
 drivers/scsi/lpfc/lpfc_init.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/scsi/lpfc/lpfc_init.c b/drivers/scsi/lpfc/lpfc_init.c
index 7dc4218..8533160 100644
--- a/drivers/scsi/lpfc/lpfc_init.c
+++ b/drivers/scsi/lpfc/lpfc_init.c
@@ -2566,7 +2566,7 @@ lpfc_block_mgmt_io(struct lpfc_hba *phba, int mbx_action)
}
spin_unlock_irqrestore(&phba->hbalock, iflag);
 
-   /* Wait for the outstnading mailbox command to complete */
+   /* Wait for the outstanding mailbox command to complete */
while (phba->sli.mbox_active) {
/* Check active mailbox complete status every 2ms */
msleep(2);
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V2] MCE: fix an error of mce_bad_pages statistics

2012-12-10 Thread Simon Jeons
On Tue, 2012-12-11 at 04:19 +0100, Andi Kleen wrote:
> On Mon, Dec 10, 2012 at 09:13:11PM -0600, Simon Jeons wrote:
> > On Tue, 2012-12-11 at 04:01 +0100, Andi Kleen wrote:
> > > > Oh, it will be putback to lru list during migration. So does your "some
> > > > time" mean before call check_new_page?
> > > 
> > > Yes until the next check_new_page() whenever that is. If the migration
> > > works it will be earlier, otherwise later.
> > 
> > But I can't figure out any page reclaim path check if the page is set
> > PG_hwpoison, can poisoned pages be rclaimed?
> 
> The only way to reclaim a page is to free and reallocate it.

Then why there doesn't have check in reclaim path to avoid relcaim
poisoned page?

-Simon

> 
> -Andi


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/4] regulator: s5m8767: Fix to read the first DVS register.

2012-12-10 Thread Mark Brown
On Mon, Dec 10, 2012 at 06:19:40PM +0530, Amit Daniel Kachhap wrote:
> This patch modifies the DVS register read function to select correct DVS1
> register. This change is required because the GPIO select pin is 000 in
> unintialized state and hence selects the DVS1 register.

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


resend--[PATCH] improve read ahead in kernel

2012-12-10 Thread xtu4

resend it, due to format error

Subject: [PATCH] when system in low memory scenario, imaging there is a mp3
 play, ora video play, we need to read mp3 or video file
 from memory to page cache,but when system lack of memory,
 page cache of mp3 or video file will be reclaimed.once read
 in memory, then reclaimed, it will cause audio or video
 glitch,and it will increase the io operation at the same
 time.

Signed-off-by: xiaobing tu 
---
 include/linux/mm_types.h |4 
 mm/filemap.c |4 
 mm/readahead.c   |   20 +---
 mm/vmscan.c  |   10 --
 4 files changed, 33 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5b42f1b..2739995 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -149,6 +149,10 @@ struct page {
  */
 void *shadow;
 #endif
+#ifdef CONFIG_LOWMEMORY_READAHEAD
+unsigned int ioprio;
+#endif
+
 }
 /*
  * If another subsystem starts using the double word pairing for atomic
diff --git a/mm/filemap.c b/mm/filemap.c
index a0701e6..ca3a3e8 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -129,6 +129,10 @@ void __delete_from_page_cache(struct page *page)
 page->mapping = NULL;
 /* Leave page->index set: truncation lookup relies upon it */
 mapping->nrpages--;
+#ifdef CONFIG_LOWMEMORY_READAHEAD
+page->ioprio = 0;
+#endif
+
 __dec_zone_page_state(page, NR_FILE_PAGES);
 if (PageSwapBacked(page))
 __dec_zone_page_state(page, NR_SHMEM);
diff --git a/mm/readahead.c b/mm/readahead.c
index cbcbb02..5c2d2ff 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -159,6 +159,11 @@ __do_page_cache_readahead(struct address_space 
*mapping, struct file *filp,

 int page_idx;
 int ret = 0;
 loff_t isize = i_size_read(inode);
+#ifdef CONFIG_LOWMEMORY_READAHEAD
+int class = 0;
+if (p->io_context)
+class = IOPRIO_PRIO_CLASS(p->io_context->ioprio);
+#endif

 if (isize == 0)
 goto out;
@@ -177,12 +182,21 @@ __do_page_cache_readahead(struct address_space 
*mapping, struct file *filp,

 rcu_read_lock();
 page = radix_tree_lookup(&mapping->page_tree, page_offset);
 rcu_read_unlock();
-if (page)
-continue;
-
+if (page){
+#ifdef CONFIG_LOWMEMORY_READAHEAD
+if (class == IOPRIO_CLASS_RT) {
+page->ioprio = 1;
+#endif
+continue;
+}
 page = page_cache_alloc_readahead(mapping);
 if (!page)
 break;
+#ifdef CONFIG_LOWMEMORY_READAHEAD
+if (class == IOPRIO_CLASS_RT) {
+page->ioprio = 1;
+#endif
+
 page->index = page_offset;
 list_add(&page->lru, &page_pool);
 if (page_idx == nr_to_read - lookahead_size)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 753a2dc..0a1cae8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -728,8 +728,14 @@ static enum page_references 
page_check_references(struct page *page,

 }

 /* Reclaim if clean, defer dirty pages to writeback */
-if (referenced_page && !PageSwapBacked(page))
-return PAGEREF_RECLAIM_CLEAN;
+if (referenced_page && !PageSwapBacked(page)) {
+#ifdef CONFIG_LOWMEMORY_READAHEAD
+if (page->ioprio == 1) {
+return PAGEREF_ACTIVATE;
+} else
+#endif
+return PAGEREF_RECLAIM_CLEAN;
+}

 return PAGEREF_RECLAIM;
 }
--
1.7.6

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/4] regulator: s5m8767: Fix to work when platform registers less regulators

2012-12-10 Thread Mark Brown
On Mon, Dec 10, 2012 at 06:19:39PM +0530, Amit Daniel Kachhap wrote:
> Signed-off-by: Amit Daniel Kachhap 

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:x86/microcode] x86/microcode_intel_early.c: Early update ucode on Intel's CPU

2012-12-10 Thread H. Peter Anvin
On 12/10/2012 06:39 PM, Yinghai Lu wrote:
> 
> No, you should not copy that several times.
> 
> just pre-allocate some kbytes in BRK, and copy to there one time.
> 

He doesn't copy it several times.  He just saves an offset into the
initrd blob.

-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/12] Refactoring the ab8500 battery management drivers

2012-12-10 Thread Anton Vorontsov
On Fri, Nov 30, 2012 at 11:57:22AM +, Lee Jones wrote:
> The aim of this and subsequent patch-sets is to refactor battery
> management services provided by the ab8500 MFD. This first patch-set
> brings a few modifications to the collection which happened on the
> internal kernel, but were never Mainlined. There are lots more of
> these to come. We also tidy-up some of the Device Tree related patches
> which are currently pending in -next.

It fails to apply...

  Applying: ab8500_charger: Charger current step-up/down
  Applying: ab8500_fg: Don't clear the CCMuxOffset bit
  Applying: ab8500_btemp: Detect battery type in workqueue
  Applying: ab8500_btemp: Fix crazy tabbing implementation
  fatal: sha1 information is lacking or useless (drivers/power/ab8500_bmdata.c).
  Repository lacks necessary blobs to fall back on 3-way merge.
  Cannot fall back to three-way merge.
  Patch failed at 0001 ab8500_btemp: Fix crazy tabbing implementation

I have tried battery tree (as of eba3b670a9166a91be5a, Nov 18), I have
tried pristine Linus' tree, and I have tried linux-next. All failed in
different places.

I have tried to apply the "ab8500_btemp: Fix crazy tabbing implementation"
manually (which applied with fuzz), but then the other patches failed. So
I gave up.

What is the base of the patches?

Looking at the patch...

  diff --git a/drivers/power/ab8500_bmdata.c b/drivers/power/ab8500_bmdata.c
  index f16b60c..2623b16 100644
  --- a/drivers/power/ab8500_bmdata.c
  +++ b/drivers/power/ab8500_bmdata.c

There is really no f16b60c object in any tree, which makes me think that
you use some private tree.

Thanks,
Anton.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V2] MCE: fix an error of mce_bad_pages statistics

2012-12-10 Thread Andi Kleen
> "There are not so many free pages in a typical server system", sorry I don't
> quite understand it.

Linux tries to keep most memory in caches. As Linus says "free memory is
bad memory"

>
> buffered_rmqueue()
>   prep_new_page()
>   check_new_page()
>   bad_page()
> 
> If we alloc 2^10 pages and one of them is a poisoned page, then the whole 4M
> memory will be dropped.

prep_new_page() is only called on whatever is allocated.
MAX_ORDER is much smaller than 2^10

If you allocate a large order page then yes the complete page is
dropped. This is today generally true in hwpoison. It would be one
possible area of improvement (probably mostly if 1GB pages become
more common than they are today)

It's usually not a problem because usually most allocations are
small order and systems have generally very few memory errors,
and even the largest MAX_ORDER pages are a small fraction of the 
total memory.

If you lose larger amounts of memory usually you quickly hit something
that HWPoison cannot handle.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH RESEND RESEND] net: remove obsolete simple_strto

2012-12-10 Thread Abhijit Pawar
This patch replace the obsolete simple_strto with kstrto

Signed-off-by: Abhijit Pawar 
---
 net/core/netpoll.c |6 --
 net/ipv4/netfilter/ipt_CLUSTERIP.c |9 +++--
 net/mac80211/debugfs_sta.c |4 +++-
 net/netfilter/nf_conntrack_core.c  |6 --
 4 files changed, 18 insertions(+), 7 deletions(-)

diff --git a/net/core/netpoll.c b/net/core/netpoll.c
index 77a0388..3151acf 100644
--- a/net/core/netpoll.c
+++ b/net/core/netpoll.c
@@ -674,7 +674,8 @@ int netpoll_parse_options(struct netpoll *np, char *opt)
if ((delim = strchr(cur, '@')) == NULL)
goto parse_failed;
*delim = 0;
-   np->local_port = simple_strtol(cur, NULL, 10);
+   if (kstrtou16(cur, 10, &np->local_port))
+   goto parse_failed;
cur = delim;
}
cur++;
@@ -705,7 +706,8 @@ int netpoll_parse_options(struct netpoll *np, char *opt)
*delim = 0;
if (*cur == ' ' || *cur == '\t')
np_info(np, "warning: whitespace is not allowed\n");
-   np->remote_port = simple_strtol(cur, NULL, 10);
+   if (kstrtou16(cur, 10, &np->remote_port))
+   goto parse_failed;
cur = delim;
}
cur++;
diff --git a/net/ipv4/netfilter/ipt_CLUSTERIP.c 
b/net/ipv4/netfilter/ipt_CLUSTERIP.c
index fe5daea..75e33a7 100644
--- a/net/ipv4/netfilter/ipt_CLUSTERIP.c
+++ b/net/ipv4/netfilter/ipt_CLUSTERIP.c
@@ -661,6 +661,7 @@ static ssize_t clusterip_proc_write(struct file *file, 
const char __user *input,
 #define PROC_WRITELEN  10
char buffer[PROC_WRITELEN+1];
unsigned long nodenum;
+   int rc;
 
if (size > PROC_WRITELEN)
return -EIO;
@@ -669,11 +670,15 @@ static ssize_t clusterip_proc_write(struct file *file, 
const char __user *input,
buffer[size] = 0;
 
if (*buffer == '+') {
-   nodenum = simple_strtoul(buffer+1, NULL, 10);
+   rc = kstrtoul(buffer+1, 10, &nodenum);
+   if (rc)
+   return rc;
if (clusterip_add_node(c, nodenum))
return -ENOMEM;
} else if (*buffer == '-') {
-   nodenum = simple_strtoul(buffer+1, NULL,10);
+   rc = kstrtoul(buffer+1, 10, &nodenum);
+   if (rc)
+   return rc;
if (clusterip_del_node(c, nodenum))
return -ENOENT;
} else
diff --git a/net/mac80211/debugfs_sta.c b/net/mac80211/debugfs_sta.c
index 49a1c70..6fb1168 100644
--- a/net/mac80211/debugfs_sta.c
+++ b/net/mac80211/debugfs_sta.c
@@ -220,7 +220,9 @@ static ssize_t sta_agg_status_write(struct file *file, 
const char __user *userbu
} else
return -EINVAL;
 
-   tid = simple_strtoul(buf, NULL, 0);
+   ret = kstrtoul(buf, 0, &tid);
+   if (ret)
+   return ret;
 
if (tid >= IEEE80211_NUM_TIDS)
return -EINVAL;
diff --git a/net/netfilter/nf_conntrack_core.c 
b/net/netfilter/nf_conntrack_core.c
index af17516..08cdc71 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -1409,7 +1409,7 @@ EXPORT_SYMBOL_GPL(nf_ct_alloc_hashtable);
 
 int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp)
 {
-   int i, bucket;
+   int i, bucket, rc;
unsigned int hashsize, old_size;
struct hlist_nulls_head *hash, *old_hash;
struct nf_conntrack_tuple_hash *h;
@@ -1422,7 +1422,9 @@ int nf_conntrack_set_hashsize(const char *val, struct 
kernel_param *kp)
if (!nf_conntrack_htable_size)
return param_set_uint(val, kp);
 
-   hashsize = simple_strtoul(val, NULL, 0);
+   rc = kstrtouint(val, 0, &hashsize);
+   if (rc)
+   return rc;
if (!hashsize)
return -EINVAL;
 
-- 
1.7.7.6

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RESEND] net: remove obsolete simple_strto

2012-12-10 Thread Abhijit Pawar
On 12/11/2012 12:40 AM, David Miller wrote:
> From: Abhijit Pawar 
> Date: Mon, 10 Dec 2012 14:42:28 +0530
> 
>> This patch replace the obsolete simple_strto with kstrto
>>
>> Signed-off-by: Abhijit Pawar 
> 
> Applied.
> 
Hi David,
It seems that there are occurences of simple_strto* still present in the
couple of files which are not yet removed correctly by this patch. I
will send a modified patch shortly. Please revert this commit and use
the newly sent patch to merge with the tree.

-- 
-
Abhijit
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 3/5] page_alloc: Introduce zone_movable_limit[] to keep movable limit for nodes

2012-12-10 Thread Tang Chen

On 12/11/2012 11:07 AM, Jianguo Wu wrote:

On 2012/12/11 10:33, Tang Chen wrote:


This patch introduces a new array zone_movable_limit[] to store the
ZONE_MOVABLE limit from movablecore_map boot option for all nodes.
The function sanitize_zone_movable_limit() will find out to which
node the ranges in movable_map.map[] belongs, and calculates the
low boundary of ZONE_MOVABLE for each node.

Signed-off-by: Tang Chen
Signed-off-by: Jiang Liu
Reviewed-by: Wen Congyang
Reviewed-by: Lai Jiangshan
Tested-by: Lin Feng
---
  mm/page_alloc.c |   77 +++
  1 files changed, 77 insertions(+), 0 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1c91d16..4853619 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -206,6 +206,7 @@ static unsigned long __meminitdata 
arch_zone_highest_possible_pfn[MAX_NR_ZONES];
  static unsigned long __initdata required_kernelcore;
  static unsigned long __initdata required_movablecore;
  static unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES];
+static unsigned long __meminitdata zone_movable_limit[MAX_NUMNODES];

  /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
  int movable_zone;
@@ -4340,6 +4341,77 @@ static unsigned long __meminit 
zone_absent_pages_in_node(int nid,
return __absent_pages_in_range(nid, zone_start_pfn, zone_end_pfn);
  }

+/**
+ * sanitize_zone_movable_limit - Sanitize the zone_movable_limit array.
+ *
+ * zone_movable_limit is initialized as 0. This function will try to get
+ * the first ZONE_MOVABLE pfn of each node from movablecore_map, and
+ * assigne them to zone_movable_limit.
+ * zone_movable_limit[nid] == 0 means no limit for the node.
+ *
+ * Note: Each range is represented as [start_pfn, end_pfn)
+ */
+static void __meminit sanitize_zone_movable_limit(void)
+{
+   int map_pos = 0, i, nid;
+   unsigned long start_pfn, end_pfn;
+
+   if (!movablecore_map.nr_map)
+   return;
+
+   /* Iterate all ranges from minimum to maximum */
+   for_each_mem_pfn_range(i, MAX_NUMNODES,&start_pfn,&end_pfn,&nid) {
+   /*
+* If we have found lowest pfn of ZONE_MOVABLE of the node
+* specified by user, just go on to check next range.
+*/
+   if (zone_movable_limit[nid])
+   continue;
+
+#ifdef CONFIG_ZONE_DMA
+   /* Skip DMA memory. */
+   if (start_pfn<  arch_zone_highest_possible_pfn[ZONE_DMA])
+   start_pfn = arch_zone_highest_possible_pfn[ZONE_DMA];
+#endif
+
+#ifdef CONFIG_ZONE_DMA32
+   /* Skip DMA32 memory. */
+   if (start_pfn<  arch_zone_highest_possible_pfn[ZONE_DMA32])
+   start_pfn = arch_zone_highest_possible_pfn[ZONE_DMA32];
+#endif
+
+#ifdef CONFIG_HIGHMEM
+   /* Skip lowmem if ZONE_MOVABLE is highmem. */
+   if (zone_movable_is_highmem()&&


Hi Tang,

I think zone_movable_is_highmem() is not work correctly here.
sanitize_zone_movable_limit
zone_movable_is_highmem<--using movable_zone here
find_zone_movable_pfns_for_nodes
find_usable_zone_for_movable<--movable_zone is specified here

I think Jiang Liu's patch works fine for highmem, please refer to:
http://marc.info/?l=linux-mm&m=135476085816087&w=2


Hi Wu,

Yes, I forgot movable_zone think. Thanks for reminding me. :)

But Liu's patch you just mentioned, I didn't use it because I
don't think we should skip kernelcore when movablecore_map is specified.
If these 2 options are not conflict, we should satisfy them both. :)

Of course, I also think Liu's suggestion is wonderful. But I think we
need more discussion on it. :)

I'll fix it soon.
Thanks. :)



Thanks,
Jianguo Wu


+   start_pfn<  arch_zone_lowest_possible_pfn[ZONE_HIGHMEM])
+   start_pfn = arch_zone_lowest_possible_pfn[ZONE_HIGHMEM];
+#endif
+
+   if (start_pfn>= end_pfn)
+   continue;
+
+   while (map_pos<  movablecore_map.nr_map) {
+   if (end_pfn<= movablecore_map.map[map_pos].start_pfn)
+   break;
+
+   if (start_pfn>= movablecore_map.map[map_pos].end_pfn) {
+   map_pos++;
+   continue;
+   }
+
+   /*
+* The start_pfn of ZONE_MOVABLE is either the minimum
+* pfn specified by movablecore_map, or 0, which means
+* the node has no ZONE_MOVABLE.
+*/
+   zone_movable_limit[nid] = max(start_pfn,
+   movablecore_map.map[map_pos].start_pfn);
+
+   break;
+   }
+   }
+}
+
  #else /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
  static inline unsigned lon

[PATCH] Fix Irq Subsystem menu

2012-12-10 Thread Paul Thompson
Hi;

In menuconfig, General setup -> Irq subsystem contains two
possible menu-items. Sometimes, neither menu-item exists. This
patch prevents the Irq susystem menu from appearing at all unless it
will contain at least one menu-item, preventing a confusing, empty menu.

--- linux-3.7-rc8/kernel/irq/Kconfig.orig   2012-12-05 20:59:00.963707538 
-0500
+++ linux-3.7-rc8/kernel/irq/Kconfig2012-12-05 21:00:18.454788693 -0500
@@ -3,7 +3,6 @@ config HAVE_GENERIC_HARDIRQS
bool
 
 if HAVE_GENERIC_HARDIRQS
-menu "IRQ subsystem"
 #
 # Interrupt subsystem related configuration options
 #
@@ -56,6 +55,13 @@ config GENERIC_IRQ_CHIP
 config IRQ_DOMAIN
bool
 
+# Support forced irq threading
+config IRQ_FORCED_THREADING
+   bool
+
+menu "IRQ subsystem"
+   depends on ( IRQ_DOMAIN && DEBUG_FS ) || MAY_HAVE_SPARSE_IRQ
+
 config IRQ_DOMAIN_DEBUG
bool "Expose hardware/virtual IRQ mapping via debugfs"
depends on IRQ_DOMAIN && DEBUG_FS
@@ -66,10 +72,6 @@ config IRQ_DOMAIN_DEBUG
 
  If you don't know what this means you don't need it.
 
-# Support forced irq threading
-config IRQ_FORCED_THREADING
-   bool
-
 config SPARSE_IRQ
bool "Support sparse irq numbering" if MAY_HAVE_SPARSE_IRQ
---help---

Signed-off-by: Paul Thompson 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V2] MCE: fix an error of mce_bad_pages statistics

2012-12-10 Thread Xishi Qiu
On 2012/12/11 10:58, Andi Kleen wrote:

>> That sounds like overkill. There are not so many free pages in a
>> typical server system.
> 
> As Fengguang said -- memory error handling is tricky. Lots of things
> could be done in theory, but they all have a cost in testing and 
> maintenance. 
> 
> In general they are only worth doing if the situation is common and
> represents a significant percentage of the total pages of a relevant server
> workload.
> 
> -Andi
> 

Hi Andi and Fengguang,

"There are not so many free pages in a typical server system", sorry I don't
quite understand it.

buffered_rmqueue()
prep_new_page()
check_new_page()
bad_page()

If we alloc 2^10 pages and one of them is a poisoned page, then the whole 4M
memory will be dropped.

Thanks,
Xishi Qiu

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V2] MCE: fix an error of mce_bad_pages statistics

2012-12-10 Thread Andi Kleen
On Mon, Dec 10, 2012 at 09:13:11PM -0600, Simon Jeons wrote:
> On Tue, 2012-12-11 at 04:01 +0100, Andi Kleen wrote:
> > > Oh, it will be putback to lru list during migration. So does your "some
> > > time" mean before call check_new_page?
> > 
> > Yes until the next check_new_page() whenever that is. If the migration
> > works it will be earlier, otherwise later.
> 
> But I can't figure out any page reclaim path check if the page is set
> PG_hwpoison, can poisoned pages be rclaimed?

The only way to reclaim a page is to free and reallocate it.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V2] MCE: fix an error of mce_bad_pages statistics

2012-12-10 Thread Simon Jeons
On Tue, 2012-12-11 at 04:01 +0100, Andi Kleen wrote:
> > Oh, it will be putback to lru list during migration. So does your "some
> > time" mean before call check_new_page?
> 
> Yes until the next check_new_page() whenever that is. If the migration
> works it will be earlier, otherwise later.

But I can't figure out any page reclaim path check if the page is set
PG_hwpoison, can poisoned pages be rclaimed?

-Simon

> 
> -andi


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/1] media: saa7146: don't use mutex_lock_interruptible() in device_release().

2012-12-10 Thread Cyril Roelandt
Use uninterruptible mutex_lock in the release() file op to make sure all
resources are properly freed when a process is being terminated. Returning
-ERESTARTSYS has no effect for a terminating process and this may cause driver
resources not to be released.

This was found using the following semantic patch (http://coccinelle.lip6.fr/):


@r@
identifier fops;
identifier release_func;
@@
static const struct v4l2_file_operations fops = {
.release = release_func
};

@depends on r@
identifier r.release_func;
expression E;
@@
static int release_func(...)
{
...
- if (mutex_lock_interruptible(E)) return -ERESTARTSYS;
+ mutex_lock(E);
...
}


Signed-off-by: Cyril Roelandt 
---
 drivers/media/common/saa7146/saa7146_fops.c |3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/media/common/saa7146/saa7146_fops.c 
b/drivers/media/common/saa7146/saa7146_fops.c
index b3890bd..0afe98d 100644
--- a/drivers/media/common/saa7146/saa7146_fops.c
+++ b/drivers/media/common/saa7146/saa7146_fops.c
@@ -265,8 +265,7 @@ static int fops_release(struct file *file)
 
DEB_EE("file:%p\n", file);
 
-   if (mutex_lock_interruptible(vdev->lock))
-   return -ERESTARTSYS;
+   mutex_lock(vdev->lock);
 
if (vdev->vfl_type == VFL_TYPE_VBI) {
if (dev->ext_vv_data->capabilities & V4L2_CAP_VBI_CAPTURE)
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/1] media: saa7146: don't use mutex_lock_interruptible in

2012-12-10 Thread Cyril Roelandt
This is the same kind of bug as the one fixed by
ddc43d6dc7df0849fe41b91460fa76145cf87b67 : mutex_lock() must be used in the
device_release file operation in order for all resources to be freed, since
returning -RESTARTSYS has no effect here.

I stole the commit log from Sylwester Nawrocki, who fixed a few of these issues,
since I could not formulate it better.

---

Cyril Roelandt (1):
  media: saa7146: don't use mutex_lock_interruptible() in
device_release().

 drivers/media/common/saa7146/saa7146_fops.c |3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 3/5] page_alloc: Introduce zone_movable_limit[] to keep movable limit for nodes

2012-12-10 Thread Jianguo Wu
On 2012/12/11 10:33, Tang Chen wrote:

> This patch introduces a new array zone_movable_limit[] to store the
> ZONE_MOVABLE limit from movablecore_map boot option for all nodes.
> The function sanitize_zone_movable_limit() will find out to which
> node the ranges in movable_map.map[] belongs, and calculates the
> low boundary of ZONE_MOVABLE for each node.
> 
> Signed-off-by: Tang Chen 
> Signed-off-by: Jiang Liu 
> Reviewed-by: Wen Congyang 
> Reviewed-by: Lai Jiangshan 
> Tested-by: Lin Feng 
> ---
>  mm/page_alloc.c |   77 
> +++
>  1 files changed, 77 insertions(+), 0 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 1c91d16..4853619 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -206,6 +206,7 @@ static unsigned long __meminitdata 
> arch_zone_highest_possible_pfn[MAX_NR_ZONES];
>  static unsigned long __initdata required_kernelcore;
>  static unsigned long __initdata required_movablecore;
>  static unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES];
> +static unsigned long __meminitdata zone_movable_limit[MAX_NUMNODES];
>  
>  /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
>  int movable_zone;
> @@ -4340,6 +4341,77 @@ static unsigned long __meminit 
> zone_absent_pages_in_node(int nid,
>   return __absent_pages_in_range(nid, zone_start_pfn, zone_end_pfn);
>  }
>  
> +/**
> + * sanitize_zone_movable_limit - Sanitize the zone_movable_limit array.
> + *
> + * zone_movable_limit is initialized as 0. This function will try to get
> + * the first ZONE_MOVABLE pfn of each node from movablecore_map, and
> + * assigne them to zone_movable_limit.
> + * zone_movable_limit[nid] == 0 means no limit for the node.
> + *
> + * Note: Each range is represented as [start_pfn, end_pfn)
> + */
> +static void __meminit sanitize_zone_movable_limit(void)
> +{
> + int map_pos = 0, i, nid;
> + unsigned long start_pfn, end_pfn;
> +
> + if (!movablecore_map.nr_map)
> + return;
> +
> + /* Iterate all ranges from minimum to maximum */
> + for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) {
> + /*
> +  * If we have found lowest pfn of ZONE_MOVABLE of the node
> +  * specified by user, just go on to check next range.
> +  */
> + if (zone_movable_limit[nid])
> + continue;
> +
> +#ifdef CONFIG_ZONE_DMA
> + /* Skip DMA memory. */
> + if (start_pfn < arch_zone_highest_possible_pfn[ZONE_DMA])
> + start_pfn = arch_zone_highest_possible_pfn[ZONE_DMA];
> +#endif
> +
> +#ifdef CONFIG_ZONE_DMA32
> + /* Skip DMA32 memory. */
> + if (start_pfn < arch_zone_highest_possible_pfn[ZONE_DMA32])
> + start_pfn = arch_zone_highest_possible_pfn[ZONE_DMA32];
> +#endif
> +
> +#ifdef CONFIG_HIGHMEM
> + /* Skip lowmem if ZONE_MOVABLE is highmem. */
> + if (zone_movable_is_highmem() &&

Hi Tang,

I think zone_movable_is_highmem() is not work correctly here.
sanitize_zone_movable_limit
zone_movable_is_highmem  <--using movable_zone here
find_zone_movable_pfns_for_nodes
find_usable_zone_for_movable <--movable_zone is specified here

I think Jiang Liu's patch works fine for highmem, please refer to:
http://marc.info/?l=linux-mm&m=135476085816087&w=2

Thanks,
Jianguo Wu

> + start_pfn < arch_zone_lowest_possible_pfn[ZONE_HIGHMEM])
> + start_pfn = arch_zone_lowest_possible_pfn[ZONE_HIGHMEM];
> +#endif
> +
> + if (start_pfn >= end_pfn)
> + continue;
> +
> + while (map_pos < movablecore_map.nr_map) {
> + if (end_pfn <= movablecore_map.map[map_pos].start_pfn)
> + break;
> +
> + if (start_pfn >= movablecore_map.map[map_pos].end_pfn) {
> + map_pos++;
> + continue;
> + }
> +
> + /*
> +  * The start_pfn of ZONE_MOVABLE is either the minimum
> +  * pfn specified by movablecore_map, or 0, which means
> +  * the node has no ZONE_MOVABLE.
> +  */
> + zone_movable_limit[nid] = max(start_pfn,
> + movablecore_map.map[map_pos].start_pfn);
> +
> + break;
> + }
> + }
> +}
> +
>  #else /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
>  static inline unsigned long __meminit zone_spanned_pages_in_node(int nid,
>   unsigned long zone_type,
> @@ -4358,6 +4430,10 @@ static inline unsigned long __meminit 
> zone_absent_pages_in_node(int nid,
>   return zholes_size[zone_type];
>  }
>  
> +static void __meminit sanitize_zone_movable_limit(v

Re: [PATCH V2] MCE: fix an error of mce_bad_pages statistics

2012-12-10 Thread Andi Kleen
> Oh, it will be putback to lru list during migration. So does your "some
> time" mean before call check_new_page?

Yes until the next check_new_page() whenever that is. If the migration
works it will be earlier, otherwise later.

-andi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V2] MCE: fix an error of mce_bad_pages statistics

2012-12-10 Thread Andi Kleen
> That sounds like overkill. There are not so many free pages in a
> typical server system.

As Fengguang said -- memory error handling is tricky. Lots of things
could be done in theory, but they all have a cost in testing and 
maintenance. 

In general they are only worth doing if the situation is common and
represents a significant percentage of the total pages of a relevant server
workload.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Serial8250 doesn't populate in /proc/iomem?

2012-12-10 Thread Bjorn Helgaas
[+cc linux-arm, linux-samsung-soc, linux-serial]

On Sun, Dec 9, 2012 at 11:25 PM, Woody Wu  wrote:
> Hi, list
>
> I found some io memory information is lost from /dev/iomem and want to
> find out why.
>
> I have a 2.6.16 kernel running on a ARM board (Samsung S3C2410). From
> the kernel log, I see 16 8250 serial ports were detected, and each of
> thoese ports has a memory address:
>
> Serial: 8250/16550 driver $Revision: 1.90 $ 16 ports, IRQ sharing enabled
> serial8250: tts0 at MMIO 0xe140 (irq = 16) is a 16550A
> serial8250: tts1 at MMIO 0xe148 (irq = 16) is a 16550A
> serial8250: tts2 at MMIO 0xe1400010 (irq = 17) is a 16550A
> serial8250: tts3 at MMIO 0xe1400018 (irq = 17) is a 16550A
> serial8250: tts4 at MMIO 0xe1400020 (irq = 18) is a 16550A
> serial8250: tts5 at MMIO 0xe1400028 (irq = 18) is a 16550A
> serial8250: tts6 at MMIO 0xe1400030 (irq = 19) is a 16550A
> serial8250: tts7 at MMIO 0xe1400038 (irq = 19) is a 16550A
> serial8250: tts8 at MMIO 0xe1400040 (irq = 48) is a 16550A
> serial8250: tts9 at MMIO 0xe1400048 (irq = 48) is a 16550A
> serial8250: tts10 at MMIO 0xe1400050 (irq = 49) is a 16550A
> serial8250: tts11 at MMIO 0xe1400058 (irq = 49) is a 16550A
> serial8250: tts12 at MMIO 0xe1400060 (irq = 60) is a 16550A
> serial8250: tts13 at MMIO 0xe1400068 (irq = 60) is a 16550A
> serial8250: tts14 at MMIO 0xe1400070 (irq = 61) is a 16550A
> serial8250: tts15 at MMIO 0xe1400078 (irq = 61) is a 16550A
>
> I can read/write these serial ports from /dev/ttys*, in other words,
> they do exist.  I also can find the driver info from /proc/devices:
>
> Character devices:
>
>   ...
>   4 /dev/vc/0
>   4 tty
>   4 tts
>   5 /dev/tty
>   5 /dev/console
>   5 /dev/ptmx
>   7 vcs
>   ...
>
> The problem is, I don't understand why there is no information about
> these ports in /proc/iomem file. The 'iomem' file now contains:
>
> 1100-11000ffe : AX88796B
> 1900-19000ffe : AX88796B
> 3000-33ff : System RAM
>   3001c000-301e826b : Kernel text
>   301ea000-302234a3 : Kernel data
> 4900-490f : s3c2410-ohci
>   4900-490f : ohci_hcd
> 4e00-4e0f : s3c2410-nand
>   4e00-4e0f : s3c2410-nand
> 5000-50003fff : s3c2410-uart.0
>   5000-50ff : s3c2410-uart
> 50004000-50007fff : s3c2410-uart.1
>   50004000-500040ff : s3c2410-uart
> 50008000-5000bfff : s3c2410-uart.2
>   50008000-500080ff : s3c2410-uart
> 5300-530f : s3c2410-wdt
> 5400-540f : s3c2410-i2c
> 5a00-5a0f : s3c2410-sdi
>   5a00-5a0f : mmci-s3c2410
>
> You see, there is no serial8250 informations.
>
> Can anyone here please tell me how this can happen?  Does it mean the
> serial8250 driver don't populate or register in the /proc/iomem?

That looks like a bug to me.  There should be entries in /proc/iomem
for the hardware device (showing that something responds at that
address) and for the driver that claims the device.

I think the 8250 core does the reservation in
serial8250_request_std_resource().  You could try putting some printks
in that path to see whether it's exercised.

You're running a fairly old kernel, so it's possible the bug has
already been fixed.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V2] MCE: fix an error of mce_bad_pages statistics

2012-12-10 Thread Fengguang Wu
On Tue, Dec 11, 2012 at 10:25:00AM +0800, Xishi Qiu wrote:
> On 2012/12/10 23:38, Andi Kleen wrote:
> 
> >> It is another topic, I mean since the page is poisoned, so why not isolate 
> >> it
> >> from page buddy alocator in soft_offline_page() rather than in 
> >> check_new_page().
> >> I find soft_offline_page() only migrate the page and mark HWPoison, the 
> >> poisoned
> >> page is still managed by page buddy alocator.
> > 
> > Doing it in check_new_page is the only way if the page is currently
> > allocated by someone. Since that's not uncommon it's simplest to always
> > do it this way.
> > 
> > -Andi
> > 
> 
> Hi Andi,
> 
> The poisoned page is isolated in check_new_page, however the whole buddy 
> block will
> be dropped, it seems to be a waste of memory.
> 
> Can we separate the poisoned page from the buddy block, then *only* drop the 
> poisoned
> page?

That sounds like overkill. There are not so many free pages in a
typical server system.

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v3] Support volatile range for anon vma

2012-12-10 Thread Minchan Kim
Sorry, resending with fixing compile error. :(

>From 0cfd3b65e4e90ab59abe8a337334414f92423cad Mon Sep 17 00:00:00 2001
From: Minchan Kim 
Date: Tue, 11 Dec 2012 11:38:30 +0900
Subject: [RFC v3] Support volatile range for anon vma

This still is [RFC v3] because just passed my simple test
with TCMalloc tweaking.

I hope more inputs from user-space allocator people and test patch
with their allocator because it might need design change of arena
management design for getting real vaule.

Changelog from v2

 * Removing madvise(addr, length, MADV_NOVOLATILE).
 * add vmstat about the number of discarded volatile pages
 * discard volatile pages without promotion in reclaim path

This is based on v3.6.

- What's the madvise(addr, length, MADV_VOLATILE)?

  It's a hint that user deliver to kernel so kernel can *discard*
  pages in a range anytime.

- What happens if user access page(ie, virtual address) discarded
  by kernel?

  The user can see zero-fill-on-demand pages as if madvise(DONTNEED).

- What happens if user access page(ie, virtual address) doesn't
  discarded by kernel?

  The user can see old data without page fault.

- What's different with madvise(DONTNEED)?

  System call semantic

  DONTNEED makes sure user always can see zero-fill pages after
  he calls madvise while VOLATILE can see zero-fill pages or
  old data.

  Internal implementation

  The madvise(DONTNEED) should zap all mapped pages in range so
  overhead is increased linearly with the number of mapped pages.
  Even, if user access zapped pages by write, page fault + page
  allocation + memset should be happened.

  The madvise(VOLATILE) should mark the flag in a range(ie, VMA).
  It doesn't touch pages any more so overhead of the system call
  should be very small. If memory pressure happens, VM can discard
  pages in VMAs marked by VOLATILE. If user access address with
  write mode by discarding by VM, he can see zero-fill pages so the
  cost is same with DONTNEED but if memory pressure isn't severe,
  user can see old data without (page fault + page allocation + memset)

  The VOLATILE mark should be removed in page fault handler when first
  page fault occur in marked vma so next page faults will follow normal
  page fault path. That's why user don't need madvise(MADV_NOVOLATILE)
  interface.

- What's the benefit compared to DONTNEED?

  1. The system call overhead is smaller because VOLATILE just marks
 the flag to VMA instead of zapping all the page in a range.

  2. It has a chance to eliminate overheads (ex, page fault +
 page allocation + memset(PAGE_SIZE)).

- Isn't there any drawback?

  DONTNEED doesn't need exclusive mmap_sem locking so concurrent page
  fault of other threads could be allowed. But VOLATILE needs exclusive
  mmap_sem so other thread would be blocked if they try to access
  not-mapped pages. That's why I designed madvise(VOLATILE)'s overhead
  should be small as far as possible.

  Other concern of exclusive mmap_sem is when page fault occur in
  VOLATILE marked vma. We should remove the flag of vma and merge
  adjacent vmas so needs exclusive mmap_sem. It can slow down page fault
  handling and prevent concurrent page fault. But we need such handling
  just once when page fault occur after we mark VOLATILE into VMA
  only if memory pressure happpens so the page is discarded. So it wouldn't
  not common so that benefit we get by this feature would be bigger than
  lose.

- What's for targetting?

  Firstly, user-space allocator like ptmalloc, tcmalloc or heap management
  of virtual machine like Dalvik. Also, it comes in handy for embedded
  which doesn't have swap device so they can't reclaim anonymous pages.
  By discarding instead of swap, it could be used in the non-swap system.
  For it,  we have to age anon lru list although we don't have swap because
  I don't want to discard volatile pages by top priority when memory pressure
  happens as volatile in this patch means "We don't need to swap out because
  user can handle the situation which data are disappear suddenly", NOT
  "They are useless so hurry up to reclaim them". So I want to apply same
  aging rule of nomal pages to them.

  Anonymous page background aging of non-swap system would be a trade-off
  for getting good feature. Even, we had done it two years ago until merge
  [1] and I believe gain of this patch will beat loss of anon lru aging's
  overead once all of allocator start to use madvise.
  (This patch doesn't include background aging in case of non-swap system
  but it's trivial if we decide)

[1] 74e3f3c3, vmscan: prevent background aging of anon page in no swap system

Cc: Michael Kerrisk 
Cc: Arun Sharma 
Cc: san...@google.com
Cc: Paul Turner 
CC: David Rientjes 
Cc: John Stultz 
Cc: Andrew Morton 
Cc: Christoph Lameter 
Cc: Android Kernel Team 
Cc: Robert Love 
Cc: Mel Gorman 
Cc: Hugh Dickins 
Cc: Dave Hansen 
Cc: Rik van Riel 
Cc: Dave Chinner 
Cc: Neil Brown 
Cc: Mike Hommey 
Cc: Taras Glek 
Cc: KOSAKI Motohiro 
Cc: Ch

Re: [tip:x86/microcode] x86/microcode_intel_early.c: Early update ucode on Intel's CPU

2012-12-10 Thread Yinghai Lu
On Mon, Dec 3, 2012 at 4:18 PM, Yu, Fenghua  wrote:
>> From: yhlu.ker...@gmail.com [mailto:yhlu.ker...@gmail.com] On Behalf Of
>> Yinghai Lu
>>
>> may need to copy the ucode.bin to BRK at first. that will make code
>> much simple, and later does not need to
>> copy them back in free_bootmem_initrd.
>>
>> aka, this patchset is not ready for 3.8 even.
>>
>
> I will relocate saved microcode blob (mc_saved_in_initrd) after initrd is
> relocated in updated patches. Thus, mc_saved_in_initrd always point to
> right initrd during boot time.

No, you should not copy that several times.

just pre-allocate some kbytes in BRK, and copy to there one time.

Yinghai
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: TIP tree's master branch failed to boot up

2012-12-10 Thread Michael Wang
On 12/11/2012 10:34 AM, H. Peter Anvin wrote:
> On 12/10/2012 06:22 PM, Michael Wang wrote:
>> On 12/11/2012 01:02 AM, H. Peter Anvin wrote:
>>> On 12/09/2012 08:50 PM, Michael Wang wrote:
 Hi, Folks

 I'm testing with the latest tip tree's master branch 3.7.0-rc8 and
 failed to boot up my server, it's hung at very beginning and I could not
 catch any useful log, is there any one else got this problem or I'm the
 only one?.

 Regards,
 Michael Wang

>>>
>>> 32 or 64 bits?
>>
>> 64 bits.
>>
> 
> Thanks.  We're working on that patchset and have found a bunch of
> issues, hopefully we can get them resolved very soon.

I see, let me know if you need more info :)

Regards,
Michael Wang

> 
>   -hpa
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v3 3/5] page_alloc: Introduce zone_movable_limit[] to keep movable limit for nodes

2012-12-10 Thread Tang Chen
This patch introduces a new array zone_movable_limit[] to store the
ZONE_MOVABLE limit from movablecore_map boot option for all nodes.
The function sanitize_zone_movable_limit() will find out to which
node the ranges in movable_map.map[] belongs, and calculates the
low boundary of ZONE_MOVABLE for each node.

Signed-off-by: Tang Chen 
Signed-off-by: Jiang Liu 
Reviewed-by: Wen Congyang 
Reviewed-by: Lai Jiangshan 
Tested-by: Lin Feng 
---
 mm/page_alloc.c |   77 +++
 1 files changed, 77 insertions(+), 0 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1c91d16..4853619 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -206,6 +206,7 @@ static unsigned long __meminitdata 
arch_zone_highest_possible_pfn[MAX_NR_ZONES];
 static unsigned long __initdata required_kernelcore;
 static unsigned long __initdata required_movablecore;
 static unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES];
+static unsigned long __meminitdata zone_movable_limit[MAX_NUMNODES];
 
 /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
 int movable_zone;
@@ -4340,6 +4341,77 @@ static unsigned long __meminit 
zone_absent_pages_in_node(int nid,
return __absent_pages_in_range(nid, zone_start_pfn, zone_end_pfn);
 }
 
+/**
+ * sanitize_zone_movable_limit - Sanitize the zone_movable_limit array.
+ *
+ * zone_movable_limit is initialized as 0. This function will try to get
+ * the first ZONE_MOVABLE pfn of each node from movablecore_map, and
+ * assigne them to zone_movable_limit.
+ * zone_movable_limit[nid] == 0 means no limit for the node.
+ *
+ * Note: Each range is represented as [start_pfn, end_pfn)
+ */
+static void __meminit sanitize_zone_movable_limit(void)
+{
+   int map_pos = 0, i, nid;
+   unsigned long start_pfn, end_pfn;
+
+   if (!movablecore_map.nr_map)
+   return;
+
+   /* Iterate all ranges from minimum to maximum */
+   for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) {
+   /*
+* If we have found lowest pfn of ZONE_MOVABLE of the node
+* specified by user, just go on to check next range.
+*/
+   if (zone_movable_limit[nid])
+   continue;
+
+#ifdef CONFIG_ZONE_DMA
+   /* Skip DMA memory. */
+   if (start_pfn < arch_zone_highest_possible_pfn[ZONE_DMA])
+   start_pfn = arch_zone_highest_possible_pfn[ZONE_DMA];
+#endif
+
+#ifdef CONFIG_ZONE_DMA32
+   /* Skip DMA32 memory. */
+   if (start_pfn < arch_zone_highest_possible_pfn[ZONE_DMA32])
+   start_pfn = arch_zone_highest_possible_pfn[ZONE_DMA32];
+#endif
+
+#ifdef CONFIG_HIGHMEM
+   /* Skip lowmem if ZONE_MOVABLE is highmem. */
+   if (zone_movable_is_highmem() &&
+   start_pfn < arch_zone_lowest_possible_pfn[ZONE_HIGHMEM])
+   start_pfn = arch_zone_lowest_possible_pfn[ZONE_HIGHMEM];
+#endif
+
+   if (start_pfn >= end_pfn)
+   continue;
+
+   while (map_pos < movablecore_map.nr_map) {
+   if (end_pfn <= movablecore_map.map[map_pos].start_pfn)
+   break;
+
+   if (start_pfn >= movablecore_map.map[map_pos].end_pfn) {
+   map_pos++;
+   continue;
+   }
+
+   /*
+* The start_pfn of ZONE_MOVABLE is either the minimum
+* pfn specified by movablecore_map, or 0, which means
+* the node has no ZONE_MOVABLE.
+*/
+   zone_movable_limit[nid] = max(start_pfn,
+   movablecore_map.map[map_pos].start_pfn);
+
+   break;
+   }
+   }
+}
+
 #else /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
 static inline unsigned long __meminit zone_spanned_pages_in_node(int nid,
unsigned long zone_type,
@@ -4358,6 +4430,10 @@ static inline unsigned long __meminit 
zone_absent_pages_in_node(int nid,
return zholes_size[zone_type];
 }
 
+static void __meminit sanitize_zone_movable_limit(void)
+{
+}
+
 #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
 
 static void __meminit calculate_node_totalpages(struct pglist_data *pgdat,
@@ -4923,6 +4999,7 @@ void __init free_area_init_nodes(unsigned long 
*max_zone_pfn)
 
/* Find the PFNs that ZONE_MOVABLE begins at in each node */
memset(zone_movable_pfn, 0, sizeof(zone_movable_pfn));
+   sanitize_zone_movable_limit();
find_zone_movable_pfns_for_nodes();
 
/* Print out the zone ranges */
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
Mo

[PATCH v3 5/5] page_alloc: Bootmem limit with movablecore_map

2012-12-10 Thread Tang Chen
This patch make sure bootmem will not allocate memory from areas that
may be ZONE_MOVABLE. The map info is from movablecore_map boot option.

Signed-off-by: Tang Chen 
Reviewed-by: Wen Congyang 
Reviewed-by: Lai Jiangshan 
Tested-by: Lin Feng 
---
 include/linux/memblock.h |1 +
 mm/memblock.c|   18 +-
 2 files changed, 18 insertions(+), 1 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index d452ee1..6e25597 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -42,6 +42,7 @@ struct memblock {
 
 extern struct memblock memblock;
 extern int memblock_debug;
+extern struct movablecore_map movablecore_map;
 
 #define memblock_dbg(fmt, ...) \
if (memblock_debug) printk(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__)
diff --git a/mm/memblock.c b/mm/memblock.c
index 6259055..197c3be 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -101,6 +101,7 @@ phys_addr_t __init_memblock 
memblock_find_in_range_node(phys_addr_t start,
 {
phys_addr_t this_start, this_end, cand;
u64 i;
+   int curr = movablecore_map.nr_map - 1;
 
/* pump up @end */
if (end == MEMBLOCK_ALLOC_ACCESSIBLE)
@@ -114,13 +115,28 @@ phys_addr_t __init_memblock 
memblock_find_in_range_node(phys_addr_t start,
this_start = clamp(this_start, start, end);
this_end = clamp(this_end, start, end);
 
-   if (this_end < size)
+restart:
+   if (this_end <= this_start || this_end < size)
continue;
 
+   for (; curr >= 0; curr--) {
+   if ((movablecore_map.map[curr].start_pfn << PAGE_SHIFT)
+   < this_end)
+   break;
+   }
+
cand = round_down(this_end - size, align);
+   if (curr >= 0 &&
+   cand < movablecore_map.map[curr].end_pfn << PAGE_SHIFT) {
+   this_end = movablecore_map.map[curr].start_pfn
+  << PAGE_SHIFT;
+   goto restart;
+   }
+
if (cand >= this_start)
return cand;
}
+
return 0;
 }
 
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: TIP tree's master branch failed to boot up

2012-12-10 Thread H. Peter Anvin
On 12/10/2012 06:22 PM, Michael Wang wrote:
> On 12/11/2012 01:02 AM, H. Peter Anvin wrote:
>> On 12/09/2012 08:50 PM, Michael Wang wrote:
>>> Hi, Folks
>>>
>>> I'm testing with the latest tip tree's master branch 3.7.0-rc8 and
>>> failed to boot up my server, it's hung at very beginning and I could not
>>> catch any useful log, is there any one else got this problem or I'm the
>>> only one?.
>>>
>>> Regards,
>>> Michael Wang
>>>
>>
>> 32 or 64 bits?
> 
> 64 bits.
> 

Thanks.  We're working on that patchset and have found a bunch of
issues, hopefully we can get them resolved very soon.

-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: TIP tree's master branch failed to boot up

2012-12-10 Thread Michael Wang
On 12/10/2012 06:54 PM, Ingo Molnar wrote:
> 
> I've Cc:-ed hpa, he merged the x86/microcode bits.
> 
> Michael Wang, I've excluded x86/microcode from the latest 
> version of the -tip tree which I just pushed out:
> 
>   4e00fd4c93e0 Merge branch 'x86/cleanups'
> 
> Can you confirm that your server boots now?

Yes, it works.

Regards,
Michael Wang

> 
> Thanks,
> 
>   Ingo
> 
> * Michael Wang  wrote:
> 
>> On 12/10/2012 12:50 PM, Michael Wang wrote:
>>> Hi, Folks
>>>
>>> I'm testing with the latest tip tree's master branch 3.7.0-rc8 and
>>> failed to boot up my server, it's hung at very beginning and I could not
>>> catch any useful log, is there any one else got this problem or I'm the
>>> only one?.
>>
>> And bisect catch below commit:
>>
>> commit 56e7dba100a50f674627a3764fd4da4a6ec93295
>> Merge: ea8432f 16544f8
>> Author: Ingo Molnar 
>> Date:   Fri Dec 7 12:13:11 2012 +0100
>>
>> Merge branch 'x86/microcode'
>>
>> Regards,
>> Michael Wang
>>
>>
>>>
>>> Regards,
>>> Michael Wang
>>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v3 0/5] Add movablecore_map boot option

2012-12-10 Thread Tang Chen
[What we are doing]
This patchset provide a boot option for user to specify ZONE_MOVABLE memory
map for each node in the system.

movablecore_map=nn[KMG]@ss[KMG]

This option make sure memory range from ss to ss+nn is movable memory.


[Why we do this]
If we hot remove a memroy, the memory cannot have kernel memory,
because Linux cannot migrate kernel memory currently. Therefore,
we have to guarantee that the hot removed memory has only movable
memoroy.

Linux has two boot options, kernelcore= and movablecore=, for
creating movable memory. These boot options can specify the amount
of memory use as kernel or movable memory. Using them, we can
create ZONE_MOVABLE which has only movable memory.

But it does not fulfill a requirement of memory hot remove, because
even if we specify the boot options, movable memory is distributed
in each node evenly. So when we want to hot remove memory which
memory range is 0x8000-0c000, we have no way to specify
the memory as movable memory.

So we proposed a new feature which specifies memory range to use as
movable memory.


[Ways to do this]
There may be 2 ways to specify movable memory.
 1. use firmware information
 2. use boot option

1. use firmware information
  According to ACPI spec 5.0, SRAT table has memory affinity structure
  and the structure has Hot Pluggable Filed. See "5.2.16.2 Memory
  Affinity Structure". If we use the information, we might be able to
  specify movable memory by firmware. For example, if Hot Pluggable
  Filed is enabled, Linux sets the memory as movable memory.

2. use boot option
  This is our proposal. New boot option can specify memory range to use
  as movable memory.


[How we do this]
We chose second way, because if we use first way, users cannot change
memory range to use as movable memory easily. We think if we create
movable memory, performance regression may occur by NUMA. In this case,
user can turn off the feature easily if we prepare the boot option.
And if we prepare the boot optino, the user can select which memory
to use as movable memory easily. 


[How to use]
Specify the following boot option:
movablecore_map=nn[KMG]@ss[KMG]

That means physical address range from ss to ss+nn will be allocated as
ZONE_MOVABLE.

And the following points should be considered.

1) If the range is involved in a single node, then from ss to the end of
   the node will be ZONE_MOVABLE.
2) If the range covers two or more nodes, then from ss to the end of
   the node will be ZONE_MOVABLE, and all the other nodes will only
   have ZONE_MOVABLE.
3) If no range is in the node, then the node will have no ZONE_MOVABLE
   unless kernelcore or movablecore is specified.
4) This option could be specified at most MAX_NUMNODES times.
5) If kernelcore or movablecore is also specified, movablecore_map will have
   higher priority to be satisfied.
6) This option has no conflict with memmap option.


Change log:

v2 -> v3:
1) Use memblock_alloc_try_nid() instead of memblock_alloc_nid() to allocate
   memory twice if a whole node is ZONE_MOVABLE.
2) Add DMA, DMA32 addresses check, make sure ZONE_MOVABLE won't use these 
addresses.
   Suggested by Wu Jianguo 
3) Add lowmem addresses check, when the system has highmem, make sure 
ZONE_MOVABLE
   won't use lowmem. Suggested by Liu Jiang 
4) Fix misuse of pfns in movablecore_map.map[] as physical addresses.

Tang Chen (4):
  page_alloc: add movable_memmap kernel parameter
  page_alloc: Introduce zone_movable_limit[] to keep movable limit for
nodes
  page_alloc: Make movablecore_map has higher priority
  page_alloc: Bootmem limit with movablecore_map

Yasuaki Ishimatsu (1):
  x86: get pg_data_t's memory from other node

 Documentation/kernel-parameters.txt |   17 +++
 arch/x86/mm/numa.c  |5 +-
 include/linux/memblock.h|1 +
 include/linux/mm.h  |   11 ++
 mm/memblock.c   |   18 +++-
 mm/page_alloc.c |  238 ++-
 6 files changed, 282 insertions(+), 8 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v3 2/5] page_alloc: add movable_memmap kernel parameter

2012-12-10 Thread Tang Chen
This patch adds functions to parse movablecore_map boot option. Since the
option could be specified more then once, all the maps will be stored in
the global variable movablecore_map.map array.

And also, we keep the array in monotonic increasing order by start_pfn.
And merge all overlapped ranges.

Signed-off-by: Tang Chen 
Signed-off-by: Lai Jiangshan 
Reviewed-by: Wen Congyang 
Tested-by: Lin Feng 
---
 Documentation/kernel-parameters.txt |   17 +
 include/linux/mm.h  |   11 +++
 mm/page_alloc.c |  126 +++
 3 files changed, 154 insertions(+), 0 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 9776f06..785f878 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1620,6 +1620,23 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
that the amount of memory usable for all allocations
is not too small.
 
+   movablecore_map=nn[KMG]@ss[KMG]
+   [KNL,X86,IA-64,PPC] This parameter is similar to
+   memmap except it specifies the memory map of
+   ZONE_MOVABLE.
+   If more areas are all within one node, then from
+   lowest ss to the end of the node will be ZONE_MOVABLE.
+   If an area covers two or more nodes, the area from
+   ss to the end of the 1st node will be ZONE_MOVABLE,
+   and all the rest nodes will only have ZONE_MOVABLE.
+   If memmap is specified at the same time, the
+   movablecore_map will be limited within the memmap
+   areas. If kernelcore or movablecore is also specified,
+   movablecore_map will have higher priority to be
+   satisfied. So the administrator should be careful that
+   the amount of movablecore_map areas are not too large.
+   Otherwise kernel won't have enough memory to start.
+
MTD_Partition=  [MTD]
Format: ,,,
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index bcaab4e..29622c2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1328,6 +1328,17 @@ extern void free_bootmem_with_active_regions(int nid,
unsigned long max_low_pfn);
 extern void sparse_memory_present_with_active_regions(int nid);
 
+#define MOVABLECORE_MAP_MAX MAX_NUMNODES
+struct movablecore_entry {
+   unsigned long start_pfn;/* start pfn of memory segment */
+   unsigned long end_pfn;  /* end pfn of memory segment */
+};
+
+struct movablecore_map {
+   int nr_map;
+   struct movablecore_entry map[MOVABLECORE_MAP_MAX];
+};
+
 #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
 
 #if !defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP) && \
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a8f2c87..1c91d16 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -198,6 +198,9 @@ static unsigned long __meminitdata nr_all_pages;
 static unsigned long __meminitdata dma_reserve;
 
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
+/* Movable memory ranges, will also be used by memblock subsystem. */
+struct movablecore_map movablecore_map;
+
 static unsigned long __meminitdata arch_zone_lowest_possible_pfn[MAX_NR_ZONES];
 static unsigned long __meminitdata 
arch_zone_highest_possible_pfn[MAX_NR_ZONES];
 static unsigned long __initdata required_kernelcore;
@@ -5003,6 +5006,129 @@ static int __init cmdline_parse_movablecore(char *p)
 early_param("kernelcore", cmdline_parse_kernelcore);
 early_param("movablecore", cmdline_parse_movablecore);
 
+/**
+ * insert_movablecore_map - Insert a memory range in to movablecore_map.map.
+ * @start_pfn: start pfn of the range
+ * @end_pfn: end pfn of the range
+ *
+ * This function will also merge the overlapped ranges, and sort the array
+ * by start_pfn in monotonic increasing order.
+ */
+static void __init insert_movablecore_map(unsigned long start_pfn,
+ unsigned long end_pfn)
+{
+   int pos, overlap;
+
+   /*
+* pos will be at the 1st overlapped range, or the position
+* where the element should be inserted.
+*/
+   for (pos = 0; pos < movablecore_map.nr_map; pos++)
+   if (start_pfn <= movablecore_map.map[pos].end_pfn)
+   break;
+
+   /* If there is no overlapped range, just insert the element. */
+   if (pos == movablecore_map.nr_map ||
+   end_pfn < movablecore_map.map[pos].start_pfn) {
+   /*
+* If pos is not the end of array, we need to move all
+* the rest elements backward.
+*/
+   if (pos < movablecore_map.nr_map)
+   memmove(&

[PATCH v3 4/5] page_alloc: Make movablecore_map has higher priority

2012-12-10 Thread Tang Chen
If kernelcore or movablecore is specified at the same time
with movablecore_map, movablecore_map will have higher
priority to be satisfied.
This patch will make find_zone_movable_pfns_for_nodes()
calculate zone_movable_pfn[] with the limit from
zone_movable_limit[].

Signed-off-by: Tang Chen 
Reviewed-by: Wen Congyang 
Reviewed-by: Lai Jiangshan 
Tested-by: Lin Feng 
---
 mm/page_alloc.c |   35 +++
 1 files changed, 31 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4853619..e7b6db5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4839,12 +4839,25 @@ static void __init 
find_zone_movable_pfns_for_nodes(void)
required_kernelcore = max(required_kernelcore, corepages);
}
 
-   /* If kernelcore was not specified, there is no ZONE_MOVABLE */
-   if (!required_kernelcore)
+   /*
+* No matter kernelcore/movablecore was limited or not, movable_zone
+* should always be set to a usable zone index.
+*/
+   find_usable_zone_for_movable();
+
+   /*
+* If neither kernelcore/movablecore nor movablecore_map is specified,
+* there is no ZONE_MOVABLE. But if movablecore_map is specified, the
+* start pfn of ZONE_MOVABLE has been stored in zone_movable_limit[].
+*/
+   if (!required_kernelcore) {
+   if (movablecore_map.nr_map)
+   memcpy(zone_movable_pfn, zone_movable_limit,
+   sizeof(zone_movable_pfn));
goto out;
+   }
 
/* usable_startpfn is the lowest possible pfn ZONE_MOVABLE can be at */
-   find_usable_zone_for_movable();
usable_startpfn = arch_zone_lowest_possible_pfn[movable_zone];
 
 restart:
@@ -4872,10 +4885,24 @@ restart:
for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
unsigned long size_pages;
 
+   /*
+* Find more memory for kernelcore in
+* [zone_movable_pfn[nid], zone_movable_limit[nid]).
+*/
start_pfn = max(start_pfn, zone_movable_pfn[nid]);
if (start_pfn >= end_pfn)
continue;
 
+   if (zone_movable_limit[nid]) {
+   end_pfn = min(end_pfn, zone_movable_limit[nid]);
+   /* No range left for kernelcore in this node */
+   if (start_pfn >= end_pfn) {
+   zone_movable_pfn[nid] =
+   zone_movable_limit[nid];
+   break;
+   }
+   }
+
/* Account for what is only usable for kernelcore */
if (start_pfn < usable_startpfn) {
unsigned long kernel_pages;
@@ -4935,12 +4962,12 @@ restart:
if (usable_nodes && required_kernelcore > usable_nodes)
goto restart;
 
+out:
/* Align start of ZONE_MOVABLE on all nids to MAX_ORDER_NR_PAGES */
for (nid = 0; nid < MAX_NUMNODES; nid++)
zone_movable_pfn[nid] =
roundup(zone_movable_pfn[nid], MAX_ORDER_NR_PAGES);
 
-out:
/* restore the node_state */
node_states[N_HIGH_MEMORY] = saved_node_state;
 }
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v3 1/5] x86: get pg_data_t's memory from other node

2012-12-10 Thread Tang Chen
From: Yasuaki Ishimatsu 

If system can create movable node which all memory of the
node is allocated as ZONE_MOVABLE, setup_node_data() cannot
allocate memory for the node's pg_data_t.
So, use memblock_alloc_try_nid() instead of memblock_alloc_nid()
to retry when the first allocation fails.

Signed-off-by: Yasuaki Ishimatsu 
Signed-off-by: Lai Jiangshan 
Signed-off-by: Tang Chen 
Signed-off-by: Jiang Liu 
---
 arch/x86/mm/numa.c |5 ++---
 1 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 2d125be..db939b6 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -222,10 +222,9 @@ static void __init setup_node_data(int nid, u64 start, u64 
end)
nd_pa = __pa(nd);
remapped = true;
} else {
-   nd_pa = memblock_alloc_nid(nd_size, SMP_CACHE_BYTES, nid);
+   nd_pa = memblock_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid);
if (!nd_pa) {
-   pr_err("Cannot find %zu bytes in node %d\n",
-  nd_size, nid);
+   pr_err("Cannot find %zu bytes in any node\n", nd_size);
return;
}
nd = __va(nd_pa);
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC v3] Support volatile range for anon vma

2012-12-10 Thread Minchan Kim
This still is [RFC v3] because just passed my simple test
with TCMalloc tweaking.

I hope more inputs from user-space allocator people and test patch
with their allocator because it might need design change of arena
management design for getting real vaule.

Changelog from v2

 * Removing madvise(addr, length, MADV_NOVOLATILE).
 * add vmstat about the number of discarded volatile pages
 * discard volatile pages without promotion in reclaim path

This is based on v3.6.

- What's the madvise(addr, length, MADV_VOLATILE)?

  It's a hint that user deliver to kernel so kernel can *discard*
  pages in a range anytime.

- What happens if user access page(ie, virtual address) discarded
  by kernel?

  The user can see zero-fill-on-demand pages as if madvise(DONTNEED).

- What happens if user access page(ie, virtual address) doesn't
  discarded by kernel?

  The user can see old data without page fault.

- What's different with madvise(DONTNEED)?

  System call semantic

  DONTNEED makes sure user always can see zero-fill pages after
  he calls madvise while VOLATILE can see zero-fill pages or
  old data.

  Internal implementation

  The madvise(DONTNEED) should zap all mapped pages in range so
  overhead is increased linearly with the number of mapped pages.
  Even, if user access zapped pages by write, page fault + page
  allocation + memset should be happened.

  The madvise(VOLATILE) should mark the flag in a range(ie, VMA).
  It doesn't touch pages any more so overhead of the system call
  should be very small. If memory pressure happens, VM can discard
  pages in VMAs marked by VOLATILE. If user access address with
  write mode by discarding by VM, he can see zero-fill pages so the
  cost is same with DONTNEED but if memory pressure isn't severe,
  user can see old data without (page fault + page allocation + memset)

  The VOLATILE mark should be removed in page fault handler when first
  page fault occur in marked vma so next page faults will follow normal
  page fault path. That's why user don't need madvise(MADV_NOVOLATILE)
  interface.

- What's the benefit compared to DONTNEED?

  1. The system call overhead is smaller because VOLATILE just marks
 the flag to VMA instead of zapping all the page in a range.

  2. It has a chance to eliminate overheads (ex, page fault +
 page allocation + memset(PAGE_SIZE)).

- Isn't there any drawback?

  DONTNEED doesn't need exclusive mmap_sem locking so concurrent page
  fault of other threads could be allowed. But VOLATILE needs exclusive
  mmap_sem so other thread would be blocked if they try to access
  not-mapped pages. That's why I designed madvise(VOLATILE)'s overhead
  should be small as far as possible.

  Other concern of exclusive mmap_sem is when page fault occur in
  VOLATILE marked vma. We should remove the flag of vma and merge
  adjacent vmas so needs exclusive mmap_sem. It can slow down page fault
  handling and prevent concurrent page fault. But we need such handling
  just once when page fault occur after we mark VOLATILE into VMA
  only if memory pressure happpens so the page is discarded. So it wouldn't
  not common so that benefit we get by this feature would be bigger than
  lose.

- What's for targetting?

  Firstly, user-space allocator like ptmalloc, tcmalloc or heap management
  of virtual machine like Dalvik. Also, it comes in handy for embedded
  which doesn't have swap device so they can't reclaim anonymous pages.
  By discarding instead of swap, it could be used in the non-swap system.
  For it,  we have to age anon lru list although we don't have swap because
  I don't want to discard volatile pages by top priority when memory pressure
  happens as volatile in this patch means "We don't need to swap out because
  user can handle the situation which data are disappear suddenly", NOT
  "They are useless so hurry up to reclaim them". So I want to apply same
  aging rule of nomal pages to them.

  Anonymous page background aging of non-swap system would be a trade-off
  for getting good feature. Even, we had done it two years ago until merge
  [1] and I believe gain of this patch will beat loss of anon lru aging's
  overead once all of allocator start to use madvise.
  (This patch doesn't include background aging in case of non-swap system
  but it's trivial if we decide)

[1] 74e3f3c3, vmscan: prevent background aging of anon page in no swap system

Cc: Michael Kerrisk 
Cc: Arun Sharma 
Cc: san...@google.com
Cc: Paul Turner 
CC: David Rientjes 
Cc: John Stultz 
Cc: Andrew Morton 
Cc: Christoph Lameter 
Cc: Android Kernel Team 
Cc: Robert Love 
Cc: Mel Gorman 
Cc: Hugh Dickins 
Cc: Dave Hansen 
Cc: Rik van Riel 
Cc: Dave Chinner 
Cc: Neil Brown 
Cc: Mike Hommey 
Cc: Taras Glek 
Cc: KOSAKI Motohiro 
Cc: Christoph Lameter 
Cc: KAMEZAWA Hiroyuki 
Signed-off-by: Minchan Kim 
---
 arch/x86/mm/fault.c   |2 +
 include/asm-generic/mman-common.h |6 ++
 include/linux/mm.h|7 ++-
 include/linux/rmap.h

Re: [PATCH 5/6] ACPI: Replace struct acpi_bus_ops with enum type

2012-12-10 Thread Yinghai Lu
On Mon, Dec 10, 2012 at 5:28 PM, Rafael J. Wysocki  wrote:
>>
>> OK, thanks for the pointers.  I actually see more differences between our
>> patchsets.  For one example, you seem to have left the parent->ops.bind()
>> stuff in acpi_add_single_object() which calls it even drivers_autoprobe is
>> set.
>
> Sorry, that should have been "which calls it even when drivers_autoprobe is
> not set".  I need to be more careful.
>

oh,  Jiang Liu had one patch to remove that workaround.

http://git.kernel.org/?p=linux/kernel/git/yinghai/linux-yinghai.git;a=commitdiff;h=b40dba80c2b8395570d8357e6b3f417c27c84504

ACPI/pci-bind: remove bind/unbind callbacks from acpi_device_ops

Maybe you can review that patches in my for-pci-next2...
those are ACPI related anyway.

those patches have been there for a while, and Bjorn did not have time
to digest them.

or you prefer I resend updated version as huge whole patchset?

Thanks

Yinghai
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V2] MCE: fix an error of mce_bad_pages statistics

2012-12-10 Thread Xishi Qiu
On 2012/12/10 23:38, Andi Kleen wrote:

>> It is another topic, I mean since the page is poisoned, so why not isolate it
>> from page buddy alocator in soft_offline_page() rather than in 
>> check_new_page().
>> I find soft_offline_page() only migrate the page and mark HWPoison, the 
>> poisoned
>> page is still managed by page buddy alocator.
> 
> Doing it in check_new_page is the only way if the page is currently
> allocated by someone. Since that's not uncommon it's simplest to always
> do it this way.
> 
> -Andi
> 

Hi Andi,

The poisoned page is isolated in check_new_page, however the whole buddy block 
will
be dropped, it seems to be a waste of memory.

Can we separate the poisoned page from the buddy block, then *only* drop the 
poisoned
page?

Thanks
Xishi Qiu


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: TIP tree's master branch failed to boot up

2012-12-10 Thread Michael Wang
On 12/11/2012 01:02 AM, H. Peter Anvin wrote:
> On 12/09/2012 08:50 PM, Michael Wang wrote:
>> Hi, Folks
>>
>> I'm testing with the latest tip tree's master branch 3.7.0-rc8 and
>> failed to boot up my server, it's hung at very beginning and I could not
>> catch any useful log, is there any one else got this problem or I'm the
>> only one?.
>>
>> Regards,
>> Michael Wang
>>
> 
> 32 or 64 bits?

64 bits.

Regards,
Michael Wang

> 
> -hpa
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V2] MCE: fix an error of mce_bad_pages statistics

2012-12-10 Thread Simon Jeons
On Tue, 2012-12-11 at 03:03 +0100, Andi Kleen wrote:
> > IIUC, soft offlining will isolate and migrate hwpoisoned page, and this
> > page will not be accessed by memory management subsystem until unpoison,
> > correct?
> 
> No, soft offlining can still allow accesses for some time. It'll never kill
> anything.

Oh, it will be putback to lru list during migration. So does your "some
time" mean before call check_new_page?

   -Simon 

> 
> Hard tries much harder and will kill.
> 
> In some cases (unshrinkable kernel allocation) they end up doing the same
> because there isn't any other alternative though. However these are
> expected to only apply to a small percentage of pages in a typical
> system.
> 
> -Andi


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V2] MCE: fix an error of mce_bad_pages statistics

2012-12-10 Thread Andi Kleen
> IIUC, soft offlining will isolate and migrate hwpoisoned page, and this
> page will not be accessed by memory management subsystem until unpoison,
> correct?

No, soft offlining can still allow accesses for some time. It'll never kill
anything.

Hard tries much harder and will kill.

In some cases (unshrinkable kernel allocation) they end up doing the same
because there isn't any other alternative though. However these are
expected to only apply to a small percentage of pages in a typical
system.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   5   6   7   >