date:20130612

Re: [RFC] PTR_ERR: return 0 if ptr isn't an error value.

2013-06-12 Thread Rusty Russell

Julia Lawall  writes:
> On Mon, 3 Jun 2013, Uwe Kleine-König wrote:
> For a random example, here is a function that currently uses PTR_RET:

Heheh, nice choice: I think I wrote that code originally :)

> static int __net_init iptable_raw_net_init(struct net *net)
> {
> struct ipt_replace *repl;
>
> repl = ipt_alloc_initial_table(&packet_raw);
>   if (repl == NULL)
> return -ENOMEM;
> net->ipv4.iptable_raw =
> ipt_register_table(net, &packet_raw, repl);
>   kfree(repl);
> return PTR_RET(net->ipv4.iptable_raw);
> }
>
> If it becomes return PTR_ERR(...); at the end, won't it look like the 
> function always fails?

That is a valid point, though in this case the reader will know that
can't be the case.

On the other hand, there's an incremental learning curve cost to every
convenience function we add.  There are only 50 places where we use
PTR_RET(), so it's not saving us very much typing over the clearest
solution: open-coding the test.

I think using PTR_ERR() is a less bad solution than promoting PTR_RET,
which has a non-obvious name.

Cheers,
Rusty.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] MDIO: FSL_PQ_MDIO: Fix bug on incorrect offset of tbipa register

2013-06-12 Thread Oded Gabbay

On 06/12/2013 09:31 PM, Scott Wood wrote:

On 06/12/2013 10:08:29 AM, Sebastian Andrzej Siewior wrote:

On 06/12/2013 02:47 PM, Oded Gabbay wrote:
> This patch fixes a bug in the fsl_pq_mdio.c module and in relevant 
device-tree

> files regarding the correct offset of the tbipa register in the eTSEC
> controller in some of Freescale's PQ3 and QorIQ SoC.
> The bug happens when the mdio in the device tree is configured to 
be compatible
> to "fsl,gianfar-tbi". Because the mdio device in the device tree 
points to
> addresses 25520, 26520 or 27520 (depends on the controller ID), the 
variable
> priv->map at function fsl_pq_mdio_probe, points to that address. 
However,
> later in the function there is a write to register tbipa that is 
actually
> located at 25030, 26030 or 27030. Because the correct address is 
not io mapped,

> the contents are written to a different register in the controller.
> The fix sets the address of the mdio device to start at 25000, 
26000 or 27000

> and changes the mii_offset field to 0x520 in the relevant entry
> (fsl,gianfar-tbi) of the fsl_pq_mdio_match array.
>
> Note: This patch may break MDIO functionallity of some old 
Freescale's SoC
> until Freescale will fix their device tree files. Basically, every 
device tree
> which contains an mdio device that is compatible to 
"fsl,gianfar-tbi" should be

> examined.

Not as is.
Please add a check for the original address. If it has 0x520 at the end
print a warning and fix it up. Please add to the patch description
which register is modified instead if this patch is not applied.
Depending on how critical this it might has to go stable.

I'm not sure it's stable material if this is something that has never 
worked...

The device tree binding will also need to be fixed to note the 
difference in "reg" between "fsl,gianfar-mdio" and "fsl-gianfar-tbi" 
-- and should give an example of the latter.

-Scott
I read the 2 comments and I'm not sure what should be the best way to 
move ahead.

I would like to describe what is the impact of not accepting this patch:
When you connect any eTSEC, except the first one, using SGMII, you must 
configure the TBIPA register because
the MII management configuration uses the TBIPA address as part of the 
SGMII initialization sequence,

as described in the P2020 Reference manual.
So, if that register is not initialized, the sequence is broken the and 
eTSEC is not functioning (can not send/receive

packets).
I still think the best way to fix it is what I did:
1. Point the priv->map to the start of the whole registers range of the 
eTSEC
2. Set mii_offset to 0x520 in the "gianfar-tbi" entry of the 
"fsl_pq_mdio_match" array.
3. Fix all the usages of the "gianfar-tbi" in the device tree files - 
change the starting address and reg range

I think this is the best way because it is stated in "fsl_pq_mdio_probe" 
function that:

/*
 * Some device tree nodes represent only the MII registers, and
 * others represent the MAC and MII registers.  The 'mii_offset' field
 * contains the offset of the MII registers inside the mapped register
 * space.
 */
and that's why we have priv->map and priv->regs. So my fix goes 
according to the current design of the driver.

-Oded
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] ncpfs: fix rmdir returns Device or resource busy

2013-06-12 Thread Al Viro

On Thu, Jun 13, 2013 at 03:01:22AM +0100, Al Viro wrote:
> On Fri, Jun 07, 2013 at 05:14:52PM +0100, Al Viro wrote:
> > On Fri, Jun 07, 2013 at 11:09:05AM -0500, Dave Chiluk wrote:
> > > Can't you just use the patch from my original e-mail?  Anyhow I attached
> > > it an already signed-off patch.
> > > 
> > > Al Viro Can you integrate it now?
> > 
> > Applied...  FWIW, patch directly in mail body is more convenient to deal 
> > with.
> 
> Actually, looking at that stuff...  Why are we bothering with -EBUSY for
> removal of busy directories on ncpfs, anyway?  It's not just rmdir(), it's
> overwriting rename() as well.  IS_DEADDIR checks in fs/namei.c and 
> fs/readdir.c
> mean that the only method of ncpfs directories that might get called after
> successful removal is ->setattr() and it would be trivial to add the check
> in ncp_notify_change() that would make it fail for dead directories without
> bothering the server at all...
> 
> Related question: what happens if you open / unlink / fchmod on ncpfs?

Speaking of crap used only by ncpfs: I think we can use ->d_iput() to get rid
of d_validate() for good.  The only remaining user is ncpfs; what happens there
is that we use the page cache of directory to cache the references to dentries
made by readdir.  We could do the following trick:
* have ->d_fsdata for these dentries a pointer into the cache page where
the reference back to dentry is stored
* ->freepage() for those pages consisting of
grab global spinlock
go through all dentries still pointed to by pointers in that
page, zeroing ->d_fsdata
drop the spinlock
* ->d_iput() for those dentries consisting of
grab the same spinlock
if ->d_fsdata is non-zero, store NULL at the address pointed
to by it
drop the spinlock
* ncp_dget_fpos() would
grab that spinlock
check if the reference to dentry in the position we are
interested in is non-NULL
grab ->d_lock
if DCACHE_DENTRY_KILLED is not set
bump ->d_count
drop ->d_lock
drop the spinlock
return dentry
// dentry is doomed
clear the reference
drop ->d_lock
drop the spinlock
return NULL
* ncp_fill_cache() would insert the sucker into cache and set
->d_fsdata under the same spinlock.

IOW, instead of wanking with untrusted pointers to dentries, we simply make
sure we clean the pointer when dentry is going away and clean the reference
from dentry to the location of that pointer when the page is going away.

Objections?  I can do a patch along those lines, but I've nothing to test it
on.  Had that been cifs, I could at least use samba to test the fucker, but
I've no idea how to do that with ncpfs and I'm not too fond of checking how
much bitrot has mars_nwe suffered...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[net-next PATCH V2 1/2] macvtap: slient sparse warnings

2013-06-12 Thread Jason Wang

This patch silents the following sparse warnings:

drivers/net/macvtap.c:98:9: warning: incorrect type in assignment (different
address spaces)
drivers/net/macvtap.c:98:9:expected struct macvtap_queue *
drivers/net/macvtap.c:98:9:got struct macvtap_queue [noderef]
*
drivers/net/macvtap.c:120:9: warning: incorrect type in assignment (different
address spaces)
drivers/net/macvtap.c:120:9:expected struct macvtap_queue *
drivers/net/macvtap.c:120:9:got struct macvtap_queue [noderef]
*
drivers/net/macvtap.c:151:22: error: incompatible types in comparison expression
(different address spaces)
drivers/net/macvtap.c:233:23: error: incompatible types in comparison expression
(different address spaces)
drivers/net/macvtap.c:243:23: error: incompatible types in comparison expression
(different address spaces)
drivers/net/macvtap.c:247:15: error: incompatible types in comparison expression
(different address spaces)
  CC [M]  drivers/net/macvtap.o
drivers/net/macvlan.c:232:24: error: incompatible types in comparison expression
(different address spaces)

Signed-off-by: Jason Wang 
---
 drivers/net/macvtap.c  |2 +-
 include/linux/if_macvlan.h |2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index 992151c..edcbf1c 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -429,7 +429,7 @@ static int macvtap_open(struct inode *inode, struct file 
*file)
if (!q)
goto out;
 
-   q->sock.wq = &q->wq;
+   RCU_INIT_POINTER(q->sock.wq, &q->wq);
init_waitqueue_head(&q->wq.wait);
q->sock.type = SOCK_RAW;
q->sock.state = SS_CONNECTED;
diff --git a/include/linux/if_macvlan.h b/include/linux/if_macvlan.h
index 1912133..f49a9f6 100644
--- a/include/linux/if_macvlan.h
+++ b/include/linux/if_macvlan.h
@@ -70,7 +70,7 @@ struct macvlan_dev {
int (*receive)(struct sk_buff *skb);
int (*forward)(struct net_device *dev, struct sk_buff *skb);
/* This array tracks active taps. */
-   struct macvtap_queue*taps[MAX_MACVTAP_QUEUES];
+   struct macvtap_queue__rcu *taps[MAX_MACVTAP_QUEUES];
/* This list tracks all taps (both enabled and disabled) */
struct list_headqueue_list;
int numvtaps;
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[net-next PATCH V2 2/2] macvtap: fix uninitialized return value macvtap_ioctl_set_queue()

2013-06-12 Thread Jason Wang

Return -EINVAL on illegal flag instead of uninitialized value. This fixes the
kbuild test warning.

Signed-off-by: Jason Wang 
---
 drivers/net/macvtap.c |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index edcbf1c..5a76f20 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -969,6 +969,8 @@ static int macvtap_ioctl_set_queue(struct file *file, 
unsigned int flags)
ret = macvtap_enable_queue(vlan->dev, file, q);
else if (flags & IFF_DETACH_QUEUE)
ret = macvtap_disable_queue(q);
+   else
+   ret = -EINVAL;
 
macvtap_put_vlan(vlan);
return ret;
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4/8] vrange: Clear volatility on new mmaps

2013-06-12 Thread Minchan Kim

Hey John,

On Tue, Jun 11, 2013 at 09:22:47PM -0700, John Stultz wrote:
> At lsf-mm, the issue was brought up that there is a precedence with
> interfaces like mlock, such that new mappings in a pre-existing range
> do no inherit the mlock state.
> 
> This is mostly because mlock only modifies the existing vmas, and so
> any new mmaps create new vmas, which won't be mlocked.
> 
> Since volatility is not stored in the vma (for good cause, specfically
> as we'd have to have manage file volatility differently from anonymous
> and we're likely to manage volatility on small chunks of memory, which
> would cause lots of vma splitting and churn), this patch clears volatilty
> on new mappings, to ensure that we don't inherit volatility if memory in
> an existing volatile range is unmapped and then re-mapped with something
> else.
> 
> Thus, this patch forces any volatility to be cleared on mmap.

If we have lots of node on vroot but it doesn't include newly mmmaping
vma range, it's purely unnecessary cost and that's never what we want.

> 
> XXX: We expect this patch to be not well loved by mm folks, and are open
> to alternative methods here. Its more of a place holder to address
> the issue from lsf-mm and hopefully will spur some further discussion.

Another idea is we can add "bool is_vrange" in struct vm_area_struct.
It is protected by vrange_lock. The scenario is following as,

When do_vrange is called with VRANGE_VOLATILE, it iterates vmas
and mark the vma->is_vrange to true. So, we can avoid tree traversal
if the is_vrange is false when munmap is called and newly mmaped vma
doesn't need to clear any volatility.

And it would help the performance of purging path to find that a page
is volatile page or not(for now, it is traversing on vroot to find it
but we could do it easily via checking the vma->is_vrange).

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [net-next PATCH 1/2] macvtap: slient sparse warnings

2013-06-12 Thread Jason Wang

On 06/13/2013 01:48 PM, Eric Dumazet wrote:
> On Thu, 2013-06-13 at 12:21 +0800, Jason Wang wrote:
>> This patch silents the following sparse warnings:
>>
>> dr
>> Signed-off-by: Jason Wang 
>> ---
>>  drivers/net/macvtap.c  |2 +-
>>  include/linux/if_macvlan.h |2 +-
>>  2 files changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
>> index 992151c..acf55ba 100644
>> --- a/drivers/net/macvtap.c
>> +++ b/drivers/net/macvtap.c
>> @@ -429,7 +429,7 @@ static int macvtap_open(struct inode *inode, struct file 
>> *file)
>>  if (!q)
>>  goto out;
>>  
>> -q->sock.wq = &q->wq;
>> +rcu_assign_pointer(q->sock.wq, &q->wq);
> Since nobody but you [1] can access this object at that time, you could
> rather use
>   RCU_INIT_POINTER(q->sock.wq, &q->wq);
>
> rcu_assign_pointer() is also a documentation point (memory barrier)
>
> By using RCU_INIT_POINTER() you clearly show that you know you need no
> memory barrier.
>

True.
>>  init_waitqueue_head(&q->wq.wait);
> [1] :
> or else, this init_waitqueue_head() should be done _before_ the
> rcu_asign_pointer()

Thanks, I will send v2 by using RCU_INIT_POINTER().
>
>>  q->sock.type = SOCK_RAW;
>>  q->sock.state = SS_CONNECTED;
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] arm64: Add support to pass earlyprintk argument via device tree

2013-06-12 Thread Pranavkumar Sawargaonkar

Hi Grant,

On 12 June 2013 18:58, Grant Likely  wrote:
> On Mon,  3 Jun 2013 21:21:11 +0530, Pranavkumar Sawargaonkar 
>  wrote:
>> This patch adds support for defining and passing earlyprintk
>> related information i.e. device and address information via
>> device tree by adding it inside "chosen" node.
>>
>> This will help user to just specify "earlyprintk" from bootargs
>> without actually knowing the address and device to enable
>> earlyprintk.
>>
>> Mechanism:
>>
>> One can just append earlyprintk=device-type,address (same as we pass
>> through command line) in "/chosen" node to notify kernel which is the
>> earlyprintk device and what is its address.
>>
>> Backward Compatibility:
>>
>> This patch also allows existing method of specifying earlyprintk
>> parameter via bootargs.
>>
>> Existing method i.e. passing via bootargs will still have precedence
>> over device tree i.e. if one specifies earlyprintk=device-type,address
>> in bootargs then kernel will use information from bootargs instead of
>> device tree.
>>
>> If user just specifies earlyprintk (without =...) then kernel will
>> look for device tree earlyprintk parameter.
>>
>> Signed-off-by: Pranavkumar Sawargaonkar 
>> Signed-off-by: Anup Patel 
>
> I'm not a big fan of this. It seems to be short-circuiting around
> existing properties.

Agreed, but intention was to have a simple solution since it is just
to be used for early printing/debugging.
and also keep that in sync with existing method of specifying
earlyprintk from command line.

The kernel /should/ be able to use the
> linux,stdout-path property to determine what the earlyprintk device to
> use is.

For this there are two problems:

1. Early printk code gets initialized before un-flattening of a device tree.
Hence trying to find out node from stdout-path is tricky as we do not have
of_find_node_by_path available.

2. Current compatible strings in arm64 early printk code are not in
synced (or different)
from actual compatible strings used in drivers -
e.g. for PL011
In earlyprintk code match name is just pl011 but in dts it is
specified as "arm,pl011"
Hence we will need multiple changes to implement it.

>
> g.

Thanks,
Pranav

>
>> ---
>>  arch/arm64/kernel/early_printk.c |7 +++
>>  arch/arm64/kernel/setup.c|   21 -
>>  2 files changed, 27 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/arm64/kernel/early_printk.c 
>> b/arch/arm64/kernel/early_printk.c
>> index fbb6e18..4e6f845 100644
>> --- a/arch/arm64/kernel/early_printk.c
>> +++ b/arch/arm64/kernel/early_printk.c
>> @@ -29,6 +29,8 @@
>>  static void __iomem *early_base;
>>  static void (*printch)(char ch);
>>
>> +extern char *earlyprintk_dt_args;
>> +
>>  /*
>>   * PL011 single character TX.
>>   */
>> @@ -116,6 +118,11 @@ static int __init setup_early_printk(char *buf)
>>   phys_addr_t paddr = 0;
>>
>>   if (!buf) {
>> + /* Try to check if Device Tree has this argument or not ? */
>> + buf = earlyprintk_dt_args;
>> + }
>> +
>> + if (!buf) {
>>   pr_warning("No earlyprintk arguments passed.\n");
>>   return 0;
>>   }
>> diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
>> index 6a9a532..fb2d56f 100644
>> --- a/arch/arm64/kernel/setup.c
>> +++ b/arch/arm64/kernel/setup.c
>> @@ -60,6 +60,8 @@ EXPORT_SYMBOL(processor_id);
>>  unsigned int elf_hwcap __read_mostly;
>>  EXPORT_SYMBOL_GPL(elf_hwcap);
>>
>> +char *earlyprintk_dt_args;
>> +
>>  static const char *cpu_name;
>>  static const char *machine_name;
>>  phys_addr_t __fdt_pointer __initdata;
>> @@ -122,6 +124,23 @@ static void __init setup_processor(void)
>>   elf_hwcap = 0;
>>  }
>>
>> +int __init early_init_dt_scan_chosen_arm64(unsigned long node,
>> +const char *uname,
>> +int depth, void *data)
>> +{
>> + char *prop;
>> +
>> + /* Check if this is chosen node */
>> + if (early_init_dt_scan_chosen(node, uname, depth, data) == 0)
>> + return 0;
>> +
>> + prop = of_get_flat_dt_prop(node, "earlyprintk", NULL);
>> + if (prop)
>> + earlyprintk_dt_args = prop;
>> +
>> + return 1;
>> +}
>> +
>>  static void __init setup_machine_fdt(phys_addr_t dt_phys)
>>  {
>>   struct boot_param_header *devtree;
>> @@ -165,7 +184,7 @@ static void __init setup_machine_fdt(phys_addr_t dt_phys)
>>   pr_info("Machine: %s\n", machine_name);
>>
>>   /* Retrieve various information from the /chosen node */
>> - of_scan_flat_dt(early_init_dt_scan_chosen, boot_command_line);
>> + of_scan_flat_dt(early_init_dt_scan_chosen_arm64, boot_command_line);
>>   /* Initialize {size,address}-cells info */
>>   of_scan_flat_dt(early_init_dt_scan_root, NULL);
>>   /* Setup memory, calling early_init_dt_add_memory_arch */
>> --
>> 1.7.9.5
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of

[PATCH] uprobes: fix return value in error handling path

2013-06-12 Thread zhangwei(Jovi)

When I inject incorrect argument into uprobe_events,

[root@jovi tracing]# echo 'p:myprobe /bin/bash' > uprobe_events
[root@jovi tracing]#

it doesn't return any error value in there, this patch fix it.

Signed-off-by: zhangwei(Jovi) 
---
 kernel/trace/trace_uprobe.c |4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index 32494fb0..d5d0cd3 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -283,8 +283,10 @@ static int create_trace_uprobe(int argc, char **argv)
return -EINVAL;
}
arg = strchr(argv[1], ':');
-   if (!arg)
+   if (!arg) {
+   ret = -EINVAL;
goto fail_address_parse;
+   }

*arg++ = '\0';
filename = argv[1];
-- 
1.7.9.7


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 03/11] gpio: davinci: Modify to platform driver

2013-06-12 Thread Sekhar Nori

On 6/12/2013 5:40 PM, Philip, Avinash wrote:
> On Wed, Jun 12, 2013 at 13:13:59, Nori, Sekhar wrote:
>> On 6/11/2013 6:25 PM, Philip, Avinash wrote:
>>
>>> On Tue, Jun 11, 2013 at 17:26:06, Nori, Sekhar wrote:
>>
 On 5/22/2013 12:40 PM, Philip Avinash wrote:
>>
> @@ -179,13 +204,10 @@ static int __init davinci_gpio_setup(void)
>   gpiochip_add(&ctlrs[i].chip);
>   }
>  
> - soc_info->gpio_ctlrs = ctlrs;

> - soc_info->gpio_ctlrs_num = DIV_ROUND_UP(ngpio, 32);

 You drop setting gpio_ctlrs_num here and don't introduce it anywhere
 else in the patchset so in effect you render the inline gpio get/set API
 useless. Looks like this initialization should be moved to platform code?
>>>
>>> With [PATCH 08/11] ARM: davinci: start using gpiolib support gpio get/set 
>>> API
>>> Has no more dependency on soc_info->gpio_ctlrs_num.
>>
>> With this series, you have removed support for inline gpio get/set API.
>> I see that the inline functions are still available for use on
>> tnetv107x. I wonder why it is important to keep these for tnetv107x when
>> not necessary for other DaVinci devices?
> 
> To support DT boot in da850, gpio davinci has to be converted to a driver and
> remove references to davinci_soc_info from driver. But tnetv107x has 
> separate GPIO driver and reference to davinci_soc_info can also be removed.
> But I didn't found defconfig support for tnetv107x platforms and left 
> untouched.
> As I will not be able to build and test on tnetv107x, I prefer to not touch 
> it.

You can always build it by enabling it in menuconfig. Its an ARMv6
platform so you will have to disable other ARMv5 based platforms from
while enabling it. ARMv5 and ARMv6 cannot co-exist in the same image.

> 
>>
>> When you are removing this feature, please note it prominently in your
>> cover letter and patch description.
> 
> Ok
> 
>> Also, please provide some data on 
>> how the latency now compares to that of inline access earlier. This is
>> important especially for the read.
> 
> I am not sure whether I understood correctly or not? Meanwhile I had done
> an experiment by reading printk timing before and after gpio_get_value from
> a test module. I think this will help in software latency for inline access 
> over
> gpiolib specific access.
> 
> gpio_get_value latency testing code
> 
> +
> +   local_irq_disable();
> +   pr_emerg("%d %x\n", __LINE__, jiffies);
> +   gpio_get_value(gpio_num);
> +   pr_emerg("%d %x\n", __LINE__, jiffies);
> +   local_irq_enable();
> 
> inline gpio functions with interrupt disabled
> [   29.734337] 81 966c
> [   29.736847] 83 966c
> 
> Time diff =   0.00251
> 
> gpio library with interrupt disabled
> 
> [  272.876763] 81 f567
> [  272.879291] 83 f567
> 
> Time diff =   0.002528
> 
> Inline function takes less time as expected.

Okay, please note these experiments in your cover letter. Its an 18usec
difference. I have no reference to say if that will affect any
application, but it will at least serve as information for someone who
may get affected by it.

> 
>> For the writes, gpio clock will
>> mostly determine how soon the value changes on the pin.
>>
>> I am okay with removing the inline access feature. It helps simplify
>> code and most arm machines don't use them. I would just like to see some
>> data for justification as this can be seen as feature regression. Also,
>> if we are removing it, its better to also remove it completely and get
>> the LOC savings instead of just stopping its usage.
> 
> I see build failure with below patch for tnetv107x
> [v6,6/6] Davinci: tnetv107x default configuration 

Where is this patch? What is the commit-id if it is in mainline? Where
is the failure log?

> 
> So I prefer to leave tnetv107x platform for now.

I don't think that's acceptable. At least by me.

Thanks,
Sekhar
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHSET v2 cgroup/for-3.11] cgroup: convert cgroup_subsys_state refcnt to percpu_ref

2013-06-12 Thread Li Zefan

On 2013/6/13 12:04, Tejun Heo wrote:
> Hello,
> 
> The changes from the last take[L] are,
> 
> * Rebased on top of further percpu-refcount updates.
> 
> * 0003: Broken patch description updated.
> 
> * 0004: Stupid list_del_init() conversions from the last decade
> dropped.
> 
> * 0005: Typo fix.
> 
> * 0011: percpu_ref_init() error handling fixed.  Premature and
> duplicate base css ref puts fixed.
> 
> This patchset does a lot of cleanup and then updates css
> (cgroup_subsys_state) refcnt to use the new percpu reference counter.
> A css (cgroup_subsys_state) is how each cgroup is represented to a
> controller.  As such, it can be used in hot paths across the various
> subsystems different controllers are associated with.
> 
> One of the common operations is reference counting, which up until now
> has been implemented using a global atomic counter and can have
> significant adverse impact on scalability.  For example, css refcnt
> can be gotten and put multiple times by blkcg for each IO request.
> For highops configurations which try to do as much per-cpu as
> possible, the global frequent refcnting can be very expensive.
> 
> In general, given the various hugely diverse paths css's end up being
> used from, we need to make it cheap and highly scalable.  In its
> usage, css refcnting isn't very different from module refcnting.
> 
> This patchset contains the following 11 patches.
> 
>  0001-cgroup-remove-now-unused-css_depth.patch
>  0002-cgroup-consistently-use-cset-for-struct-css_set-vari.patch
>  0003-cgroup-bring-some-sanity-to-naming-around-cg_cgroup_.patch
>  0004-cgroup-use-kzalloc-instead-of-kmalloc.patch
>  0005-cgroup-clean-up-css_-try-get-and-css_put.patch
>  0006-cgroup-rename-CGRP_REMOVED-to-CGRP_DEAD.patch
>  0007-cgroup-drop-unnecessary-RCU-dancing-from-__put_css_s.patch
>  0008-cgroup-remove-cgroup-count-and-use.patch
>  0009-cgroup-reorder-the-operations-in-cgroup_destroy_lock.patch
>  0010-cgroup-split-cgroup-destruction-into-two-steps.patch
>  0011-cgroup-use-percpu-refcnt-for-cgroup_subsys_states.patch
> 
> 0001-0007 are cleanups, many of them cosmetic.  They clean up the code
> paths which are related with the refcnting changes.  If you're only
> interested in the precpu usage, you can probably skip these.
> 
> 0008-0010 updates the cgroup destruction path so that percpu refcnting
> can be used.
> 
> 0011 updates css refcnting to use percpu_ref.
> 
> This patchset is on top of
> 
>   cgroup/for-3.11 d5c56ced77 ("cgroup: clean up the cftype array for the base 
> cgroup files")
> + [1] percpu/review-percpu-ref-tryget
> 
> and available in the following git branch.
> 
>  git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git 
> review-css-percpu-ref
> 
> diffstat follows.  Thanks.
> 
>  include/linux/cgroup.h |   87 ++---
>  kernel/cgroup.c|  723 
> ++---
>  2 files changed, 420 insertions(+), 390 deletions(-)
> 

Acked-by: Li Zefan 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 3/3] ARM: s5p64x0: Use common uncompress.h part for plat-samsung

2013-06-12 Thread Tushar Behera

On 06/04/2013 09:49 AM, Tushar Behera wrote:
> From: Tomasz Figa 
> 
> Since uart_base can be set dynamically in arch_detect_cpu(), there is no
> need to have a copy of all code locally, just to override UART base
> address.
> 
> This patch removes any duplicate code in uncompress.h variant of s5p64x0
> and implements proper arch_detect_cpu() function to initialize UART with
> SoC-specific parameters.
> 
> While at it, replace hard-coded register address with macro.
> 
> Signed-off-by: Tomasz Figa 
> Signed-off-by: Tushar Behera 
> ---
> (This patch replaces the original patch in this patchset as it was an
> earlier-posted near-identical patch.)
> 
> Changes for v2:
> * Remove the declaration of uart_base (taken care in plat/uncompress.h)
> 

Kukjin,

Would you please queue this up for 3.11?

Thanks.

-- 
Tushar Behera
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 2/3] ARM: SAMSUNG: Consolidate uncompress subroutine

2013-06-12 Thread Tushar Behera

On 06/04/2013 09:49 AM, Tushar Behera wrote:
> For mach-exynos, uart_base is a pointer and the value is calculated
> in the machine folder. For other machines, uart_base is defined as
> a macro in platform directory. For symmetry, the uart_base macro
> definition is removed and the uart_base calculation is moved to
> specific machine folders.
> 
> This would help us consolidating uncompress subroutine for s5p64x0.
> 
> Signed-off-by: Tushar Behera 
> ---
> Changes for v2:
> * Remove ifdef's while calculating uart_base value.
> 

Kukjin,

Would you please queue this up for 3.11?

Thanks.

-- 
Tushar Behera
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH net] macvlan: don't touch promisc without passthrough

2013-06-12 Thread John Fastabend


On 06/12/2013 06:56 AM, Sergei Shtylyov wrote:

Hello.

On 12-06-2013 15:34, Michael S. Tsirkin wrote:


commit df8ef8f3aaa6692970a436204c4429210addb23a in linux 3.5 added a way


Please also specify that commit's summary line in parens.


to control NOPROMISC macvlan flag through netlink.



However, with a non passthrough device we don't set promisc on open or
clear it on stop, even if NOPROMISC is off.  As a result:



If userspace clears NOPROMISC on open, then does not clear it on a
netlink command, promisc counter is not decremented on stop and there
will be no way to clear it once macvlan is detached.



If userspace does not clear NOPROMISC on open, then sets NOPROMISC on a
netlink command, promisc counter will be decremented from 0 and overflow
to f with no way to clear promisc afterwards.



To fix, simply ignore NOPROMISC flag in a netlink command for
non-passthrough devices, same as we do at open/stop.



While at it - since we touch this code anyway - check
dev_set_promiscuity return code and pass it to users (though an error
here is unlikely).



Cc: "David S. Miller" 
CC: Roopa Prabhu 
Cc: John Fastabend 
Signed-off-by: Michael S. Tsirkin 
---



Please review, and consider for 3.10 and -stable.




Other than those few nits looks good to me thanks!

Reviewed-by: John Fastabend 

--
John Fastabend Intel Corporation
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v4] cpufreq: fix governor start/stop race condition

2013-06-12 Thread Viresh Kumar

On 13 June 2013 11:10, Xiaoguang Chen  wrote:
> 2013/6/12 Viresh Kumar :
>> On 12 June 2013 14:39, Xiaoguang Chen  wrote:
>>
>>> ret = policy->governor->governor(policy, event);
>>
>> We again reached to the same problem. We shouldn't call
>> this between taking locks, otherwise recursive locks problems
>> would be seen again.
>
> But this is not the same lock as the deadlock case, it is a new lock,
> and only used in this function. no other functions use this lock.
> I don't know how can we get dead lock again?

I believe I have seen the recursive lock issue with different locks but
I am not sure. Anyway, I believe the implementation can be simpler than
that.

Check below patch (attached too):

x--x

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 2d53f47..80b9798 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -46,6 +46,7 @@ static DEFINE_PER_CPU(struct cpufreq_policy *,
cpufreq_cpu_data);
 static DEFINE_PER_CPU(char[CPUFREQ_NAME_LEN], cpufreq_cpu_governor);
 #endif
 static DEFINE_RWLOCK(cpufreq_driver_lock);
+static DEFINE_MUTEX(cpufreq_governor_lock);

 /*
  * cpu_policy_rwsem is a per CPU reader-writer semaphore designed to cure
@@ -1541,7 +1542,6 @@ static int __cpufreq_governor(struct
cpufreq_policy *policy,
 #else
struct cpufreq_governor *gov = NULL;
 #endif
-
if (policy->governor->max_transition_latency &&
policy->cpuinfo.transition_latency >
policy->governor->max_transition_latency) {
@@ -1562,6 +1562,21 @@ static int __cpufreq_governor(struct
cpufreq_policy *policy,

pr_debug("__cpufreq_governor for CPU %u, event %u\n",
policy->cpu, event);
+
+   mutex_lock(&cpufreq_governor_lock);
+   if ((!policy->governor_enabled && (event == CPUFREQ_GOV_STOP)) ||
+   (policy->governor_enabled && (event == CPUFREQ_GOV_START))) {
+   mutex_unlock(&cpufreq_governor_lock);
+   return -EBUSY;
+   }
+
+   if (event == CPUFREQ_GOV_STOP)
+   policy->governor_enabled = 0;
+   else if (event == CPUFREQ_GOV_START)
+   policy->governor_enabled = 1;
+
+   mutex_unlock(&cpufreq_governor_lock);
+
ret = policy->governor->governor(policy, event);

if (!ret) {
@@ -1569,6 +1584,14 @@ static int __cpufreq_governor(struct
cpufreq_policy *policy,
policy->governor->initialized++;
else if (event == CPUFREQ_GOV_POLICY_EXIT)
policy->governor->initialized--;
+   } else {
+   /* Restore original values */
+   mutex_lock(&cpufreq_governor_lock);
+   if (event == CPUFREQ_GOV_STOP)
+   policy->governor_enabled = 1;
+   else if (event == CPUFREQ_GOV_START)
+   policy->governor_enabled = 0;
+   mutex_unlock(&cpufreq_governor_lock);
}

/* we keep one module reference alive for
diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
index 037d36a..c12db73 100644
--- a/include/linux/cpufreq.h
+++ b/include/linux/cpufreq.h
@@ -107,6 +107,7 @@ struct cpufreq_policy {
unsigned intpolicy; /* see above */
struct cpufreq_governor *governor; /* see below */
void*governor_data;
+   int governor_enabled; /* governor start/stop flag */

struct work_struct  update; /* if update_policy() needs to be
 * called, but you're in IRQ context */


0001-cpufreq-fix-governor-start-stop-race-condition.patch
Description: Binary data

Re: [net-next PATCH 1/2] macvtap: slient sparse warnings

2013-06-12 Thread Eric Dumazet

On Thu, 2013-06-13 at 12:21 +0800, Jason Wang wrote:
> This patch silents the following sparse warnings:
> 
> dr
> Signed-off-by: Jason Wang 
> ---
>  drivers/net/macvtap.c  |2 +-
>  include/linux/if_macvlan.h |2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
> index 992151c..acf55ba 100644
> --- a/drivers/net/macvtap.c
> +++ b/drivers/net/macvtap.c
> @@ -429,7 +429,7 @@ static int macvtap_open(struct inode *inode, struct file 
> *file)
>   if (!q)
>   goto out;
>  
> - q->sock.wq = &q->wq;
> + rcu_assign_pointer(q->sock.wq, &q->wq);

Since nobody but you [1] can access this object at that time, you could
rather use
RCU_INIT_POINTER(q->sock.wq, &q->wq);

rcu_assign_pointer() is also a documentation point (memory barrier)

By using RCU_INIT_POINTER() you clearly show that you know you need no
memory barrier.

>   init_waitqueue_head(&q->wq.wait);

[1] :
or else, this init_waitqueue_head() should be done _before_ the
rcu_asign_pointer()

>   q->sock.type = SOCK_RAW;
>   q->sock.state = SS_CONNECTED;


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Percpu tag allocator

2013-06-12 Thread Andrew Morton

On Wed, 12 Jun 2013 20:54:32 -0700 Kent Overstreet  
wrote:

> On Wed, Jun 12, 2013 at 08:03:11PM -0700, Andrew Morton wrote:
> > On Wed, 12 Jun 2013 19:05:36 -0700 Kent Overstreet  
> > wrote:
> > 
> > > ...
> > >
> > > > Why can't we use ida_get_new_above()?
> > > > 
> > > >If it is "speed" then how bad is the problem and what efforts have
> > > >been made to address them within the idr code?  (A per-cpu magazine
> > > >frontend to ida_get_new_above() might suit).
> > > > 
> > > >If it is "functionality" then what efforts have been made to
> > > >suitably modify the ida code?
> > > 
> > > Originally, it was because every time I opened idr.[ch] I was confronted
> > > with an enormous pile of little functions, with very little indication
> > > in the way of what they were all trying to do or which ones I might want
> > > to start with.
> > > 
> > > Having finally read enough of the code to maybe get a handle on what
> > > it's about - performance is a large part of it, but idr is also a more
> > > complicated api that does more than what I wanted.
> > 
> > They all sound like pretty crappy reasons ;) If the idr/ida interface
> > is nasty then it can be wrapped to provide the same interface as the
> > percpu tag allocator.
> 
> Well, the way I see it right now, idr and this tag allocator are doing
> two somewhat different things and I'm really not sure how one piece of
> code could do both jobs.
> 
> Supposing idr could be made percpu and just as fast as my tag allocator:

We don't know how fast either is, nor what the performance requirements are.

> the problem is, using idr where right now I'd use the tag allocator
> forces you to do two allocations instead of one:
> 
> First, you allocate your idr slot - but now that has to point to
> something, so you need a kmem_cache_alloc() too.

Only a single allocation is needed - for the bit used to reserve
ida_get_new_above()'s ID, plus epsilon for idr metadata. 
ida_get_new_above() will call kmalloc approximately 0.001 times per
invocation.

> So right now at least I honestly think letting the tag allocator and idr
> be distinct things is probably the way to go.

That depends on performance testing results and upon performance
requirements.  Neither are known at this time.

> (now someone will point out to me all the fancy percpu optimizations in
> idr I missed).

idr use percpu data for reliability, not for performance.

Cross-cpu costs might be significant.  There are surely ways to reduce
them.  I could suggest one but I've found that when I make a
suggestion, others busily work on shooting down my suggestion while
avoiding thinking of their own ideas.

> > I think all this could be done with test_and_set_bit() and friends,
> > btw.  xchg() hurts my brain.
> 
> Ahh, you were talking about with a slightly bigger rework.

Not really.  Just that the xchg() in this code could be replaced in a
straightforward fashion with test_and_set_bit().  Or maybe not.

> > > You mean if a GFP_NOWAIT allocation fails? It's the same as any other
> > > allocation, I suppose.
> > > 
> > > Did you have something else in mind that could be implemented? I don't
> > > want to add code for a reserve to this code - IMO, if a reserve is
> > > needed it should be done elsewhere (like how mempools work on top of
> > > existing allocators).
> > 
> > Dunno, really - I'm just wondering what the implications of an
> > allocation failure will be.  I suspect it's EIO?  Which in some
> > circumstances could be as serious as total system failure (loss of
> > data), so reliability/robustness is a pretty big deal.
> 
> Ahh. That's just outside the scope of this code - IME, in driver code
> GFP_NOWAIT allocations are not the norm - most tags are created when
> you're submitting a bio or processing a request, and then you're in
> process context. But in your error handling code you could also need to
> allocate tags to resubmit things - that'd be an issue if you're silly
> enough to stick all your error handling code in your interrupt handling
> path.

Our experience with alloc_pages(GFP_ATOMIC) has no relevance to this
code, because this code doesn't call alloc_pages()!  Instead of failing
if page reserves are exhausted, it will fail if no tags are available
in this new data structure.

The reliability (or otherwise) of alloc_pages(GFP_ATOMIC) is well
understood whereas the reliability of percpu_tag_alloc(GFP_ATOMIC) is
not understood at all.  So let us think about both this and about the
consequences of allocation failures.

> > Another thing: CPU hot-unplug.  If it's "don't care" then the
> > forthcoming changelog should thoroughly explain why, please.  Otherwise
> > it will need a notifier to spill back any reservation.
> 
> Oh - doesn't care, becasue steal_tags() will eventually get it. We
> _could_ satisfy more allocations if we had a notifier - but we'll still
> meet our guarantees and it's absolutely not a correctness issue, so I
> lean towards just leaving it out.

I agr

Re: [PATCH] PCI: Remove not needed check in disable aspm link

2013-06-12 Thread Yinghai Lu

On Wed, Jun 12, 2013 at 8:50 PM, Bjorn Helgaas  wrote:
> On Wed, Jun 12, 2013 at 1:41 PM, Yinghai Lu  wrote:
>
> I think you're saying that in systems that support both acpiphp and
> pciehp, we should be using pciehp, but we currently use acpiphp.  If
> so, that's certainly a bug.  How serious is it?  Is it a disaster if
> we use acpiphp until we can resolve this cleanly?  Are there a lot of
> systems that claim to support acpiphp but it doesn't actually work?

No sure. To make acpiphp would need more expertise in bios.
Normally BIOS vendor would have half done work there, and will need
OEM or system vendor have someone to make it work 
You would not want to read asl code in DSDT to help them out.
That is not something that we can control.

Yinghai
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 09/15] net: can: Convert to use devm_ioremap_resource

2013-06-12 Thread Sachin Kamat

On 13 June 2013 11:00, Tushar Behera  wrote:
> On 06/10/2013 05:05 PM, Tushar Behera wrote:
>> Commit 75096579c3ac ("lib: devres: Introduce devm_ioremap_resource()")
>> introduced devm_ioremap_resource() and deprecated the use of
>> devm_request_and_ioremap().
>>
>> Signed-off-by: Tushar Behera 
>> CC: net...@vger.kernel.org
>> CC: linux-...@vger.kernel.org
>> CC: Marc Kleine-Budde 
>> CC: Wolfgang Grandegger 
>> ---
>>  drivers/net/can/c_can/c_can_platform.c |4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/net/can/c_can/c_can_platform.c 
>> b/drivers/net/can/c_can/c_can_platform.c
>> index 6b6130b..b918c73 100644
>> --- a/drivers/net/can/c_can/c_can_platform.c
>> +++ b/drivers/net/can/c_can/c_can_platform.c
>> @@ -201,8 +201,8 @@ static int c_can_plat_probe(struct platform_device *pdev)
>>   priv->instance = pdev->id;
>>
>>   res = platform_get_resource(pdev, IORESOURCE_MEM, 1);
>> - priv->raminit_ctrlreg = devm_request_and_ioremap(&pdev->dev, 
>> res);
>> - if (!priv->raminit_ctrlreg || priv->instance < 0)
>> + priv->raminit_ctrlreg = devm_ioremap_resource(&pdev->dev, res);
>> + if (IS_ERR(priv->raminit_ctrlreg) || priv->instance < 0)
>>   dev_info(&pdev->dev, "control memory is not used for 
>> raminit\n");
>>   else
>>   priv->raminit = c_can_hw_raminit;
>>
>
> IS_ERR() check on the return value of devm_ioremap_resource produces
> following sparse warning. This is because of the __iomem address space
> attribute. How should we fix this?
>
> drivers/net/can/c_can/c_can_platform.c:205:32: warning: incorrect type
> in argument 1 (different address spaces)
> drivers/net/can/c_can/c_can_platform.c:205:32:expected void const *ptr
> drivers/net/can/c_can/c_can_platform.c:205:32:got unsigned int
> [noderef] [usertype] *raminit_ctrlreg
>

This is a known issue since the time devm_ioremap_resource was
introduced. Thierry had already submitted some fixes for this.
I am not sure if they have been applied yet.
More details here:
https://patchwork.kernel.org/patch/2071361/

-- 
With warm regards,
Sachin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v4] cpufreq: fix governor start/stop race condition

2013-06-12 Thread Xiaoguang Chen

2013/6/12 Viresh Kumar :
> On 12 June 2013 14:39, Xiaoguang Chen  wrote:
>
>> ret = policy->governor->governor(policy, event);
>
> We again reached to the same problem. We shouldn't call
> this between taking locks, otherwise recursive locks problems
> would be seen again.

But this is not the same lock as the deadlock case, it is a new lock,
and only used in this function. no other functions use this lock.
I don't know how can we get dead lock again?

Thanks
Xiaoguang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/3] skbuff: Update truesize in pskb_expand_head

2013-06-12 Thread Eric Dumazet

On Thu, 2013-06-13 at 09:35 +1000, Dave Wiltshire wrote:

> Firstly, from my cover letter: "Perhaps I don't understand something,
> but I thought it best to generate the change and then ask. So is this
> correct?".

Sure I have no problems with that.

>  But secondly, I understand that the only reason for truesize
> is for memory accounting on sockets. Indeed that's why I thought this
> was incorrect. Something being complex is not a good reason not to do
> it.

OK but right now your patch adds many regressions, that will take weeks
for other dev to discover, understand and fix.

If you change skb->truesize not properly while skb is owned by a socket,
we can hold references forever and prevent sockets from being
dismantled/freed, or worse we'll free sockets while they are still in
use and panic the machine.

Some callers of pskb_expand_head() do not want skb->truesize to be
modified, because skb might be orphaned or not.

There are hundred of call sites.

Really, your patch is way too risky and will consume too much time for
very little gain. We cannot change conventions without a full audit.

There are some cases where truesize can not be exactly tracked.
For example, when we pull all data from a frag into skb->head, and the
frag can be release (put_page() on it), we do not know its original size
and can not reduce skb->truesize by this amount.
Thats fine, most of the time, because we pull network headers and they
are limited in size.

Your changelog used : "This is likely a memory audit leak", which is
weak considering the amount of time needed for us to validate such a big
change.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V4 3/3] ARM: dts: Add pcie controller node for Samsung EXYNOS5440 SoC

2013-06-12 Thread Jingoo Han

On Wednesday, June 12, 2013 7:56 PM, Arnd Bergmann wrote:
> 
> Thanks for the update! A few comments again:
> 
> On Wednesday 12 June 2013 19:20:00 Jingoo Han wrote:
> >
> > diff --git a/arch/arm/boot/dts/exynos5440-ssdk5440.dts 
> > b/arch/arm/boot/dts/exynos5440-ssdk5440.dts
> > index d55042b..efe7d39 100644
> > --- a/arch/arm/boot/dts/exynos5440-ssdk5440.dts
> > +++ b/arch/arm/boot/dts/exynos5440-ssdk5440.dts
> > @@ -30,4 +30,12 @@
> > clock-frequency = <5000>;
> > };
> > };
> > +
> > +   pcie0@4000 {
> > +   reset-gpio = <5>;
> > +   };
> > +
> > +   pcie1@6000 {
> > +   reset-gpio = <22>;
> > +   };
> >  };
> 
> As mentioned before, please use the gpio binding to pass gpio numbers.

OK, I will use gpio binding.

> 
> > diff --git a/arch/arm/boot/dts/exynos5440.dtsi 
> > b/arch/arm/boot/dts/exynos5440.dtsi
> > index f6b1c89..2c15f9d 100644
> > --- a/arch/arm/boot/dts/exynos5440.dtsi
> > +++ b/arch/arm/boot/dts/exynos5440.dtsi
> > @@ -216,4 +216,42 @@
> > clock-names = "rtc";
> > status = "disabled";
> > };
> > +
> > +   pcie0@0x29 {
> > +   compatible = "samsung,exynos5440-pcie";
> > +   reg = <0x29 0x1000
> > +   0x27 0x1000
> > +   0x271000 0x40>;
> > +   interrupts = <0 20 0>, <0 21 0>, <0 22 0>;
> > +   clocks = <&clock 28>, <&clock 27>;
> > +   clock-names = "pcie", "pcie_bus";
> > +   #address-cells = >;
> > +   #size-cells = <2>;
> > +   device_type = "pci";
> > +   ranges = <0x0800 0 0x4000 0x4000 0 0x0020   
> > /* configuration space */
> > + 0x8100 0 0  0x4020 0 0x0001   
> > /* downstream I/O */
> > + 0x8200 0 0x4021 0x4021 0 0x1000>; 
> > /* non-prefetchable memory */
> 
> I think you did not reply to my question regarding the size of the
> memory space. Does it extend from 0x4021 to 0x5021,
> or from 0x4021 to 0x5000. You probably meant the latter
> but wrote the former. If not, please add a comment for clarification.

OK, I see.
It extends to 0x6000. I will modify it.

> 
> > +   #interrupt-cells = <1>;
> > +   interrupt-map-mask = <0 0 0 0>;
> > +   interrupt-map = <0x0 0 &gic 53>;
> > +   };
> 
> So all PCI IntA interrupts are mapped to a single gic interrupt? That
> sounds like a bottleneck when you have a lot of devices on the bus.
> Do you have MSI support?

INTA, INTB, INTC, and INTD are mapped to a single gic interrupt.
Exynos5440 PCIe has MSI support; however, MSI support patch will
be posted later.

> 
>   Arnd

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 09/15] net: can: Convert to use devm_ioremap_resource

2013-06-12 Thread Tushar Behera

On 06/10/2013 05:05 PM, Tushar Behera wrote:
> Commit 75096579c3ac ("lib: devres: Introduce devm_ioremap_resource()")
> introduced devm_ioremap_resource() and deprecated the use of
> devm_request_and_ioremap().
> 
> Signed-off-by: Tushar Behera 
> CC: net...@vger.kernel.org
> CC: linux-...@vger.kernel.org
> CC: Marc Kleine-Budde 
> CC: Wolfgang Grandegger 
> ---
>  drivers/net/can/c_can/c_can_platform.c |4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/can/c_can/c_can_platform.c 
> b/drivers/net/can/c_can/c_can_platform.c
> index 6b6130b..b918c73 100644
> --- a/drivers/net/can/c_can/c_can_platform.c
> +++ b/drivers/net/can/c_can/c_can_platform.c
> @@ -201,8 +201,8 @@ static int c_can_plat_probe(struct platform_device *pdev)
>   priv->instance = pdev->id;
>  
>   res = platform_get_resource(pdev, IORESOURCE_MEM, 1);
> - priv->raminit_ctrlreg = devm_request_and_ioremap(&pdev->dev, 
> res);
> - if (!priv->raminit_ctrlreg || priv->instance < 0)
> + priv->raminit_ctrlreg = devm_ioremap_resource(&pdev->dev, res);
> + if (IS_ERR(priv->raminit_ctrlreg) || priv->instance < 0)
>   dev_info(&pdev->dev, "control memory is not used for 
> raminit\n");
>   else
>   priv->raminit = c_can_hw_raminit;
> 

IS_ERR() check on the return value of devm_ioremap_resource produces
following sparse warning. This is because of the __iomem address space
attribute. How should we fix this?

drivers/net/can/c_can/c_can_platform.c:205:32: warning: incorrect type
in argument 1 (different address spaces)
drivers/net/can/c_can/c_can_platform.c:205:32:expected void const *ptr
drivers/net/can/c_can/c_can_platform.c:205:32:got unsigned int
[noderef] [usertype] *raminit_ctrlreg


-- 
Tushar Behera
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC: PATCH] err.h: silence warning when using IS_ERR on void __iomem *

2013-06-12 Thread Sachin Kamat

On 13 June 2013 03:01, Marc Kleine-Budde  wrote:
> Commit 75096579c3ac ("lib: devres: Introduce devm_ioremap_resource()")
> introduced devm_ioremap_resource() and encourage to check its return value 
> with
> IS_ERR(). This however leads to the following sparse warnings, as
> devm_ioremap_resource() returns a void __iomem pointer:
>
> drivers/net/can/c_can/c_can_platform.c:205:32: warning: incorrect type in 
> argument 1 (different address spaces)
> drivers/net/can/c_can/c_can_platform.c:205:32:expected void const *ptr
> drivers/net/can/c_can/c_can_platform.c:205:32:got unsigned int [noderef] 
> [usertype] *raminit_ctrlreg

CC ing Thierry who has solved this issue some time back.

-- 
With warm regards,
Sachin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 3/3] arm: dt: zynq: Add support for the zed platform

2013-06-12 Thread Michal Simek

On 06/12/2013 10:16 PM, Sören Brinkmann wrote:
> On Wed, Jun 12, 2013 at 09:33:58PM +0200, Steffen Trumtrar wrote:
>> On Wed, Jun 12, 2013 at 11:26:34AM -0700, Sören Brinkmann wrote:
>>> On Wed, Jun 12, 2013 at 08:23:45PM +0200, Steffen Trumtrar wrote:
 On Wed, Jun 12, 2013 at 09:41:08AM -0700, Soren Brinkmann wrote:
> Add a DT fragment for the Zed Zynq platform and a corresponding
> target to the Makefile
>
> Signed-off-by: Soren Brinkmann 
> ---
> I used the 'xlnx,...' compat strings since it seems this is what is
> used in the Xilinx and Digilent vendor trees.
>
> +/include/ "zynq-7000.dtsi"
> +
> +/ {
> + model = "Zynq Zed Development Board";
> + compatible = "xlnx,zynq-zed", "xlnx,zynq-7000";
> +
> + memory {
> + device_type = "memory";
> + reg = <0 0x2000>;
> + };
> +
> + chosen {
> + bootargs = "console=ttyPS1,115200 earlyprintk";
> + };
> +
> +};

 Hi!

 This looks a little bit to basic. No?! Not even an UART?
>>> The UART is imported from the common zynq-7000.dtsi.
>>
>> Hm, you are actually right, although I think that you shouldn't be.
>> It is possible to NOT use the UARTs, isn't it? So, default on for both UARTs 
>> is wrong.
> Well, in that case the dtsi has to be fixed to add the 'status = "disabled"
> property to the UARTs which then can be overridden in the board dts files
> as needed. I guess I'll prepare another patch for a v3 for adding status
> properties to the UART nodes.

I am ok with this. Just to be sure that we will probably need to start
to use port-number to reflect which port is first, second.


 The compatible should include digilent or avnet. Digilent only sells to 
 academic
 customers, Avnet doesn't.
>>> I don't care at all. So, who makes the decision which one is the correct
>>> one? Actually we could even drop the zed specific one completely and go
>>> with 'xlnx,zynq-7000' only.
>>
>> I'm okay with that.
> Okay, let's wait a little and see if there are other opinions and then I
> can prepare a v3.

AFAIK Digilent is producer, Avnet is reseller. I think it is enough to write
in description ZedBoard and don't mentioned manufacturer.

By my previous comment I thought to have there just zynq-7000 compatible string
without any additional zc702/zc706 or zed board properties.

Thanks,
Michal

-- 
Michal Simek, Ing. (M.Eng), OpenPGP -> KeyID: FE3D1F91
w: www.monstr.eu p: +42-0-721842854
Maintainer of Linux kernel - Microblaze cpu - http://www.monstr.eu/fdt/
Maintainer of Linux kernel - Xilinx Zynq ARM architecture
Microblaze U-BOOT custodian and responsible for u-boot arm zynq platform




signature.asc
Description: OpenPGP digital signature

RE: [PATCH 1/1] thermal: consider emul_temperature while computing trend

2013-06-12 Thread R, Durgadoss

> -Original Message-
> From: linux-pm-ow...@vger.kernel.org [mailto:linux-pm-
> ow...@vger.kernel.org] On Behalf Of Eduardo Valentin
> Sent: Thursday, May 30, 2013 3:07 AM
> To: Zhang, Rui
> Cc: Eduardo Valentin; Amit Daniel Kachhap; R, Durgadoss; linux-
> p...@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: [PATCH 1/1] thermal: consider emul_temperature while computing
> trend
> 
> In case emulated temperature is in use, using the trend
> provided by driver layer can lead to bogus situation.
> In this case, debugger user would set a temperature value,
> but the trend would be from driver computation.
> 
> To avoid this situation, this patch changes the get_tz_trend()
> to consider the emulated temperature whenever that is in use.

Nice catch.
Sorry, I missed looking at it earlier. Just got woken up by Rui's reply
on this. The patch looks fine to me.

Thanks,
Durga
> 
> Cc: Zhang Rui 
> Cc: Amit Daniel Kachhap 
> Cc: Durgadoss R 
> Cc: linux...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Eduardo Valentin 
> ---
>  drivers/thermal/thermal_core.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/thermal/thermal_core.c
> b/drivers/thermal/thermal_core.c
> index d755440..c00dc92 100644
> --- a/drivers/thermal/thermal_core.c
> +++ b/drivers/thermal/thermal_core.c
> @@ -155,7 +155,8 @@ int get_tz_trend(struct thermal_zone_device *tz, int
> trip)
>  {
>   enum thermal_trend trend;
> 
> - if (!tz->ops->get_trend || tz->ops->get_trend(tz, trip, &trend)) {
> + if (tz->emul_temperature || !tz->ops->get_trend ||
> + tz->ops->get_trend(tz, trip, &trend)) {
>   if (tz->temperature > tz->last_temperature)
>   trend = THERMAL_TREND_RAISING;
>   else if (tz->temperature < tz->last_temperature)
> --
> 1.8.2.1.342.gfa7285d
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] scsi: ufs: Set fDeviceInit flag to initiate device initialization

2013-06-12 Thread Sujit Reddy Thumma


On 6/12/2013 11:04 AM, Santosh Y wrote:


  /**
+ *  ufshcd_query_request() - API for issuing query request to the device.
+ *  @hba: ufs driver context
+ *  @query: params for query request
+ *  @descriptor: buffer for sending/receiving descriptor
+ *  @retries: number of times to try executing the command
+ *
+ *   All necessary fields for issuing a query and receiving its response
+ *   are stored in the UFS hba struct. We can use this method since we know
+ *   there is only one active query request or any internal command at all
+ *   times.
+ */
+static int ufshcd_send_query_request(struct ufs_hba *hba,
+   struct ufs_query_req *query,
+   u8 *descriptor,
+   struct ufs_query_res *response)
+{
+   int ret;
+
+   BUG_ON(!hba);
+   if (!query || !response) {
+   dev_err(hba->dev,
+   "%s: NULL pointer query = %p, response = %p\n",
+   __func__, query, response);
+   return -EINVAL;
+   }
+
+   mutex_lock(&hba->i_cmd.dev_cmd_lock);
+   hba->i_cmd.query.request = query;
+   hba->i_cmd.query.response = response;
+   hba->i_cmd.query.descriptor = descriptor;
+
+   ret = ufshcd_exec_internal_cmd(hba, DEV_CMD_TYPE_QUERY,
+   QUERY_REQ_TIMEOUT);


Can this be generic, as external query commands might also use it?


External query commands can call ufshcd_send_query_request() directly, 
without going into hassle of taking mutex lock and filling internal cmd 
structure.





+
+   hba->i_cmd.query.request = NULL;
+   hba->i_cmd.query.response = NULL;
+   hba->i_cmd.query.descriptor = NULL;
+   mutex_unlock(&hba->i_cmd.dev_cmd_lock);
+
+   return ret;
+}
+
+/**





--
Regards,
Sujit
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: power-efficient scheduling design

2013-06-12 Thread Preeti U Murthy

Hi,

On 06/11/2013 06:20 AM, Rafael J. Wysocki wrote:
> 
> OK, so let's try to take one step more and think about what part should belong
> to the scheduler and what part should be taken care of by the "idle" driver.
> 
> Do you have any specific view on that?

I gave it some thought and went through Ingo's mail once again. I have
some view points which I have stated at the end of this mail.

>> Of course, you could export more scheduler information to cpuidle,
>> various hooks (task wakeup etc.) but then we have another framework,
>> cpufreq. It also decides the CPU parameters (frequency) based on the
>> load controlled by the scheduler. Can cpufreq decide whether it's
>> better to keep the CPU at higher frequency so that it gets to idle
>> quicker and therefore deeper sleep states? I don't think it has enough
>> information because there are at least three deciding factors
>> (cpufreq, cpuidle and scheduler's load balancing) which are not
>> unified.
>
> Why not? When the cpu load is high, cpu frequency governor knows it has
> to boost the frequency of that CPU. The task gets over quickly, the CPU
> goes idle. Then the cpuidle governor kicks in to put the CPU to deeper
> sleep state gradually.

 The cpufreq governor boosts the frequency enough to cover the load,
 which means reducing the idle time. It does not know whether it is
 better to boost the frequency twice as high so that it gets to idle
 quicker. You can change the governor's policy but does it have any
 information from cpuidle?
>>>
>>> Well, it may get that information directly from the hardware.  Actually,
>>> intel_pstate does that, but intel_pstate is the governor and the scaling
>>> driver combined.
>>
>> To add to this, cpufreq currently functions in the below fashion. I am
>> talking of the on demand governor, since it is more relevant to our
>> discussion.
>>
>> stepped up frequency--
>>   threshold
>>   -stepped down freq level1---
>> -stepped down freq level2---
>>   ---stepped down freq level3
>>
>> If the cpu idle time is below a threshold , it boosts the frequency to
> 
> Did you mean "above the threshold"?

No I meant "above". I am referring to the cpu *idle* time.

>> Also an idea about how cpu frequency governor can decide on the scaling
>> frequency is stated above.
> 
> Actaully, intel_pstate uses a PID controller for making those decisions and
> I think this may be just the right thing to do.

But don't you think we need to include the current cpu load during this
decision making as well? I mean a fn(idle_time) logic in cpu frequency
governor, which is currently absent. Today, it just checks if idle_time
< threshold, and sets one specific frequency. Of course the PID could
then make the decision about the frequencies which can be candidates for
scaling up, but cpu freq governor could decide which among these to pick
based on fn(idle_time) .

> 
> [...]
> 
>>>
>>> Well, there's nothing like "predicted load".  At best, we may be able to 
>>> make
>>> more or less educated guesses about it, so in my opinion it is better to use
>>> the information about what happened in the past for making decisions 
>>> regarding
>>> the current settings and re-adjust them over time as we get more 
>>> information.
>>
>> Agree with this as well. scheduler can at best supply information
>> regarding the historic load and hope that it is what defines the future
>> as well. Apart from this I dont know what other information scheduler
>> can supply cpuidle governor with.
>>>
>>> So how much decision making regarding the idle state to put the given CPU 
>>> into
>>> should be there in the scheduler?  I believe the only information coming out
>>> of the scheduler regarding that should be "OK, this CPU is now idle and I'll
>>> need it in X nanoseconds from now" plus possibly a hint about the wakeup
>>> latency tolerance (but those hints may come from other places too).  That 
>>> said
>>> the decision *which* CPU should become idle at the moment very well may 
>>> require
>>> some information about what options are available from the layer below (for
>>> example, "putting core X into idle for Y of time will save us Z energy" or
>>> something like that).
>>
>> Agree. Except that the information should be "Ok , this CPU is now idle
>> and it has not done much work in the recent past,it is a 10% loaded CPU".
> 
> And what would that be useful for to the "idle" layer?  What matters is the
> "I'll need it in X nanoseconds from now" part.
> 
> Yes, the load part would be interesting to the "frequency" layer.

>>> What if we could integrate cpuidle with cpufreq so that there is one code
>>> layer representing what the hardware can do to the scheduler?  What benefits
>>> can we get from that, if any?
>>
>> We could debate on this point. I am a bit confused about this. As I see
>> it, there is no problem with keeping them separate

Re: [PATCH 1/2] scsi: ufs: Add support for sending NOP OUT UPIU

2013-06-12 Thread Sujit Reddy Thumma


On 6/12/2013 11:00 AM, Santosh Y wrote:

+/*
+ * ufshcd_wait_for_register - wait for register value to change
+ * @hba - per-adapter interface
+ * @reg - mmio register offset
+ * @mask - mask to apply to read register value
+ * @val - wait condition
+ * @interval_us - polling interval in microsecs
+ * @timeout_ms - timeout in millisecs
+ *
+ * Returns final register value after iteration
+ */
+static u32 ufshcd_wait_for_register(struct ufs_hba *hba, u32 reg, u32 mask,
+   u32 val, unsigned long interval_us, unsigned long timeout_ms)
+{
+   u32 tmp;
+   ktime_t start;
+   unsigned long diff;
+
+   tmp = ufshcd_readl(hba, reg);
+
+   start = ktime_get();
+   while ((tmp & mask) == val) {



...as now I notice it, 'val' is the wait condition and the loop
continues if the wait condition is met. I feel it's a bit confusing.
Wouldn't something like (x != wait_condition) be appropriate?


Makes sense.




+   /* wakeup within 50us of expiry */
+   usleep_range(interval_us, interval_us + 50);
+   tmp = ufshcd_readl(hba, reg);
+   diff = ktime_to_ms(ktime_sub(ktime_get(), start));
+   if (diff > timeout_ms) {
+   tmp = ufshcd_readl(hba, reg);


Why this extra read? The register value might have changed during the
execution of 2 previous statements, is that the assumption?


Yes, if there is a preemption between the last register read and the 
diff calculation and the CPU comes back after long time, it can happen 
that diff is greater than timeout and we enter the if condition. So, it 
is better to read the value after a timeout and return to the caller.





+   break;
+   }
+   }
+
+   return tmp;
+}
+
  /**
   * ufshcd_get_intr_mask - Get the interrupt bit mask
   * @hba - Pointer to adapter instance
@@ -223,18 +267,13 @@ static inline void ufshcd_free_hba_memory(struct ufs_hba 
*hba)
  }

  /**
- * ufshcd_is_valid_req_rsp - checks if controller TR response is valid
+ * ufshcd_get_req_rsp - returns the TR response
   * @ucd_rsp_ptr: pointer to response UPIU
- *
- * This function checks the response UPIU for valid transaction type in
- * response field
- * Returns 0 on success, non-zero on failure
   */
  static inline int
-ufshcd_is_valid_req_rsp(struct utp_upiu_rsp *ucd_rsp_ptr)
+ufshcd_get_req_rsp(struct utp_upiu_rsp *ucd_rsp_ptr)
  {
-   return ((be32_to_cpu(ucd_rsp_ptr->header.dword_0) >> 24) ==
-UPIU_TRANSACTION_RESPONSE) ? 0 : DID_ERROR << 16;
+   return be32_to_cpu(ucd_rsp_ptr->header.dword_0) >> 24;
  }

  /**
@@ -331,9 +370,9 @@ static inline void ufshcd_copy_sense_data(struct ufshcd_lrb 
*lrbp)
  {
 int len;
 if (lrbp->sense_buffer) {
-   len = be16_to_cpu(lrbp->ucd_rsp_ptr->sense_data_len);
+   len = be16_to_cpu(lrbp->ucd_rsp_ptr->sc.sense_data_len);
 memcpy(lrbp->sense_buffer,
-   lrbp->ucd_rsp_ptr->sense_data,
+   lrbp->ucd_rsp_ptr->sc.sense_data,
 min_t(int, len, SCSI_SENSE_BUFFERSIZE));
 }
  }
@@ -551,76 +590,128 @@ static void ufshcd_disable_intr(struct ufs_hba *hba, u32 
intrs)
  }

  /**
+ * ufshcd_prepare_req_desc() - Fills the requests header
+ * descriptor according to request
+ * @lrbp: pointer to local reference block
+ * @upiu_flags: flags required in the header
+ * @cmd_dir: requests data direction
+ */
+static void ufshcd_prepare_req_desc(struct ufshcd_lrb *lrbp, u32 *upiu_flags,
+   enum dma_data_direction cmd_dir)
+{
+   struct utp_transfer_req_desc *req_desc = lrbp->utr_descriptor_ptr;
+   u32 data_direction;
+   u32 dword_0;
+
+   if (cmd_dir == DMA_FROM_DEVICE) {
+   data_direction = UTP_DEVICE_TO_HOST;
+   *upiu_flags = UPIU_CMD_FLAGS_READ;
+   } else if (cmd_dir == DMA_TO_DEVICE) {
+   data_direction = UTP_HOST_TO_DEVICE;
+   *upiu_flags = UPIU_CMD_FLAGS_WRITE;
+   } else {
+   data_direction = UTP_NO_DATA_TRANSFER;
+   *upiu_flags = UPIU_CMD_FLAGS_NONE;
+   }
+
+   dword_0 = data_direction | (lrbp->command_type
+   << UPIU_COMMAND_TYPE_OFFSET);
+   if (lrbp->intr_cmd)
+   dword_0 |= UTP_REQ_DESC_INT_CMD;
+
+   /* Transfer request descriptor header fields */
+   req_desc->header.dword_0 = cpu_to_le32(dword_0);
+
+   /*
+* assigning invalid value for command status. Controller
+* updates OCS on command completion, with the command
+* status
+*/
+   req_desc->header.dword_2 =
+   cpu_to_le32(OCS_INVALID_COMMAND_STATUS);
+}
+
+/**
+ * ufshcd_prepare_utp_scsi_cmd_upiu() - fills the utp_transfer_req_desc,
+ * for scsi commands
+ * @lrbp - local reference block pointer
+ * @upiu_flags - flags
+ */
+static
+void ufshcd_prepare_utp_scsi_cmd_upiu(struct ufshcd_lrb *lrbp,

[net-next PATCH 2/2] macvtap: fix uninitialized return value macvtap_ioctl_set_queue()

2013-06-12 Thread Jason Wang

Return -EINVAL on illegal flag instead of uninitialized value. This fixes the
kbuild test warning.

Signed-off-by: Jason Wang 
---
 drivers/net/macvtap.c |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index acf55ba..9f03ce3 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -969,6 +969,8 @@ static int macvtap_ioctl_set_queue(struct file *file, 
unsigned int flags)
ret = macvtap_enable_queue(vlan->dev, file, q);
else if (flags & IFF_DETACH_QUEUE)
ret = macvtap_disable_queue(q);
+   else
+   ret = -EINVAL;
 
macvtap_put_vlan(vlan);
return ret;
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[net-next PATCH 1/2] macvtap: slient sparse warnings

2013-06-12 Thread Jason Wang

This patch silents the following sparse warnings:

drivers/net/macvtap.c:98:9: warning: incorrect type in assignment (different
address spaces)
drivers/net/macvtap.c:98:9:expected struct macvtap_queue *
drivers/net/macvtap.c:98:9:got struct macvtap_queue [noderef]
*
drivers/net/macvtap.c:120:9: warning: incorrect type in assignment (different
address spaces)
drivers/net/macvtap.c:120:9:expected struct macvtap_queue *
drivers/net/macvtap.c:120:9:got struct macvtap_queue [noderef]
*
drivers/net/macvtap.c:151:22: error: incompatible types in comparison expression
(different address spaces)
drivers/net/macvtap.c:233:23: error: incompatible types in comparison expression
(different address spaces)
drivers/net/macvtap.c:243:23: error: incompatible types in comparison expression
(different address spaces)
drivers/net/macvtap.c:247:15: error: incompatible types in comparison expression
(different address spaces)
  CC [M]  drivers/net/macvtap.o
drivers/net/macvlan.c:232:24: error: incompatible types in comparison expression
(different address spaces)

Signed-off-by: Jason Wang 
---
 drivers/net/macvtap.c  |2 +-
 include/linux/if_macvlan.h |2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index 992151c..acf55ba 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -429,7 +429,7 @@ static int macvtap_open(struct inode *inode, struct file 
*file)
if (!q)
goto out;
 
-   q->sock.wq = &q->wq;
+   rcu_assign_pointer(q->sock.wq, &q->wq);
init_waitqueue_head(&q->wq.wait);
q->sock.type = SOCK_RAW;
q->sock.state = SS_CONNECTED;
diff --git a/include/linux/if_macvlan.h b/include/linux/if_macvlan.h
index 1912133..f49a9f6 100644
--- a/include/linux/if_macvlan.h
+++ b/include/linux/if_macvlan.h
@@ -70,7 +70,7 @@ struct macvlan_dev {
int (*receive)(struct sk_buff *skb);
int (*forward)(struct net_device *dev, struct sk_buff *skb);
/* This array tracks active taps. */
-   struct macvtap_queue*taps[MAX_MACVTAP_QUEUES];
+   struct macvtap_queue__rcu *taps[MAX_MACVTAP_QUEUES];
/* This list tracks all taps (both enabled and disabled) */
struct list_headqueue_list;
int numvtaps;
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v6] ARM: dts: OMAP5: Add Palmas MFD node and regulator nodes

2013-06-12 Thread J Keerthy

This patch adds Palmas MFD node and the regulator nodes for OMAP5.

The node definitions are based on: https://lkml.org/lkml/2013/6/6/25

Boot tested on omap5-uevm board.

Signed-off-by: Graeme Gregory 
Signed-off-by: J Keerthy 
Reviewed-by: Stephen Warren 
---
V6:
Changed the order of properties.

V5:
Corrected the IRQ_TYPE flag for OMAP5 board.

V4:
Removed splitting Palmas node.

V3:
Moved the entire Palmas device tree node to board file.

 arch/arm/boot/dts/omap5-uevm.dts |  167 ++
 1 files changed, 167 insertions(+), 0 deletions(-)

diff --git a/arch/arm/boot/dts/omap5-uevm.dts b/arch/arm/boot/dts/omap5-uevm.dts
index 927db1e..3d0b7b6 100644
--- a/arch/arm/boot/dts/omap5-uevm.dts
+++ b/arch/arm/boot/dts/omap5-uevm.dts
@@ -8,6 +8,8 @@
 /dts-v1/;
 
 #include "omap5.dtsi"
+#include 
+#include 
 
 / {
model = "TI OMAP5 uEVM board";
@@ -254,6 +256,171 @@
pinctrl-0 = <&i2c1_pins>;
 
clock-frequency = <40>;
+
+   palmas: palmas@48 {
+   compatible = "ti,palmas";
+   interrupts = ; /* IRQ_SYS_1N */
+   interrupt-parent = <&gic>;
+   reg = <0x48>;
+   interrupt-controller;
+   #interrupt-cells = <2>;
+
+   palmas_pmic {
+   compatible = "ti,palmas-pmic";
+   interrupt-parent = <&palmas>;
+   interrupts = <14 IRQ_TYPE_NONE>;
+   interrupt-name = "short-irq";
+
+   ti,ldo6-vibrator;
+
+   regulators {
+   smps123_reg: smps123 {
+   regulator-name = "smps123";
+   regulator-min-microvolt = < 60>;
+   regulator-max-microvolt = <150>;
+   regulator-always-on;
+   regulator-boot-on;
+   };
+
+   smps45_reg: smps45 {
+   regulator-name = "smps45";
+   regulator-min-microvolt = < 60>;
+   regulator-max-microvolt = <131>;
+   regulator-always-on;
+   regulator-boot-on;
+   };
+
+   smps6_reg: smps6 {
+   regulator-name = "smps6";
+   regulator-min-microvolt = <120>;
+   regulator-max-microvolt = <120>;
+   regulator-always-on;
+   regulator-boot-on;
+   };
+
+   smps7_reg: smps7 {
+   regulator-name = "smps7";
+   regulator-min-microvolt = <180>;
+   regulator-max-microvolt = <180>;
+   regulator-always-on;
+   regulator-boot-on;
+   };
+
+   smps8_reg: smps8 {
+   regulator-name = "smps8";
+   regulator-min-microvolt = < 60>;
+   regulator-max-microvolt = <131>;
+   regulator-always-on;
+   regulator-boot-on;
+   };
+
+   smps9_reg: smps9 {
+   regulator-name = "smps9";
+   regulator-min-microvolt = <210>;
+   regulator-max-microvolt = <210>;
+   regulator-always-on;
+   regulator-boot-on;
+   ti,smps-range = <0x80>;
+   };
+
+   smps10_reg: smps10 {
+   regulator-name = "smps10";
+   regulator-min-microvolt = <500>;
+   regulator-max-microvolt = <500>;
+   regulator-always-on;
+   regulator-boot-on;
+   };
+
+   ldo1_reg: ldo1 {
+   regulator-name = "ldo1";
+   regulator-min-microvolt = <280>;
+   regulator-max-microvolt = <280>;
+   regulator-always-on;
+

Business Opportunity

2013-06-12 Thread Mr.Chi Pui

Good day, I Am Chi Pui;Do not be surprised! I got your email contact via the
World Email On-line Directory" I am crediting officer at Sino Pac Bank Plc
in Hong Kong and i have a deal of $17.3M to discuss with you urgently.

Regards,
Mr.Chi Pui

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 10/11] cpufreq: tegra: select CPU_FREQ_TABLE for ARCH_TEGRA

2013-06-12 Thread Viresh Kumar

On 12 June 2013 21:20, Stephen Warren  wrote:
> On 06/12/2013 02:15 AM, Viresh Kumar wrote:
>> ARCH_TEGRA selects ARCH_HAS_CPUFREQ, so CPUFREQ will be enabled for all 
>> variants
>> of TEGRA. CPUFreq driver for tegra is enabled if ARCH_TEGRA is selected. 
>> Driver
>> uses APIs from freq_table.c and so we must select CPU_FREQ_TABLE for 
>> ARCH_TEGRA.
>>
>> This also removes select CPU_FREQ_TABLE from individual tegra variants.
>
> I guess the real issue here is that drivers/cpufreq/tegra-cpufreq.c gets
> built based on ARCH_TEGRA, which doesn't depend on nor select CPU_FREQ
> itself, so:
>
> select CPU_FREQ_TABLE if CPU_FREQ
>
> ... isn't guaranteed to fire.
>
> The correct solution seems to be:
>
> * Add CONFIG_ARM_TEGRA_CPUFREQ to drivers/cpufreq/Kconfig.arm.
> * Make that Kconfig option selct CPU_FREQ_TABLE.
> * Make that Kconfig option be def_bool ARCH_TEGRA.
> * Modify drivers/cpufreq/Makefile to build tegra-cpufreq.c based on that.
> * Remove all the cpufreq-related stuff from arch/arm/mach-tegra/Kconfig.
>
> That way, tegra-cpufreq.c can't be built if !CPU_FREQ, and Tegra's
> cpufreq works the same way as all the other cpufreq drivers.

Hmmm. check this out (attached too for you to test):

--x---x

From: Viresh Kumar 
Date: Wed, 12 Jun 2013 12:05:48 +0530
Subject: [PATCH] cpufreq: tegra: create CONFIG_ARM_TEGRA_CPUFREQ

currently Tegra cpufreq driver gets built based on ARCH_TEGRA, which doesn't
depend on nor select CPU_FREQ itself, so:

select CPU_FREQ_TABLE if CPU_FREQ

... isn't guaranteed to fire.

The correct solution seems to be:

* Add CONFIG_ARM_TEGRA_CPUFREQ to drivers/cpufreq/Kconfig.arm.
* Make that Kconfig option selct CPU_FREQ_TABLE.
* Make that Kconfig option be def_bool ARCH_TEGRA.
* Modify drivers/cpufreq/Makefile to build tegra-cpufreq.c based on that.
* Remove all the cpufreq-related stuff from arch/arm/mach-tegra/Kconfig.

That way, tegra-cpufreq.c can't be built if !CPU_FREQ, and Tegra's
cpufreq works the same way as all the other cpufreq drivers.

This patch does it.

Suggested-by: Stephen Warren 
Acked-by: Arnd Bergmann 
Signed-off-by: Viresh Kumar 
---
 arch/arm/mach-tegra/Kconfig | 3 ---
 drivers/cpufreq/Kconfig.arm | 8 
 drivers/cpufreq/Makefile| 2 +-
 3 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/arch/arm/mach-tegra/Kconfig b/arch/arm/mach-tegra/Kconfig
index 84d72fc..5c0db06 100644
--- a/arch/arm/mach-tegra/Kconfig
+++ b/arch/arm/mach-tegra/Kconfig
@@ -28,7 +28,6 @@ config ARCH_TEGRA_2x_SOC
select ARM_ERRATA_754327 if SMP
select ARM_ERRATA_764369 if SMP
select ARM_GIC
-   select CPU_FREQ_TABLE if CPU_FREQ
select CPU_V7
select PINCTRL
select PINCTRL_TEGRA20
@@ -46,7 +45,6 @@ config ARCH_TEGRA_3x_SOC
select ARM_ERRATA_754322
select ARM_ERRATA_764369 if SMP
select ARM_GIC
-   select CPU_FREQ_TABLE if CPU_FREQ
select CPU_V7
select PINCTRL
select PINCTRL_TEGRA30
@@ -63,7 +61,6 @@ config ARCH_TEGRA_114_SOC
select ARM_ARCH_TIMER
select ARM_GIC
select ARM_L1_CACHE_SHIFT_6
-   select CPU_FREQ_TABLE if CPU_FREQ
select CPU_V7
select PINCTRL
select PINCTRL_TEGRA114
diff --git a/drivers/cpufreq/Kconfig.arm b/drivers/cpufreq/Kconfig.arm
index d52261b..5085427 100644
--- a/drivers/cpufreq/Kconfig.arm
+++ b/drivers/cpufreq/Kconfig.arm
@@ -154,3 +154,11 @@ config ARM_SPEAR_CPUFREQ
default y
help
  This adds the CPUFreq driver support for SPEAr SOCs.
+
+config ARM_TEGRA_CPUFREQ
+   bool "TEGRA CPUFreq support"
+   depends on ARCH_TEGRA
+   select CPU_FREQ_TABLE
+   default y
+   help
+ This adds the CPUFreq driver support for TEGRA SOCs.
diff --git a/drivers/cpufreq/Makefile b/drivers/cpufreq/Makefile
index 13c3f83..9c873e7 100644
--- a/drivers/cpufreq/Makefile
+++ b/drivers/cpufreq/Makefile
@@ -71,7 +71,7 @@ obj-$(CONFIG_ARM_S5PV210_CPUFREQ) += s5pv210-cpufreq.o
 obj-$(CONFIG_ARM_SA1100_CPUFREQ)   += sa1100-cpufreq.o
 obj-$(CONFIG_ARM_SA1110_CPUFREQ)   += sa1110-cpufreq.o
 obj-$(CONFIG_ARM_SPEAR_CPUFREQ)+= spear-cpufreq.o
-obj-$(CONFIG_ARCH_TEGRA)   += tegra-cpufreq.o
+obj-$(CONFIG_ARM_TEGRA_CPUFREQ)+= tegra-cpufreq.o

 
##
 # PowerPC platform drivers


0001-cpufreq-tegra-create-CONFIG_ARM_TEGRA_CPUFREQ.patch
Description: Binary data

Re: [RFCv2] security: smack: add a hash table to quicken smk_find_entry()

2013-06-12 Thread Casey Schaufler

On 6/12/2013 6:40 AM, Tomasz Stanislawski wrote:
> Hi Casey,
> Thank you for your review.
> Please refer to comments below.
>
> On 06/12/2013 07:11 AM, Casey Schaufler wrote:
>> On 6/11/2013 5:55 AM, Tomasz Stanislawski wrote:
>>> This patch adds a hash table to quicken searching of a smack label by its 
>>> name.
>>>
>>> Basically, the patch improves performance of SMACK initialization.  Parsing 
>>> of
>>> rules involves translation from a string to a smack_known (aka label) entity
>>> which is done in smk_find_entry().
>> There is one place where this is done, and that is in smk_import_entry().
>>
>> There is another place where labels are looked up, and that is in
>> smack_from_secattr(). Can you apply this enhancement there, too?
>>
> The overmentioned function contains following check:
>
> if (memcmp(sap->attr.mls.cat,
>  skp->smk_netlabel.attr.mls.cat,
>  SMK_CIPSOLEN) != 0)
>  continue;
>
> What does it mean?

The Smack label is encoded in the CIPSO categories in one
of two ways. When the label is imported (smk_import_entry())
the CIPSO encoding is generated and saved in the smack_known
entry for that label. When a packet is received the CIPSO
categories are not interpreted, instead it's a simple matter
of matching the saved category set in the label list.

> I expect that there was a reason not to use smk_find_entry()
> in smack_from_secattr().

Yes. While the dirty little secret is that for short labels
the CIPSO encoding is the Smack label, you can't count on the
bitwise representation matching because, well, networking code
is just like that.


> Do you want me to introduce a new hash table for searching
> skp->smk_netlabel.attr.mls.cat using up to SMK_CIPSOLEN character
> to produce a hash?

If we're going down the hashing path this would be an obvious
step to take. And we're always looking for ways to improve
network performance.

> If it is true then this change is rather orthogonal to the current patch
> and it can go to a separate patch.

I can go either way.

> Moreover, smack_from_secid() may need some hashing improvement.

Yes. The hash function is going to be pretty trivial. smack_from_secid
is a fortunately rare call.

>
>>> The current implementation of the function iterates over a global list of
>>> smack_known resulting in O(N) complexity for smk_find_entry().  The total
>>> complexity of SMACK initialization becomes O(rules * labels).  Therefore it
>>> scales quadratically with a complexity of a system.
>>>
>>> Applying the patch reduced the complexity of smk_find_entry() to O(1) as 
>>> long
>>> as number of label is in hundreds. If the number of labels is increased 
>>> please
>>> update SMACK_HASH_SLOTS constant defined in security/smack/smack.h. 
>>> Introducing
>>> the configuration of this constant with Kconfig or cmdline might be a good
>>> idea.
>>>
>>> The size of the hash table was adjusted experimentally.  The rule set used 
>>> by
>>> TIZEN contains circa 17K rules for 500 labels.  The table above contains
>>> results of SMACK initialization using 'time smackctl apply' bash command.
>>> The 'Ref' is a kernel without this patch applied. The consecutive values
>>> refers to value of SMACK_HASH_SLOTS.  Every measurement was repeated three
>>> times to reduce noise.
>>>
>>>  |  Ref  |   1   |   2   |   4   |   8   |   16  |   32  |   64  |  128 
>>>  |  256  |  512
>>> 
>>> Run1 | 1.156 | 1.096 | 0.883 | 0.764 | 0.692 | 0.667 | 0.649 | 0.633 | 
>>> 0.634 | 0.629 | 0.620
>>> Run2 | 1.156 | 1.111 | 0.885 | 0.764 | 0.694 | 0.661 | 0.649 | 0.651 | 
>>> 0.634 | 0.638 | 0.623
>>> Run3 | 1.160 | 1.107 | 0.886 | 0.764 | 0.694 | 0.671 | 0.661 | 0.638 | 
>>> 0.631 | 0.624 | 0.638
>>> AVG  | 1.157 | 1.105 | 0.885 | 0.764 | 0.693 | 0.666 | 0.653 | 0.641 | 
>>> 0.633 | 0.630 | 0.627
>>  4%  20% 14% 9%  4%  2%  2%  1%  
>> <1%
>>
>>
>> You get 4% by going to the hlist. The improvement is trivial after 16 slots.
>> If you have 500 labels and 128 slots that's an average of 4 labels per slot.
>> That's an awfully big hash table for so few labels and so little performance
>> gain over what you get with 16 slots. Plus, 500 labels is a huge number of
>> labels. Most Smack systems should be using far fewer than that. 
>>
> Ok. 16 slots might be considered as 'good-enough'.
>
> However, the cost of a single slot is a hash table is only 4 bytes (on ARM).
> For example, using 64 slots instead of 16 costs only 4 * (64 - 16) = 196 
> bytes.
> It is a very small value in comparison to other structures used in SMACK.
> For example, sizeof(struct smack_rule) == 20. Notice that 32 bytes are 
> consumed
> due to kmalloc alignment policy (refer to include/linux/kmalloc_sizes.h).
> So the cost is equal to the cost of 6 additional rules.
>
> The p

Re: [PATCH] PCI: Remove not needed check in disable aspm link

2013-06-12 Thread Jiang Liu (Gerry)


Hi Bjorn,
I'm working on several acpiphp related bugfixes, and feel some
are materials for 3.10 too. Actually we have identified four bugs
related to dock station support on Sony VAIO VPCZ23A4R laptop.
I will try to send out patchset to address these bugs tonight.
Seems we really need to rethink about acpiphp and pciehp now.
Regards!
Gerry
On 2013/6/13 11:50, Bjorn Helgaas wrote:

On Wed, Jun 12, 2013 at 1:41 PM, Yinghai Lu  wrote:

On Wed, Jun 12, 2013 at 10:05 AM, Bjorn Helgaas  wrote:

current code from acpi_pci_root_add we have
   1. pci_acpi_scan_root
  ==> pci devices enumeration and bus scanning.
  ==> pci_alloc_child_bus
  ==> pcibios_add_bus
  ==> acpi_pci_add_bus
  ==> acpiphp_enumerate_slots
  ==> ...==> register_slot
   ==> device_is_managed_by_native_pciehp
 ==> check osc_set with
OSC_PCI_EXPRESS_NATIVE_HP_CONTROL
2. _OSC set request

so we always have acpiphp hotplug slot registered at first.

so either we need to
A. revert reverting about _OSC
B. move pcibios_add_bus down to pci_bus_add_devices()
 as acpiphp and apci pci slot driver are some kind of drivers for pci_bus
C. A+B


It doesn't surprise me at all that there are problems in the _OSC code
and the acpiphp/pciehp interaction.  That whole area is a complete
disaster.  It'd really be nice if somebody stepped up and reworked it
so it makes sense.

But this report is useless to me.  I don't have time to work out what
the problem is and how it affects users and come up with a fix.


effects: without fix the problem, user can not use pcie native hotplug
if their system's firmware support acpihp and pciehp.
And make it worse, that acpiphp have to be built-in, so they have no
way to blacklist acpiphp in config.



My advice is to simplify the path first, and worry about fixing the
bug afterwards.  We've already done several iterations of fiddling
with things, and I think all we're doing is playing "whack-a-mole" and
pushing the bugs around from one place to another.


We need to address regression at first.
my suggestion is : revert the reverting and apply my -v3 version that will fix
regression that Roman Yepishev met.

please check attached two patches, hope it could save your some time.


OK, you're right.  It's not reasonable to do anything more than a
minimal fix when we're at -rc5.

Sigh.  I'll spend tomorrow trying to understand your patches and write
changelogs for you.

I think you're saying that in systems that support both acpiphp and
pciehp, we should be using pciehp, but we currently use acpiphp.  If
so, that's certainly a bug.  How serious is it?  Is it a disaster if
we use acpiphp until we can resolve this cleanly?  Are there a lot of
systems that claim to support acpiphp but it doesn't actually work?

Bjorn

.




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH v4] ARM: dts: OMAP5: Add palmas MFD node and regulator nodes

2013-06-12 Thread J, KEERTHY

Hi Stephen,

> -Original Message-
> From: Stephen Warren [mailto:swar...@wwwdotorg.org]
> Sent: Wednesday, June 12, 2013 9:22 PM
> To: J, KEERTHY
> Cc: Cousson, Benoit; devicetree-disc...@lists.ozlabs.org; linux-
> o...@vger.kernel.org; linux-kernel@vger.kernel.org;
> ldewan...@nvidia.com; grant.lik...@secretlab.ca; swar...@nvidia.com;
> sa...@linux.intel.com; g...@slimlogic.co.uk; lee.jo...@linaro.org
> Subject: Re: [PATCH v4] ARM: dts: OMAP5: Add palmas MFD node and
> regulator nodes
> 
> On 06/12/2013 02:19 AM, J Keerthy wrote:
> > This patch adds Palmas MFD node and the regulator nodes for OMAP5.
> >
> > The node definitions are based on: https://lkml.org/lkml/2013/6/6/25
> >
> > Boot tested on omap5-uevm board.
> 
> Reviewed-by: Stephen Warren 

Thanks..

> 
> > diff --git a/arch/arm/boot/dts/omap5-uevm.dts
> > b/arch/arm/boot/dts/omap5-uevm.dts
> 
> > +   palmas: palmas@48 {
> > +   reg = <0x48>;
> > +   interrupts = ; /* IRQ_SYS_1N
> */
> > +   interrupt-parent = <&gic>;
> > +   compatible = "ti,palmas";
> 
> ... although personally I prefer the compatible property to be right at
> the top of each node, since it's what implies the schema for the rest
> of the node. That's a tiny nit though; feel free to ignore it if the
> OMAP files don't follow that convention.

Sure. I will send hopefully the last version incorporating this :-).

Regards,
Keerthy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 01/11] cgroup: remove now unused css_depth()

2013-06-12 Thread Tejun Heo

Signed-off-by: Tejun Heo 
---
 include/linux/cgroup.h |  1 -
 kernel/cgroup.c| 12 
 2 files changed, 13 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index d0ad379..5830592 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -848,7 +848,6 @@ bool css_is_ancestor(struct cgroup_subsys_state *cg,
 
 /* Get id and depth of css */
 unsigned short css_id(struct cgroup_subsys_state *css);
-unsigned short css_depth(struct cgroup_subsys_state *css);
 struct cgroup_subsys_state *cgroup_css_from_dir(struct file *f, int id);
 
 #else /* !CONFIG_CGROUPS */
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 3ad2a9c..dc8997a 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -5230,18 +5230,6 @@ unsigned short css_id(struct cgroup_subsys_state *css)
 }
 EXPORT_SYMBOL_GPL(css_id);
 
-unsigned short css_depth(struct cgroup_subsys_state *css)
-{
-   struct css_id *cssid;
-
-   cssid = rcu_dereference_check(css->id, css_refcnt(css));
-
-   if (cssid)
-   return cssid->depth;
-   return 0;
-}
-EXPORT_SYMBOL_GPL(css_depth);
-
 /**
  *  css_is_ancestor - test "root" css is an ancestor of "child"
  * @child: the css to be tested.
-- 
1.8.2.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCHSET v2 cgroup/for-3.11] cgroup: convert cgroup_subsys_state refcnt to percpu_ref

2013-06-12 Thread Tejun Heo

Hello,

The changes from the last take[L] are,

* Rebased on top of further percpu-refcount updates.

* 0003: Broken patch description updated.

* 0004: Stupid list_del_init() conversions from the last decade
dropped.

* 0005: Typo fix.

* 0011: percpu_ref_init() error handling fixed.  Premature and
duplicate base css ref puts fixed.

This patchset does a lot of cleanup and then updates css
(cgroup_subsys_state) refcnt to use the new percpu reference counter.
A css (cgroup_subsys_state) is how each cgroup is represented to a
controller.  As such, it can be used in hot paths across the various
subsystems different controllers are associated with.

One of the common operations is reference counting, which up until now
has been implemented using a global atomic counter and can have
significant adverse impact on scalability.  For example, css refcnt
can be gotten and put multiple times by blkcg for each IO request.
For highops configurations which try to do as much per-cpu as
possible, the global frequent refcnting can be very expensive.

In general, given the various hugely diverse paths css's end up being
used from, we need to make it cheap and highly scalable.  In its
usage, css refcnting isn't very different from module refcnting.

This patchset contains the following 11 patches.

 0001-cgroup-remove-now-unused-css_depth.patch
 0002-cgroup-consistently-use-cset-for-struct-css_set-vari.patch
 0003-cgroup-bring-some-sanity-to-naming-around-cg_cgroup_.patch
 0004-cgroup-use-kzalloc-instead-of-kmalloc.patch
 0005-cgroup-clean-up-css_-try-get-and-css_put.patch
 0006-cgroup-rename-CGRP_REMOVED-to-CGRP_DEAD.patch
 0007-cgroup-drop-unnecessary-RCU-dancing-from-__put_css_s.patch
 0008-cgroup-remove-cgroup-count-and-use.patch
 0009-cgroup-reorder-the-operations-in-cgroup_destroy_lock.patch
 0010-cgroup-split-cgroup-destruction-into-two-steps.patch
 0011-cgroup-use-percpu-refcnt-for-cgroup_subsys_states.patch

0001-0007 are cleanups, many of them cosmetic.  They clean up the code
paths which are related with the refcnting changes.  If you're only
interested in the precpu usage, you can probably skip these.

0008-0010 updates the cgroup destruction path so that percpu refcnting
can be used.

0011 updates css refcnting to use percpu_ref.

This patchset is on top of

  cgroup/for-3.11 d5c56ced77 ("cgroup: clean up the cftype array for the base 
cgroup files")
+ [1] percpu/review-percpu-ref-tryget

and available in the following git branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git 
review-css-percpu-ref

diffstat follows.  Thanks.

 include/linux/cgroup.h |   87 ++---
 kernel/cgroup.c|  723 ++---
 2 files changed, 420 insertions(+), 390 deletions(-)

--
tejun

[L] http://thread.gmane.org/gmane.linux.kernel.containers/26167

[1] http://thread.gmane.org/gmane.linux.kernel/1507258
http://thread.gmane.org/gmane.linux.kernel/1507437
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 02/11] cgroup: consistently use @cset for struct css_set variables

2013-06-12 Thread Tejun Heo

cgroup.c uses @cg for most struct css_set variables, which in itself
could be a bit confusing, but made much worse by the fact that there
are places which use @cg for struct cgroup variables.
compare_css_sets() epitomizes this confusion - @[old_]cg are struct
css_set while @cg[12] are struct cgroup.

It's not like the whole deal with cgroup, css_set and cg_cgroup_link
isn't already confusing enough.  Let's give it some sanity by
uniformly using @cset for all struct css_set variables.

* s/cg/cset/ for all css_set variables.

* s/oldcg/old_cset/ s/oldcgrp/old_cgrp/.  The same for the ones
  prefixed with "new".

* s/cg/cgrp/ for cgroup variables in compare_css_sets().

* s/css/cset/ for the cgroup variable in task_cgroup_from_root().

* Whiteline adjustments.

This patch is purely cosmetic.

Signed-off-by: Tejun Heo 
---
 kernel/cgroup.c | 216 
 1 file changed, 109 insertions(+), 107 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index dc8997a..e7f6845 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -376,30 +376,32 @@ static unsigned long css_set_hash(struct 
cgroup_subsys_state *css[])
  * compiled into their kernel but not actually in use */
 static int use_task_css_set_links __read_mostly;
 
-static void __put_css_set(struct css_set *cg, int taskexit)
+static void __put_css_set(struct css_set *cset, int taskexit)
 {
struct cg_cgroup_link *link;
struct cg_cgroup_link *saved_link;
+
/*
 * Ensure that the refcount doesn't hit zero while any readers
 * can see it. Similar to atomic_dec_and_lock(), but for an
 * rwlock
 */
-   if (atomic_add_unless(&cg->refcount, -1, 1))
+   if (atomic_add_unless(&cset->refcount, -1, 1))
return;
write_lock(&css_set_lock);
-   if (!atomic_dec_and_test(&cg->refcount)) {
+   if (!atomic_dec_and_test(&cset->refcount)) {
write_unlock(&css_set_lock);
return;
}
 
/* This css_set is dead. unlink it and release cgroup refcounts */
-   hash_del(&cg->hlist);
+   hash_del(&cset->hlist);
css_set_count--;
 
-   list_for_each_entry_safe(link, saved_link, &cg->cg_links,
+   list_for_each_entry_safe(link, saved_link, &cset->cg_links,
 cg_link_list) {
struct cgroup *cgrp = link->cgrp;
+
list_del(&link->cg_link_list);
list_del(&link->cgrp_link_list);
 
@@ -421,45 +423,45 @@ static void __put_css_set(struct css_set *cg, int 
taskexit)
}
 
write_unlock(&css_set_lock);
-   kfree_rcu(cg, rcu_head);
+   kfree_rcu(cset, rcu_head);
 }
 
 /*
  * refcounted get/put for css_set objects
  */
-static inline void get_css_set(struct css_set *cg)
+static inline void get_css_set(struct css_set *cset)
 {
-   atomic_inc(&cg->refcount);
+   atomic_inc(&cset->refcount);
 }
 
-static inline void put_css_set(struct css_set *cg)
+static inline void put_css_set(struct css_set *cset)
 {
-   __put_css_set(cg, 0);
+   __put_css_set(cset, 0);
 }
 
-static inline void put_css_set_taskexit(struct css_set *cg)
+static inline void put_css_set_taskexit(struct css_set *cset)
 {
-   __put_css_set(cg, 1);
+   __put_css_set(cset, 1);
 }
 
 /*
  * compare_css_sets - helper function for find_existing_css_set().
- * @cg: candidate css_set being tested
- * @old_cg: existing css_set for a task
+ * @cset: candidate css_set being tested
+ * @old_cset: existing css_set for a task
  * @new_cgrp: cgroup that's being entered by the task
  * @template: desired set of css pointers in css_set (pre-calculated)
  *
  * Returns true if "cg" matches "old_cg" except for the hierarchy
  * which "new_cgrp" belongs to, for which it should match "new_cgrp".
  */
-static bool compare_css_sets(struct css_set *cg,
-struct css_set *old_cg,
+static bool compare_css_sets(struct css_set *cset,
+struct css_set *old_cset,
 struct cgroup *new_cgrp,
 struct cgroup_subsys_state *template[])
 {
struct list_head *l1, *l2;
 
-   if (memcmp(template, cg->subsys, sizeof(cg->subsys))) {
+   if (memcmp(template, cset->subsys, sizeof(cset->subsys))) {
/* Not all subsystems matched */
return false;
}
@@ -473,28 +475,28 @@ static bool compare_css_sets(struct css_set *cg,
 * candidates.
 */
 
-   l1 = &cg->cg_links;
-   l2 = &old_cg->cg_links;
+   l1 = &cset->cg_links;
+   l2 = &old_cset->cg_links;
while (1) {
struct cg_cgroup_link *cgl1, *cgl2;
-   struct cgroup *cg1, *cg2;
+   struct cgroup *cgrp1, *cgrp2;
 
l1 = l1->next;
l2 = l2->next;
/* See if we reached the end - both lists are equal length. */
-   if

[PATCH 03/11] cgroup: bring some sanity to naming around cg_cgroup_link

2013-06-12 Thread Tejun Heo

cgroups and css_sets are mapped M:N and this M:N mapping is
represented by struct cg_cgroup_link which forms linked lists on both
sides.  The naming around this mapping is already confusing and struct
cg_cgroup_link exacerbates the situation quite a bit.

>From cgroup side, it starts off ->css_sets and runs through
->cgrp_link_list.  From css_set side, it starts off ->cg_links and
runs through ->cg_link_list.  This is rather reversed as
cgrp_link_list is used to iterate css_sets and cg_link_list cgroups.
Also, this is the only place which is still using the confusing "cg"
for css_sets.  This patch cleans it up a bit.

* s/cgroup->css_sets/cgroup->cset_links/
  s/css_set->cg_links/css_set->cgrp_links/
  s/cgroup_iter->cg_link/cgroup_iter->cset_link/

* s/cg_cgroup_link/cgrp_cset_link/

* s/cgrp_cset_link->cg/cgrp_cset_link->cset/
  s/cgrp_cset_link->cgrp_link_list/cgrp_cset_link->cset_link/
  s/cgrp_cset_link->cg_link_list/cgrp_cset_link->cgrp_link/

* s/init_css_set_link/init_cgrp_cset_link/
  s/free_cg_links/free_cgrp_cset_links/
  s/allocate_cg_links/allocate_cgrp_cset_links/

* s/cgl[12]/link[12]/ in compare_css_sets()

* s/saved_link/tmp_link/ s/tmp/tmp_links/ and a couple similar
  adustments.

* Comment and whiteline adjustments.

After the changes, we have

list_for_each_entry(link, &cont->cset_links, cset_link) {
struct css_set *cset = link->cset;

instead of

list_for_each_entry(link, &cont->css_sets, cgrp_link_list) {
struct css_set *cset = link->cg;

This patch is purely cosmetic.

v2: Fix broken sentences in the patch description.

Signed-off-by: Tejun Heo 
---
 include/linux/cgroup.h |  15 ++--
 kernel/cgroup.c| 226 -
 2 files changed, 120 insertions(+), 121 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 5830592..0e32855 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -215,10 +215,10 @@ struct cgroup {
struct cgroupfs_root *root;
 
/*
-* List of cg_cgroup_links pointing at css_sets with
-* tasks in this cgroup. Protected by css_set_lock
+* List of cgrp_cset_links pointing at css_sets with tasks in this
+* cgroup.  Protected by css_set_lock.
 */
-   struct list_head css_sets;
+   struct list_head cset_links;
 
struct list_head allcg_node;/* cgroupfs_root->allcg_list */
struct list_head cft_q_node;/* used during cftype add/rm */
@@ -365,11 +365,10 @@ struct css_set {
struct list_head tasks;
 
/*
-* List of cg_cgroup_link objects on link chains from
-* cgroups referenced from this css_set. Protected by
-* css_set_lock
+* List of cgrp_cset_links pointing at cgroups referenced from this
+* css_set.  Protected by css_set_lock.
 */
-   struct list_head cg_links;
+   struct list_head cgrp_links;
 
/*
 * Set of subsystem states, one for each subsystem. This array
@@ -792,7 +791,7 @@ struct cgroup *cgroup_next_descendant_post(struct cgroup 
*pos,
 
 /* A cgroup_iter should be treated as an opaque object */
 struct cgroup_iter {
-   struct list_head *cg_link;
+   struct list_head *cset_link;
struct list_head *task;
 };
 
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index e7f6845..0b5cd0e 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -315,20 +315,24 @@ static void cgroup_release_agent(struct work_struct 
*work);
 static DECLARE_WORK(release_agent_work, cgroup_release_agent);
 static void check_for_release(struct cgroup *cgrp);
 
-/* Link structure for associating css_set objects with cgroups */
-struct cg_cgroup_link {
-   /*
-* List running through cg_cgroup_links associated with a
-* cgroup, anchored on cgroup->css_sets
-*/
-   struct list_head cgrp_link_list;
-   struct cgroup *cgrp;
-   /*
-* List running through cg_cgroup_links pointing at a
-* single css_set object, anchored on css_set->cg_links
-*/
-   struct list_head cg_link_list;
-   struct css_set *cg;
+/*
+ * A cgroup can be associated with multiple css_sets as different tasks may
+ * belong to different cgroups on different hierarchies.  In the other
+ * direction, a css_set is naturally associated with multiple cgroups.
+ * This M:N relationship is represented by the following link structure
+ * which exists for each association and allows traversing the associations
+ * from both sides.
+ */
+struct cgrp_cset_link {
+   /* the cgroup and css_set this link associates */
+   struct cgroup   *cgrp;
+   struct css_set  *cset;
+
+   /* list of cgrp_cset_links anchored at cgrp->cset_links */
+   struct list_headcset_link;
+
+   /* list of cgrp_cset_links anchored at css_set->cgrp_links */
+   struct list_headcgrp_link;
 };
 
 /* The default css_set - used by init

[PATCH 06/11] cgroup: rename CGRP_REMOVED to CGRP_DEAD

2013-06-12 Thread Tejun Heo

We will add another flag indicating that the cgroup is in the process
of being killed.  REMOVING / REMOVED is more difficult to distinguish
and cgroup_is_removing()/cgroup_is_removed() are a bit awkward.  Also,
later percpu_ref usage will involve "kill"ing the refcnt.

 s/CGRP_REMOVED/CGRP_DEAD/
 s/cgroup_is_removed()/cgroup_is_dead()

This patch is purely cosmetic.

Signed-off-by: Tejun Heo 
---
 include/linux/cgroup.h |  2 +-
 kernel/cgroup.c| 30 ++
 2 files changed, 15 insertions(+), 17 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index a494636..c86a93a 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -143,7 +143,7 @@ static inline void css_put(struct cgroup_subsys_state *css)
 /* bits in struct cgroup flags field */
 enum {
/* Control Group is dead */
-   CGRP_REMOVED,
+   CGRP_DEAD,
/*
 * Control Group has previously had a child cgroup or a task,
 * but no longer (only if CGRP_NOTIFY_ON_RELEASE is set)
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index e308ee7..8f5fbcb 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -226,9 +226,9 @@ static int css_refcnt(struct cgroup_subsys_state *css)
 }
 
 /* convenient tests for these bits */
-static inline bool cgroup_is_removed(const struct cgroup *cgrp)
+static inline bool cgroup_is_dead(const struct cgroup *cgrp)
 {
-   return test_bit(CGRP_REMOVED, &cgrp->flags);
+   return test_bit(CGRP_DEAD, &cgrp->flags);
 }
 
 /**
@@ -300,7 +300,7 @@ static inline struct cftype *__d_cft(struct dentry *dentry)
 static bool cgroup_lock_live_group(struct cgroup *cgrp)
 {
mutex_lock(&cgroup_mutex);
-   if (cgroup_is_removed(cgrp)) {
+   if (cgroup_is_dead(cgrp)) {
mutex_unlock(&cgroup_mutex);
return false;
}
@@ -892,7 +892,7 @@ static void cgroup_diput(struct dentry *dentry, struct 
inode *inode)
if (S_ISDIR(inode->i_mode)) {
struct cgroup *cgrp = dentry->d_fsdata;
 
-   BUG_ON(!(cgroup_is_removed(cgrp)));
+   BUG_ON(!(cgroup_is_dead(cgrp)));
call_rcu(&cgrp->rcu_head, cgroup_free_rcu);
} else {
struct cfent *cfe = __d_cfe(dentry);
@@ -2366,7 +2366,7 @@ static ssize_t cgroup_file_write(struct file *file, const 
char __user *buf,
struct cftype *cft = __d_cft(file->f_dentry);
struct cgroup *cgrp = __d_cgrp(file->f_dentry->d_parent);
 
-   if (cgroup_is_removed(cgrp))
+   if (cgroup_is_dead(cgrp))
return -ENODEV;
if (cft->write)
return cft->write(cgrp, cft, file, buf, nbytes, ppos);
@@ -2411,7 +2411,7 @@ static ssize_t cgroup_file_read(struct file *file, char 
__user *buf,
struct cftype *cft = __d_cft(file->f_dentry);
struct cgroup *cgrp = __d_cgrp(file->f_dentry->d_parent);
 
-   if (cgroup_is_removed(cgrp))
+   if (cgroup_is_dead(cgrp))
return -ENODEV;
 
if (cft->read)
@@ -2834,7 +2834,7 @@ static void cgroup_cfts_commit(struct cgroup_subsys *ss,
 
mutex_lock(&inode->i_mutex);
mutex_lock(&cgroup_mutex);
-   if (!cgroup_is_removed(cgrp))
+   if (!cgroup_is_dead(cgrp))
cgroup_addrm_files(cgrp, ss, cfts, is_add);
mutex_unlock(&cgroup_mutex);
mutex_unlock(&inode->i_mutex);
@@ -3002,14 +3002,14 @@ struct cgroup *cgroup_next_sibling(struct cgroup *pos)
/*
 * @pos could already have been removed.  Once a cgroup is removed,
 * its ->sibling.next is no longer updated when its next sibling
-* changes.  As CGRP_REMOVED is set on removal which is fully
+* changes.  As CGRP_DEAD is set on removal which is fully
 * serialized, if we see it unasserted, it's guaranteed that the
 * next sibling hasn't finished its grace period even if it's
 * already removed, and thus safe to dereference from this RCU
 * critical section.  If ->sibling.next is inaccessible,
-* cgroup_is_removed() is guaranteed to be visible as %true here.
+* cgroup_is_dead() is guaranteed to be visible as %true here.
 */
-   if (likely(!cgroup_is_removed(pos))) {
+   if (likely(!cgroup_is_dead(pos))) {
next = list_entry_rcu(pos->sibling.next, struct cgroup, 
sibling);
if (&next->sibling != &pos->parent->children)
return next;
@@ -4386,7 +4386,7 @@ static int cgroup_destroy_locked(struct cgroup *cgrp)
 * attempts fail thus maintaining the removal conditions verified
 * above.
 *
-* Note that CGRP_REMVOED clearing is depended upon by
+* Note that CGRP_DEAD assertion is depended upon by
 * cgroup_next_sibling() to resume iteration after dropping RCU
 * read lock.  See cgroup_next_sibling() for details.
 */
@@ -4

[PATCH 04/11] cgroup: use kzalloc() instead of kmalloc()

2013-06-12 Thread Tejun Heo

There's no point in using kmalloc() instead of the clearing variant
for trivial stuff.  We can live dangerously elsewhere.  Use kzalloc()
instead and drop 0 inits.

While at it, do trivial code reorganization in cgroup_file_open().

This patch doesn't introduce any functional changes.

v2: I was caught in the very distant past where list_del() didn't
poison and the initial version converted list_del()s to
list_del_init()s too.  Li and Kent took me out of the stasis
chamber.

Signed-off-by: Tejun Heo 
Cc: Li Zefan 
Cc: Kent Overstreet 
---
 kernel/cgroup.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 0b5cd0e..e308ee7 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -597,7 +597,7 @@ static int allocate_cgrp_cset_links(int count, struct 
list_head *tmp_links)
INIT_LIST_HEAD(tmp_links);
 
for (i = 0; i < count; i++) {
-   link = kmalloc(sizeof(*link), GFP_KERNEL);
+   link = kzalloc(sizeof(*link), GFP_KERNEL);
if (!link) {
free_cgrp_cset_links(tmp_links);
return -ENOMEM;
@@ -658,7 +658,7 @@ static struct css_set *find_css_set(struct css_set 
*old_cset,
if (cset)
return cset;
 
-   cset = kmalloc(sizeof(*cset), GFP_KERNEL);
+   cset = kzalloc(sizeof(*cset), GFP_KERNEL);
if (!cset)
return NULL;
 
@@ -2478,10 +2478,12 @@ static int cgroup_file_open(struct inode *inode, struct 
file *file)
cft = __d_cft(file->f_dentry);
 
if (cft->read_map || cft->read_seq_string) {
-   struct cgroup_seqfile_state *state =
-   kzalloc(sizeof(*state), GFP_USER);
+   struct cgroup_seqfile_state *state;
+
+   state = kzalloc(sizeof(*state), GFP_USER);
if (!state)
return -ENOMEM;
+
state->cft = cft;
state->cgroup = __d_cgrp(file->f_dentry->d_parent);
file->f_op = &cgroup_seqfile_operations;
@@ -3514,7 +3516,7 @@ static struct cgroup_pidlist *cgroup_pidlist_find(struct 
cgroup *cgrp,
}
}
/* entry not found; create a new one */
-   l = kmalloc(sizeof(struct cgroup_pidlist), GFP_KERNEL);
+   l = kzalloc(sizeof(struct cgroup_pidlist), GFP_KERNEL);
if (!l) {
mutex_unlock(&cgrp->pidlist_mutex);
return l;
@@ -3523,8 +3525,6 @@ static struct cgroup_pidlist *cgroup_pidlist_find(struct 
cgroup *cgrp,
down_write(&l->mutex);
l->key.type = type;
l->key.ns = get_pid_ns(ns);
-   l->use_count = 0; /* don't increment here */
-   l->list = NULL;
l->owner = cgrp;
list_add(&l->links, &cgrp->pidlists);
mutex_unlock(&cgrp->pidlist_mutex);
-- 
1.8.2.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 07/11] cgroup: drop unnecessary RCU dancing from __put_css_set()

2013-06-12 Thread Tejun Heo

__put_css_set() does RCU read access on @cgrp across dropping
@cgrp->count so that it can continue accessing @cgrp even if the count
reached zero and destruction of the cgroup commenced.  Given that both
sides - __css_put() and cgroup_destroy_locked() - are cold paths, this
is unnecessary.  Just making cgroup_destroy_locked() grab css_set_lock
while checking @cgrp->count is enough.

Remove the RCU read locking from __put_css_set() and make
cgroup_destroy_locked() read-lock css_set_lock when checking
@cgrp->count.  This will also allow removing @cgrp->count.

Signed-off-by: Tejun Heo 
---
 kernel/cgroup.c | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 8f5fbcb..22f9729 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -407,19 +407,13 @@ static void __put_css_set(struct css_set *cset, int 
taskexit)
list_del(&link->cset_link);
list_del(&link->cgrp_link);
 
-   /*
-* We may not be holding cgroup_mutex, and if cgrp->count is
-* dropped to 0 the cgroup can be destroyed at any time, hence
-* rcu_read_lock is used to keep it alive.
-*/
-   rcu_read_lock();
+   /* @cgrp can't go away while we're holding css_set_lock */
if (atomic_dec_and_test(&cgrp->count) &&
notify_on_release(cgrp)) {
if (taskexit)
set_bit(CGRP_RELEASABLE, &cgrp->flags);
check_for_release(cgrp);
}
-   rcu_read_unlock();
 
kfree(link);
}
@@ -4373,11 +4367,19 @@ static int cgroup_destroy_locked(struct cgroup *cgrp)
struct cgroup *parent = cgrp->parent;
struct cgroup_event *event, *tmp;
struct cgroup_subsys *ss;
+   bool empty;
 
lockdep_assert_held(&d->d_inode->i_mutex);
lockdep_assert_held(&cgroup_mutex);
 
-   if (atomic_read(&cgrp->count) || !list_empty(&cgrp->children))
+   /*
+* css_set_lock prevents @cgrp from being removed while
+* __put_css_set() is in progress.
+*/
+   read_lock(&css_set_lock);
+   empty = !atomic_read(&cgrp->count) && list_empty(&cgrp->children);
+   read_unlock(&css_set_lock);
+   if (!empty)
return -EBUSY;
 
/*
@@ -5054,8 +5056,6 @@ void cgroup_exit(struct task_struct *tsk, int 
run_callbacks)
 
 static void check_for_release(struct cgroup *cgrp)
 {
-   /* All of these checks rely on RCU to keep the cgroup
-* structure alive */
if (cgroup_is_releasable(cgrp) &&
!atomic_read(&cgrp->count) && list_empty(&cgrp->children)) {
/*
-- 
1.8.2.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 09/11] cgroup: reorder the operations in cgroup_destroy_locked()

2013-06-12 Thread Tejun Heo

This patch reorders the operations in cgroup_destroy_locked() such
that the userland visible parts happen before css offlining and
removal from the ->sibling list.  This will be used to make css use
percpu refcnt.

While at it, split out CGRP_DEAD related comment from the refcnt
deactivation one and correct / clarify how different guarantees are
met.

While this patch changes the specific order of operations, it
shouldn't cause any noticeable behavior difference.

Signed-off-by: Tejun Heo 
---
 kernel/cgroup.c | 61 +
 1 file changed, 35 insertions(+), 26 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 062653f..9261acc 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -4382,13 +4382,8 @@ static int cgroup_destroy_locked(struct cgroup *cgrp)
 
/*
 * Block new css_tryget() by deactivating refcnt and mark @cgrp
-* removed.  This makes future css_tryget() and child creation
-* attempts fail thus maintaining the removal conditions verified
-* above.
-*
-* Note that CGRP_DEAD assertion is depended upon by
-* cgroup_next_sibling() to resume iteration after dropping RCU
-* read lock.  See cgroup_next_sibling() for details.
+* removed.  This makes future css_tryget() attempts fail which we
+* guarantee to ->css_offline() callbacks.
 */
for_each_subsys(cgrp->root, ss) {
struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id];
@@ -4396,8 +4391,41 @@ static int cgroup_destroy_locked(struct cgroup *cgrp)
WARN_ON(atomic_read(&css->refcnt) < 0);
atomic_add(CSS_DEACT_BIAS, &css->refcnt);
}
+
+   /*
+* Mark @cgrp dead.  This prevents further task migration and child
+* creation by disabling cgroup_lock_live_group().  Note that
+* CGRP_DEAD assertion is depended upon by cgroup_next_sibling() to
+* resume iteration after dropping RCU read lock.  See
+* cgroup_next_sibling() for details.
+*/
set_bit(CGRP_DEAD, &cgrp->flags);
 
+   /* CGRP_DEAD is set, remove from ->release_list for the last time */
+   raw_spin_lock(&release_list_lock);
+   if (!list_empty(&cgrp->release_list))
+   list_del_init(&cgrp->release_list);
+   raw_spin_unlock(&release_list_lock);
+
+   /*
+* Remove @cgrp directory.  The removal puts the base ref but we
+* aren't quite done with @cgrp yet, so hold onto it.
+*/
+   dget(d);
+   cgroup_d_remove_dir(d);
+
+   /*
+* Unregister events and notify userspace.
+* Notify userspace about cgroup removing only after rmdir of cgroup
+* directory to avoid race between userspace and kernelspace.
+*/
+   spin_lock(&cgrp->event_list_lock);
+   list_for_each_entry_safe(event, tmp, &cgrp->event_list, list) {
+   list_del_init(&event->list);
+   schedule_work(&event->remove);
+   }
+   spin_unlock(&cgrp->event_list_lock);
+
/* tell subsystems to initate destruction */
for_each_subsys(cgrp->root, ss)
offline_css(ss, cgrp);
@@ -4412,34 +4440,15 @@ static int cgroup_destroy_locked(struct cgroup *cgrp)
for_each_subsys(cgrp->root, ss)
css_put(cgrp->subsys[ss->subsys_id]);
 
-   raw_spin_lock(&release_list_lock);
-   if (!list_empty(&cgrp->release_list))
-   list_del_init(&cgrp->release_list);
-   raw_spin_unlock(&release_list_lock);
-
/* delete this cgroup from parent->children */
list_del_rcu(&cgrp->sibling);
list_del_init(&cgrp->allcg_node);
 
-   dget(d);
-   cgroup_d_remove_dir(d);
dput(d);
 
set_bit(CGRP_RELEASABLE, &parent->flags);
check_for_release(parent);
 
-   /*
-* Unregister events and notify userspace.
-* Notify userspace about cgroup removing only after rmdir of cgroup
-* directory to avoid race between userspace and kernelspace.
-*/
-   spin_lock(&cgrp->event_list_lock);
-   list_for_each_entry_safe(event, tmp, &cgrp->event_list, list) {
-   list_del_init(&event->list);
-   schedule_work(&event->remove);
-   }
-   spin_unlock(&cgrp->event_list_lock);
-
return 0;
 }
 
-- 
1.8.2.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 10/11] cgroup: split cgroup destruction into two steps

2013-06-12 Thread Tejun Heo

Split cgroup_destroy_locked() into two steps and put the latter half
into cgroup_offline_fn() which is executed from a work item.  The
latter half is responsible for offlining the css's, removing the
cgroup from internal lists, and propagating release notification to
the parent.  The separation is to allow using percpu refcnt for css.

Note that this allows for other cgroup operations to happen between
the first and second halves of destruction, including creating a new
cgroup with the same name.  As the target cgroup is marked DEAD in the
first half and cgroup internals don't care about the names of cgroups,
this should be fine.  A comment explaining this will be added by the
next patch which implements the actual percpu refcnting.

As RCU freeing is guaranteed to happen after the second step of
destruction, we can use the same work item for both.  This patch
renames cgroup->free_work to ->destroy_work and uses it for both
purposes.  INIT_WORK() is now performed right before queueing the work
item.

Signed-off-by: Tejun Heo 
---
 include/linux/cgroup.h |  2 +-
 kernel/cgroup.c| 38 +++---
 2 files changed, 28 insertions(+), 12 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 81bfd02..e345d8b 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -233,7 +233,7 @@ struct cgroup {
 
/* For RCU-protected deletion */
struct rcu_head rcu_head;
-   struct work_struct free_work;
+   struct work_struct destroy_work;
 
/* List of events which userspace want to receive */
struct list_head event_list;
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 9261acc..ebbfc04 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -208,6 +208,7 @@ static struct cgroup_name root_cgroup_name = { .name = "/" 
};
  */
 static int need_forkexit_callback __read_mostly;
 
+static void cgroup_offline_fn(struct work_struct *work);
 static int cgroup_destroy_locked(struct cgroup *cgrp);
 static int cgroup_addrm_files(struct cgroup *cgrp, struct cgroup_subsys 
*subsys,
  struct cftype cfts[], bool is_add);
@@ -830,7 +831,7 @@ static struct cgroup_name *cgroup_alloc_name(struct dentry 
*dentry)
 
 static void cgroup_free_fn(struct work_struct *work)
 {
-   struct cgroup *cgrp = container_of(work, struct cgroup, free_work);
+   struct cgroup *cgrp = container_of(work, struct cgroup, destroy_work);
struct cgroup_subsys *ss;
 
mutex_lock(&cgroup_mutex);
@@ -875,7 +876,8 @@ static void cgroup_free_rcu(struct rcu_head *head)
 {
struct cgroup *cgrp = container_of(head, struct cgroup, rcu_head);
 
-   schedule_work(&cgrp->free_work);
+   INIT_WORK(&cgrp->destroy_work, cgroup_free_fn);
+   schedule_work(&cgrp->destroy_work);
 }
 
 static void cgroup_diput(struct dentry *dentry, struct inode *inode)
@@ -1407,7 +1409,6 @@ static void init_cgroup_housekeeping(struct cgroup *cgrp)
INIT_LIST_HEAD(&cgrp->allcg_node);
INIT_LIST_HEAD(&cgrp->release_list);
INIT_LIST_HEAD(&cgrp->pidlists);
-   INIT_WORK(&cgrp->free_work, cgroup_free_fn);
mutex_init(&cgrp->pidlist_mutex);
INIT_LIST_HEAD(&cgrp->event_list);
spin_lock_init(&cgrp->event_list_lock);
@@ -2994,12 +2995,13 @@ struct cgroup *cgroup_next_sibling(struct cgroup *pos)
/*
 * @pos could already have been removed.  Once a cgroup is removed,
 * its ->sibling.next is no longer updated when its next sibling
-* changes.  As CGRP_DEAD is set on removal which is fully
-* serialized, if we see it unasserted, it's guaranteed that the
-* next sibling hasn't finished its grace period even if it's
-* already removed, and thus safe to dereference from this RCU
-* critical section.  If ->sibling.next is inaccessible,
-* cgroup_is_dead() is guaranteed to be visible as %true here.
+* changes.  As CGRP_DEAD assertion is serialized and happens
+* before the cgroup is taken off the ->sibling list, if we see it
+* unasserted, it's guaranteed that the next sibling hasn't
+* finished its grace period even if it's already removed, and thus
+* safe to dereference from this RCU critical section.  If
+* ->sibling.next is inaccessible, cgroup_is_dead() is guaranteed
+* to be visible as %true here.
 */
if (likely(!cgroup_is_dead(pos))) {
next = list_entry_rcu(pos->sibling.next, struct cgroup, 
sibling);
@@ -4362,7 +4364,6 @@ static int cgroup_destroy_locked(struct cgroup *cgrp)
__releases(&cgroup_mutex) __acquires(&cgroup_mutex)
 {
struct dentry *d = cgrp->dentry;
-   struct cgroup *parent = cgrp->parent;
struct cgroup_event *event, *tmp;
struct cgroup_subsys *ss;
bool empty;
@@ -4426,6 +4427,21 @@ static int cgroup_destroy_locked(struct cgroup *cgrp)
}
spin_unlock(&cgrp

[PATCH 11/11] cgroup: use percpu refcnt for cgroup_subsys_states

2013-06-12 Thread Tejun Heo

A css (cgroup_subsys_state) is how each cgroup is represented to a
controller.  As such, it can be used in hot paths across the various
subsystems different controllers are associated with.

One of the common operations is reference counting, which up until now
has been implemented using a global atomic counter and can have
significant adverse impact on scalability.  For example, css refcnt
can be gotten and put multiple times by blkcg for each IO request.
For highops configurations which try to do as much per-cpu as
possible, the global frequent refcnting can be very expensive.

In general, given the various and hugely diverse paths css's end up
being used from, we need to make it cheap and highly scalable.  In its
usage, css refcnting isn't very different from module refcnting.

This patch converts css refcnting to use the recently added
percpu_ref.  css_get/tryget/put() directly maps to the matching
percpu_ref operations and the deactivation logic is no longer
necessary as percpu_ref already has refcnt killing.

The only complication is that as the refcnt is per-cpu,
percpu_ref_kill() in itself doesn't ensure that further tryget
operations will fail, which we need to guarantee before invoking
->css_offline()'s.  This is resolved collecting kill confirmation
using percpu_ref_kill_and_confirm() and initiating the offline phase
of destruction after all css refcnt's are confirmed to be seen as
killed on all CPUs.  The previous patches already splitted destruction
into two phases, so percpu_ref_kill_and_confirm() can be hooked up
easily.

This patch removes css_refcnt() which is used for rcu dereference
sanity check in css_id().  While we can add a percpu refcnt API to ask
the same question, css_id() itself is scheduled to be removed fairly
soon, so let's not bother with it.  Just drop the sanity check and use
rcu_dereference_raw() instead.

v2: - init_cgroup_css() was calling percpu_ref_init() without checking
  the return value.  This causes two problems - the obvious lack
  of error handling and percpu_ref_init() being called from
  cgroup_init_subsys() before the allocators are up, which
  triggers warnings but doesn't cause actual problems as the
  refcnt isn't used for roots anyway.  Fix both by moving
  percpu_ref_init() to cgroup_create().

- The base references were put too early by
  percpu_ref_kill_and_confirm() and cgroup_offline_fn() put the
  refs one extra time.  This wasn't noticeable because css's go
  through another RCU grace period before being freed.  Update
  cgroup_destroy_locked() to grab an extra reference before
  killing the refcnts.  This problem was noticed by Kent.

Signed-off-by: Tejun Heo 
Cc: Michal Hocko 
Cc: Mike Snitzer 
Cc: Vivek Goyal 
Cc: "Alasdair G. Kergon" 
Cc: Jens Axboe 
Cc: Mikulas Patocka 
Cc: Glauber Costa 
Cc: Kent Overstreet 
---
 include/linux/cgroup.h |  23 +++
 kernel/cgroup.c| 165 +++--
 2 files changed, 112 insertions(+), 76 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index e345d8b..b7bd4be 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifdef CONFIG_CGROUPS
 
@@ -72,13 +73,8 @@ struct cgroup_subsys_state {
 */
struct cgroup *cgroup;
 
-   /*
-* State maintained by the cgroup system to allow subsystems
-* to be "busy". Should be accessed via css_get(),
-* css_tryget() and css_put().
-*/
-
-   atomic_t refcnt;
+   /* reference count - access via css_[try]get() and css_put() */
+   struct percpu_ref refcnt;
 
unsigned long flags;
/* ID for this css, if possible */
@@ -104,11 +100,9 @@ static inline void css_get(struct cgroup_subsys_state *css)
 {
/* We don't need to reference count the root state */
if (!(css->flags & CSS_ROOT))
-   atomic_inc(&css->refcnt);
+   percpu_ref_get(&css->refcnt);
 }
 
-extern bool __css_tryget(struct cgroup_subsys_state *css);
-
 /**
  * css_tryget - try to obtain a reference on the specified css
  * @css: target css
@@ -123,11 +117,9 @@ static inline bool css_tryget(struct cgroup_subsys_state 
*css)
 {
if (css->flags & CSS_ROOT)
return true;
-   return __css_tryget(css);
+   return percpu_ref_tryget(&css->refcnt);
 }
 
-extern void __css_put(struct cgroup_subsys_state *css);
-
 /**
  * css_put - put a css reference
  * @css: target css
@@ -137,7 +129,7 @@ extern void __css_put(struct cgroup_subsys_state *css);
 static inline void css_put(struct cgroup_subsys_state *css)
 {
if (!(css->flags & CSS_ROOT))
-   __css_put(css);
+   percpu_ref_put(&css->refcnt);
 }
 
 /* bits in struct cgroup flags field */
@@ -231,9 +223,10 @@ struct cgroup {
struct list_head pidlists;
struct mutex pidlist_mutex;
 
-   /* For RCU-protec

[PATCH 08/11] cgroup: remove cgroup->count and use

2013-06-12 Thread Tejun Heo

cgroup->count tracks the number of css_sets associated with the cgroup
and used only to verify that no css_set is associated when the cgroup
is being destroyed.  It's superflous as the destruction path can
simply check whether cgroup->cset_links is empty instead.

Drop cgroup->count and check ->cset_links directly from
cgroup_destroy_locked().

Signed-off-by: Tejun Heo 
---
 include/linux/cgroup.h |  6 --
 kernel/cgroup.c| 21 +
 2 files changed, 5 insertions(+), 22 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index c86a93a..81bfd02 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -169,12 +169,6 @@ struct cgroup_name {
 struct cgroup {
unsigned long flags;/* "unsigned long" so bitops work */
 
-   /*
-* count users of this cgroup. >0 means busy, but doesn't
-* necessarily indicate the number of tasks in the cgroup
-*/
-   atomic_t count;
-
int id; /* ida allocated in-hierarchy ID */
 
/*
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 22f9729..062653f 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -408,8 +408,7 @@ static void __put_css_set(struct css_set *cset, int 
taskexit)
list_del(&link->cgrp_link);
 
/* @cgrp can't go away while we're holding css_set_lock */
-   if (atomic_dec_and_test(&cgrp->count) &&
-   notify_on_release(cgrp)) {
+   if (list_empty(&cgrp->cset_links) && notify_on_release(cgrp)) {
if (taskexit)
set_bit(CGRP_RELEASABLE, &cgrp->flags);
check_for_release(cgrp);
@@ -616,7 +615,6 @@ static void link_css_set(struct list_head *tmp_links, 
struct css_set *cset,
link = list_first_entry(tmp_links, struct cgrp_cset_link, cset_link);
link->cset = cset;
link->cgrp = cgrp;
-   atomic_inc(&cgrp->count);
list_move(&link->cset_link, &cgrp->cset_links);
/*
 * Always add links to the tail of the list so that the list
@@ -4373,11 +4371,11 @@ static int cgroup_destroy_locked(struct cgroup *cgrp)
lockdep_assert_held(&cgroup_mutex);
 
/*
-* css_set_lock prevents @cgrp from being removed while
-* __put_css_set() is in progress.
+* css_set_lock synchronizes access to ->cset_links and prevents
+* @cgrp from being removed while __put_css_set() is in progress.
 */
read_lock(&css_set_lock);
-   empty = !atomic_read(&cgrp->count) && list_empty(&cgrp->children);
+   empty = list_empty(&cgrp->cset_links) && list_empty(&cgrp->children);
read_unlock(&css_set_lock);
if (!empty)
return -EBUSY;
@@ -5057,7 +5055,7 @@ void cgroup_exit(struct task_struct *tsk, int 
run_callbacks)
 static void check_for_release(struct cgroup *cgrp)
 {
if (cgroup_is_releasable(cgrp) &&
-   !atomic_read(&cgrp->count) && list_empty(&cgrp->children)) {
+   list_empty(&cgrp->cset_links) && list_empty(&cgrp->children)) {
/*
 * Control Group is currently removeable. If it's not
 * already queued for a userspace notification, queue
@@ -5425,11 +5423,6 @@ static void debug_css_free(struct cgroup *cont)
kfree(cont->subsys[debug_subsys_id]);
 }
 
-static u64 cgroup_refcount_read(struct cgroup *cont, struct cftype *cft)
-{
-   return atomic_read(&cont->count);
-}
-
 static u64 debug_taskcount_read(struct cgroup *cont, struct cftype *cft)
 {
return cgroup_task_count(cont);
@@ -5511,10 +5504,6 @@ static u64 releasable_read(struct cgroup *cgrp, struct 
cftype *cft)
 
 static struct cftype debug_files[] =  {
{
-   .name = "cgroup_refcount",
-   .read_u64 = cgroup_refcount_read,
-   },
-   {
.name = "taskcount",
.read_u64 = debug_taskcount_read,
},
-- 
1.8.2.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 05/11] cgroup: clean up css_[try]get() and css_put()

2013-06-12 Thread Tejun Heo

* __css_get() isn't used by anyone.  Fold it into css_get().

* Add proper function comments to all css reference functions.

This patch is purely cosmetic.

v2: Typo fix as per Li.

Signed-off-by: Tejun Heo 
Cc: Li Zefan 
---
 include/linux/cgroup.h | 48 
 1 file changed, 24 insertions(+), 24 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 0e32855..a494636 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -94,33 +94,31 @@ enum {
CSS_ONLINE  = (1 << 1), /* between ->css_online() and 
->css_offline() */
 };
 
-/* Caller must verify that the css is not for root cgroup */
-static inline void __css_get(struct cgroup_subsys_state *css, int count)
-{
-   atomic_add(count, &css->refcnt);
-}
-
-/*
- * Call css_get() to hold a reference on the css; it can be used
- * for a reference obtained via:
- * - an existing ref-counted reference to the css
- * - task->cgroups for a locked task
+/**
+ * css_get - obtain a reference on the specified css
+ * @css: target css
+ *
+ * The caller must already have a reference.
  */
-
 static inline void css_get(struct cgroup_subsys_state *css)
 {
/* We don't need to reference count the root state */
if (!(css->flags & CSS_ROOT))
-   __css_get(css, 1);
+   atomic_inc(&css->refcnt);
 }
 
-/*
- * Call css_tryget() to take a reference on a css if your existing
- * (known-valid) reference isn't already ref-counted. Returns false if
- * the css has been destroyed.
- */
-
 extern bool __css_tryget(struct cgroup_subsys_state *css);
+
+/**
+ * css_tryget - try to obtain a reference on the specified css
+ * @css: target css
+ *
+ * Obtain a reference on @css if it's alive.  The caller naturally needs to
+ * ensure that @css is accessible but doesn't have to be holding a
+ * reference on it - IOW, RCU protected access is good enough for this
+ * function.  Returns %true if a reference count was successfully obtained;
+ * %false otherwise.
+ */
 static inline bool css_tryget(struct cgroup_subsys_state *css)
 {
if (css->flags & CSS_ROOT)
@@ -128,12 +126,14 @@ static inline bool css_tryget(struct cgroup_subsys_state 
*css)
return __css_tryget(css);
 }
 
-/*
- * css_put() should be called to release a reference taken by
- * css_get() or css_tryget()
- */
-
 extern void __css_put(struct cgroup_subsys_state *css);
+
+/**
+ * css_put - put a css reference
+ * @css: target css
+ *
+ * Put a reference obtained via css_get() and css_tryget().
+ */
 static inline void css_put(struct cgroup_subsys_state *css)
 {
if (!(css->flags & CSS_ROOT))
-- 
1.8.2.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v5 3/5] vmcore: Introduce remap_oldmem_pfn_range()

2013-06-12 Thread HATAYAMA Daisuke


(2013/06/12 18:13), Michael Holzheu wrote:

On Tue, 11 Jun 2013 21:42:15 +0900
HATAYAMA Daisuke  wrote:


2013/6/11 Michael Holzheu :

On Mon, 10 Jun 2013 22:40:24 +0900
HATAYAMA Daisuke  wrote:


2013/6/8 Michael Holzheu :




Thanks for that hint! So together with your other comment regarding
error checking for __read_vmcore() the function would look like the
following:

static int mmap_vmcore_fault(struct vm_area_struct *vma, struct'vm_fault *vmf)
{
 struct address_space *mapping = vma->vm_private_data;
 pgoff_t index = vmf->pgoff;
 struct page *page;
 loff_t src;
 char *buf;

 page = find_or_create_page(mapping, index, GFP_KERNEL);
 if (!page)
 return VM_FAULT_OOM;
 if (!PageUptodate(page)) {
 src = index << PAGE_CACHE_SHIFT;
 buf = (void *) (page_to_pfn(page) << PAGE_SHIFT);
 if (__read_vmcore(buf, PAGE_SIZE, &src, 0) < 0) {
 unlock_page(page);
 return VM_FAULT_SIGBUS;
 }
 SetPageUptodate(page);
 }
 unlock_page(page);
 vmf->page = page;
 return 0;
}

Perhaps one open issue remains:

Can we remove the page from the page cache if __read_vmcore() fails?


Sorry, I overlooked the case that __read_vmcore() returns ENOMEM since it uses 
ioremap() internally in which page table allocation happens. Fault handler 
should return VM_FAULT_OOM in that case.

--
Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH percpu/for-3.11 2/2] percpu-refcount: implement percpu_ref_cancel_init()

2013-06-12 Thread Kent Overstreet

On Wed, Jun 12, 2013 at 08:58:31PM -0700, Tejun Heo wrote:
> At first I named it percpu_ref_free() but it looked too symmetric to
> init, more so than kill, so I was worried that people might get
> confused that this is the normal interface to shutdown a percpu
> refcnt, so the weird cancel_init name and further restriction on its
> usage.

...Yeah, confusion with _kill() is a good point. Ok, cancel_init() it
is.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH percpu/for-3.11 2/2] percpu-refcount: implement percpu_ref_cancel_init()

2013-06-12 Thread Tejun Heo

On Wed, Jun 12, 2013 at 08:56:36PM -0700, Kent Overstreet wrote:
> On Wed, Jun 12, 2013 at 08:52:35PM -0700, Tejun Heo wrote:
> > Normally, percpu_ref_init() initializes and percpu_ref_kill*()
> > initiates destruction which completes asynchronously.  The
> > asynchronous destruction can be problematic in init failure path where
> > the caller wants to destroy half-constructed object - distinguishing
> > half-constructed objects from the usual release method can be painful
> > for complex objects.
> > 
> > This patch implements percpu_ref_cancel_init() which synchronously
> > destroys the percpu_ref without invoking release.  To avoid
> > unintentional misuses, the function requires the ref to have finished
> > percpu_ref_init() but never used and triggers WARN otherwise.
> 
> That's a good idea, I should've implemented that for aio.
> 
> I probably would've just gone with percpu_ref_free() (if caller knows
> it's safe, they can do whatever they want) but I suppose I can live with
> percpu_ref_cancel_init().

At first I named it percpu_ref_free() but it looked too symmetric to
init, more so than kill, so I was worried that people might get
confused that this is the normal interface to shutdown a percpu
refcnt, so the weird cancel_init name and further restriction on its
usage.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH percpu/for-3.11 2/2] percpu-refcount: implement percpu_ref_cancel_init()

2013-06-12 Thread Kent Overstreet

On Wed, Jun 12, 2013 at 08:52:35PM -0700, Tejun Heo wrote:
> Normally, percpu_ref_init() initializes and percpu_ref_kill*()
> initiates destruction which completes asynchronously.  The
> asynchronous destruction can be problematic in init failure path where
> the caller wants to destroy half-constructed object - distinguishing
> half-constructed objects from the usual release method can be painful
> for complex objects.
> 
> This patch implements percpu_ref_cancel_init() which synchronously
> destroys the percpu_ref without invoking release.  To avoid
> unintentional misuses, the function requires the ref to have finished
> percpu_ref_init() but never used and triggers WARN otherwise.

That's a good idea, I should've implemented that for aio.

I probably would've just gone with percpu_ref_free() (if caller knows
it's safe, they can do whatever they want) but I suppose I can live with
percpu_ref_cancel_init().

Acked-by: Kent Overstreet 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Percpu tag allocator

2013-06-12 Thread Kent Overstreet

On Wed, Jun 12, 2013 at 08:03:11PM -0700, Andrew Morton wrote:
> On Wed, 12 Jun 2013 19:05:36 -0700 Kent Overstreet  
> wrote:
> 
> > ...
> >
> > > Why can't we use ida_get_new_above()?
> > > 
> > >If it is "speed" then how bad is the problem and what efforts have
> > >been made to address them within the idr code?  (A per-cpu magazine
> > >frontend to ida_get_new_above() might suit).
> > > 
> > >If it is "functionality" then what efforts have been made to
> > >suitably modify the ida code?
> > 
> > Originally, it was because every time I opened idr.[ch] I was confronted
> > with an enormous pile of little functions, with very little indication
> > in the way of what they were all trying to do or which ones I might want
> > to start with.
> > 
> > Having finally read enough of the code to maybe get a handle on what
> > it's about - performance is a large part of it, but idr is also a more
> > complicated api that does more than what I wanted.
> 
> They all sound like pretty crappy reasons ;) If the idr/ida interface
> is nasty then it can be wrapped to provide the same interface as the
> percpu tag allocator.

Well, the way I see it right now, idr and this tag allocator are doing
two somewhat different things and I'm really not sure how one piece of
code could do both jobs.

Supposing idr could be made percpu and just as fast as my tag allocator:
the problem is, using idr where right now I'd use the tag allocator
forces you to do two allocations instead of one:

First, you allocate your idr slot - but now that has to point to
something, so you need a kmem_cache_alloc() too. Which is probably the
way to go if you don't want to preallocate everything, but for the kinds
of things I'm using the tag allocator for preallocating tag structs is
fine but that second allocation would really hurt.

So, suppose we reimplement idr on top of the tag allocator: you have a
parallel array (or array type thing) of pointers, that lets you map
integers -> pointers - that gets you the current idr functionality
(minus whatever I've missed about it).

But with idr you don't have to specify the maximum integer/tag up front
- it enlarges the mapping as needed, which is great but it's a fair
amount of complexity. So now we've got two parallel data structures -
the tag allocator, and this extensible array thing - idr, if I'm not
mistaken, rolls the allocation stuff into the extensible array data
structure.

For users of idr who care about scalability this might be the way to go,
but I'm sure there's plenty who don't and this would be be a (perhaps
small) regression in memory and runtime overhead.

So right now at least I honestly think letting the tag allocator and idr
be distinct things is probably the way to go.

> I could understand performance being an issue, but diligence demands
> that we test that, or at least provide a convincing argument.

Well, idr is going to be a lot heavier than the atomic bit vector this
tag allocator originally replaced in driver code, and that was a very
significant performance improvement.

(now someone will point out to me all the fancy percpu optimizations in
idr I missed).

> The nice thing about a lock per cpu is that the stealing CPU can grab
> it and then steal a bunch of tags without releasing the lock: less
> lock-taking.  Maybe the current code does that amortisation as well;
> my intent-reverse-engineering resources got depleted.

Yeah, the current code does do that.

> Also, with a lock-per-cpu the stolen-from CPU just spins, so the ugly
> TAG_CPU_STEALING fallback-to-global-allocator thing goes away.
> 
> > > Also, I wonder if this was all done in the incorrect order.  Why make
> > > alloc_local_tag() fail?  steal_tags() could have just noodled off and
> > > tried to steal from the next CPU if alloc_local_tag() was in progress
> > > on this CPU?
> > 
> > steal_tags() can't notice that alloc_local_tag() is in progress,
> 
> Yes it can - the stolen-from CPU sets a flag in its cpu-local object
> while in its critical section.  The stealing CPU sees that and skips.
> 
> I think all this could be done with test_and_set_bit() and friends,
> btw.  xchg() hurts my brain.

Ahh, you were talking about with a slightly bigger rework.

It's not xchg/cmpxchg that hurt my brain (I've used cmpxchg once or
twice just for keeping some statistics up to date that was just
something slightly more interesting than a counter) - it's when there's
pointers involved and insane ordering that my brain melts.

ABA. Feh.

> > You mean if a GFP_NOWAIT allocation fails? It's the same as any other
> > allocation, I suppose.
> > 
> > Did you have something else in mind that could be implemented? I don't
> > want to add code for a reserve to this code - IMO, if a reserve is
> > needed it should be done elsewhere (like how mempools work on top of
> > existing allocators).
> 
> Dunno, really - I'm just wondering what the implications of an
> allocation failure will be.  I suspect it's EIO?  Which in som

[PATCH percpu/for-3.11 2/2] percpu-refcount: implement percpu_ref_cancel_init()

2013-06-12 Thread Tejun Heo

Normally, percpu_ref_init() initializes and percpu_ref_kill*()
initiates destruction which completes asynchronously.  The
asynchronous destruction can be problematic in init failure path where
the caller wants to destroy half-constructed object - distinguishing
half-constructed objects from the usual release method can be painful
for complex objects.

This patch implements percpu_ref_cancel_init() which synchronously
destroys the percpu_ref without invoking release.  To avoid
unintentional misuses, the function requires the ref to have finished
percpu_ref_init() but never used and triggers WARN otherwise.

Signed-off-by: Tejun Heo 
---
 include/linux/percpu-refcount.h |1 +
 lib/percpu-refcount.c   |   27 +++
 2 files changed, 28 insertions(+)

--- a/include/linux/percpu-refcount.h
+++ b/include/linux/percpu-refcount.h
@@ -69,6 +69,7 @@ struct percpu_ref {
 
 int __must_check percpu_ref_init(struct percpu_ref *ref,
 percpu_ref_func_t *release);
+void percpu_ref_cancel_init(struct percpu_ref *ref);
 void percpu_ref_kill_and_confirm(struct percpu_ref *ref,
 percpu_ref_func_t *confirm_kill);
 
--- a/lib/percpu-refcount.c
+++ b/lib/percpu-refcount.c
@@ -54,6 +54,33 @@ int percpu_ref_init(struct percpu_ref *r
return 0;
 }
 
+/**
+ * percpu_ref_cancel_init - cancel percpu_ref_init()
+ * @ref: percpu_ref to cancel init for
+ *
+ * Once a percpu_ref is initialized, its destruction is initiated by
+ * percpu_ref_kill*() and completes asynchronously, which can be painful to
+ * do when destroying a half-constructed object in init failure path.
+ *
+ * This function destroys @ref without invoking @ref->release and the
+ * memory area containing it can be freed immediately on return.  To
+ * prevent accidental misuse, it's required that @ref has finished
+ * percpu_ref_init(), whether successful or not, but never used.
+ */
+void percpu_ref_cancel_init(struct percpu_ref *ref)
+{
+   unsigned __percpu *pcpu_count = ref->pcpu_count;
+   int cpu;
+
+   WARN_ON_ONCE(atomic_read(&ref->count) != 1 + PCPU_COUNT_BIAS);
+
+   if (pcpu_count) {
+   for_each_possible_cpu(cpu)
+   WARN_ON_ONCE(*per_cpu_ptr(pcpu_count, cpu));
+   free_percpu(ref->pcpu_count);
+   }
+}
+
 static void percpu_ref_kill_rcu(struct rcu_head *rcu)
 {
struct percpu_ref *ref = container_of(rcu, struct percpu_ref, rcu);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH percpu/for-3.11 1/2] percpu-refcount: add __must_check to percpu_ref_init() and don't use ACCESS_ONCE() in percpu_ref_kill_rcu()

2013-06-12 Thread Tejun Heo

Two small changes.

* Unlike most init functions, percpu_ref_init() allocates memory and
  may fail.  Let's mark it with __must_check in case the caller
  forgets.

* percpu_ref_kill_rcu() is unnecessarily using ACCESS_ONCE() to
  dereference @ref->pcpu_count, which can be misleading.  The pointer
  is guaranteed to be valid and visible and can't change underneath
  the function.  Drop ACCESS_ONCE().

Signed-off-by: Tejun Heo 
---
 include/linux/percpu-refcount.h |3 ++-
 lib/percpu-refcount.c   |4 +---
 2 files changed, 3 insertions(+), 4 deletions(-)

--- a/include/linux/percpu-refcount.h
+++ b/include/linux/percpu-refcount.h
@@ -67,7 +67,8 @@ struct percpu_ref {
struct rcu_head rcu;
 };
 
-int percpu_ref_init(struct percpu_ref *ref, percpu_ref_func_t *release);
+int __must_check percpu_ref_init(struct percpu_ref *ref,
+percpu_ref_func_t *release);
 void percpu_ref_kill_and_confirm(struct percpu_ref *ref,
 percpu_ref_func_t *confirm_kill);
 
--- a/lib/percpu-refcount.c
+++ b/lib/percpu-refcount.c
@@ -57,12 +57,10 @@ int percpu_ref_init(struct percpu_ref *r
 static void percpu_ref_kill_rcu(struct rcu_head *rcu)
 {
struct percpu_ref *ref = container_of(rcu, struct percpu_ref, rcu);
-   unsigned __percpu *pcpu_count;
+   unsigned __percpu *pcpu_count = ref->pcpu_count;
unsigned count = 0;
int cpu;
 
-   pcpu_count = ACCESS_ONCE(ref->pcpu_count);
-
/* Mask out PCPU_REF_DEAD */
pcpu_count = (unsigned __percpu *)
(((unsigned long) pcpu_count) & ~PCPU_STATUS_MASK);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] PCI: Remove not needed check in disable aspm link

2013-06-12 Thread Bjorn Helgaas

On Wed, Jun 12, 2013 at 1:41 PM, Yinghai Lu  wrote:
> On Wed, Jun 12, 2013 at 10:05 AM, Bjorn Helgaas  wrote:
>>> current code from acpi_pci_root_add we have
>>>   1. pci_acpi_scan_root
>>>  ==> pci devices enumeration and bus scanning.
>>>  ==> pci_alloc_child_bus
>>>  ==> pcibios_add_bus
>>>  ==> acpi_pci_add_bus
>>>  ==> acpiphp_enumerate_slots
>>>  ==> ...==> register_slot
>>>   ==> device_is_managed_by_native_pciehp
>>> ==> check osc_set with
>>> OSC_PCI_EXPRESS_NATIVE_HP_CONTROL
>>>2. _OSC set request
>>>
>>> so we always have acpiphp hotplug slot registered at first.
>>>
>>> so either we need to
>>> A. revert reverting about _OSC
>>> B. move pcibios_add_bus down to pci_bus_add_devices()
>>> as acpiphp and apci pci slot driver are some kind of drivers for pci_bus
>>> C. A+B
>>
>> It doesn't surprise me at all that there are problems in the _OSC code
>> and the acpiphp/pciehp interaction.  That whole area is a complete
>> disaster.  It'd really be nice if somebody stepped up and reworked it
>> so it makes sense.
>>
>> But this report is useless to me.  I don't have time to work out what
>> the problem is and how it affects users and come up with a fix.
>
> effects: without fix the problem, user can not use pcie native hotplug
> if their system's firmware support acpihp and pciehp.
> And make it worse, that acpiphp have to be built-in, so they have no
> way to blacklist acpiphp in config.
>
>>
>> My advice is to simplify the path first, and worry about fixing the
>> bug afterwards.  We've already done several iterations of fiddling
>> with things, and I think all we're doing is playing "whack-a-mole" and
>> pushing the bugs around from one place to another.
>
> We need to address regression at first.
> my suggestion is : revert the reverting and apply my -v3 version that will fix
> regression that Roman Yepishev met.
>
> please check attached two patches, hope it could save your some time.

OK, you're right.  It's not reasonable to do anything more than a
minimal fix when we're at -rc5.

Sigh.  I'll spend tomorrow trying to understand your patches and write
changelogs for you.

I think you're saying that in systems that support both acpiphp and
pciehp, we should be using pciehp, but we currently use acpiphp.  If
so, that's certainly a bug.  How serious is it?  Is it a disaster if
we use acpiphp until we can resolve this cleanly?  Are there a lot of
systems that claim to support acpiphp but it doesn't actually work?

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 2/2] percpu-refcount: implement percpu_tryget() along with percpu_ref_kill_and_confirm()

2013-06-12 Thread Tejun Heo

Implement percpu_tryget() which stops giving out references once the
percpu_ref is visible as killed.  Because the refcnt is per-cpu,
different CPUs will start to see a refcnt as killed at different
points in time and tryget() may continue to succeed on subset of cpus
for a while after percpu_ref_kill() returns.

For use cases where it's necessary to know when all CPUs start to see
the refcnt as dead, percpu_ref_kill_and_confirm() is added.  The new
function takes an extra argument @confirm_kill which is invoked when
the refcnt is guaranteed to be viewed as killed on all CPUs.

While this isn't the prettiest interface, it doesn't force synchronous
wait and is much safer than requiring the caller to do its own
call_rcu().

v2: Patch description rephrased to emphasize that tryget() may
continue to succeed on some CPUs after kill() returns as suggested
by Kent.

Signed-off-by: Tejun Heo 
Cc: Kent Overstreet 
---
 include/linux/percpu-refcount.h |   50 +++-
 lib/percpu-refcount.c   |   23 +-
 2 files changed, 66 insertions(+), 7 deletions(-)

--- a/include/linux/percpu-refcount.h
+++ b/include/linux/percpu-refcount.h
@@ -63,11 +63,28 @@ struct percpu_ref {
 */
unsigned __percpu   *pcpu_count;
percpu_ref_func_t   *release;
+   percpu_ref_func_t   *confirm_kill;
struct rcu_head rcu;
 };
 
 int percpu_ref_init(struct percpu_ref *ref, percpu_ref_func_t *release);
-void percpu_ref_kill(struct percpu_ref *ref);
+void percpu_ref_kill_and_confirm(struct percpu_ref *ref,
+percpu_ref_func_t *confirm_kill);
+
+/**
+ * percpu_ref_kill - drop the initial ref
+ * @ref: percpu_ref to kill
+ *
+ * Must be used to drop the initial ref on a percpu refcount; must be called
+ * precisely once before shutdown.
+ *
+ * Puts @ref in non percpu mode, then does a call_rcu() before gathering up the
+ * percpu counters and dropping the initial ref.
+ */
+static inline void percpu_ref_kill(struct percpu_ref *ref)
+{
+   return percpu_ref_kill_and_confirm(ref, NULL);
+}
 
 #define PCPU_STATUS_BITS   2
 #define PCPU_STATUS_MASK   ((1 << PCPU_STATUS_BITS) - 1)
@@ -99,6 +116,37 @@ static inline void percpu_ref_get(struct
 }
 
 /**
+ * percpu_ref_tryget - try to increment a percpu refcount
+ * @ref: percpu_ref to try-get
+ *
+ * Increment a percpu refcount unless it has already been killed.  Returns
+ * %true on success; %false on failure.
+ *
+ * Completion of percpu_ref_kill() in itself doesn't guarantee that tryget
+ * will fail.  For such guarantee, percpu_ref_kill_and_confirm() should be
+ * used.  After the confirm_kill callback is invoked, it's guaranteed that
+ * no new reference will be given out by percpu_ref_tryget().
+ */
+static inline bool percpu_ref_tryget(struct percpu_ref *ref)
+{
+   unsigned __percpu *pcpu_count;
+   int ret = false;
+
+   rcu_read_lock();
+
+   pcpu_count = ACCESS_ONCE(ref->pcpu_count);
+
+   if (likely(REF_STATUS(pcpu_count) == PCPU_REF_PTR)) {
+   __this_cpu_inc(*pcpu_count);
+   ret = true;
+   }
+
+   rcu_read_unlock();
+
+   return ret;
+}
+
+/**
  * percpu_ref_put - decrement a percpu refcount
  * @ref: percpu_ref to put
  *
--- a/lib/percpu-refcount.c
+++ b/lib/percpu-refcount.c
@@ -89,6 +89,10 @@ static void percpu_ref_kill_rcu(struct r
 
atomic_add((int) count - PCPU_COUNT_BIAS, &ref->count);
 
+   /* @ref is viewed as dead on all CPUs, send out kill confirmation */
+   if (ref->confirm_kill)
+   ref->confirm_kill(ref);
+
/*
 * Now we're in single atomic_t mode with a consistent refcount, so it's
 * safe to drop our initial ref:
@@ -97,22 +101,29 @@ static void percpu_ref_kill_rcu(struct r
 }
 
 /**
- * percpu_ref_kill - safely drop initial ref
+ * percpu_ref_kill_and_confirm - drop the initial ref and schedule confirmation
  * @ref: percpu_ref to kill
+ * @confirm_kill: optional confirmation callback
  *
- * Must be used to drop the initial ref on a percpu refcount; must be called
- * precisely once before shutdown.
+ * Equivalent to percpu_ref_kill() but also schedules kill confirmation if
+ * @confirm_kill is not NULL.  @confirm_kill, which may not block, will be
+ * called after @ref is seen as dead from all CPUs - all further
+ * invocations of percpu_ref_tryget() will fail.  See percpu_ref_tryget()
+ * for more details.
  *
- * Puts @ref in non percpu mode, then does a call_rcu() before gathering up the
- * percpu counters and dropping the initial ref.
+ * It's guaranteed that there will be at least one full RCU grace period
+ * between the invocation of this function and @confirm_kill and the caller
+ * can piggy-back their RCU release on the callback.
  */
-void percpu_ref_kill(struct percpu_ref *ref)
+void percpu_ref_kill_and_confirm(struct percpu_ref *ref,
+percpu_ref_func_t *confirm_k

Re: [PATCH percpu/for-3.11 1/2] percpu-refcount: cosmetic updates

2013-06-12 Thread Tejun Heo

On Wed, Jun 12, 2013 at 01:45:32PM -0700, Tejun Heo wrote:
> From 49d1e1a6561f1e4b190695f36d2594c7c04b8e76 Mon Sep 17 00:00:00 2001
> From: Tejun Heo 
> Date: Wed, 12 Jun 2013 13:37:42 -0700
> 
> * s/percpu_ref_release/percpu_ref_func_t/ as it's customary to have _t
>   postfix for types and the type is gonna be used for a different type
>   of callback too.
> 
> * Add @ARG to function comments.
> 
> * Drop unnecessary and unaligned indentation from percpu_ref_init()
>   function comment.
> 
> Signed-off-by: Tejun Heo 

Applied to percpu/for-3.11.  Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH percpu/for-3.11] percpu-refcount: consistently use plain (non-sched) RCU

2013-06-12 Thread Tejun Heo

On Wed, Jun 12, 2013 at 01:40:32PM -0700, Tejun Heo wrote:
> From e96262150a513ce3d54ff221d4ace8aeec96e0bf Mon Sep 17 00:00:00 2001
> From: Tejun Heo 
> Date: Wed, 12 Jun 2013 13:37:42 -0700
> 
> percpu_ref_get/put() are using preempt_disable/enable() while
> percpu_ref_kill() is using plain call_rcu() instead of
> call_rcu_sched().  This is buggy as grace periods of the two may not
> match.  Fix it by using plain RCU in percpu_ref_get/put().
> 
> (I suggested using sched RCU in the first place but there's no actual
>  benefit in doing so unless we're gonna introduce different variants
>  of get/put to be called while preemption is alredy disabled, which we
>  definitely shouldn't.)
> 
> Signed-off-by: Tejun Heo 
> Reported-by: Rusty Russell 

Applied to percpu/for-3.11.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[GIT PULL] pstore/ram for 3.11

2013-06-12 Thread Rob Herring

Not sure who takes this, but please pull these 2 changes for pstore for
3.11. These are necessary to get pstore to work with on-chip RAM on
Calxeda highbank platform.

I dropped the write-combine change from this series. While write-combine
is technically the correct mapping needed for ARM, it doesn't solve any
problem for me.

Rob

The following changes since commit c7788792a5e7b0d5d7f96d0766b4cb6112d47d75:

  Linux 3.10-rc2 (2013-05-20 14:37:38 -0700)

are available in the git repository at:

  git://sources.calxeda.com/kernel/linux.git tags/pstore-for-3.11

for you to fetch changes up to da98a8882b3b388e9580821c8ad281c85f94d025:

  pstore/ram: avoid atomic accesses for ioremapped regions (2013-05-21
15:32:44 -0500)


2 changes for pstore ram:

- Don't use atomic ops on ioremapped RAM which is not supported on ARM
- Allow non-power of 2 sized RAM buffers


Rob Herring (2):
  pstore/ram: remove the power of buffer size limitation
  pstore/ram: avoid atomic accesses for ioremapped regions

 fs/pstore/ram.c  |  2 --
 fs/pstore/ram_core.c | 54
++--
 2 files changed, 52 insertions(+), 4 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

2013-06-12 Thread Bjorn Helgaas

On Wed, Jun 12, 2013 at 8:44 PM, Takao Indoh  wrote:
> (2013/06/12 13:45), Bjorn Helgaas wrote:
>> [+cc Vivek, Haren; sorry I didn't think to add you earlier]
>>
>> On Tue, Jun 11, 2013 at 12:08 AM, Takao Indoh
>>  wrote:
>>> (2013/06/11 11:20), Bjorn Helgaas wrote:
>>
 I'm not sure you need to reset legacy devices (or non-PCI devices)
 yet, but the current hook isn't anchored anywhere -- it's just an
 fs_initcall() that doesn't give the reader any clue about the
 connection between the reset and the problem it's solving.

 If we do something like this patch, I think it needs to be done at the
 point where we enable or disable the IOMMU.  That way, it's connected
 to the important event, and there's a clue about how to make
 corresponding fixes for other IOMMUs.
>>>
>>> Ok. pci_iommu_init() is appropriate place to add this hook?
>>
>> I looked at various IOMMU init places today, and it's far more
>> complicated and varied than I had hoped.
>>
>> This reset scheme depends on enumerating PCI devices before we
>> initialize the IOMMU used by those devices.  x86 works that way today,
>> but not all architectures do (see the sparc pci_fire_pbm_init(), for
>
> Sorry, could you tell me which part depends on architecture?

Your patch works if PCIe devices are reset before the kdump kernel
enables the IOMMU.  On x86, this is possible because PCI enumeration
happens before the IOMMU initialization.  On sparc, the IOMMU is
initialized before PCI devices are enumerated, so there would still be
a window where ongoing DMA could cause an IOMMU error.

Of course, it might be possible to reorganize the sparc code to to the
IOMMU init *after* it enumerates PCI devices.  But I think that change
would be hard to justify.

And I think even on x86, it would be better if we did the IOMMU init
before PCI enumeration -- the PCI devices depend on the IOMMU, so
logically the IOMMU should be initialized first so the PCI devices can
be associated with it as they are enumerated.

>> example).  And I think conceptually, the IOMMU should be enumerated
>> and initialized *before* the devices that use it.
>>
>> So I'm uncomfortable with that aspect of this scheme.
>>
>> It would be at least conceivable to reset the devices in the system
>> kernel, before the kexec.  I know we want to do as little as possible
>> in the crashing kernel, but it's at least a possibility, and it might
>> be cleaner.
>
> I bet this will be not accepted by kdump maintainer. Everything in panic
> kernel is unreliable.

kdump is inherently unreliable.  The kdump kernel doesn't start from
an arbitrary machine state.  We don't expect it to tolerate all CPUs
running, for example.  Maybe it should be expected to tolerate PCI
devices running, either.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] kernel: timer: looping issue, need reset variable 'found'

2013-06-12 Thread Chen Gang


Sorry for replying late during these days, firstly.


On 06/10/2013 10:12 PM, Thomas Gleixner wrote:
> On Sun, 9 Jun 2013, Chen Gang wrote:
> 
>> > 
>> > According to __internal_add_timer(), in _next_timer_interrupt(), when
>> > 'tv1.vec' find one, but need 'cascade bucket(s)', we still need find
>> > each slot of 'tv*.vec'.
> No, we do not. We only need to scan the first cascade array after the
> enqueued timer. If there is nothing in tv2 which might come before the
> found timer, then any timer in tv3 will be later than the one we found
> in the primary wheel.
> 

OK, thanks, I should read details to confirm it (at least now, I still
not quite be sure about it).


>> > So need reset variable 'found', so can fully scan ''do {...} while()''
>> > for 'tv*.vec'.
> And thereby lose the information, that we already found a timer in the
> scan of the primary array.

Hmm..., I should read details.


And excuse me, I also have to do another things during these days, so
maybe reply late.


Thanks.
-- 
Chen Gang

Asianux Corporation
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] USB: serial/ftdi_sio.c Fix kernel oops

2013-06-12 Thread Ben Hutchings

On Fri, 2013-06-07 at 15:14 +0200, Lotfi Manseur wrote:
> Handle null termios in ftdi_set_termios(), introduced in
> commit 552f6bf1bb0eda0011c0525dd587aa9e7ba5b846
> This has been corrected in the mainline by
> commits c515598e0f5769916c31c00392cc2bfe6af74e55 and
> a816e3113b63753c330ca4751ea1d208e93e3015.
> 
> This is to be fixed in longterm 2.6.32.60 and 3.4.47.
> This bug has been found with coccinelle.
> 
> Signed-off-by: Lotfi Manseur 
> Signed-off-by: Nicolas Palix 

I've queued up those changes for 3.2.  This backported version seems
nicer, but we generally prefer to use patches that are as close as
possible to those in mainline.

Ben.

> ---
>  drivers/usb/serial/ftdi_sio.c |6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/usb/serial/ftdi_sio.c b/drivers/usb/serial/ftdi_sio.c
> index c374beb..615bd9e 100644
> --- a/drivers/usb/serial/ftdi_sio.c
> +++ b/drivers/usb/serial/ftdi_sio.c
> @@ -2364,7 +2364,8 @@ static void ftdi_set_termios(struct tty_struct *tty,
>  
>   cflag = termios->c_cflag;
>  
> - if (old_termios->c_cflag == termios->c_cflag
> + if (old_termios
> + && old_termios->c_cflag == termios->c_cflag
>   && old_termios->c_ispeed == termios->c_ispeed
>   && old_termios->c_ospeed == termios->c_ospeed)
>   goto no_c_cflag_changes;
> @@ -2373,7 +2374,8 @@ static void ftdi_set_termios(struct tty_struct *tty,
>  ftdi_sio_read_bulk_callback  - need to examine what this means -
>  don't see any problems yet */
>  
> - if ((old_termios->c_cflag & (CSIZE|PARODD|PARENB|CMSPAR|CSTOPB)) ==
> + if (old_termios &&
> + (old_termios->c_cflag & (CSIZE|PARODD|PARENB|CMSPAR|CSTOPB)) ==
>   (termios->c_cflag & (CSIZE|PARODD|PARENB|CMSPAR|CSTOPB)))
>   goto no_data_parity_stop_changes;
>  

-- 
Ben Hutchings
friends: People who know you well, but like you anyway.


signature.asc
Description: This is a digitally signed message part

Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

2013-06-12 Thread Takao Indoh

(2013/06/12 22:19), Don Dutile wrote:
> On 06/11/2013 07:19 PM, Sumner, William wrote:
>>
>>> (2013/06/11 11:20), Bjorn Helgaas wrote:
 On Fri, Jun 7, 2013 at 2:46 AM, Takao Indoh  
 wrote:
> (2013/06/07 13:14), Bjorn Helgaas wrote:

>> One thing I'm not sure about is that you are only resetting PCIe
>> devices, but I don't think the problem is actually specific to PCIe,
>> is it?  I think the same issue could occur on any system with an
>> IOMMU.  In the x86 PC world, most IOMMUs are connected with PCIe, but
>> there are systems with IOMMUs for plain old PCI devices, e.g.,
>> PA-RISC.
>
> Right, this is not specific to PCIe. The reasons why the target is only
> PCIe is just to make algorithm to reset simple. It is possible to reset
> legacy PCI devices in my patch, but code becomes somewhat complicated. I
> thought recently most systems used PCIe and there was little demand for
> resetting legacy PCI. Therefore I decided not to reset legacy PCI
> devices, but I'll do if there are requests :-)

 I'm not sure you need to reset legacy devices (or non-PCI devices)
 yet, but the current hook isn't anchored anywhere -- it's just an
 fs_initcall() that doesn't give the reader any clue about the
 connection between the reset and the problem it's solving.

 If we do something like this patch, I think it needs to be done at the
 point where we enable or disable the IOMMU.  That way, it's connected
 to the important event, and there's a clue about how to make
 corresponding fixes for other IOMMUs.
>>>
>>> Ok. pci_iommu_init() is appropriate place to add this hook?
>>>
 We already have a "reset_devices" boot option.  This is for the same
 purpose, as far as I can tell, and I'm not sure there's value in
 having both "reset_devices" and "pci=pcie_reset_endpoint_devices".  In
 fact, there's nothing specific even to PCI here.  The Intel VT-d docs
 seem carefully written so they could apply to either PCIe or non-PCI
 devices.
>>>
>>> Yeah, I can integrate this option into reset_devices. The reason why
>>> I separate them is to avoid regression.
>>>
>>> I have tested my patch on many machines and basically it worked, but I
>>> found two machines where this reset patch caused problem. The first one
>>> was caused by bug of raid card firmware. After updating firmware, this
>>> patch worked. The second one was due to bug of PCIe switch chip. I
>>> reported this bug to the vendor but it is not fixed yet.
>>>
>>> Anyway, this patch may cause problems on such a buggy machine, so I
>>> introduced new boot parameter so that user can enable or disable this
>>> function aside from reset_devices.
>>>
>>> Actually Vivek Goyal, kdump maintainer said same thing. He proposed using
>>> reset_devices instead of adding new one, and using quirk or something to
>>> avoid such a buggy devices.
>>
>> With respect to "and using quirk or something to avoid such buggy devices",
>> I believe that it will be necessary to provide a mechanism for devices that
>> need special handling to do the reset -- perhaps something like a list
>> of tuples: (device_type, function_to_call) with a default function_to_call
>> when the device_type is not found in the list.  These functions would
>> need to be physically separate from the device driver because if the device
>> is present it needs to be reset even if the crash kernel chooses not to load
>> the driver for that device.
>>
>>>
>>> So, basically I agree with using reset_devices, but I want to prepare
>>> workaround in case this reset causes something wrong.
>>>
>> I like the ability to specify the original "reset_devices" separately from
>> invoking this new mechanism.  With so many different uses for Linux in
>> so many different environments and with so many different device drivers
>> it seems reasonable to keep the ability to tell the device drivers to
>> reset their devices -- instead of pulling the reset line on all devices.
>>
>> I also like the ability to invoke the new reset feature separately from
>> telling the device drivers to do it.
>>
>>>

>> I tried to make a list of the interesting scenarios and the events
>> that are relevant to this problem:
>>
>>Case 1: IOMMU off in system, off in kdump kernel
>>  system kernel leaves IOMMU off
>>DMA targets system-kernel memory
>>  kexec to kdump kernel (IOMMU off, devices untouched)
>>DMA targets system-kernel memory (harmless)
>>  kdump kernel re-inits device
>>DMA targets kdump-kernel memory
>>
>>Case 2: IOMMU off in system kernel, on in kdump kernel
>>  system kernel leaves IOMMU off
>>DMA targets system-kernel memory
>>  kexec to kdump kernel (IOMMU off, devices untouched)
>>DMA targets system-kernel memory (harmless)
>>  kdu

Re: [RFC PATCH v3 2/2] drivers: mfd: vexpress: add Serial Power Controller (SPC) support

2013-06-12 Thread Nicolas Pitre

On Thu, 13 Jun 2013, Samuel Ortiz wrote:

> > +static struct vexpress_spc_drvdata *info;
> > +static u32 *vexpress_spc_config_data;
> > +static struct vexpress_config_bridge *vexpress_spc_config_bridge;
> > +static struct vexpress_config_func *opp_func, *perf_func;
> > +
> > +static int vexpress_spc_load_result = -EAGAIN;
> As I said, quite static...
> 
> > +static bool __init __vexpress_spc_check_loaded(void);
> > +/*
> > + * Pointer spc_check_loaded is swapped after init hence it is safe
> > + * to initialize it to a function in the __init section
> > + */
> > +static bool (*spc_check_loaded)(void) __refdata = 
> > &__vexpress_spc_check_loaded;
> > +
> > +static bool __init __vexpress_spc_check_loaded(void)
> > +{
> > +   if (vexpress_spc_load_result == -EAGAIN)
> > +   vexpress_spc_load_result = vexpress_spc_init();
> > +   spc_check_loaded = &vexpress_spc_initialized;
> > +   return vexpress_spc_initialized();
> > +}
> > +
> > +/*
> > + * Function exported to manage early_initcall ordering.
> > + * SPC code is needed very early in the boot process
> > + * to bring CPUs out of reset and initialize power
> > + * management back-end. After boot swap pointers to
> > + * make the functionality check available to loadable
> > + * modules, when early boot init functions have been
> > + * already freed from kernel address space.
> > + */
> > +bool vexpress_spc_check_loaded(void)
> > +{
> > +   return spc_check_loaded();
> > +}
> > +EXPORT_SYMBOL_GPL(vexpress_spc_check_loaded);
> That one and the previous function look really nasty to me.
> The simple fact that you need a static variable in your code to check if
> your module is loaded sounds really fishy.

Maybe a bit of context might explain why things are done this way.

This code is, amongst other things, needed to properly bring up 
secondary CPUs during boot.  This means that it has to be initialized at 
the early_initcall level before SMP is initialized.  So that rules out 
most of the device driver niceties which are not initialized yet, and 
that also means that this can't be made into a loadable module.

To make things even more subtle, this code is needed to implement the 
backend for the MCPM layer through which the actual bringup of secondary 
CPUs is done.  So that MCPM backend is itself also initialized at the 
early_initcall level, but we don't know and can't enforce the order 
those different things will be initialized.  So the MCPM backend calls 
vexpress_spc_check_loaded() to initialize it if it has not been 
initialized yet, or return false if there is no SPC on the booted 
system.

Yet this code is also the entry point for some operating point changes 
requested by the cpufreq driver, and that one can be modular.  And the 
cpufreq driver also has to test for the presence of a SPC via 
vexpress_spc_check_loaded().

So that's the main reason why there is a static state variable, and why 
there is a switch of function pointed by spc_check_loaded to let the 
init code go after boot and still be able to be referenced by modules 
after boot.

Nicolas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [josef-btrfs:master 61/69] fs/btrfs/delayed-inode.c:393:1: warning: control reaches end of non-void function

2013-06-12 Thread Fengguang Wu

On Wed, Jun 12, 2013 at 12:32:11PM -0700, Zach Brown wrote:
> >fs/btrfs/delayed-inode.c: In function 'get_ins_del_root':
> > >> fs/btrfs/delayed-inode.c:393:1: warning: control reaches end of non-void 
> > >> function [-Wreturn-type]
> > 
> > vim +393 fs/btrfs/delayed-inode.c
> > 
> >385  static struct rb_root *get_ins_del_root(struct 
> > btrfs_delayed_node *delayed_node,
> >386  int ins_del)
> >387  {
> >388  if (ins_del == BTRFS_DELAYED_INSERTION_ITEM)
> >389  return &delayed_node->ins_root;
> >390  if (ins_del == BTRFS_DELAYED_DELETION_ITEM)
> >391  return &delayed_node->del_root;
> >392  BUG();
> >  > 393  }
> 
> Hrmph.
> 
> Is this false positive worth working around?  What version of gcc is
> this from?

gcc (Debian 4.7.1-6) 4.7.1

> I'd expect the unreachable() in BUG() to silence this
> warning on modern gcc.

Yeah it's strange why gcc still complain about that. Anyway I'll
detect and ignore this kind of false positive in my report scripts.

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 04/11] cgroup: use kzalloc() and list_del_init()

2013-06-12 Thread Li Zefan

On 2013/6/13 10:38, Kent Overstreet wrote:
> On Thu, Jun 13, 2013 at 10:36:40AM +0800, Li Zefan wrote:
>> On 2013/6/13 5:03, Tejun Heo wrote:
>>> There's no point in using kmalloc() and list_del() instead of the
>>> clearing variants for trivial stuff.  We can live dangerously
>>> elsewhere.  Use kzalloc() and list_del_init() instead and drop 0
>>> inits.
>>>
>>
>> Do you mean we prefer list_del_init() than list_del() in general? Then
>> in which cases do we prefer list_del()?
> 
> IMO, list_del() is preferred when the object shouldn't be reused (i.e.
> it gets taken off a list and then it's freed).

yeah, this is what I have in my mind. I would wonder why list_del_init()
if I know that object won't be used anymore.

> list_del_init() could
> hide bugs.
> 

Same here. I do worry a bit about this.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] sched: smart wake-affine

2013-06-12 Thread Michael Wang

Hi, Peter

Would you like to give some comments on this approach? or may be just
some hint on what's the concern? the damage on pgbench is still there...

Regards,
Michael Wang


On 05/28/2013 01:05 PM, Michael Wang wrote:
> wake-affine stuff is always trying to pull wakee close to waker, by theory,
> this will bring benefit if waker's cpu cached hot data for wakee, or the
> extreme ping-pong case.
> 
> And testing show it could benefit hackbench 15% at most.
> 
> However, the whole stuff is somewhat blindly and time-consuming, some
> workload therefore suffer.
> 
> And testing show it could damage pgbench 50% at most.
> 
> Thus, wake-affine stuff should be smarter, and realise when to stop
> it's thankless effort.
> 
> This patch introduced per task 'nr_wakee_switch', which will be increased
> each time the task switch it's wakee.
> 
> So a high 'nr_wakee_switch' means the task has more than one wakee, and
> less the wakee number, higher the wakeup frequency.
> 
> Now when making the decision on whether to pull or not, pay attention on
> the wakee with a high 'nr_wakee_switch', pull such task may benefit wakee,
> but that imply waker will face cruel competition later, it could be very
> crule or very fast depends on the story behind 'nr_wakee_switch', whatever,
> waker therefore suffer.
> 
> Furthermore, if waker also has a high 'nr_wakee_switch', that imply multiple
> tasks rely on it, waker's higher latency will damage all those tasks, pull
> wakee in such cases seems to be a bad deal.
> 
> Thus, when 'waker->nr_wakee_switch / wakee->nr_wakee_switch' become higher
> and higher, the deal seems to be worse and worse.
> 
> This patch therefore help wake-affine stuff to stop it's work when:
> 
>   wakee->nr_wakee_switch > factor &&
>   waker->nr_wakee_switch > (factor * wakee->nr_wakee_switch)
> 
> The factor here is the online cpu number, so more cpu will lead to more pull
> since the trial become more severe.
> 
> After applied the patch, pgbench show 42% improvement at most.
> 
> Test:
>   Test with 12 cpu X86 server and tip 3.10.0-rc1.
> 
>   basesmart
> 
>   | db_size | clients |  tps  | |  tps  |
>   +-+-+---+ +---+
>   | 21 MB   |   1 | 10749 | | 10337 |
>   | 21 MB   |   2 | 21382 | | 21391 |
>   | 21 MB   |   4 | 41570 | | 41808 |
>   | 21 MB   |   8 | 52828 | | 58792 |
>   | 21 MB   |  12 | 48447 | | 54553 |
>   | 21 MB   |  16 | 46246 | | 56726 | +22.66%
>   | 21 MB   |  24 | 43850 | | 56853 | +29.65%
>   | 21 MB   |  32 | 43455 | | 55846 | +28.51%
>   | 7483 MB |   1 |  9290 | |  8848 |
>   | 7483 MB |   2 | 19347 | | 19351 |
>   | 7483 MB |   4 | 37135 | | 37511 |
>   | 7483 MB |   8 | 47310 | | 50210 |
>   | 7483 MB |  12 | 42721 | | 49396 |
>   | 7483 MB |  16 | 41016 | | 51826 | +26.36%
>   | 7483 MB |  24 | 37540 | | 52579 | +40.06%
>   | 7483 MB |  32 | 36756 | | 51332 | +39.66%
>   | 15 GB   |   1 |  8758 | |  8670 |
>   | 15 GB   |   2 | 19204 | | 19249 |
>   | 15 GB   |   4 | 36997 | | 37199 |
>   | 15 GB   |   8 | 46578 | | 50681 |
>   | 15 GB   |  12 | 42141 | | 48671 |
>   | 15 GB   |  16 | 40518 | | 51280 | +26.56%
>   | 15 GB   |  24 | 36788 | | 52329 | +42.24%
>   | 15 GB   |  32 | 36056 | | 50350 | +39.64%
> 
> 
> 
> CC: Ingo Molnar 
> CC: Peter Zijlstra 
> CC: Mike Galbraith 
> Signed-off-by: Michael Wang 
> ---
>  include/linux/sched.h |3 +++
>  kernel/sched/fair.c   |   45 +
>  2 files changed, 48 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 178a8d9..1c996c7 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1041,6 +1041,9 @@ struct task_struct {
>  #ifdef CONFIG_SMP
>   struct llist_node wake_entry;
>   int on_cpu;
> + struct task_struct *last_wakee;
> + unsigned long nr_wakee_switch;
> + unsigned long last_switch_decay;
>  #endif
>   int on_rq;
>  
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f62b16d..eaaceb7 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3127,6 +3127,45 @@ static inline unsigned long effective_load(struct 
> task_group *tg, int cpu,
>  
>  #endif
>  
> +static void record_wakee(struct task_struct *p)
> +{
> + /*
> +  * Rough decay, don't worry about the boundary, really active
> +  * task won't care the loose.
> +  */
> + if (jiffies > current->last_switch_decay + HZ) {
> + current->nr_wakee_switch = 0;
> + current->last_switch_decay = jiffies;
> + }
> +
> + if (current->last_wakee != p) {
> + current->last_wakee = p;
> + current->nr_wakee_switch++;
> + }
> +}
> +
> +static int nasty_pull(struct task_struct *p)
> +{
> + in

Re: [PATCH 04/11] cgroup: use kzalloc() and list_del_init()

2013-06-12 Thread Tejun Heo

On Wed, Jun 12, 2013 at 07:56:23PM -0700, Tejun Heo wrote:
> poison.  Maybe it was something which got stuck in my brain from
> before the git history or I'm just hallucinating.  Anyways, yeap,

Just checked 2.4.  It didn't poison then.  Somehow I never got that
out of my brain all these years.  #feelingancient

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Linux 3.2.46

2013-06-12 Thread Ben Hutchings

On Fri, 2013-06-07 at 17:20 +0800, Li Zefan wrote:
> On 2013/5/31 21:06, Ben Hutchings wrote:
> > I'm announcing the release of the 3.2.46 kernel.
> > 
> > All users of the 3.2 kernel series should upgrade.
> > 
> > The updated 3.2.y git tree can be found at:
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git 
> > linux-3.2.y
> > and can be browsed at the normal kernel.org git web browser:
> > http://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git
> > 
> 
> I guess you'll get compile error if you enable ftrace but disable 
> CONFIG_DYNAMIC_FTRACE,
> and you need to backport 7f49ef69db6bbf756c0abca7e9b65b32e999eec8.

Thanks, I've added that to the queue for the next update.

Ben.

-- 
Ben Hutchings
friends: People who know you well, but like you anyway.


signature.asc
Description: This is a digitally signed message part

Re: [PATCH] Percpu tag allocator

2013-06-12 Thread Andrew Morton

On Wed, 12 Jun 2013 19:05:36 -0700 Kent Overstreet  
wrote:

> ...
>
> > Why can't we use ida_get_new_above()?
> > 
> >If it is "speed" then how bad is the problem and what efforts have
> >been made to address them within the idr code?  (A per-cpu magazine
> >frontend to ida_get_new_above() might suit).
> > 
> >If it is "functionality" then what efforts have been made to
> >suitably modify the ida code?
> 
> Originally, it was because every time I opened idr.[ch] I was confronted
> with an enormous pile of little functions, with very little indication
> in the way of what they were all trying to do or which ones I might want
> to start with.
> 
> Having finally read enough of the code to maybe get a handle on what
> it's about - performance is a large part of it, but idr is also a more
> complicated api that does more than what I wanted.

They all sound like pretty crappy reasons ;) If the idr/ida interface
is nasty then it can be wrapped to provide the same interface as the
percpu tag allocator.

I could understand performance being an issue, but diligence demands
that we test that, or at least provide a convincing argument.

>
> ...
>
> > > + remote = per_cpu_ptr(pool->tag_cpu, cpu);
> > > +
> > > + if (remote == tags)
> > > + continue;
> > > +
> > > + clear_bit(cpu, pool->cpus_have_tags);
> > > +
> > > + nr_free = xchg(&remote->nr_free, TAG_CPU_STEALING);
> > 
> > (looks to see what TAG_CPU_STEALING does)
> > 
> > OK, this looks pretty lame.  It adds a rare fallback to the central
> > allocator, which makes that path hard to test.  And it does this at
> > basically the same cost of plain old spin_lock().  I do think it would
> > be better to avoid the underministic code and use plain old
> > spin_lock().  I appreciate the lock ranking concern, but you're a
> > cleaver chap ;)
> 
> Yeah, the cmpxchg() stuff turned out trickier than I'd hoped - it's
> actually the barriers (guarding against a race with percpu_tag_free())
> that concern me more than that fallback.
> 
> I did torture test this code quite a bit and I'm not terribly eager to
> futz with it more, but I may try switching to spinlocks for the percpu
> freelists to see how it works out - I had the idea that I might be able
> to avoid some writes to shared cachelines if I can simplify that stuff,
> which would probably make it worth it.

The nice thing about a lock per cpu is that the stealing CPU can grab
it and then steal a bunch of tags without releasing the lock: less
lock-taking.  Maybe the current code does that amortisation as well;
my intent-reverse-engineering resources got depleted.

Also, with a lock-per-cpu the stolen-from CPU just spins, so the ugly
TAG_CPU_STEALING fallback-to-global-allocator thing goes away.

> > Also, I wonder if this was all done in the incorrect order.  Why make
> > alloc_local_tag() fail?  steal_tags() could have just noodled off and
> > tried to steal from the next CPU if alloc_local_tag() was in progress
> > on this CPU?
> 
> steal_tags() can't notice that alloc_local_tag() is in progress,

Yes it can - the stolen-from CPU sets a flag in its cpu-local object
while in its critical section.  The stealing CPU sees that and skips.

I think all this could be done with test_and_set_bit() and friends,
btw.  xchg() hurts my brain.

> alloc_local_tag() just uses cmpxchg to update nr_free - there's no
> sentinal value or anything.
> 
> OTOH, if alloc_local_tag() sees TAG_CPU_STEALING - the next thing that's
> going to happen is steal_tags() is going to set nr_free to 0, and the
> only way tags are going to end up back on our cpu's freelist is if we
> fail and do something else - we've got irqs disabled so
> percpu_tag_free() can't do it.
> 
> ...
>
> > What actions can a driver take if an irq-time tag allocation attempt
> > fails?  How can the driver author test those actions?
> 
> You mean if a GFP_NOWAIT allocation fails? It's the same as any other
> allocation, I suppose.
> 
> Did you have something else in mind that could be implemented? I don't
> want to add code for a reserve to this code - IMO, if a reserve is
> needed it should be done elsewhere (like how mempools work on top of
> existing allocators).

Dunno, really - I'm just wondering what the implications of an
allocation failure will be.  I suspect it's EIO?  Which in some
circumstances could be as serious as total system failure (loss of
data), so reliability/robustness is a pretty big deal.


Another thing: CPU hot-unplug.  If it's "don't care" then the
forthcoming changelog should thoroughly explain why, please.  Otherwise
it will need a notifier to spill back any reservation.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 04/11] cgroup: use kzalloc() and list_del_init()

2013-06-12 Thread Tejun Heo

On Wed, Jun 12, 2013 at 07:52:02PM -0700, Kent Overstreet wrote:
> On Wed, Jun 12, 2013 at 07:48:59PM -0700, Tejun Heo wrote:
> > On Wed, Jun 12, 2013 at 07:43:10PM -0700, Kent Overstreet wrote:
> > > list_del() does do poisoning - and list debugging is cheaper to enable
> > > than full slab debugging.
> > 
> > Ah, right, now we have DEBUG_LIST.  Completely forgot about that.  I
> > don't think the cost difference matters that much as long as there are
> > enough people running with slab debugging, but, yeah, with DEBUG_LIST,
> > leaving list_del() alone would actually be better.  I'll drop that
> > part.
> 
> I can't remember if it was Fedora or RH (or both?) but in one of those
> they actually leave it enabled in their production kernels. Someone was
> blogging about the bugs it found...

And we poison regardless of DEBUG_LIST and looks like have been doing
that forever.  I have no idea why I was thinking list_del() didn't
poison.  Maybe it was something which got stuck in my brain from
before the git history or I'm just hallucinating.  Anyways, yeap,
list_del() is better.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC ticketlock] v3 Auto-queued ticketlock

2013-06-12 Thread Lai Jiangshan

On 06/12/2013 11:40 PM, Paul E. McKenney wrote:
> Breaking up locks is better than implementing high-contention locks, but
> if we must have high-contention locks, why not make them automatically
> switch between light-weight ticket locks at low contention and queued
> locks at high contention?  After all, this would remove the need for
> the developer to predict which locks will be highly contended.
> 
> This commit allows ticket locks to automatically switch between pure
> ticketlock and queued-lock operation as needed.  If too many CPUs are
> spinning on a given ticket lock, a queue structure will be allocated
> and the lock will switch to queued-lock operation.  When the lock becomes
> free, it will switch back into ticketlock operation.  The low-order bit
> of the head counter is used to indicate that the lock is in queued mode,
> which forces an unconditional mismatch between the head and tail counters.
> This approach means that the common-case code path under conditions of
> low contention is very nearly that of a plain ticket lock.
> 
> A fixed number of queueing structures is statically allocated in an
> array.  The ticket-lock address is used to hash into an initial element,
> but if that element is already in use, it moves to the next element.  If
> the entire array is already in use, continue to spin in ticket mode.
> 
> Signed-off-by: Paul E. McKenney 
> [ paulmck: Eliminate duplicate code and update comments (Steven Rostedt). ]
> [ paulmck: Address Eric Dumazet review feedback. ]
> [ paulmck: Use Lai Jiangshan idea to eliminate smp_mb(). ]
> [ paulmck: Expand ->head_tkt from s32 to s64 (Waiman Long). ]
> [ paulmck: Move cpu_relax() to main spin loop (Steven Rostedt). ]
> [ paulmck: Reduce queue-switch contention (Waiman Long). ]
> [ paulmck: __TKT_SPIN_INC for __ticket_spin_trylock() (Steffen Persvold). ]
> [ paulmck: Type safety fixes (Steven Rostedt). ]
> [ paulmck: Pre-check cmpxchg() value (Waiman Long). ]
> [ paulmck: smp_mb() downgrade to smp_wmb() (Lai Jiangshan). ]


Hi, Paul,

I simplify the code and remove the search by encoding the index of struct 
tkt_q_head
into lock->tickets.head.

A) lock->tickets.head(when queued-lock):
-
 index of struct tkt_q_head |0|1|
-

The bit0 = 1 for queued, and the bit1 = 0 is reserved for 
__ticket_spin_unlock(),
thus __ticket_spin_unlock() will not change the higher bits of 
lock->tickets.head.

B) tqhp->head is for the real value of lock->tickets.head.
if the last bit of tqhp->head is 1, it means it is the head ticket when started 
queuing.

Thanks,
Lai

 kernel/tktqlock.c |   51 +--
 1 files changed, 13 insertions(+), 38 deletions(-)

diff --git a/kernel/tktqlock.c b/kernel/tktqlock.c
index 912817c..1329d0f 100644
--- a/kernel/tktqlock.c
+++ b/kernel/tktqlock.c
@@ -33,7 +33,7 @@ struct tkt_q {
 
 struct tkt_q_head {
arch_spinlock_t *ref;   /* Pointer to spinlock if in use. */
-   s64 head_tkt;   /* Head ticket when started queuing. */
+   __ticket_t head;/* Real head when queued. */
struct tkt_q *spin; /* Head of queue. */
struct tkt_q **spin_tail;   /* Tail of queue. */
 };
@@ -77,15 +77,8 @@ static unsigned long tkt_q_hash(arch_spinlock_t *lock)
  */
 static struct tkt_q_head *tkt_q_find_head(arch_spinlock_t *lock)
 {
-   int i;
-   int start;
-
-   start = i = tkt_q_hash(lock);
-   do
-   if (ACCESS_ONCE(tkt_q_heads[i].ref) == lock)
-   return &tkt_q_heads[i];
-   while ((i = tkt_q_next_slot(i)) != start);
-   return NULL;
+   BUILD_BUG_ON(TKT_Q_NQUEUES > (1 << (TICKET_SHIFT - 2)));
+   return &tkt_q_heads[ACCESS_ONCE(lock->tickets.head) >> 2];
 }
 
 /*
@@ -101,11 +94,11 @@ static bool tkt_q_try_unqueue(arch_spinlock_t *lock, 
struct tkt_q_head *tqhp)
 
/* Pick up the ticket values. */
asold = ACCESS_ONCE(*lock);
-   if ((asold.tickets.head & ~0x1) == asold.tickets.tail) {
+   if (tqhp->head == asold.tickets.tail) {
 
/* Attempt to mark the lock as not having a queue. */
asnew = asold;
-   asnew.tickets.head &= ~0x1;
+   asnew.tickets.head = tqhp->head;
if (cmpxchg(&lock->head_tail,
asold.head_tail,
asnew.head_tail) == asold.head_tail) {
@@ -128,12 +121,9 @@ void tkt_q_do_wake(arch_spinlock_t *lock)
struct tkt_q_head *tqhp;
struct tkt_q *tqp;
 
-   /*
-* If the queue is still being set up, wait for it.  Note that
-* the caller's xadd() provides the needed memory ordering.
-*/
-   while ((tqhp = tkt_q_find_head(lock)) == NULL)
-   cpu_relax();
+   tqhp = tkt_q_find_head(lock);
+   ACCESS_ONCE(lock->tickets.head) -= __TKT_SPIN_INC;
+   ACCESS_ONCE(tqhp->head) = (tqhp

Re: [PATCH 04/11] cgroup: use kzalloc() and list_del_init()

2013-06-12 Thread Kent Overstreet

On Wed, Jun 12, 2013 at 07:48:59PM -0700, Tejun Heo wrote:
> On Wed, Jun 12, 2013 at 07:43:10PM -0700, Kent Overstreet wrote:
> > list_del() does do poisoning - and list debugging is cheaper to enable
> > than full slab debugging.
> 
> Ah, right, now we have DEBUG_LIST.  Completely forgot about that.  I
> don't think the cost difference matters that much as long as there are
> enough people running with slab debugging, but, yeah, with DEBUG_LIST,
> leaving list_del() alone would actually be better.  I'll drop that
> part.

I can't remember if it was Fedora or RH (or both?) but in one of those
they actually leave it enabled in their production kernels. Someone was
blogging about the bugs it found...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 04/11] cgroup: use kzalloc() and list_del_init()

2013-06-12 Thread Tejun Heo

On Wed, Jun 12, 2013 at 07:43:10PM -0700, Kent Overstreet wrote:
> list_del() does do poisoning - and list debugging is cheaper to enable
> than full slab debugging.

Ah, right, now we have DEBUG_LIST.  Completely forgot about that.  I
don't think the cost difference matters that much as long as there are
enough people running with slab debugging, but, yeah, with DEBUG_LIST,
leaving list_del() alone would actually be better.  I'll drop that
part.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V2 0/5] input: pxa27x-keypad: enhancement and device tree support

2013-06-12 Thread Chao Xie

hi, Dmitry
What are the status for these patches? Thanks.

On Wed, May 15, 2013 at 2:40 PM, Dmitry Torokhov
 wrote:
> Hi Chao,
>
> On Mon, May 13, 2013 at 04:02:07PM +0800, Chao Xie wrote:
>> hi, dmitry
>> What is your idea about these patches?
>> Do i need add someone else to review them?
>>
>
> No, they look good, give me a couple more days but I should apply them.
> I am going to fold the first 4 into one patch though.
>
>> On Mon, May 6, 2013 at 11:04 AM, Chao Xie  wrote:
>> > The patches include 2 parts
>> > 1. use matrix_keypad for matrix keyes support
>> > 2. add device tree support for pxa27x-keypad
>> >
>> > V2->V1:
>> > Do not copy the members from pdata. For device tree support,
>> > directly allocate the pdata structure.
>> >
>> > Chao Xie (5):
>> >   input: pxa27x-keypad: use matrix_keymap for matrix keyes
>> >   arm: mmp: use matrix_keymap for aspenite
>> >   arm: mmp: use matrix_keymap for teton_bga
>> >   input: pxa27x-keypad: remove the unused members at platform data
>> >   input: pxa27x-keypad: add device tree support
>> >
>> >  .../devicetree/bindings/input/pxa27x-keypad.txt|   60 +
>> >  arch/arm/mach-mmp/aspenite.c   |   10 +-
>> >  arch/arm/mach-mmp/teton_bga.c  |8 +-
>> >  drivers/input/keyboard/Kconfig |1 +
>> >  drivers/input/keyboard/pxa27x_keypad.c |  266 
>> > ++--
>> >  include/linux/platform_data/keypad-pxa27x.h|3 +-
>> >  6 files changed, 325 insertions(+), 23 deletions(-)
>> >  create mode 100644 
>> > Documentation/devicetree/bindings/input/pxa27x-keypad.txt
>> >
>> > --
>> > 1.7.4.1
>> >
>
> --
> Dmitry
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

2013-06-12 Thread Takao Indoh

(2013/06/12 13:45), Bjorn Helgaas wrote:
> [+cc Vivek, Haren; sorry I didn't think to add you earlier]
> 
> On Tue, Jun 11, 2013 at 12:08 AM, Takao Indoh
>  wrote:
>> (2013/06/11 11:20), Bjorn Helgaas wrote:
> 
>>> I'm not sure you need to reset legacy devices (or non-PCI devices)
>>> yet, but the current hook isn't anchored anywhere -- it's just an
>>> fs_initcall() that doesn't give the reader any clue about the
>>> connection between the reset and the problem it's solving.
>>>
>>> If we do something like this patch, I think it needs to be done at the
>>> point where we enable or disable the IOMMU.  That way, it's connected
>>> to the important event, and there's a clue about how to make
>>> corresponding fixes for other IOMMUs.
>>
>> Ok. pci_iommu_init() is appropriate place to add this hook?
> 
> I looked at various IOMMU init places today, and it's far more
> complicated and varied than I had hoped.
> 
> This reset scheme depends on enumerating PCI devices before we
> initialize the IOMMU used by those devices.  x86 works that way today,
> but not all architectures do (see the sparc pci_fire_pbm_init(), for

Sorry, could you tell me which part depends on architecture?

> example).  And I think conceptually, the IOMMU should be enumerated
> and initialized *before* the devices that use it.
> 
> So I'm uncomfortable with that aspect of this scheme.
> 
> It would be at least conceivable to reset the devices in the system
> kernel, before the kexec.  I know we want to do as little as possible
> in the crashing kernel, but it's at least a possibility, and it might
> be cleaner.

I bet this will be not accepted by kdump maintainer. Everything in panic
kernel is unreliable.

Thanks,
Takao Indoh

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Please backport bee980d9e9642e96351fa3ca9077b853ecf62f57 (xen/events: Handle VIRQ_TIMER before any other hardirq in event loop.) to earlier kernels v3.8...and so on

2013-06-12 Thread Ben Hutchings

On Mon, 2013-06-03 at 08:02 -0400, Konrad Rzeszutek Wilk wrote:
> Hey Greg,
> 
> I hadn't (by mistake) put the CC: sta...@vger.kernel.org on said patch
> (Which is in the Linux kernel).
> 
> If possible please back-port said patch to the existing stable trees.
> Attached is a version that applies to some of the earlier trees easily.

Queued up for 3.2, thanks.

Ben.

-- 
Ben Hutchings
friends: People who know you well, but like you anyway.


signature.asc
Description: This is a digitally signed message part

Re: [scsi] WARNING: at fs/sysfs/dir.c:530 sysfs_add_one()

2013-06-12 Thread Fengguang Wu

On Wed, Jun 12, 2013 at 07:34:43PM -0700, James Bottomley wrote:
> On Thu, 2013-06-13 at 09:30 +0800, Fengguang Wu wrote:
> > Greetings,
> > 
> > I got the below dmesg and the first bad commit is
> > 
> > commit a256ba092ec57213f96059d41ac5473ff92f5e7c
> > Author: Nicholas Bellinger 
> > Date:   Sat May 18 02:40:43 2013 -0700
> > 
> > scsi: Split scsi_dispatch_cmd into __scsi_dispatch_cmd setup
> > 
> > Signed-off-by: Nicholas Bellinger 
> 
> You may be evaluating a bogus branch because no such commit has darkened
> the SCSI list, so it obviously can't be on upstream track.

Yeah, sorry I should probably not CC linux-scsi.

The bug is still in a private tree:

git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending.git  scsi-mq

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Suggestion] arch/*/include/asm/bitops.h: about __set_bit() API.

2013-06-12 Thread Chen Gang

On 06/11/2013 03:32 PM, Geert Uytterhoeven wrote:
> On Tue, Jun 11, 2013 at 12:26 AM, Tony Luck  wrote:
>> > On Sat, Jun 8, 2013 at 3:08 AM, Chen Gang  wrote:
>>> >> using 'unsigned int *', implicitly:
>>> >>   ./ia64/include/asm/bitops.h:63:__set_bit (int nr, volatile void *addr)
>> >
>> > There is some downside on ia64 to your suggestion.  If "addr" is properly
>> > aligned for an "int", but misaligned for a long ... i.e. addr%8 == 4, then 
>> > I'll
>> > take an unaligned reference trap if I work with long* where the current 
>> > code
>> > working with int* does not.
>> >
>> > Now perhaps all the callers do guarantee long* alignment?  But I don't 
>> > know.
>> >
>> > Apart from uniformity, there doesn't see to be any upside to changing this.
> The address pointers have been supposed to be "long *" for a very long time.
> Probably alpha (the second official Linux platform) did it
> differently, and never
> standardized to "long *".

Excuse me, I am not quite familiar with the details, but I guess, it is
about functional feature issues, not (or not only) about bug issues.

For the architectures which can fully support 64-bit OS, excluding ia64
and alpha, all of them can support setting 64 bits (from 0 to 63) under
64-bit machine.

I am not quite sure whether any sub-systems have already set higher bit
(> 31) under 64-bit machine, but in the future, it seems they could (at
least our API supposed so).


Thanks.
-- 
Chen Gang

Asianux Corporation
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 04/11] cgroup: use kzalloc() and list_del_init()

2013-06-12 Thread Kent Overstreet

On Wed, Jun 12, 2013 at 07:41:15PM -0700, Tejun Heo wrote:
> Hello,
> 
> On Wed, Jun 12, 2013 at 7:38 PM, Kent Overstreet  
> wrote:
> > IMO, list_del() is preferred when the object shouldn't be reused (i.e.
> > it gets taken off a list and then it's freed). list_del_init() could
> > hide bugs.
> 
> Nah... use-after-frees are detected much more reliably by poisoning
> anyway. Using list_del() instead of list_del_init() to hunt down
> use-after-free isn't a good idea because you're likely to corrupt the
> memory of unrelated area. I really don't see much point in using
> list_del().

list_del() does do poisoning - and list debugging is cheaper to enable
than full slab debugging.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC 3/3] clk: dt: binding for basic divider clock

2013-06-12 Thread Mike Turquette

Quoting Heiko Stübner (2013-06-03 15:18:32)
> Am Montag, 3. Juni 2013, 19:53:10 schrieb Mike Turquette:
> > Devicetree binding for the basic clock divider, plus the setup function
> > to register the clock.  Based on the existing fixed-clock binding.
> > 
> > Signed-off-by: Mike Turquette 
> > ---
> 
> [...]
> 
> > +/**
> > + * of_div_clk_setup() - Setup function for simple div rate clock
> > + */
> > +void of_divider_clk_setup(struct device_node *node)
> > +{
> > + struct clk *clk;
> > + const char *clk_name = node->name;
> > + void __iomem *reg;
> > + const char *parent_name;
> > + u8 clk_divider_flags = 0;
> > + u8 mask = 0;
> > + u8 shift = 0;
> 
> in the mux-clock these 3 are unsigned long and u32 types ... what is correct?
> 

Good catch.  Shifts should be u8, masks should be u32.  I've left the
flags as u8.  The binding doesn't dictate this and if we end up having
more flags (hopefully not!) then this value can be embiggened.

> 
> > + struct clk_div_table *table;
> > +
> > + of_property_read_string(node, "clock-output-names", &clk_name);
> > +
> > + parent_name = of_clk_get_parent_name(node, 0);
> > +
> > + reg = of_iomap(node, 0);
> > +
> > + if (of_property_read_u8(node, "mask", &mask)) {
> > + pr_err("%s: missing mask property for %s\n", __func__, 
> > node->name);
> > + return;
> > + }
> > +
> > + if (of_property_read_u8(node, "shift", &shift))
> > + pr_debug("%s: missing shift property defaults to zero for 
> > %s\n",
> > + __func__, node->name);
> 
> same here ... mux reads u32
> 
> > + if (of_property_read_bool(node, "index_one"))
> > + clk_divider_flags |= CLK_DIVIDER_ONE_BASED;
> > +
> > + if (of_property_read_bool(node, "index_power_of_two"))
> > + clk_divider_flags |= CLK_DIVIDER_POWER_OF_TWO;
> > +
> > + if (of_property_read_bool(node, "index_allow_zero"))
> > + clk_divider_flags |= CLK_DIVIDER_ALLOW_ZERO;
> > +
> > + table = of_clk_get_div_table(node);
> > + if (IS_ERR(table))
> > + return;
> > +
> > + clk = clk_register_divider_table(NULL, clk_name,
> > + parent_name, 0,
> > + reg, shift, mask,
> > + clk_divider_flags, table,
> > + NULL);
> 
> this causes trouble, as the divider clock code above still requires a width 
> instead of a mask. I remember talk about this going to change separately, but 
> couldn't find anything of the sort in linux-next.

Right.  I viewed creation of the DT bindings a bit separately from the
CCF code, which I think is the right approach.  A mask is definitely a
more useful structure than a width value, and the DT bindings need to be
at least a little future-proof, so I chose a mask.

I'll update the divider code to use a mask and post that as part of the
v2 series.

Regards,
Mike

> 
> 
> 
> > +
> > + if (!IS_ERR(clk))
> > + of_clk_add_provider(node, of_clk_src_simple_get, clk);
> > +}
> > +EXPORT_SYMBOL_GPL(of_divider_clk_setup);
> > +CLK_OF_DECLARE(divider_clk, "divider-clock", of_divider_clk_setup);
> > +#endif
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 04/11] cgroup: use kzalloc() and list_del_init()

2013-06-12 Thread Tejun Heo

Hello,

On Wed, Jun 12, 2013 at 7:38 PM, Kent Overstreet  wrote:
> IMO, list_del() is preferred when the object shouldn't be reused (i.e.
> it gets taken off a list and then it's freed). list_del_init() could
> hide bugs.

Nah... use-after-frees are detected much more reliably by poisoning
anyway. Using list_del() instead of list_del_init() to hunt down
use-after-free isn't a good idea because you're likely to corrupt the
memory of unrelated area. I really don't see much point in using
list_del().

Thanks.

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 04/11] cgroup: use kzalloc() and list_del_init()

2013-06-12 Thread Tejun Heo

Hello,

On Wed, Jun 12, 2013 at 7:36 PM, Li Zefan  wrote:
> Do you mean we prefer list_del_init() than list_del() in general? Then

Yes.

> in which cases do we prefer list_del()?

Nowadays, I don't think we ever prefer list_del(). Maybe if it can be
shown that the extra init part is noticeably expensive?

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 04/11] cgroup: use kzalloc() and list_del_init()

2013-06-12 Thread Kent Overstreet

On Thu, Jun 13, 2013 at 10:36:40AM +0800, Li Zefan wrote:
> On 2013/6/13 5:03, Tejun Heo wrote:
> > There's no point in using kmalloc() and list_del() instead of the
> > clearing variants for trivial stuff.  We can live dangerously
> > elsewhere.  Use kzalloc() and list_del_init() instead and drop 0
> > inits.
> > 
> 
> Do you mean we prefer list_del_init() than list_del() in general? Then
> in which cases do we prefer list_del()?

IMO, list_del() is preferred when the object shouldn't be reused (i.e.
it gets taken off a list and then it's freed). list_del_init() could
hide bugs.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 05/11] cgroup: clean up css_[try]get() and css_put()

2013-06-12 Thread Li Zefan

On 2013/6/13 5:03, Tejun Heo wrote:
> * __css_get() isn't used by anyone.  Fold it into css_get().
> 
> * Add proper function comments to all css reference functions.
> 
> This patch is purely cosmetic.
> 
> Signed-off-by: Tejun Heo 
> ---
>  include/linux/cgroup.h | 48 
>  1 file changed, 24 insertions(+), 24 deletions(-)
> 
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 0e32855..7b16882 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -94,33 +94,31 @@ enum {
>   CSS_ONLINE  = (1 << 1), /* between ->css_online() and 
> ->css_offline() */
>  };
>  
> -/* Caller must verify that the css is not for root cgroup */
> -static inline void __css_get(struct cgroup_subsys_state *css, int count)
> -{
> - atomic_add(count, &css->refcnt);
> -}
> -
> -/*
> - * Call css_get() to hold a reference on the css; it can be used
> - * for a reference obtained via:
> - * - an existing ref-counted reference to the css
> - * - task->cgroups for a locked task
> +/**
> + * css_get - obtain a reference on the specified css
> + * @css: taget css

s/taget/target

> + *
> + * The caller must already have a reference.
>   */
> -
>  static inline void css_get(struct cgroup_subsys_state *css)
>  {
>   /* We don't need to reference count the root state */
>   if (!(css->flags & CSS_ROOT))
> - __css_get(css, 1);
> + atomic_inc(&css->refcnt);
>  }
>  

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 04/11] cgroup: use kzalloc() and list_del_init()

2013-06-12 Thread Li Zefan

On 2013/6/13 5:03, Tejun Heo wrote:
> There's no point in using kmalloc() and list_del() instead of the
> clearing variants for trivial stuff.  We can live dangerously
> elsewhere.  Use kzalloc() and list_del_init() instead and drop 0
> inits.
> 

Do you mean we prefer list_del_init() than list_del() in general? Then
in which cases do we prefer list_del()?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Commit f9afbd45b0d0 broke mips r4k.

2013-06-12 Thread Rob Landley

My aboriginal linux project builds tiny linux systems to run under  
qemu, producing as close to the same system as possible across a bunch  
of different architectures. The above change broke the mips r4k build  
I've been running under qemu.


Here's a toolchain and reproduction sequence:

  wget http://landley.net/aboriginal/bin/cross-compiler-mips.tar.bz2
  tar xvjf cross-compiler-mips.tar.bz2
  export PATH=$PWD/cross-compiler-mips/bin:$PATH
  make ARCH=mips allnoconfig KCONFIG_ALLCONFIG=miniconfig.mips
  make CROSS_COMPILE=mips- ARCH=mips

(The file miniconfig.mips is attached.)

It ends:

  CC  init/version.o
  LD  init/built-in.o
arch/mips/built-in.o: In function `local_r4k_flush_cache_page':
c-r4k.c:(.text+0xe278): undefined reference to `kvm_local_flush_tlb_all'
c-r4k.c:(.text+0xe278): relocation truncated to fit: R_MIPS_26 against  
`kvm_local_flush_tlb_all'

arch/mips/built-in.o: In function `local_flush_tlb_range':
(.text+0xe938): undefined reference to `kvm_local_flush_tlb_all'
arch/mips/built-in.o: In function `local_flush_tlb_range':
(.text+0xe938): relocation truncated to fit: R_MIPS_26 against  
`kvm_local_flush_tlb_all'

arch/mips/built-in.o: In function `local_flush_tlb_mm':
(.text+0xed38): undefined reference to `kvm_local_flush_tlb_all'
arch/mips/built-in.o: In function `local_flush_tlb_mm':
(.text+0xed38): relocation truncated to fit: R_MIPS_26 against  
`kvm_local_flush_tlb_all'

kernel/built-in.o: In function `__schedule':
core.c:(.sched.text+0x16a0): undefined reference to  
`kvm_local_flush_tlb_all'
core.c:(.sched.text+0x16a0): relocation truncated to fit: R_MIPS_26  
against `kvm_local_flush_tlb_all'

mm/built-in.o: In function `use_mm':
(.text+0x182c8): undefined reference to `kvm_local_flush_tlb_all'
mm/built-in.o: In function `use_mm':
(.text+0x182c8): relocation truncated to fit: R_MIPS_26 against  
`kvm_local_flush_tlb_all'
fs/built-in.o:(.text+0x7b50): more undefined references to  
`kvm_local_flush_tlb_all' follow

fs/built-in.o: In function `flush_old_exec':
(.text+0x7b50): relocation truncated to fit: R_MIPS_26 against  
`kvm_local_flush_tlb_all'


Revert the above commit and it builds to the end.

RobCONFIG_EXPERIMENTAL=y
CONFIG_NO_HZ=y
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_IKCONFIG=y
CONFIG_IKCONFIG_PROC=y
CONFIG_CC_OPTIMIZE_FOR_SIZE=y
CONFIG_PCI=y
CONFIG_BINFMT_ELF=y
CONFIG_BINFMT_SCRIPT=y
CONFIG_MAGIC_SYSRQ=y

CONFIG_BLK_DEV=y
CONFIG_BLK_DEV_LOOP=y

CONFIG_IDE=y
CONFIG_IDE_GD=y
CONFIG_IDE_GD_ATA=y
CONFIG_BLK_DEV_IDECD=y

CONFIG_SCSI=y
CONFIG_BLK_DEV_SD=y
CONFIG_BLK_DEV_SR=y
CONFIG_SCSI_LOWLEVEL=y

CONFIG_NET=y
CONFIG_PACKET=y
CONFIG_UNIX=y
CONFIG_INET=y

CONFIG_NETDEVICES=y
CONFIG_NET_ETHERNET=y
CONFIG_NET_PCI=y
CONFIG_8139CP=y

CONFIG_HW_RANDOM=y

CONFIG_RTC_CLASS=y
CONFIG_RTC_HCTOSYS=y
CONFIG_RTC_INTF_SYSFS=y
CONFIG_RTC_INTF_DEV=y

CONFIG_EXT4_FS=y
CONFIG_EXT4_USE_FOR_EXT23=y
CONFIG_TMPFS=y
CONFIG_MISC_FILESYSTEMS=y
CONFIG_SQUASHFS=y
CONFIG_SQUASHFS_ZLIB=y
CONFIG_DEVTMPFS=y

CONFIG_VIRTUALIZATION=y
CONFIG_VIRTIO_PCI=y
CONFIG_VIRTIO_NET=y
CONFIG_NET_9P=y
CONFIG_NET_9P_VIRTIO=y
CONFIG_NETWORK_FILESYSTEMS=y
CONFIG_9P_FS=y
CONFIG_9P_FS_POSIX_ACL=y

# More random (inexplicable) guard symbols added in 3.2.  TODO: write
# miniconfig expander that automatically sets guard symbols when setting a
# dependent symbol.

CONFIG_ETHERNET=y
CONFIG_NET_VENDOR_INTEL=y
CONFIG_NET_VENDOR_REALTEK=y
CONFIG_NET_VENDOR_AMD=y
CONFIG_NET_VENDOR_NATSEMI=y
CONFIG_NET_VENDOR_8390=y

CONFIG_MIPS_MALTA=y
CONFIG_CPU_MIPS32_R2=y
CONFIG_SERIAL_8250=y
CONFIG_SERIAL_8250_CONSOLE=y
#CONFIG_PM=y
CONFIG_PCNET32=y
CONFIG_BLK_DEV_PIIX=y

Re: [PATCH 03/11] cgroup: bring some sanity to naming around cg_cgroup_link

2013-06-12 Thread Li Zefan

On 2013/6/13 5:03, Tejun Heo wrote:
> cgroups and css_sets are mapped M:N and this M:N mapping is
> represented by struct cg_cgroup_link which points to forms linked
> lists on both sides.  The naming around this already confusing struct
> is pretty bad.
> 
>>From cgroup side, it starts off ->css_sets and runs through
> ->cgrp_link_list.  From css_set side, it starts off ->cg_links and
> runs through ->cg_link_list.  This is rather reversed as
> cgrp_link_list is used to iterate css_sets and cg_link_list cgroups.
> Also, this is the only place which is still using the confusing "cg"
> for css_sets.  This patch cleans it up a bit.
> 
> * s/cgroup->css_sets/cgroup->cset_links/
>   s/css_set->cg_links/css_set->cgrp_links/
>   s/cgroup_iter->cg_link/cgroup_iter->cset_link/
> 
> * s/cg_cgroup_link/cgrp_cset_link/
> 
> * s/cgrp_cset_link->cg/cgrp_cset_link->cset/
>   s/cgrp_cset_link->cgrp_link_list/cgrp_cset_link->cset_link/
>   s/cgrp_cset_link->cg_link_list/cgrp_cset_link->cgrp_link/
> 
> * s/init_css_set_link/init_cgrp_cset_link/
>   s/free_cg_links/free_cgrp_cset_links/
>   s/allocate_cg_links/allocate_cgrp_cset_links/
> 
> * s/cgl[12]/link[12]/ in compare_css_sets()
> 
> * s/saved_link/tmp_link/ s/tmp/tmp_links/ and a couple similar
>   adustments.
> 
> * Comment and whiteline adjustments.
> 
> After the changes, we have
> 
>   list_for_each_entry(link, &cont->cset_links, cset_link) {
>   struct css_set *cset = link->cset;
> 
> instead of
> 
>   list_for_each_entry(link, &cont->css_sets, cgrp_link_list) {
>   struct css_set *cset = link->cg;
> 

Those renames really make the code more readable! I have to think for a while
everytime I see those namings.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [scsi] WARNING: at fs/sysfs/dir.c:530 sysfs_add_one()

2013-06-12 Thread James Bottomley

On Thu, 2013-06-13 at 09:30 +0800, Fengguang Wu wrote:
> Greetings,
> 
> I got the below dmesg and the first bad commit is
> 
> commit a256ba092ec57213f96059d41ac5473ff92f5e7c
> Author: Nicholas Bellinger 
> Date:   Sat May 18 02:40:43 2013 -0700
> 
> scsi: Split scsi_dispatch_cmd into __scsi_dispatch_cmd setup
> 
> Signed-off-by: Nicholas Bellinger 

You may be evaluating a bogus branch because no such commit has darkened
the SCSI list, so it obviously can't be on upstream track.

James


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] HID: fold ps3remote driver into generic Sony driver

2013-06-12 Thread David Dillow

On Sat, 2013-06-01 at 00:55 -0400, David Dillow wrote:
> On Tue, 2013-05-28 at 15:45 +0200, Jiri Kosina wrote:
> > I'd be glad if you could provide your Tested-by: for the patch below. 
> > Thanks a lot.
> 
> I tried to test this tonight, but was thwarted by Fedora 16 on the only
> readily available test platform -- old bluetoothd w/ the internal
> driver. I'll need to upgrade that box, recreate the updated bluetooth
> environment I had from some time ago, or try it on my myth frontend
> (Ubuntu 13.04). I'll get another chance on the 9th; I won't have access
> to the HW again until then. :/

Ok, finally wrestled Myth away from family long enough to test. All
looks good, sorry about the delay.

Tested-by: David Dillow 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Suggestion] arch/*/include/asm/bitops.h: about __set_bit() API.

2013-06-12 Thread Chen Gang

On 06/08/2013 06:08 PM, Chen Gang wrote:
> Several architectures have different __set_bit() API to others, in
> standard API, 2nd param of __set_bit() is 'unsigned long *', but:
> 
>   in 'min10300', it is 'unsigned char *',

Oh, it is my fault, for 'mn10300' it is no issue, it is not 'unsigned
char *' (it is a generic one which can match any type).

Also 'min10300' --> 'mn10300'.



Thanks.
-- 
Chen Gang

Asianux Corporation
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/1] thermal: consider emul_temperature while computing trend

2013-06-12 Thread Zhang Rui

On Wed, 2013-05-29 at 17:37 -0400, Eduardo Valentin wrote:
> In case emulated temperature is in use, using the trend
> provided by driver layer can lead to bogus situation.
> In this case, debugger user would set a temperature value,
> but the trend would be from driver computation.
> 
> To avoid this situation, this patch changes the get_tz_trend()
> to consider the emulated temperature whenever that is in use.
> 
> Cc: Zhang Rui 
> Cc: Amit Daniel Kachhap 
> Cc: Durgadoss R 
> Cc: linux...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Eduardo Valentin 

patch applied.

thanks,
rui
> ---
>  drivers/thermal/thermal_core.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
> index d755440..c00dc92 100644
> --- a/drivers/thermal/thermal_core.c
> +++ b/drivers/thermal/thermal_core.c
> @@ -155,7 +155,8 @@ int get_tz_trend(struct thermal_zone_device *tz, int trip)
>  {
>   enum thermal_trend trend;
>  
> - if (!tz->ops->get_trend || tz->ops->get_trend(tz, trip, &trend)) {
> + if (tz->emul_temperature || !tz->ops->get_trend ||
> + tz->ops->get_trend(tz, trip, &trend)) {
>   if (tz->temperature > tz->last_temperature)
>   trend = THERMAL_TREND_RAISING;
>   else if (tz->temperature < tz->last_temperature)


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH buf-fix] kernel, range: fix broken mtrr_cleanup(Internet mail)

2013-06-12 Thread 廖生苗

I forget change final_start/final_end to start/end.  The old code add empty 
range in following case:

Before mtrr cleanup:
1. got initial MTRR range: 0-3G.
2. MTRR code try to merge 0-1M
3. result is 0-0, 0-3G,  total 2 range count

After cleanup:
1. got final MTRR range: 0-3G, total 1 range count

the cleanup code failed, because range count mismatch.
N�Р骒r��yb�X�肚�v�^�)藓{.n�+�伐�{��赙zXФ�≤�}��财�z�&j:+v�����赙zZ+��+zf＂�h���~i���z��wア�?�ㄨ��&�)撷f��^j谦y�m��@A�a囤�
0鹅h���i

Re: [PATCH 0/7] thermal: ti-soc-thermal: fixes and DRA752 support

2013-06-12 Thread Zhang Rui

On Wed, 2013-05-29 at 11:07 -0400, Eduardo Valentin wrote:
> Hello Rui,
> 
> Here is a patch set for your consideration on ti-soc-thermal driver.
> 

the whole patch set has been applied.

thanks,
rui
> This patch set is mix of:
> 
> (a) patches that were added to the staging tree post 3.10-rc1 but
> were not included after the move from staging to thermal. This is
> because your thermal tree is based of 3.10-rc1 and not staging/next
> or linux-next/master. Thus the first two patches make sure the driver
> under drivers/thermal/ti-soc-thermal has also the needed fixes.
> 
> (b) patches containing fixes found during the addition of DRA752
> chip support.
> 
> (c) patches adding the support to DRA752 chips.
> 
> Please consider these too for 3.11.
> 
> For those people interested in testing this patch set, it is based
> on thermal/next and can also be found here:
> https://git.gitorious.org/thermal-framework/thermal-framework.git 
> thermal_work/ti-soc-thermal/fixes+dra752
> 
> All best,
> 
> Eduardo Valentin (7):
>   thermal: ti-soc-thermal: update DT reference for OMAP5430
>   thermal: ti-soc-thermal: remove external heat while extrapolating
> hotspot
>   thermal: ti-soc-thermal: freeze FSM while computing trend
>   thermal: ti-soc-thermal: remove usage of IS_ERR_OR_NULL
>   thermal: ti-soc-thermal: add thermal data for DRA752 chips
>   thermal: ti-soc-thermal: add dra752 chip to device table
>   thermal: ti-soc-thermal: add DT example for DRA752 chip
> 
>  .../devicetree/bindings/thermal/ti_soc_thermal.txt |  13 +
>  drivers/staging/ti-soc-thermal/dra752-bandgap.h| 280 
>  .../staging/ti-soc-thermal/dra752-thermal-data.c   | 476 
> +
>  drivers/thermal/ti-soc-thermal/Kconfig |  12 +
>  drivers/thermal/ti-soc-thermal/Makefile|   1 +
>  drivers/thermal/ti-soc-thermal/ti-bandgap.c|  29 +-
>  drivers/thermal/ti-soc-thermal/ti-bandgap.h|   5 +
>  drivers/thermal/ti-soc-thermal/ti-thermal-common.c |  43 +-
>  drivers/thermal/ti-soc-thermal/ti-thermal.h|   6 +
>  9 files changed, 842 insertions(+), 23 deletions(-)
>  create mode 100644 drivers/staging/ti-soc-thermal/dra752-bandgap.h
>  create mode 100644 drivers/staging/ti-soc-thermal/dra752-thermal-data.c
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH net-next 1/2] net: remove NET_LL_RX_POLL config menue

2013-06-12 Thread Eliezer Tamir

On 13/06/2013 05:01, Stephen Hemminger wrote:

On Wed, 12 Jun 2013 15:12:05 -0700 (PDT)
David Miller  wrote:

From: Eliezer Tamir 
Date: Tue, 11 Jun 2013 17:24:28 +0300

depends on X86_TSC

Wait a second, I didn't notice this before.  There needs to be a better
way to test for the accuracy you need, or if the issue is lack of a proper
API for cycle counter reading, fix that rather than add ugly arch
specific dependencies to generic networking code.

This should be sched_clock(), rather than direct TSC access.
Also any code using TSC or sched_clock has to be carefully audited to deal with
clocks running at different rates on different CPU's. Basically value is only
meaning full on same CPU.

OK,

If we covert to sched_clock(), would adding a define such as 
HAVE_HIGH_PRECISION_CLOCK to architectures that have both a high 
precision clock and a 64 bit cycles_t be a good solution?

(if not any other suggestion?)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] ARM: convert max_pfn and max_low_pfn to be relative to PFN0

2013-06-12 Thread Colin Cross

On ARM max_pfn and max_low_pfn have always been relative to the
first valid PFN, apparently due to ancient kernels being unable
to properly handle physical memory at addresses other than 0.
A comment was added:
   Note: max_low_pfn and max_pfn reflect the number of _pages_ in
   the system, not the maximum PFN.
which conflicts with the comment in include/linux/bootmem.h that
says max_pfn is the highest page.  Since then, the number of users
of max_pfn has proliferated, and they all seem to assume that
max_pfn is the highest valid pfn.  The only user of max_low_pfn
as a number of pfns instead of the highest pfn is kcore, which
conflates max_low_pfn with the size of low memory, but it is not
supported on ARM.

Remove the PHYS_PFN_OFFSET subtraction from max_pfn and max_low_pfn,
and fix up the rest of ARM mm init that adds it back again.

This fixes reading page map counts and flags from /proc/kpagecount
and /proc/kpageflags, which will return a short read when reading
pfns that overlap with max_pfn, and return 0 when reading pfn
max_pfn, making it impossible to read the flags and count for pfn
max_pfn.

>From code inspection, I believe this will also improve block device
performance where the bounce limit was set to BLK_BOUNCE_HIGH, which
was bouncing unnecessarily for the top PHYS_PFN_OFFSET pages of low
memory.

Signed-off-by: Colin Cross 
---
 arch/arm/mm/init.c | 21 ++---
 1 file changed, 6 insertions(+), 15 deletions(-)

Boot tested on 3.4 and filled up all of memory without any issues.

diff --git a/arch/arm/mm/init.c b/arch/arm/mm/init.c
index 9a5cdc0..b4e4051 100644
--- a/arch/arm/mm/init.c
+++ b/arch/arm/mm/init.c
@@ -415,16 +415,8 @@ void __init bootmem_init(void)
 */
arm_bootmem_free(min, max_low, max_high);
 
-   /*
-* This doesn't seem to be used by the Linux memory manager any
-* more, but is used by ll_rw_block.  If we can get rid of it, we
-* also get rid of some of the stuff above as well.
-*
-* Note: max_low_pfn and max_pfn reflect the number of _pages_ in
-* the system, not the maximum PFN.
-*/
-   max_low_pfn = max_low - PHYS_PFN_OFFSET;
-   max_pfn = max_high - PHYS_PFN_OFFSET;
+   max_low_pfn = max_low;
+   max_pfn = max_high;
 }
 
 /*
@@ -530,7 +522,6 @@ static inline void free_area_high(unsigned long pfn, 
unsigned long end)
 static void __init free_highpages(void)
 {
 #ifdef CONFIG_HIGHMEM
-   unsigned long max_low = max_low_pfn + PHYS_PFN_OFFSET;
struct memblock_region *mem, *res;
 
/* set highmem page free */
@@ -539,12 +530,12 @@ static void __init free_highpages(void)
unsigned long end = memblock_region_memory_end_pfn(mem);
 
/* Ignore complete lowmem entries */
-   if (end <= max_low)
+   if (end <= max_low_pfn)
continue;
 
/* Truncate partial highmem entries */
-   if (start < max_low)
-   start = max_low;
+   if (start < max_low_pfn)
+   start = max_low_pfn;
 
/* Find and exclude any reserved regions */
for_each_memblock(reserved, res) {
@@ -591,7 +582,7 @@ void __init mem_init(void)
extern u32 itcm_end;
 #endif
 
-   max_mapnr   = pfn_to_page(max_pfn + PHYS_PFN_OFFSET) - mem_map;
+   max_mapnr   = pfn_to_page(max_pfn) - mem_map;
 
/* this will put all unused low memory onto the freelists */
free_unused_memmap(&meminfo);
-- 
1.8.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V3 1/1] kvm: Add a tracepoint write_tsc_offset

2013-06-12 Thread Marcelo Tosatti

On Wed, Jun 12, 2013 at 04:43:44PM +0900, Yoshihiro YUNOMAE wrote:
> Add a tracepoint write_tsc_offset for tracing TSC offset change.
> We want to merge ftrace's trace data of guest OSs and the host OS using
> TSC for timestamp in chronological order. We need "TSC offset" values for
> each guest when merge those because the TSC value on a guest is always the
> host TSC plus guest's TSC offset. If we get the TSC offset values, we can
> calculate the host TSC value for each guest events from the TSC offset and
> the event TSC value. The host TSC values of the guest events are used when we
> want to merge trace data of guests and the host in chronological order.
> (Note: the trace_clock of both the host and the guest must be set x86-tsc in
> this case)
> 
> This tracepoint also records vcpu_id which can be used to merge trace data for
> SMP guests. A merge tool will read TSC offset for each vcpu, then the tool
> converts guest TSC values to host TSC values for each vcpu.
> 
> TSC offset is stored in the VMCS by vmx_write_tsc_offset() or
> vmx_adjust_tsc_offset(). KVM executes the former function when a guest boots.
> The latter function is executed when kvm clock is updated. Only host can read
> TSC offset value from VMCS, so a host needs to output TSC offset value
> when TSC offset is changed.
> 
> Since the TSC offset is not often changed, it could be overwritten by other
> frequent events while tracing. To avoid that, I recommend to use a special
> instance for getting this event:
> 
> 1. set a instance before booting a guest
>  # cd /sys/kernel/debug/tracing/instances
>  # mkdir tsc_offset
>  # cd tsc_offset
>  # echo x86-tsc > trace_clock
>  # echo 1 > events/kvm/kvm_write_tsc_offset/enable
> 
> 2. boot a guest
> 
> Signed-off-by: Yoshihiro YUNOMAE 
> Cc: Joerg Roedel 
> Cc: Marcelo Tosatti 
> Cc: Gleb Natapov 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: "H. Peter Anvin" 
> ---
>  arch/x86/kvm/svm.c   |   10 +-
>  arch/x86/kvm/trace.h |   21 +
>  arch/x86/kvm/vmx.c   |7 ++-
>  arch/x86/kvm/x86.c   |1 +
>  4 files changed, 37 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index a14a6ea..c0bc803 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -1026,7 +1026,10 @@ static void svm_write_tsc_offset(struct kvm_vcpu 
> *vcpu, u64 offset)
>   g_tsc_offset = svm->vmcb->control.tsc_offset -
>  svm->nested.hsave->control.tsc_offset;
>   svm->nested.hsave->control.tsc_offset = offset;
> - }
> + } else
> + trace_kvm_write_tsc_offset(vcpu->vcpu_id,
> +svm->vmcb->control.tsc_offset,
> +offset);
>  
>   svm->vmcb->control.tsc_offset = offset + g_tsc_offset;
>  
> @@ -1044,6 +1047,11 @@ static void svm_adjust_tsc_offset(struct kvm_vcpu 
> *vcpu, s64 adjustment, bool ho
>   svm->vmcb->control.tsc_offset += adjustment;
>   if (is_guest_mode(vcpu))
>   svm->nested.hsave->control.tsc_offset += adjustment;
> + else
> + trace_kvm_write_tsc_offset(vcpu->vcpu_id,
> +  svm->vmcb->control.tsc_offset - adjustment,
> +  svm->vmcb->control.tsc_offset);
> +
>   mark_dirty(svm->vmcb, VMCB_INTERCEPTS);
>  }
>  
> diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
> index fe5e00e..6c82cf1 100644
> --- a/arch/x86/kvm/trace.h
> +++ b/arch/x86/kvm/trace.h
> @@ -815,6 +815,27 @@ TRACE_EVENT(kvm_track_tsc,
> __print_symbolic(__entry->host_clock, host_clocks))
>  );
>  
> +TRACE_EVENT(kvm_write_tsc_offset,
> + TP_PROTO(unsigned int vcpu_id, __u64 previous_tsc_offset,
> +  __u64 next_tsc_offset),
> + TP_ARGS(vcpu_id, previous_tsc_offset, next_tsc_offset),
> +
> + TP_STRUCT__entry(
> + __field( unsigned int,  vcpu_id )
> + __field(__u64,  previous_tsc_offset )
> + __field(__u64,  next_tsc_offset )
> + ),
> +
> + TP_fast_assign(
> + __entry->vcpu_id= vcpu_id;
> + __entry->previous_tsc_offset= previous_tsc_offset;
> + __entry->next_tsc_offset= next_tsc_offset;
> + ),
> +
> + TP_printk("vcpu=%u prev=%llu next=%llu", __entry->vcpu_id,
> +   __entry->previous_tsc_offset, __entry->next_tsc_offset)
> +);
> +
>  #endif /* CONFIG_X86_64 */
>  
>  #endif /* _TRACE_KVM_H */
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 25a791e..eb11856 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -2096,6 +2096,8 @@ static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, 
> u64 offset)
>   (nested_cpu_has(vmcs12, CPU_BASED_USE_TSC_OFFSETING) ?
>vmcs12->tsc_offset : 0));
>   } else {
> +

Re: [PATCH] kernel/softirq.c: delete 'while' looping to improve a little performance and beautify code

2013-06-12 Thread Chen Gang

On 06/10/2013 10:15 PM, Thomas Gleixner wrote:
> On Sun, 9 Jun 2013, Chen Gang wrote:
>> > After finish the internal 'while', need not test TASKLET_STATE_SCHED
>> > again, so looping back to outside 'while' is only for set_bit().
>> > 
>> > When use 'if' and set_bit() instead of 'while', it will save at least
>> > one running conditional instruction, and also will be clearer for readers
>> > (although the binary size will be a little bigger).
> And by doing that you break the atomicity of test_and_set_bit. There
> is a good reason why this is an atomic operation and why the code is
> written as is.
>  

OK, thanks. it is my fault (and also sorry for replying late).

>> > The related patch is "1da177e Linux-2.6.12-rc2"
> How is that patch related to the problem?

Because the modification since Linux-2.6.12-rc2.


Thanks.
-- 
Chen Gang

Asianux Corporation
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 4 5 6 7 >

1 - 100 of 651 matches

Mail list logo