Re: [GIT PULL] percpu changes for v3.8

2012-12-11 Thread Tejun Heo
On Tue, Dec 11, 2012 at 06:24:53AM -0800, Tejun Heo wrote:
> Hello, Linus.
> 
> Percpu changes for v3.8.  Nothing exciting here either.  Joonsoo's is
> almost cosmetic.  Cyrill's patch fixes "percpu_alloc" early kernel
> param handling so that the kernel doesn't crash when the parameter is
> specified w/o any argument.
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu.git for-next

Oops, for-next and for-3.8 point to the same commit, so either should
work but this shoul have been,

  git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu.git for-3.8

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/49] Automatic NUMA Balancing v10

2012-12-11 Thread Mel Gorman
On Tue, Dec 11, 2012 at 10:18:07AM +0100, Ingo Molnar wrote:
> 
> * Ingo Molnar  wrote:
> 
> > > This is prototype only but what I was using as a reference 
> > > to see could I spot a problem in yours. It has not been even 
> > > boot tested but avoids remote->remote copies, contending on 
> > > PTL or holding it longer than necessary (should anyway)
> > 
> > So ... because time is running out and it would be nice to 
> > progress with this for v3.8, I'd suggest the following 
> > approach:
> > 
> >  - Please send your current tree to Linus as-is. You already 
> >have my Acked-by/Reviewed-by for its scheduler bits, and my
> >testing found your tree to have no regression to mainline,
> >plus it's a nice win in a number of NUMA-intense workloads.
> >So it's a good, monotonic step forward in terms of NUMA
> >balancing, very close to what the bits I'm working on need as
> >infrastructure.
> > 
> >  - I'll rebase all my devel bits on top of it. Instead of
> >removing the migration bandwidth I'll simply increase it for
> >testing - this should trigger similarly aggressive behavior.
> >I'll try to touch as little of the mm/ code as possible, to
> >keep things debuggable.
> 
> One minor last-minute request/nit before you send it to Linus, 
> would you mind doing a:
> 
>CONFIG_BALANCE_NUMA => CONFIG_NUMA_BALANCING
> 
> rename please? (I can do it for you if you don't have the time.)
> 
> CONFIG_NUMA_BALANCING is really what fits into our existing NUMA 
> namespace, CONFIG_NUMA, CONFIG_NUMA_EMU - and, more importantly, 
> the ordering of words follows the common generic -> less generic 
> ordering we do in the kernel for config names and methods.
> 
> So it would fit nicely into existing Kconfig naming schemes:
> 
>CONFIG_TRACING
>CONFIG_FILE_LOCKING
>CONFIG_EVENT_TRACING
> 
> etc.
> 

Yes, that makes sense. I should have spotted the rationale. I also took
the liberty of renaming the command-line parameter and the variables to
be consistent with this.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/4 v2] gpio/mvebu: convert to use irq_domain_add_simple()

2012-12-11 Thread Thomas Petazzoni
Dear Linus Walleij,

On Fri, 19 Oct 2012 12:54:02 +0200, Linus Walleij wrote:
> The MVEBU driver probably just wants a few IRQs. Using the simple
> domain has the upside of allocating IRQ descriptors if need be,
> especially in a SPARSE_IRQ environment.

Unfortunately, this creates the following warning at boot time for each
GPIO bank:

[ cut here ]
WARNING: at /home/thomas/projets/linux-2.6/kernel/irq/irqdomain.c:181 
irq_domain_add_simple+0x8c/0x9c()
Cannot allocate irq_descs @ IRQ33, assuming pre-allocated
Modules linked in:
[] (unwind_backtrace+0x0/0xf8) from [] 
(warn_slowpath_common+0x4c/0x64)
[] (warn_slowpath_common+0x4c/0x64) from [] 
(warn_slowpath_fmt+0x30/0x40)
[] (warn_slowpath_fmt+0x30/0x40) from [] 
(irq_domain_add_simple+0x8c/0x9c)
[] (irq_domain_add_simple+0x8c/0x9c) from [] 
(mvebu_gpio_probe+0x3cc/0x49c)
[] (mvebu_gpio_probe+0x3cc/0x49c) from [] 
(platform_drv_probe+0x18/0x1c)
[] (platform_drv_probe+0x18/0x1c) from [] 
(driver_probe_device+0x70/0x1f0)
[] (driver_probe_device+0x70/0x1f0) from [] 
(bus_for_each_drv+0x5c/0x88)
[] (bus_for_each_drv+0x5c/0x88) from [] 
(device_attach+0x74/0x80)
[] (device_attach+0x74/0x80) from [] 
(bus_probe_device+0x84/0xa8)
[] (bus_probe_device+0x84/0xa8) from [] 
(device_add+0x4ac/0x57c)
[] (device_add+0x4ac/0x57c) from [] 
(of_platform_device_create_pdata+0x5c/0x80)
[] (of_platform_device_create_pdata+0x5c/0x80) from [] 
(of_platform_bus_create+0xcc/0x270)
[] (of_platform_bus_create+0xcc/0x270) from [] 
(of_platform_bus_create+0x12c/0x270)
[] (of_platform_bus_create+0x12c/0x270) from [] 
(of_platform_populate+0x60/0x98)
[] (of_platform_populate+0x60/0x98) from [] 
(armada_370_xp_dt_init+0x18/0x24)
[] (armada_370_xp_dt_init+0x18/0x24) from [] 
(customize_machine+0x1c/0x28)
[] (customize_machine+0x1c/0x28) from [] 
(do_one_initcall+0x34/0x174)
[] (do_one_initcall+0x34/0x174) from [] 
(kernel_init+0x100/0x2b4)
[] (kernel_init+0x100/0x2b4) from [] 
(ret_from_fork+0x14/0x3c)
---[ end trace 1b75b31a2719ed1c ]---

Of course, the fix should be to remove the irq_alloc_descs() from the
driver prior to calling irq_domain_add_simple(). But the thing is that
our gpio-mvebu driver uses the
irq_alloc_generic_chip()/irq_setup_generic_chip() infrastructure, which
it seems requires a legacy IRQ domain (it needs the base IRQ number).

Or maybe there's a proper fix I'm missing?

Thanks,

Thomas
-- 
Thomas Petazzoni, Free Electrons
Kernel, drivers, real-time and embedded Linux
development, consulting, training and support.
http://free-electrons.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Devel] [PATCH 2/6] nfsd: swap fs root in NFSd kthreads

2012-12-11 Thread J. Bruce Fields
On Tue, Dec 11, 2012 at 07:07:00PM +0400, Stanislav Kinsbursky wrote:
> 11.12.2012 18:56, J. Bruce Fields пишет:
> >On Tue, Dec 11, 2012 at 06:12:40PM +0400, Stanislav Kinsbursky wrote:
> >>UID: 9899
> >>
> >>11.12.2012 18:00, Stanislav Kinsbursky пишет:
> >>>11.12.2012 00:28, J. Bruce Fields пишет:
> On Thu, Dec 06, 2012 at 06:34:47PM +0300, Stanislav Kinsbursky wrote:
> >NFSd does lookup. Lookup is done starting from current->fs->root.
> >NFSd is a kthread, cloned by kthreadd, and thus have global (but luckely
> >unshared) root.
> >So we have to swap root to those, which process, started NFSd, has. 
> >Because
> >that process can be in a container with it's own root.
> 
> This doesn't sound right to me.
> 
> Which lookups exactly do you see being done relative to
> current->fs->root ?
> 
> >>>
> >>>Ok, you are right. I was mistaken here.
> >>>This is not a exactly lookup, but d_path() problem in svc_export_request().
> >>>I.e. without root swapping, d_path() will give not local export path (like 
> >>>"/export")
> >>>but something like this "/root/containers_root/export".
> >>>
> >>
> >>We, actually, can do it less in less aggressive way.
> >>I.e. instead root swap and current svc_export_request() implementation:
> >>
> >>void svc_export_request(...)
> >>{
> >>
> >> pth = d_path(&exp->ex_path, *bpp, *blen);
> >>
> >>}
> >>
> >>we can do something like this:
> >>
> >>void svc_export_request(...)
> >>{
> >>struct nfsd_net *nn = ...
> >>
> >>spin_lock(&dcache_lock);
> >> pth = __d_path(&exp->ex_path, &nn->root, *bpp, *blen);
> >>spin_unlock(&dcache_lock);
> >>
> >>}
> >
> >That looks simpler, but I still don't understand why we need it.
> >
> >I'm confused about how d_path works; I would have thought that
> >filesystem namespaces would have their own vfsmount trees and hence that
> >the (vfsmount, dentry) would be enough to specify the path.  Is the root
> >argument for the case of chroot?  Do we care about that?
> >
> 
> It works very simple: just traverse the tree from specified dentry up to 
> current->fs->root.dentry.
> Having container in some fully separated mount point is great, of course. But:
> 1) this is a limitation we really want to avoid. I.e. container can be 
> chrooted into some path like "/root/containers_root/" as in example above.
> 2) NFSd kthread works in init root environment. But we anyway want to get 
> proper path string in container root, but not in kthreads root.
> 
> >Also, svc_export_request is called from mountd's read of
> >/proc/net/rpc/nfsd.export/channel.  If mountd's root is wrong, then
> >nothing's going to work anyway.
> >
> 
> I don't really understand, how  mountd's root can be wrong. I.e.
> its' always right as I see it. NFSd kthreads have to swap/use
> relative path/whatever to communicate with proper mountd.
> Or I'm missing something?

Ugh, I see the problem: I thought svc_export_request was called at the
time mountd does the read, but instead its done at the time nfsd does
the upcall.

I suspect that's wrong, and we really want this done in the context of
the mountd process when it does the read call.  If d_path is called
there then we have no problem.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: performance drop after using blkcg

2012-12-11 Thread Tejun Heo
Hello, Vivek.

On Tue, Dec 11, 2012 at 10:02:34AM -0500, Vivek Goyal wrote:
> cfq_group_served() {
> if (iops_mode(cfqd))
> charge = cfqq->slice_dispatch;
>   cfqg->vdisktime += cfq_scale_slice(charge, cfqg);
> }
> 
> Isn't it effectively IOPS scheduling. One should get IOPS rate in proportion 
> to
> their weight (as long as they can throw enough traffic at device to keep
> it busy). If not, can you please give more details about your proposal.

The problem is that we lose a lot of isolation w/o idling between
queues or groups.  This is because we switch between slices and while
a slice is in progress only ios belongint to that slice can be issued.
ie. higher priority cfqgs / cfqqs, after dispatching the ios they have
ready, lose their slice immmediately.  Lower priority slice takes over
and when hgiher priority ones get ready, they have to wait for the
lower priority one before submitting the new IOs.  In many cases, they
end up not being able to generate IOs any faster than the ones in
lower priority cfqqs/cfqgs.

This is becase we switch slices rather than iops.  We can make cfq
essentially switch iops by implementing very aggressive preemption but
I really don't see much point in that.  cfq is way too heavy and
ill-suited for high speed non-rot devices which are becoming more and
more consistent in terms of iops they can handle.

I think we need something better suited for the maturing non-rot
devices.  They're becoming very different from what cfq was built for
and we really shouldn't be maintaining several rb trees which need
full synchronization for each IO.  We're doing way too much and it
just isn't scalable.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] f2fs: fix up f2fs_get_parent issue to retrieve correct parent inode number

2012-12-11 Thread Namjae Jeon
From: Namjae Jeon 

Test Case:
[NFS Client]
ls -lR .

[NFS Server]
while [ 1 ]
do
echo 3 > /proc/sys/vm/drop_caches
done

Error on NFS Client: "No such file or directory"

When cache is dropped at the server, it results in lookup failure at the
NFS client due to non-connection with the parent. The default path is it
initiates a lookup by calculating the hash value for the name, even though
the hash values stored on the disk for "." and ".." is maintained as zero,
which results in failure from find_in_block due to not matching HASH values.
Fix up, by using the correct hashing values for these entries and
then initiating lookup request.

Signed-off-by: Namjae Jeon 
Signed-off-by: Amit Sahrawat 
---
 fs/f2fs/dir.c |5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/f2fs/dir.c b/fs/f2fs/dir.c
index b4e24f3..6cf39db 100644
--- a/fs/f2fs/dir.c
+++ b/fs/f2fs/dir.c
@@ -184,7 +184,7 @@ struct f2fs_dir_entry *f2fs_find_entry(struct inode *dir,
int namelen = child->len;
unsigned long npages = dir_blocks(dir);
struct f2fs_dir_entry *de = NULL;
-   f2fs_hash_t name_hash;
+   f2fs_hash_t name_hash = 0;
unsigned int max_depth;
unsigned int level;
 
@@ -193,7 +193,8 @@ struct f2fs_dir_entry *f2fs_find_entry(struct inode *dir,
 
*res_page = NULL;
 
-   name_hash = f2fs_dentry_hash(name, namelen);
+   if (strcmp(name, ".") && (strcmp(name, "..")))
+   name_hash = f2fs_dentry_hash(name, namelen);
max_depth = F2FS_I(dir)->i_current_depth;
 
for (level = 0; level < max_depth; level++) {
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 5/6] ACPI: Replace struct acpi_bus_ops with enum type

2012-12-11 Thread Jiang Liu
Hi Rafael,
I have worked out a patch set to clean up ACPI/PCI related 
notifications,
please refer to
http://www.spinics.net/lists/linux-pci/msg17822.html
The patchset doesn't apply cleanly to Bjorn's latest pci-next tree. I 
will
help to rebase it if needed.
Regards!
Gerry

On 12/11/2012 10:26 AM, Yinghai Lu wrote:
> On Mon, Dec 10, 2012 at 5:28 PM, Rafael J. Wysocki  wrote:
>>>
>>> OK, thanks for the pointers.  I actually see more differences between our
>>> patchsets.  For one example, you seem to have left the parent->ops.bind()
>>> stuff in acpi_add_single_object() which calls it even drivers_autoprobe is
>>> set.
>>
>> Sorry, that should have been "which calls it even when drivers_autoprobe is
>> not set".  I need to be more careful.
>>
> 
> oh,  Jiang Liu had one patch to remove that workaround.
> 
> http://git.kernel.org/?p=linux/kernel/git/yinghai/linux-yinghai.git;a=commitdiff;h=b40dba80c2b8395570d8357e6b3f417c27c84504
> 
> ACPI/pci-bind: remove bind/unbind callbacks from acpi_device_ops
> 
> Maybe you can review that patches in my for-pci-next2...
> those are ACPI related anyway.
> 
> those patches have been there for a while, and Bjorn did not have time
> to digest them.
> 
> or you prefer I resend updated version as huge whole patchset?
> 
> Thanks
> 
> Yinghai
> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: WARNING: at drivers/gpu/drm/i915/i915_gem.c:3437 i915_gem_object_pin+0x151/0x1a0()

2012-12-11 Thread Daniel Vetter
On Tue, Dec 11, 2012 at 1:31 PM, Nikola Pajkovsky  wrote:
> looks like we still have some oops in i915. i915 maintainers do you have
> any ideas what's going on? I will try to trigger that oops later today
> and provide more information.

The infamous pin leak. Should be fixed with

commit b4a98e57fc27854b5938fc8b08b68e5e68b91e1f
Author: Chris Wilson 
Date:   Thu Nov 1 09:26:26 2012 +

drm/i915: Flush outstanding unpin tasks before pageflipping

available from the drm-next (or linux-next) trees. If it holds up once
merged, we'll backport it.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Devel] [PATCH 2/6] nfsd: swap fs root in NFSd kthreads

2012-12-11 Thread Stanislav Kinsbursky

11.12.2012 18:56, J. Bruce Fields пишет:

On Tue, Dec 11, 2012 at 06:12:40PM +0400, Stanislav Kinsbursky wrote:

UID: 9899

11.12.2012 18:00, Stanislav Kinsbursky пишет:

11.12.2012 00:28, J. Bruce Fields пишет:

On Thu, Dec 06, 2012 at 06:34:47PM +0300, Stanislav Kinsbursky wrote:

NFSd does lookup. Lookup is done starting from current->fs->root.
NFSd is a kthread, cloned by kthreadd, and thus have global (but luckely
unshared) root.
So we have to swap root to those, which process, started NFSd, has. Because
that process can be in a container with it's own root.


This doesn't sound right to me.

Which lookups exactly do you see being done relative to
current->fs->root ?



Ok, you are right. I was mistaken here.
This is not a exactly lookup, but d_path() problem in svc_export_request().
I.e. without root swapping, d_path() will give not local export path (like 
"/export")
but something like this "/root/containers_root/export".



We, actually, can do it less in less aggressive way.
I.e. instead root swap and current svc_export_request() implementation:

void svc_export_request(...)
{

 pth = d_path(&exp->ex_path, *bpp, *blen);

}

we can do something like this:

void svc_export_request(...)
{
struct nfsd_net *nn = ...

spin_lock(&dcache_lock);
 pth = __d_path(&exp->ex_path, &nn->root, *bpp, *blen);
spin_unlock(&dcache_lock);

}


That looks simpler, but I still don't understand why we need it.

I'm confused about how d_path works; I would have thought that
filesystem namespaces would have their own vfsmount trees and hence that
the (vfsmount, dentry) would be enough to specify the path.  Is the root
argument for the case of chroot?  Do we care about that?



It works very simple: just traverse the tree from specified dentry up to 
current->fs->root.dentry.
Having container in some fully separated mount point is great, of course. But:
1) this is a limitation we really want to avoid. I.e. container can be chrooted into some 
path like "/root/containers_root/" as in example above.
2) NFSd kthread works in init root environment. But we anyway want to get 
proper path string in container root, but not in kthreads root.


Also, svc_export_request is called from mountd's read of
/proc/net/rpc/nfsd.export/channel.  If mountd's root is wrong, then
nothing's going to work anyway.



I don't really understand, how  mountd's root can be wrong. I.e. its' always right as I see it. NFSd kthreads have to swap/use relative path/whatever to 
communicate with proper mountd.

Or I'm missing something?


--b.




--
Best regards,
Stanislav Kinsbursky
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


null dereference at r100_debugfs_cp_ring_info+0x115/0x140

2012-12-11 Thread Dave Jones
(Taint comes from previous r600 bug reported here 
https://lkml.org/lkml/2012/12/8/131)

[35662.070628] BUG: unable to handle kernel NULL pointer dereference at 
  (null)
[35662.071719] IP: [] r100_debugfs_cp_ring_info+0x115/0x140
[35662.072652] PGD b4c17067 PUD b69d1067 PMD 0 
[35662.073243] Oops:  [#1] PREEMPT SMP 
[35662.073809] Modules linked in: nfnetlink ipt_ULOG binfmt_misc sctp libcrc32c 
scsi_transport_iscsi nfc caif_socket caif phonet bluetooth rfkill can llc2 
pppoe pppox ppp_generic slhc irda crc_ccitt rds af_key decnet rose x25 atm 
netrom appletalk ipx p8023 psnap p8022 llc ax25 nfsv3 nfs_acl nfs fscache lockd 
sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack 
ip6table_filter ip6_tables snd_hda_codec_realtek microcode snd_hda_intel 
snd_hda_codec usb_debug serio_raw pcspkr snd_pcm snd_page_alloc edac_core 
snd_timer snd i2c_piix4 soundcore r8169 mii vhost_net tun macvtap macvlan 
kvm_amd kvm
[35662.082589] CPU 0 
[35662.082852] Pid: 28200, comm: trinity-child1 Tainted: GW
3.7.0-rc8+ #10 Gigabyte Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H
[35662.084465] RIP: 0010:[]  [] 
r100_debugfs_cp_ring_info+0x115/0x140
[35662.085656] RSP: 0018:8800b6b27e58  EFLAGS: 00010202
[35662.086343] RAX:  RBX: 0001 RCX: 
[35662.087252] RDX:  RSI: 81a504ee RDI: 8800af0936c0
[35662.088163] RBP: 8800b6b27e88 R08: 1000 R09: fffe
[35662.089071] R10:  R11: 000f R12: 8800af0936c0
[35662.089980] R13:  R14:  R15: 88012444c000
[35662.090891] FS:  7f1baff14740() GS:88012ae0() 
knlGS:
[35662.091918] CS:  0010 DS:  ES:  CR0: 80050033
[35662.092656] CR2:  CR3: b6be9000 CR4: 07f0
[35662.093567] DR0:  DR1:  DR2: 
[35662.094475] DR3:  DR6: 0ff0 DR7: 0400
[35662.095383] Process trinity-child1 (pid: 28200, threadinfo 8800b6b26000, 
task 8800b05a48a0)
[35662.096525] Stack:
[35662.096782]   8800b2b0f180  
8800b6b27f50
[35662.097768]  0001 8800af0936c0 8800b6b27ef8 
811ddbdc
[35662.098826]  8800b6b27ec8 00da2c90 8800af0936f8 
0001
[35662.099882] Call Trace:
[35662.100221]  [] seq_read+0xcc/0x450
[35662.100884]  [] vfs_read+0xac/0x180
[35662.101545]  [] sys_read+0x55/0xa0
[35662.102195]  [] system_call_fastpath+0x16/0x1b
[35662.102969] Code: 1f 80 00 00 00 00 41 8d 14 1e 41 23 97 3c 13 00 00 49 8b 
87 e0 12 00 00 48 c7 c6 ee 04 a5 81 4c 89 e7 48 ff c3 89 d1 48 8d 04 88 <8b> 08 
31 c0 e8 b2 71 d6 ff 41 39 dd 73 cd 48 83 c4 08 31 c0 5b 
[35662.106677] RIP  [] r100_debugfs_cp_ring_info+0x115/0x140
[35662.107602]  RSP 
[35662.108065] CR2: 
[35662.108837] ---[ end trace 77a9a4397cec5a9d ]---
(09:57:30:davej@demonseed:~)$ 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] dma_buf: Cleanup dma_buf_fd

2012-12-11 Thread Borislav Petkov
Remove redundant 'error' variable.

Signed-off-by: Borislav Petkov 
---
 drivers/base/dma-buf.c | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/drivers/base/dma-buf.c b/drivers/base/dma-buf.c
index 460e22dee36d..a384b63be757 100644
--- a/drivers/base/dma-buf.c
+++ b/drivers/base/dma-buf.c
@@ -134,15 +134,14 @@ EXPORT_SYMBOL_GPL(dma_buf_export);
  */
 int dma_buf_fd(struct dma_buf *dmabuf, int flags)
 {
-   int error, fd;
+   int fd;
 
if (!dmabuf || !dmabuf->file)
return -EINVAL;
 
-   error = get_unused_fd_flags(flags);
-   if (error < 0)
-   return error;
-   fd = error;
+   fd = get_unused_fd_flags(flags);
+   if (fd < 0)
+   return fd;
 
fd_install(fd, dmabuf->file);
 
-- 
1.8.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: performance drop after using blkcg

2012-12-11 Thread Vivek Goyal
On Tue, Dec 11, 2012 at 06:47:18AM -0800, Tejun Heo wrote:
> Hello,
> 
> On Tue, Dec 11, 2012 at 09:43:36AM -0500, Vivek Goyal wrote:
> > I think if one sets slice_idle=0 and group_idle=0 in CFQ, for all practical
> > purposes it should become and IOPS based group scheduling.
> 
> No, I don't think it is.  You can't achieve isolation without idling
> between group switches.  We're measuring slices in terms of iops but
> what cfq actually schedules are still time slices, not IOs.

I think I have not been able to understand your proposal. Can you explain
a bit more.

This is what CFQ does in iops_mode(). It will calculate the number of
requests dispatched from a group and scale that number based on weight
and put the group back on service tree. So if you have not got your
fair share in terms of number of requests dispatched to the device,
you will be put ahead in the queue and given a chance to dispatch 
requests first. 

Now couple of things.

- There is no idling here. If device is asking for more requests (deep
  queue depth) then this group will be removed from service tree and
  CFQ will move on to serve other queued group. So if there is a dependent
  reader it will lose its share.

  If we try to idle here, then we have solved nothing in terms of
  performance problems.  Device is faster but your workload can't cope
  with it so you are artificially slowing down the device.

- But if all the contending workloads/groups are throwing enough IO
  traffic on the device and don't get expired, they they should be able
  to dispatch number of requests to device in proportion to their weight.

So this is effectively trying to keep track of number of reqeusts
dispatched from the group instead of time slice consumed by group and
then do the scheduling.

cfq_group_served() {
if (iops_mode(cfqd))
charge = cfqq->slice_dispatch;
cfqg->vdisktime += cfq_scale_slice(charge, cfqg);
}

Isn't it effectively IOPS scheduling. One should get IOPS rate in proportion to
their weight (as long as they can throw enough traffic at device to keep
it busy). If not, can you please give more details about your proposal.

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Devel] [PATCH 2/6] nfsd: swap fs root in NFSd kthreads

2012-12-11 Thread Al Viro
On Tue, Dec 11, 2012 at 09:56:21AM -0500, J. Bruce Fields wrote:

> That looks simpler, but I still don't understand why we need it.
> 
> I'm confused about how d_path works; I would have thought that
> filesystem namespaces would have their own vfsmount trees and hence that
> the (vfsmount, dentry) would be enough to specify the path.  Is the root
> argument for the case of chroot?  Do we care about that?

__d_path() is relative pathname from here to there.  Whether (and what for)
is it wanted in case of nfsd patches is a separate question...

> Also, svc_export_request is called from mountd's read of
> /proc/net/rpc/nfsd.export/channel.  If mountd's root is wrong, then
> nothing's going to work anyway.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/6] nfsd: swap fs root in NFSd kthreads

2012-12-11 Thread Stanislav Kinsbursky

11.12.2012 18:54, Al Viro пишет:

On Tue, Dec 11, 2012 at 06:00:00PM +0400, Stanislav Kinsbursky wrote:

11.12.2012 00:28, J. Bruce Fields ??:

On Thu, Dec 06, 2012 at 06:34:47PM +0300, Stanislav Kinsbursky wrote:

NFSd does lookup. Lookup is done starting from current->fs->root.
NFSd is a kthread, cloned by kthreadd, and thus have global (but luckely
unshared) root.
So we have to swap root to those, which process, started NFSd, has. Because
that process can be in a container with it's own root.


This doesn't sound right to me.

Which lookups exactly do you see being done relative to
current->fs->root ?



Ok, you are right. I was mistaken here.
This is not a exactly lookup, but d_path() problem in svc_export_request().
I.e. without root swapping, d_path() will give not local export path (like 
"/export")
but something like this "/root/containers_root/export".


Now, *that* is a different story (and makes some sense).  Take a look
at __d_path(), please.  You don't need to set ->fs->root to get d_path()
equivalent relative to given point.



Thanks, Al.
But __d_path() is not exported and this code is called from NFSd module.
Is it suitable for you to export __d_path()?

--
Best regards,
Stanislav Kinsbursky
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: signed size_t ?

2012-12-11 Thread Måns Rullgård
Christoph Lameter  writes:

> On Tue, 11 Dec 2012, kbuild test robot wrote:
>
>> mm/slab_common.c: In function 'create_boot_cache':
>> mm/slab_common.c:219:6: warning: format '%zd' expects argument of type 
>> 'signed size_t', but argument 3 has type 'size_t' [-Wformat]
>
> We already changed that once from %td to %zd so that the size_t works
> correctly.

If the argument is correctly of type size_t, the format should be '%zu'
since size_t is unsigned.

> Does a signed size_t make any sense?

A signed type corresponding to size_t sometimes makes sense and it's
called ssize_t.

-- 
Måns Rullgård
m...@mansr.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:x86/microcode] x86/microcode_intel_early.c: Early update ucode on Intel's CPU

2012-12-11 Thread Borislav Petkov
On Mon, Dec 10, 2012 at 11:07:38PM -0800, Yinghai Lu wrote:
> BTW, do we really need to update microcode so early?

Yes we do. Normally ucode gets applied by the BIOS - this early approach
is for those cases where OEMs don't release new BIOS anymore but we
still need to apply a ucode patch as early as possible.

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Devel] [PATCH 2/6] nfsd: swap fs root in NFSd kthreads

2012-12-11 Thread J. Bruce Fields
On Tue, Dec 11, 2012 at 06:12:40PM +0400, Stanislav Kinsbursky wrote:
> UID: 9899
> 
> 11.12.2012 18:00, Stanislav Kinsbursky пишет:
> >11.12.2012 00:28, J. Bruce Fields пишет:
> >>On Thu, Dec 06, 2012 at 06:34:47PM +0300, Stanislav Kinsbursky wrote:
> >>>NFSd does lookup. Lookup is done starting from current->fs->root.
> >>>NFSd is a kthread, cloned by kthreadd, and thus have global (but luckely
> >>>unshared) root.
> >>>So we have to swap root to those, which process, started NFSd, has. Because
> >>>that process can be in a container with it's own root.
> >>
> >>This doesn't sound right to me.
> >>
> >>Which lookups exactly do you see being done relative to
> >>current->fs->root ?
> >>
> >
> >Ok, you are right. I was mistaken here.
> >This is not a exactly lookup, but d_path() problem in svc_export_request().
> >I.e. without root swapping, d_path() will give not local export path (like 
> >"/export")
> >but something like this "/root/containers_root/export".
> >
> 
> We, actually, can do it less in less aggressive way.
> I.e. instead root swap and current svc_export_request() implementation:
> 
> void svc_export_request(...)
> {
>   
> pth = d_path(&exp->ex_path, *bpp, *blen);
>   
> }
> 
> we can do something like this:
> 
> void svc_export_request(...)
> {
>   struct nfsd_net *nn = ...
>   
>   spin_lock(&dcache_lock);
> pth = __d_path(&exp->ex_path, &nn->root, *bpp, *blen);
>   spin_unlock(&dcache_lock);
>   
> }

That looks simpler, but I still don't understand why we need it.

I'm confused about how d_path works; I would have thought that
filesystem namespaces would have their own vfsmount trees and hence that
the (vfsmount, dentry) would be enough to specify the path.  Is the root
argument for the case of chroot?  Do we care about that?

Also, svc_export_request is called from mountd's read of
/proc/net/rpc/nfsd.export/channel.  If mountd's root is wrong, then
nothing's going to work anyway.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/6] nfsd: swap fs root in NFSd kthreads

2012-12-11 Thread Al Viro
On Tue, Dec 11, 2012 at 06:00:00PM +0400, Stanislav Kinsbursky wrote:
> 11.12.2012 00:28, J. Bruce Fields ??:
> >On Thu, Dec 06, 2012 at 06:34:47PM +0300, Stanislav Kinsbursky wrote:
> >>NFSd does lookup. Lookup is done starting from current->fs->root.
> >>NFSd is a kthread, cloned by kthreadd, and thus have global (but luckely
> >>unshared) root.
> >>So we have to swap root to those, which process, started NFSd, has. Because
> >>that process can be in a container with it's own root.
> >
> >This doesn't sound right to me.
> >
> >Which lookups exactly do you see being done relative to
> >current->fs->root ?
> >
> 
> Ok, you are right. I was mistaken here.
> This is not a exactly lookup, but d_path() problem in svc_export_request().
> I.e. without root swapping, d_path() will give not local export path (like 
> "/export")
> but something like this "/root/containers_root/export".

Now, *that* is a different story (and makes some sense).  Take a look
at __d_path(), please.  You don't need to set ->fs->root to get d_path()
equivalent relative to given point.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Devel] [PATCH 2/6] nfsd: swap fs root in NFSd kthreads

2012-12-11 Thread Stanislav Kinsbursky

11.12.2012 18:12, Stanislav Kinsbursky пишет:

11.12.2012 18:00, Stanislav Kinsbursky пишет:

11.12.2012 00:28, J. Bruce Fields пишет:

On Thu, Dec 06, 2012 at 06:34:47PM +0300, Stanislav Kinsbursky wrote:

NFSd does lookup. Lookup is done starting from current->fs->root.
NFSd is a kthread, cloned by kthreadd, and thus have global (but luckely
unshared) root.
So we have to swap root to those, which process, started NFSd, has. Because
that process can be in a container with it's own root.


This doesn't sound right to me.

Which lookups exactly do you see being done relative to
current->fs->root ?



Ok, you are right. I was mistaken here.
This is not a exactly lookup, but d_path() problem in svc_export_request().
I.e. without root swapping, d_path() will give not local export path (like 
"/export")
but something like this "/root/containers_root/export".



We, actually, can do it less in less aggressive way.
I.e. instead root swap and current svc_export_request() implementation:

void svc_export_request(...)
{
 
 pth = d_path(&exp->ex_path, *bpp, *blen);
 
}

we can do something like this:

void svc_export_request(...)
{
 struct nfsd_net *nn = ...
 
 spin_lock(&dcache_lock);
 pth = __d_path(&exp->ex_path, &nn->root, *bpp, *blen);
 spin_unlock(&dcache_lock);
 
}



No, this won't work. Sorry for noise.


--b.



Signed-off-by: Stanislav Kinsbursky 
---
  fs/nfsd/netns.h  |1 +
  fs/nfsd/nfssvc.c |   33 -
  2 files changed, 33 insertions(+), 1 deletions(-)

diff --git a/fs/nfsd/netns.h b/fs/nfsd/netns.h
index abfc97c..5c423c6 100644
--- a/fs/nfsd/netns.h
+++ b/fs/nfsd/netns.h
@@ -101,6 +101,7 @@ struct nfsd_net {
  struct timeval nfssvc_boot;

  struct svc_serv *nfsd_serv;
+struct pathroot;
  };

  extern int nfsd_net_id;
diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
index cee62ab..177bb60 100644
--- a/fs/nfsd/nfssvc.c
+++ b/fs/nfsd/nfssvc.c
@@ -392,6 +392,7 @@ int nfsd_create_serv(struct net *net)

  set_max_drc();
  do_gettimeofday(&nn->nfssvc_boot);/* record boot time */
+get_fs_root(current->fs, &nn->root);
  return 0;
  }

@@ -426,8 +427,10 @@ void nfsd_destroy(struct net *net)
  if (destroy)
  svc_shutdown_net(nn->nfsd_serv, net);
  svc_destroy(nn->nfsd_serv);
-if (destroy)
+if (destroy) {
+path_put(&nn->root);
  nn->nfsd_serv = NULL;
+}
  }

  int nfsd_set_nrthreads(int n, int *nthreads, struct net *net)
@@ -533,6 +536,25 @@ out:
  return error;
  }

+/*
+ * This function is actually slightly modified set_fs_root()
+ */
+static void nfsd_swap_root(struct net *net)
+{
+struct nfsd_net *nn = net_generic(net, nfsd_net_id);
+struct fs_struct *fs = current->fs;
+struct path old_root;
+
+path_get(&nn->root);
+spin_lock(&fs->lock);
+write_seqcount_begin(&fs->seq);
+old_root = fs->root;
+fs->root = nn->root;
+write_seqcount_end(&fs->seq);
+spin_unlock(&fs->lock);
+if (old_root.dentry)
+path_put(&old_root);
+}

  /*
   * This is the NFS server kernel thread
@@ -559,6 +581,15 @@ nfsd(void *vrqstp)
  current->fs->umask = 0;

  /*
+ * We have to swap NFSd kthread's fs->root.
+ * Why so? Because NFSd can be started in container, which has it's own
+ * root.
+ * And so what? NFSd lookup files, and lookup start from
+ * current->fs->root.
+ */
+nfsd_swap_root(net);
+
+/*
   * thread is spawned with all signals set to SIG_IGN, re-enable
   * the ones that will bring down the thread
   */










--
Best regards,
Stanislav Kinsbursky
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 04/10] net: smc911x: use io{read,write}*_rep accessors

2012-12-11 Thread Will Deacon
Hi David,

On Mon, Dec 10, 2012 at 08:47:05PM +, David Miller wrote:
> From: Will Deacon 
> Date: Mon, 10 Dec 2012 19:12:36 +
> 
> > From: Matthew Leach 
> > 
> > The {read,write}s{b,w,l} operations are not defined by all
> > architectures and are being removed from the asm-generic/io.h
> > interface.
> > 
> > This patch replaces the usage of these string functions in the smc911x
> > accessors with io{read,write}{8,16,32}_rep calls instead.
> > 
> > Cc: Arnd Bergmann 
> > Cc: Ben Herrenschmidt 
> > Cc: net...@vger.kernel.org
> > Signed-off-by: Matthew Leach 
> > Signed-off-by: Will Deacon 
> 
> This misses the two uses in smsc911x_tx_writefifo and
> smsc911x_rx_readfifo.

Well spotted, updated patch below.

Cheers,

Will

--->8

>From b46e33465e755e945136d19938c9a8331cbafce7 Mon Sep 17 00:00:00 2001
From: Matthew Leach 
Date: Tue, 6 Nov 2012 14:51:11 +
Subject: [PATCH v3] net: smc911x: use io{read,write}*_rep accessors

The {read,write}s{b,w,l} operations are not defined by all
architectures and are being removed from the asm-generic/io.h
interface.

This patch replaces the usage of these string functions in the smc911x
accessors with io{read,write}{8,16,32}_rep calls instead.

Cc: Arnd Bergmann 
Cc: Ben Herrenschmidt 
Cc: net...@vger.kernel.org
Signed-off-by: Matthew Leach 
Signed-off-by: Will Deacon 
---
 drivers/net/ethernet/smsc/smc911x.h  | 16 
 drivers/net/ethernet/smsc/smsc911x.c |  8 
 2 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/smsc/smc911x.h 
b/drivers/net/ethernet/smsc/smc911x.h
index 3269292..d51261b 100644
--- a/drivers/net/ethernet/smsc/smc911x.h
+++ b/drivers/net/ethernet/smsc/smc911x.h
@@ -159,12 +159,12 @@ static inline void SMC_insl(struct smc911x_local *lp, int 
reg,
void __iomem *ioaddr = lp->base + reg;
 
if (lp->cfg.flags & SMC911X_USE_32BIT) {
-   readsl(ioaddr, addr, count);
+   ioread32_rep(ioaddr, addr, count);
return;
}
 
if (lp->cfg.flags & SMC911X_USE_16BIT) {
-   readsw(ioaddr, addr, count * 2);
+   ioread16_rep(ioaddr, addr, count * 2);
return;
}
 
@@ -177,12 +177,12 @@ static inline void SMC_outsl(struct smc911x_local *lp, 
int reg,
void __iomem *ioaddr = lp->base + reg;
 
if (lp->cfg.flags & SMC911X_USE_32BIT) {
-   writesl(ioaddr, addr, count);
+   iowrite32_rep(ioaddr, addr, count);
return;
}
 
if (lp->cfg.flags & SMC911X_USE_16BIT) {
-   writesw(ioaddr, addr, count * 2);
+   iowrite16_rep(ioaddr, addr, count * 2);
return;
}
 
@@ -196,14 +196,14 @@ static inline void SMC_outsl(struct smc911x_local *lp, 
int reg,
 writew(v & 0x, (lp)->base + (r));   \
 writew(v >> 16, (lp)->base + (r) + 2); \
 } while (0)
-#define SMC_insl(lp, r, p, l)   readsw((short*)((lp)->base + (r)), p, l*2)
-#define SMC_outsl(lp, r, p, l)  writesw((short*)((lp)->base + (r)), p, l*2)
+#define SMC_insl(lp, r, p, l)   ioread16_rep((short*)((lp)->base + (r)), p, 
l*2)
+#define SMC_outsl(lp, r, p, l)  iowrite16_rep((short*)((lp)->base + (r)), p, 
l*2)
 
 #elif  SMC_USE_32BIT
 #define SMC_inl(lp, r)  readl((lp)->base + (r))
 #define SMC_outl(v, lp, r)  writel(v, (lp)->base + (r))
-#define SMC_insl(lp, r, p, l)   readsl((int*)((lp)->base + (r)), p, l)
-#define SMC_outsl(lp, r, p, l)  writesl((int*)((lp)->base + (r)), p, l)
+#define SMC_insl(lp, r, p, l)   ioread32_rep((int*)((lp)->base + (r)), p, l)
+#define SMC_outsl(lp, r, p, l)  iowrite32_rep((int*)((lp)->base + (r)), p, l)
 
 #endif /* SMC_USE_16BIT */
 #endif /* SMC_DYNAMIC_BUS_CONFIG */
diff --git a/drivers/net/ethernet/smsc/smsc911x.c 
b/drivers/net/ethernet/smsc/smsc911x.c
index c53c0f4..9d46167 100644
--- a/drivers/net/ethernet/smsc/smsc911x.c
+++ b/drivers/net/ethernet/smsc/smsc911x.c
@@ -253,7 +253,7 @@ smsc911x_tx_writefifo(struct smsc911x_data *pdata, unsigned 
int *buf,
}
 
if (pdata->config.flags & SMSC911X_USE_32BIT) {
-   writesl(pdata->ioaddr + TX_DATA_FIFO, buf, wordcount);
+   iowrite32_rep(pdata->ioaddr + TX_DATA_FIFO, buf, wordcount);
goto out;
}
 
@@ -285,7 +285,7 @@ smsc911x_tx_writefifo_shift(struct smsc911x_data *pdata, 
unsigned int *buf,
}
 
if (pdata->config.flags & SMSC911X_USE_32BIT) {
-   writesl(pdata->ioaddr + __smsc_shift(pdata,
+   iowrite32_rep(pdata->ioaddr + __smsc_shift(pdata,
TX_DATA_FIFO), buf, wordcount);
goto out;
}
@@ -319,7 +319,7 @@ smsc911x_rx_readfifo(struct smsc911x_data *pdata, unsigned 
int *buf,
}
 
if (pdata->config.flags & SMSC911X_USE_32BIT) {
-   readsl(pdata->ioaddr + RX_DATA_FIFO, buf, wordcount);
+   ioread32_rep(pdata-

Re: performance drop after using blkcg

2012-12-11 Thread Tejun Heo
Hello,

On Tue, Dec 11, 2012 at 09:43:36AM -0500, Vivek Goyal wrote:
> I think if one sets slice_idle=0 and group_idle=0 in CFQ, for all practical
> purposes it should become and IOPS based group scheduling.

No, I don't think it is.  You can't achieve isolation without idling
between group switches.  We're measuring slices in terms of iops but
what cfq actually schedules are still time slices, not IOs.

> For group accounting then CFQ uses number of requests from each cgroup
> and uses that information to schedule groups.
> 
> I have not been able to figure out the practical benefits of that
> approach. At least not for the simple workloads I played with. This
> approach will not work for simple things like trying to improve dependent
> read latencies in presence of heavery writers. That's the single biggest
> use case CFQ solves, IMO.

As I wrote above, it's not about accounting.  It's about scheduling
unit.

> And that happens because we stop writes and don't let them go to device
> and device is primarily dealing with reads. If some process is doing
> dependent reads and we want to improve read latencies, then either
> we need to stop flow of writes or devices are good and they always
> prioritize READs over WRITEs. If devices are good then we probably
> don't even need blkcg.
> 
> So yes, iops based appraoch is fine just that number of cases where you
> will see any service differentiation should significantly less.

No, using iops to schedule time slices would lead to that.  We just
need to be allocating and scheduling iops, and I don't think we should
be doing that from cfq.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] remove unused code from do_wp_page

2012-12-11 Thread dingel
From: Dominik Dingel 

page_mkwrite is initalized with zero and only set once, from that point exists 
no way to get to the oom or oom_free_new labels.

Signed-off-by: Dominik Dingel 
---
 mm/memory.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 221fc9f..c322708 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2795,10 +2795,6 @@ oom_free_new:
page_cache_release(new_page);
 oom:
if (old_page) {
-   if (page_mkwrite) {
-   unlock_page(old_page);
-   page_cache_release(old_page);
-   }
page_cache_release(old_page);
}
return VM_FAULT_OOM;
-- 
1.7.12.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: performance drop after using blkcg

2012-12-11 Thread Vivek Goyal
On Tue, Dec 11, 2012 at 06:27:42AM -0800, Tejun Heo wrote:
> On Tue, Dec 11, 2012 at 09:25:18AM -0500, Vivek Goyal wrote:
> > In general, do not use blkcg on faster storage. In current form it
> > is at best suitable for single rotational SATA/SAS disk. I have not
> > been able to figure out how to provide fairness without group idling.
> 
> I think cfq is just the wrong approach for faster non-rotational
> devices.  We should be allocating iops instead of time slices.

I think if one sets slice_idle=0 and group_idle=0 in CFQ, for all practical
purposes it should become and IOPS based group scheduling.

For group accounting then CFQ uses number of requests from each cgroup
and uses that information to schedule groups.

I have not been able to figure out the practical benefits of that
approach. At least not for the simple workloads I played with. This
approach will not work for simple things like trying to improve dependent
read latencies in presence of heavery writers. That's the single biggest
use case CFQ solves, IMO.

And that happens because we stop writes and don't let them go to device
and device is primarily dealing with reads. If some process is doing
dependent reads and we want to improve read latencies, then either
we need to stop flow of writes or devices are good and they always
prioritize READs over WRITEs. If devices are good then we probably
don't even need blkcg.

So yes, iops based appraoch is fine just that number of cases where you
will see any service differentiation should significantly less.

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/8] remove vm_struct list management

2012-12-11 Thread Dave Anderson


- Original Message -

> > Can we get the same information from this rb-tree of vmap_area? Is
> > ->va_start field communication same information as vmlist was
> > communicating? What's the difference between vmap_area_root and vmlist.
> 
> Thanks for comment.
> 
> Yes. vmap_area's va_start field represent same information as vm_struct's 
> addr.
> vmap_area_root is data structure for fast searching an area.
> vmap_area_list is address sorted list, so we can use it like as vmlist.
> 
> There is a little difference vmap_area_list and vmlist.
> vmlist is lack of information about some areas in vmalloc address space.
> For example, vm_map_ram() allocate area in vmalloc address space,
> but it doesn't make a link with vmlist. To provide full information
> about vmalloc address space, using vmap_area_list is more adequate.
> 
> > So without knowing details of both the data structures, I think if vmlist
> > is going away, then user space tools should be able to traverse 
> > vmap_area_root
> > rb tree. I am assuming it is sorted using ->addr field and we should be
> > able to get vmalloc area start from there. It will just be a matter of
> > exporting right fields to user space (instead of vmlist).
> 
> There is address sorted list of vmap_area, vmap_area_list.
> So we can use it for traversing vmalloc areas if it is necessary.
> But, as I mentioned before, kexec write *just* address of vmlist and
> offset of vm_struct's address field.  It imply that they don't traverse 
> vmlist,
> because they didn't write vm_struct's next field which is needed for 
> traversing.
> Without vm_struct's next field, they have no method for traversing.
> So, IMHO, assigning dummy vm_struct to vmlist which is implemented by [7/8] is
> a safe way to maintain a compatibility of userspace tool. :)

Why bother keeping vmlist around?  kdump's makedumpfile command would not
even need to traverse the vmap_area rbtree, because it could simply look
at the first vmap_area in the sorted vmap_area_list, correct?

Dave Anderson


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: livelock in __writeback_inodes_wb ?

2012-12-11 Thread Dave Jones
On Tue, Dec 11, 2012 at 04:23:27PM +0800, Fengguang Wu wrote:
 > On Wed, Nov 28, 2012 at 09:55:15AM -0500, Dave Jones wrote:
 > > We had a user report the soft lockup detector kicked after 22
 > > seconds of no progress, with this trace..
 > 
 > Where is the original report? The reporter may help provide some clues
 > on the workload that triggered the bug.

https://bugzilla.redhat.com/show_bug.cgi?id=880949 

 > The bug reporter should know best whether there are heavy IO.
 > 
 > However I suspect it's not directly caused by heavy IO: we will
 > release &wb->list_lock before each __writeback_single_inode() call,
 > which starts writeback IO for each inode.
 > 
 > > Should there be something in this loop periodically poking
 > > the watchdog perhaps ?
 > 
 > It seems we failed to release &wb->list_lock in wb_writeback() for
 > long time (dozens of seconds). That is, the inode_sleep_on_writeback()
 > is somehow not called. However it's not obvious to me how come this
 > can happen..

Right, it seems that we only drop the lock when there is more work to do.
And if there is no work to do, then we would have bailed from the loop.

mysterious.

Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 0/3] acpi: Introduce prepare_remove device operation

2012-12-11 Thread Jiang Liu
On 12/08/2012 09:08 AM, Toshi Kani wrote:
> On Fri, 2012-12-07 at 13:57 +0800, Jiang Liu wrote:
>> On 2012-12-7 10:57, Toshi Kani wrote:
>>> On Fri, 2012-12-07 at 00:40 +0800, Jiang Liu wrote:
 On 12/04/2012 08:10 AM, Toshi Kani wrote:
> On Mon, 2012-12-03 at 12:25 +0800, Hanjun Guo wrote:
>> On 2012/11/30 6:27, Toshi Kani wrote:
>>> On Thu, 2012-11-29 at 12:48 +0800, Hanjun Guo wrote:
>  :
>
> If I read the code right, the framework calls ACPI drivers differently
> at boot-time and hot-add as follows.  That is, the new entry points are
> called at hot-add only, but .add() is called at both cases.  This
> requires .add() to work differently.
>
> Boot: .add()
> Hot-Add : .add(), .pre_configure(), configure(), etc.
>
> I think the boot-time and hot-add initialization should be done
> consistently.  While there is difficulty with the current boot sequence,
> the framework should be designed to allow them consistent, not make them
> diverged.
 Hi Toshi,
We have separated hotplug operations from driver binding/unbinding 
 interface
 due to following considerations.
 1) Physical CPU and memory devices are initialized/used before the ACPI 
 subsystem
is initialized. So under normal case, .add() of processor and 
 acpi_memhotplug only
figures out information about device already in working state instead 
 of starting
the device.
>>>
>>> I agree that the current boot sequence is not very hot-plug friendly...
>>>
 2) It's impossible to rmmod the processor and acpi_memhotplug driver at 
 runtime 
if .remove() of CPU and memory drivers do really remove the CPU/memory 
 device
from the system. And the ACPI processor driver also implements CPU PM 
 funcitonality
other than hotplug.
>>>
>>> Agreed.
>>>
 And recently Rafael has mentioned that he has a long term view to get rid 
 of the
 concept of "ACPI device". If that happens, we could easily move the hotplug
 logic from ACPI device drivers into the hotplug framework if the hotplug 
 logic
 is separated from the .add()/.remove() callbacks. Actually we could even 
 move all
 hotplug only logic into the hotplug framework and don't rely on any ACPI 
 device
 driver any more. So we could get rid of all these messy things. We could 
 achieve
 that by:
 1) moving code shared by ACPI device drivers and the hotplug framework 
 into the core.
 2) moving hotplug only code to the framework.
>>>
>>> Yes, the framework should allow such future work.  I also think that the
>>> framework itself should be independent from such ACPI issue.  Ideally,
>>> it should be able to support non-ACPI platforms.
>> The same point here. The ACPI based hotplug framework is designed as:
>> 1) an ACPI based hotplug slot driver to handle platform specific logic.
>>Platform may provide platform specific slot drivers to discover, manage
>>hotplug slots. We have provided a default implementation of slot driver
>>according to the ACPI spec.
> 
> The ACPI spec does not define that _EJ0 is required to receive a hot-add
> request, i.e. bus/device check.  This is a major issue.  Since Windows
> only supports hot-add, I think there are platforms that only support
> hot-add today.
> 
>> 2) an ACPI based hotplug manager driver, which is a platform independent
>>driver and manages all hotplug slot created by the slot driver.
> 
> It is surely impressive work, but I think is is a bit overdoing.  I
> expect hot-pluggable servers come with management console and/or GUI
> where a user can manage hardware units and initiate hot-plug operations.
> I do not think the kernel needs to step into such area since it tends to
> be platform-specific. 
One of the major usages of this feature is for testing. 
It will be hard for OSVs and OEMs to verify hotplug functionalities if it could
only be tested by physical hotplug or through management console. So to pave the
way for hotplug, we need to provide a mechanism for OEMs and OSVs to execute 
auto stress tests for hotplug functionalities.

> 
>> We haven't gone further enough to provide an ACPI independent hotplug 
>> framework
>> because we only have experience with x86 and Itanium, both are ACPI based.
>> We may try to implement an ACPI independent hotplug framework by pushing all
>> ACPI specific logic into the slot driver, I think it's doable. But we need
>> suggestions from experts of other architectures, such as SPARC and Power.
>> But seems Power already have some sorts of hotplug framework, right?
> 
> I do not know about the Linux hot-plug support on other architectures.
> PA-RISC SuperDome also supports Node hot-plug, but it is not supported
> by Linux.  Since ARM is getting used by servers, I would not surprise if
> there will be an ARM based server with hot-plug support in future.
Seems ARM is on the way to adopt ACPI, so m

signed size_t ?

2012-12-11 Thread Christoph Lameter
On Tue, 11 Dec 2012, kbuild test robot wrote:

> mm/slab_common.c: In function 'create_boot_cache':
> mm/slab_common.c:219:6: warning: format '%zd' expects argument of type 
> 'signed size_t', but argument 3 has type 'size_t' [-Wformat]

We already changed that once from %td to %zd so that the size_t works
correctly.

Does a signed size_t make any sense?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: performance drop after using blkcg

2012-12-11 Thread Tejun Heo
On Tue, Dec 11, 2012 at 09:25:18AM -0500, Vivek Goyal wrote:
> In general, do not use blkcg on faster storage. In current form it
> is at best suitable for single rotational SATA/SAS disk. I have not
> been able to figure out how to provide fairness without group idling.

I think cfq is just the wrong approach for faster non-rotational
devices.  We should be allocating iops instead of time slices.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: performance drop after using blkcg

2012-12-11 Thread Vivek Goyal
On Mon, Dec 10, 2012 at 08:28:54PM +0800, Zhao Shuai wrote:
> Hi,
> 
> I plan to use blkcg(proportional BW) in my system. But I encounter
> great performance drop after enabling blkcg.
> The testing tool is fio(version 2.0.7) and both the BW and IOPS fields
> are recorded. Two instances of fio program are carried out simultaneously,
> each opearting on a separate disk file (say /data/testfile1,
> /data/testfile2).
> System environment:
> kernel: 3.7.0-rc5
> CFQ's slice_idle is disabled(slice_idle=0) while group_idle is
> enabled(group_idle=8).
> 
> FIO configuration(e.g. "read") for the first fio program(say FIO1):
> 
> [global]
> description=Emulation of Intel IOmeter File Server Access Pattern
> 
> [iometer]
> bssplit=4k/30:8k/40:16k/30
> rw=read
> direct=1
> time_based
> runtime=180s
> ioengine=sync
> filename=/data/testfile1
> numjobs=32
> group_reporting
> 
> 
> result before using blkcg: (the value of BW is KB/s)
> 
>FIO1 BW/IOPSFIO2 BW/IOPS
> ---
> read   26799/2911  25861/2810
> write  138618/15071138578/15069
> rw 72159/7838(r)   71851/7811(r)
>72171/7840(w)   71799/7805(w)
> randread   4982/5435370/585
> randwrite  5192/5666010/654
> randrw 2369/258(r) 3027/330(r)
>2369/258(w) 3016/328(w)
> 
> result after using blkcg(create two blkio cgroups with
> default blkio.weight(500) and put FIO1 and FIO2 into these
> cgroups respectively)

These results are with slice_idle=0?

> 
>FIO1 BW/IOPSFIO2 BW/IOPS
> ---
> read   36651/3985  36470/3943
> write  75738/8229  75641/8221
> rw 49169/5342(r)   49168/5346(r)
>49200/5348(w)   49140/5341(w)
> randread   4876/5324905/534
> randwrite  5535/6035497/599
> randrw 2521/274(r) 2527/275(r)
>2510/273(w) 2532/274(w)
> 
> Comparing with those results, we found greate performance drop
> (30%-40%) in some test cases(especially for the "write", "rw" case).
> Is it normal to see write/rw bandwidth decrease by 40% after using
> blkio-cgroup? If not, any way to improve or tune the performace?

What's the storage you are using. Looking at the speed of IO I would
guess it is not one of those rotational disks.

blkcg does cause the drop in performance (due to idling at group level).
Faster the storage or more the number of cgroups, drop becomes even
more visible.

Only optimization I could think of was disabling slice_idle and you
have already done that.

There might be some opporutnities to cut down the group idling in
some cases and lose on fairness but we will have to identify those
and modify code.

In general, do not use blkcg on faster storage. In current form it
is at best suitable for single rotational SATA/SAS disk. I have not
been able to figure out how to provide fairness without group idling.

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] percpu changes for v3.8

2012-12-11 Thread Tejun Heo
Hello, Linus.

Percpu changes for v3.8.  Nothing exciting here either.  Joonsoo's is
almost cosmetic.  Cyrill's patch fixes "percpu_alloc" early kernel
param handling so that the kernel doesn't crash when the parameter is
specified w/o any argument.

  git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu.git for-next

Thanks.


Cyrill Gorcunov (1):
  mm, percpu: Make sure percpu_alloc early parameter has an argument

Joonsoo Kim (1):
  percpu: make pcpu_free_chunk() use pcpu_mem_free() instead of kfree()

 mm/percpu.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/percpu.c b/mm/percpu.c
index ddc5efb..8c8e08f 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -631,7 +631,7 @@ static void pcpu_free_chunk(struct pcpu_chunk *chunk)
if (!chunk)
return;
pcpu_mem_free(chunk->map, chunk->map_alloc * sizeof(chunk->map[0]));
-   kfree(chunk);
+   pcpu_mem_free(chunk, pcpu_chunk_struct_size);
 }
 
 /*
@@ -1380,6 +1380,9 @@ enum pcpu_fc pcpu_chosen_fc __initdata = PCPU_FC_AUTO;
 
 static int __init percpu_alloc_setup(char *str)
 {
+   if (!str)
+   return -EINVAL;
+
if (0)
/* nada */;
 #ifdef CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] workqueue changes for v3.8

2012-12-11 Thread Tejun Heo
Hello, Linus.

Please pull from the following branch to receive workqueue changes for
v3.8.  Nothing exciting.  Just two trivial changes.

  git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git for-3.8

Thanks.


Joonsoo Kim (2):
  workqueue: trivial fix for return statement in work_busy()
  workqueue: add WARN_ON_ONCE() on CPU number to wq_worker_waking_up()

 kernel/workqueue.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 084aa47..ae9a056 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -739,8 +739,10 @@ void wq_worker_waking_up(struct task_struct *task, 
unsigned int cpu)
 {
struct worker *worker = kthread_data(task);
 
-   if (!(worker->flags & WORKER_NOT_RUNNING))
+   if (!(worker->flags & WORKER_NOT_RUNNING)) {
+   WARN_ON_ONCE(worker->pool->gcwq->cpu != cpu);
atomic_inc(get_pool_nr_running(worker->pool));
+   }
 }
 
 /**
@@ -3485,7 +3487,7 @@ unsigned int work_busy(struct work_struct *work)
unsigned int ret = 0;
 
if (!gcwq)
-   return false;
+   return 0;
 
spin_lock_irqsave(&gcwq->lock, flags);
 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH RT 3/4] sched/rt: Use IPI to trigger RT task push migration instead of pulling

2012-12-11 Thread Steven Rostedt
On Tue, 2012-12-11 at 09:02 -0500, Steven Rostedt wrote:

> Currently, what we have is a huge contention on both the pulled CPU rq
> lock. We've measured over 500us latencies due to it. This hurts even the
> CPU that has the overloaded task, as the contention is on its lock.

The 500us latency was one of the max ones I saw, and I believe it was
combined with the load balancer going off at the same time as the pulls
were happening. I've seen larger latencies with limited function tracing
enabled and the load balancer was very involved.

As I believe we have users that would very much want this in, and if you
are still not sure you like this "feature" I can make it into a
SCHED_FEAT() like TTWU_QUEUE. Then you can keep this off and the big
boxes can enable them.

Maybe enable the feature by default if the box has more than 16 cpus?

-- Steve


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Devel] [PATCH 2/6] nfsd: swap fs root in NFSd kthreads

2012-12-11 Thread Stanislav Kinsbursky

11.12.2012 18:00, Stanislav Kinsbursky пишет:

11.12.2012 00:28, J. Bruce Fields пишет:

On Thu, Dec 06, 2012 at 06:34:47PM +0300, Stanislav Kinsbursky wrote:

NFSd does lookup. Lookup is done starting from current->fs->root.
NFSd is a kthread, cloned by kthreadd, and thus have global (but luckely
unshared) root.
So we have to swap root to those, which process, started NFSd, has. Because
that process can be in a container with it's own root.


This doesn't sound right to me.

Which lookups exactly do you see being done relative to
current->fs->root ?



Ok, you are right. I was mistaken here.
This is not a exactly lookup, but d_path() problem in svc_export_request().
I.e. without root swapping, d_path() will give not local export path (like 
"/export")
but something like this "/root/containers_root/export".



We, actually, can do it less in less aggressive way.
I.e. instead root swap and current svc_export_request() implementation:

void svc_export_request(...)
{

pth = d_path(&exp->ex_path, *bpp, *blen);

}

we can do something like this:

void svc_export_request(...)
{
struct nfsd_net *nn = ...

spin_lock(&dcache_lock);
pth = __d_path(&exp->ex_path, &nn->root, *bpp, *blen);
spin_unlock(&dcache_lock);

}


--b.



Signed-off-by: Stanislav Kinsbursky 
---
  fs/nfsd/netns.h  |1 +
  fs/nfsd/nfssvc.c |   33 -
  2 files changed, 33 insertions(+), 1 deletions(-)

diff --git a/fs/nfsd/netns.h b/fs/nfsd/netns.h
index abfc97c..5c423c6 100644
--- a/fs/nfsd/netns.h
+++ b/fs/nfsd/netns.h
@@ -101,6 +101,7 @@ struct nfsd_net {
  struct timeval nfssvc_boot;

  struct svc_serv *nfsd_serv;
+struct pathroot;
  };

  extern int nfsd_net_id;
diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
index cee62ab..177bb60 100644
--- a/fs/nfsd/nfssvc.c
+++ b/fs/nfsd/nfssvc.c
@@ -392,6 +392,7 @@ int nfsd_create_serv(struct net *net)

  set_max_drc();
  do_gettimeofday(&nn->nfssvc_boot);/* record boot time */
+get_fs_root(current->fs, &nn->root);
  return 0;
  }

@@ -426,8 +427,10 @@ void nfsd_destroy(struct net *net)
  if (destroy)
  svc_shutdown_net(nn->nfsd_serv, net);
  svc_destroy(nn->nfsd_serv);
-if (destroy)
+if (destroy) {
+path_put(&nn->root);
  nn->nfsd_serv = NULL;
+}
  }

  int nfsd_set_nrthreads(int n, int *nthreads, struct net *net)
@@ -533,6 +536,25 @@ out:
  return error;
  }

+/*
+ * This function is actually slightly modified set_fs_root()
+ */
+static void nfsd_swap_root(struct net *net)
+{
+struct nfsd_net *nn = net_generic(net, nfsd_net_id);
+struct fs_struct *fs = current->fs;
+struct path old_root;
+
+path_get(&nn->root);
+spin_lock(&fs->lock);
+write_seqcount_begin(&fs->seq);
+old_root = fs->root;
+fs->root = nn->root;
+write_seqcount_end(&fs->seq);
+spin_unlock(&fs->lock);
+if (old_root.dentry)
+path_put(&old_root);
+}

  /*
   * This is the NFS server kernel thread
@@ -559,6 +581,15 @@ nfsd(void *vrqstp)
  current->fs->umask = 0;

  /*
+ * We have to swap NFSd kthread's fs->root.
+ * Why so? Because NFSd can be started in container, which has it's own
+ * root.
+ * And so what? NFSd lookup files, and lookup start from
+ * current->fs->root.
+ */
+nfsd_swap_root(net);
+
+/*
   * thread is spawned with all signals set to SIG_IGN, re-enable
   * the ones that will bring down the thread
   */







--
Best regards,
Stanislav Kinsbursky
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] vfs: new super block feature flags attribute

2012-12-11 Thread Mimi Zohar
On Thu, 2012-11-22 at 14:49 +0200, Dmitry Kasatkin wrote:
> This patch introduces new super block attribute flag s_feature_flags
> and SF_IMA_DISABLED flag. This flag will be used by Integrity Measurement
> Architecture (IMA). Name suggested by Bruce Fields.

The patch looks good.  The patch description should reflect the
discussion with Al https://lkml.org/lkml/2012/9/19/9, explanining 'why'
a new flag is needed.

> Certain file system types and partitions will never be measured or
> appraised by IMA depending on the policy. For example, pseudo file
> systems are never measured and appraised. In current implementation
> policy will be checked again and again. It happens thousands times
> per second. That is absolute waste of CPU and may be battery resources.
> 
> IMA will set the SF_IMA_DISABLED flag when file system will not be measured
> and appraised and test this flag during subsequent calls to skip policy 
> search.

This explanation belongs in the subsequent patch, which makes use of the
flag.

> Signed-off-by: Dmitry Kasatkin 


> ---
>  include/linux/fs.h |4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index b33cfc9..0bef2b2 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1321,6 +1321,8 @@ struct super_block {
> 
>   /* Being remounted read-only */
>   int s_readonly_remount;
> +
> + unsigned long s_feature_flags;
>  };
> 
>  /* superblock cache pruning functions */
> @@ -1746,6 +1748,8 @@ struct super_operations {
> 
>  #define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES)
> 

Comment needed here before the start of the feature flag definitions.

> +#define SF_IMA_DISABLED  0x0001
> +
>  extern void __mark_inode_dirty(struct inode *, int);
>  static inline void mark_inode_dirty(struct inode *inode)
>  {

thanks,

Mimi

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] ima: policy search speedup

2012-12-11 Thread Mimi Zohar
On Tue, 2012-12-11 at 14:51 +0200, Kasatkin, Dmitry wrote:

> >> Here is two patches for policy search speedup.
> >>
> >> First patch adds additional features flags to superblock.
> >> Second - implementation for IMA.
> >>
> >> Two months ago I was asking about it on mailing lists.
> >> Suggestion was not to use s_flags, but e.g. s_feature_flags.
> >>
> >> Any objections about such approach?

First of all my appologies for not responding earlier.  Based on the
previous discussion https://lkml.org/lkml/2012/9/19/9, this looks like
the right direction.  Specific comments will be posted on the individual
patches.

thanks,

Mimi

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH v4 2/9] CPU hotplug: Convert preprocessor macros to static inline functions

2012-12-11 Thread Srivatsa S. Bhat
On 12/05/2012 06:10 AM, Andrew Morton wrote:
"static inline C functions would be preferred if possible.  Feel free to
fix up the wrong crufty surrounding code as well ;-)"

Convert the macros in the CPU hotplug code to static inline C functions.

Signed-off-by: Srivatsa S. Bhat 
---

 include/linux/cpu.h |8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index cf24da1..eb79f47 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -198,10 +198,10 @@ static inline void cpu_hotplug_driver_unlock(void)
 
 #else  /* CONFIG_HOTPLUG_CPU */
 
-#define get_online_cpus()  do { } while (0)
-#define put_online_cpus()  do { } while (0)
-#define get_online_cpus_atomic()   do { } while (0)
-#define put_online_cpus_atomic()   do { } while (0)
+static inline void get_online_cpus(void) {}
+static inline void put_online_cpus(void) {}
+static inline void get_online_cpus_atomic(void) {}
+static inline void put_online_cpus_atomic(void) {}
 #define hotcpu_notifier(fn, pri)   do { (void)(fn); } while (0)
 /* These aren't inline functions due to a GCC bug. */
 #define register_hotcpu_notifier(nb)   ({ (void)(nb); 0; })

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH v4 4/9] smp, cpu hotplug: Fix on_each_cpu_*() to prevent CPU offline properly

2012-12-11 Thread Srivatsa S. Bhat
Once stop_machine() is gone from the CPU offline path, we won't be able to
depend on preempt_disable() to prevent CPUs from going offline from under us.

Use the get/put_online_cpus_atomic() APIs to prevent CPUs from going offline,
while invoking from atomic context.

Signed-off-by: Srivatsa S. Bhat 
---

 kernel/smp.c |   25 +++--
 1 file changed, 15 insertions(+), 10 deletions(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index ce1a866..0031000 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -688,12 +688,12 @@ int on_each_cpu(void (*func) (void *info), void *info, 
int wait)
unsigned long flags;
int ret = 0;
 
-   preempt_disable();
+   get_online_cpus_atomic();
ret = smp_call_function(func, info, wait);
local_irq_save(flags);
func(info);
local_irq_restore(flags);
-   preempt_enable();
+   put_online_cpus_atomic();
return ret;
 }
 EXPORT_SYMBOL(on_each_cpu);
@@ -715,7 +715,11 @@ EXPORT_SYMBOL(on_each_cpu);
 void on_each_cpu_mask(const struct cpumask *mask, smp_call_func_t func,
void *info, bool wait)
 {
-   int cpu = get_cpu();
+   int cpu;
+
+   get_online_cpus_atomic();
+
+   cpu = smp_processor_id();
 
smp_call_function_many(mask, func, info, wait);
if (cpumask_test_cpu(cpu, mask)) {
@@ -723,7 +727,7 @@ void on_each_cpu_mask(const struct cpumask *mask, 
smp_call_func_t func,
func(info);
local_irq_enable();
}
-   put_cpu();
+   put_online_cpus_atomic();
 }
 EXPORT_SYMBOL(on_each_cpu_mask);
 
@@ -748,8 +752,9 @@ EXPORT_SYMBOL(on_each_cpu_mask);
  * The function might sleep if the GFP flags indicates a non
  * atomic allocation is allowed.
  *
- * Preemption is disabled to protect against CPUs going offline but not online.
- * CPUs going online during the call will not be seen or sent an IPI.
+ * We use get/put_online_cpus_atomic() to prevent CPUs from going
+ * offline in-between our operation. CPUs coming online during the
+ * call will not be seen or sent an IPI.
  *
  * You must not call this function with disabled interrupts or
  * from a hardware interrupt handler or from a bottom half handler.
@@ -764,26 +769,26 @@ void on_each_cpu_cond(bool (*cond_func)(int cpu, void 
*info),
might_sleep_if(gfp_flags & __GFP_WAIT);
 
if (likely(zalloc_cpumask_var(&cpus, (gfp_flags|__GFP_NOWARN {
-   preempt_disable();
+   get_online_cpus_atomic();
for_each_online_cpu(cpu)
if (cond_func(cpu, info))
cpumask_set_cpu(cpu, cpus);
on_each_cpu_mask(cpus, func, info, wait);
-   preempt_enable();
+   put_online_cpus_atomic();
free_cpumask_var(cpus);
} else {
/*
 * No free cpumask, bother. No matter, we'll
 * just have to IPI them one by one.
 */
-   preempt_disable();
+   get_online_cpus_atomic();
for_each_online_cpu(cpu)
if (cond_func(cpu, info)) {
ret = smp_call_function_single(cpu, func,
info, wait);
WARN_ON_ONCE(!ret);
}
-   preempt_enable();
+   put_online_cpus_atomic();
}
 }
 EXPORT_SYMBOL(on_each_cpu_cond);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH v4 3/9] smp, cpu hotplug: Fix smp_call_function_*() to prevent CPU offline properly

2012-12-11 Thread Srivatsa S. Bhat
Once stop_machine() is gone from the CPU offline path, we won't be able to
depend on preempt_disable() to prevent CPUs from going offline from under us.

Use the get/put_online_cpus_atomic() APIs to prevent CPUs from going offline,
while invoking from atomic context.

Signed-off-by: Srivatsa S. Bhat 
---

 kernel/smp.c |   38 +-
 1 file changed, 25 insertions(+), 13 deletions(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index 29dd40a..ce1a866 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -310,7 +310,8 @@ int smp_call_function_single(int cpu, smp_call_func_t func, 
void *info,
 * prevent preemption and reschedule on another processor,
 * as well as CPU removal
 */
-   this_cpu = get_cpu();
+   get_online_cpus_atomic();
+   this_cpu = smp_processor_id();
 
/*
 * Can deadlock when called with interrupts disabled.
@@ -342,7 +343,7 @@ int smp_call_function_single(int cpu, smp_call_func_t func, 
void *info,
}
}
 
-   put_cpu();
+   put_online_cpus_atomic();
 
return err;
 }
@@ -371,8 +372,10 @@ int smp_call_function_any(const struct cpumask *mask,
const struct cpumask *nodemask;
int ret;
 
+   get_online_cpus_atomic();
/* Try for same CPU (cheapest) */
-   cpu = get_cpu();
+   cpu = smp_processor_id();
+
if (cpumask_test_cpu(cpu, mask))
goto call;
 
@@ -388,7 +391,7 @@ int smp_call_function_any(const struct cpumask *mask,
cpu = cpumask_any_and(mask, cpu_online_mask);
 call:
ret = smp_call_function_single(cpu, func, info, wait);
-   put_cpu();
+   put_online_cpus_atomic();
return ret;
 }
 EXPORT_SYMBOL_GPL(smp_call_function_any);
@@ -409,14 +412,17 @@ void __smp_call_function_single(int cpu, struct 
call_single_data *data,
unsigned int this_cpu;
unsigned long flags;
 
-   this_cpu = get_cpu();
+   get_online_cpus_atomic();
+
+   this_cpu = smp_processor_id();
+
/*
 * Can deadlock when called with interrupts disabled.
 * We allow cpu's that are not yet online though, as no one else can
 * send smp call function interrupt to this cpu and as such deadlocks
 * can't happen.
 */
-   WARN_ON_ONCE(cpu_online(smp_processor_id()) && wait && irqs_disabled()
+   WARN_ON_ONCE(cpu_online(this_cpu) && wait && irqs_disabled()
 && !oops_in_progress);
 
if (cpu == this_cpu) {
@@ -427,7 +433,7 @@ void __smp_call_function_single(int cpu, struct 
call_single_data *data,
csd_lock(data);
generic_exec_single(cpu, data, wait);
}
-   put_cpu();
+   put_online_cpus_atomic();
 }
 
 /**
@@ -451,6 +457,8 @@ void smp_call_function_many(const struct cpumask *mask,
unsigned long flags;
int refs, cpu, next_cpu, this_cpu = smp_processor_id();
 
+   get_online_cpus_atomic();
+
/*
 * Can deadlock when called with interrupts disabled.
 * We allow cpu's that are not yet online though, as no one else can
@@ -467,17 +475,18 @@ void smp_call_function_many(const struct cpumask *mask,
 
/* No online cpus?  We're done. */
if (cpu >= nr_cpu_ids)
-   return;
+   goto out_unlock;
 
/* Do we have another CPU which isn't us? */
next_cpu = cpumask_next_and(cpu, mask, cpu_online_mask);
if (next_cpu == this_cpu)
-   next_cpu = cpumask_next_and(next_cpu, mask, cpu_online_mask);
+   next_cpu = cpumask_next_and(next_cpu, mask,
+   cpu_online_mask);
 
/* Fastpath: do that cpu by itself. */
if (next_cpu >= nr_cpu_ids) {
smp_call_function_single(cpu, func, info, wait);
-   return;
+   goto out_unlock;
}
 
data = &__get_cpu_var(cfd_data);
@@ -523,7 +532,7 @@ void smp_call_function_many(const struct cpumask *mask,
/* Some callers race with other cpus changing the passed mask */
if (unlikely(!refs)) {
csd_unlock(&data->csd);
-   return;
+   goto out_unlock;
}
 
raw_spin_lock_irqsave(&call_function.lock, flags);
@@ -554,6 +563,9 @@ void smp_call_function_many(const struct cpumask *mask,
/* Optionally wait for the CPUs to complete */
if (wait)
csd_lock_wait(&data->csd);
+
+out_unlock:
+   put_online_cpus_atomic();
 }
 EXPORT_SYMBOL(smp_call_function_many);
 
@@ -574,9 +586,9 @@ EXPORT_SYMBOL(smp_call_function_many);
  */
 int smp_call_function(smp_call_func_t func, void *info, int wait)
 {
-   preempt_disable();
+   get_online_cpus_atomic();
smp_call_function_many(cpu_online_mask, func, info, wait);
-   preempt_enable();
+   put_online_cpus_atomic();
 
return 0;
 }

--
To unsubscribe from this list: send the line "unsub

[RFC PATCH v4 9/9] cpu: No more __stop_machine() in _cpu_down()

2012-12-11 Thread Srivatsa S. Bhat
From: Paul E. McKenney 

The _cpu_down() function invoked as part of the CPU-hotplug offlining
process currently invokes __stop_machine(), which is slow and inflicts
substantial real-time latencies on the entire system.  This patch
substitutes stop_one_cpu() for __stop_machine() in order to improve
both performance and real-time latency.

This is currently unsafe, because there are a number of uses of
preempt_disable() that are intended to block CPU-hotplug offlining.
These will be fixed by using get/put_online_cpus_atomic(), but in the
meantime, this commit is one way to help locate them.

Signed-off-by: Paul E. McKenney 
Signed-off-by: Paul E. McKenney 
[ srivatsa.b...@linux.vnet.ibm.com: Refer to the new sync primitives for
  readers (in the changelog), and s/stop_cpus/stop_one_cpu ]
Signed-off-by: Srivatsa S. Bhat 
---

 kernel/cpu.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/cpu.c b/kernel/cpu.c
index 5a63296..3f9498e 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -484,7 +484,7 @@ static int __ref _cpu_down(unsigned int cpu, int 
tasks_frozen)
}
smpboot_park_threads(cpu);
 
-   err = __stop_machine(take_cpu_down, &tcd_param, cpumask_of(cpu));
+   err = stop_one_cpu(cpu, take_cpu_down, &tcd_param);
if (err) {
/* CPU didn't die: tell everyone.  Can't complain. */
smpboot_unpark_threads(cpu);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 1/9] CPU hotplug: Provide APIs to prevent CPU offline from atomic context

2012-12-11 Thread Tejun Heo
Hello,

On Tue, Dec 11, 2012 at 07:32:13PM +0530, Srivatsa S. Bhat wrote:
> On 12/11/2012 07:17 PM, Tejun Heo wrote:
> > Hello, Srivatsa.
> > 
> > On Tue, Dec 11, 2012 at 06:43:54PM +0530, Srivatsa S. Bhat wrote:
> >> This approach (of using synchronize_sched()) also looks good. It is simple,
> >> yet effective, but unfortunately inefficient at the writer side (because
> >> he'll have to wait for a full synchronize_sched()).
> > 
> > While synchornize_sched() is heavier on the writer side than the
> > originally posted version, it doesn't stall the whole machine and
> > wouldn't introduce latencies to others.  Shouldn't that be enough?
> > 
> 
> Short answer: Yes. But we can do better, with almost comparable code
> complexity. So I'm tempted to try that out.
> 
> Long answer:
> Even in the synchronize_sched() approach, we still have to identify the
> readers who need to be converted to use the new get/put_online_cpus_atomic()
> APIs and convert them. Then, if we can come up with a scheme such that
> the writer has to wait only for those readers to complete, then why not?
> 
> If such a scheme ends up becoming too complicated, then I agree, we
> can use synchronize_sched() itself. (That's what I meant by saying that
> we'll use this as a fallback).
> 
> But even in this scheme which uses synchronize_sched(), we are
> already half-way through (we already use 2 types of sync schemes -
> counters and rwlocks). Just a little more logic can get rid of the
> unnecessary full-wait too.. So why not give it a shot?

It's not really about the code complexity but making the reader side
as light as possible.  Please keep in mind that reader side is still
*way* more hotter than the writer side.  Before, the writer side was
heavy to the extent which causes noticeable disruptions on the whole
system and I think that's what we're trying to hunt down here.  If we
can shave of memory barriers from reader side by using
synchornized_sched() on writer side, that is the *better* result, not
worse.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH v4 8/9] kvm, vmx: Add atomic synchronization with CPU Hotplug

2012-12-11 Thread Srivatsa S. Bhat
preempt_disable() will no longer help prevent CPUs from going offline, once
stop_machine() gets removed from the CPU offline path. So use
get/put_online_cpus_atomic() in vmx_vcpu_load() to prevent CPUs from
going offline while clearing vmcs.

Reported-by: Michael Wang 
Debugged-by: Xiao Guangrong 
Signed-off-by: Srivatsa S. Bhat 
---

 arch/x86/kvm/vmx.c |8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index f858159..d8a4cf1 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1519,10 +1519,14 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int 
cpu)
struct vcpu_vmx *vmx = to_vmx(vcpu);
u64 phys_addr = __pa(per_cpu(vmxarea, cpu));
 
-   if (!vmm_exclusive)
+   if (!vmm_exclusive) {
kvm_cpu_vmxon(phys_addr);
-   else if (vmx->loaded_vmcs->cpu != cpu)
+   } else if (vmx->loaded_vmcs->cpu != cpu) {
+   /* Prevent any CPU from going offline */
+   get_online_cpus_atomic();
loaded_vmcs_clear(vmx->loaded_vmcs);
+   put_online_cpus_atomic();
+   }
 
if (per_cpu(current_vmcs, cpu) != vmx->loaded_vmcs->vmcs) {
per_cpu(current_vmcs, cpu) = vmx->loaded_vmcs->vmcs;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH v4 7/9] yield_to(), cpu-hotplug: Prevent offlining of other CPUs properly

2012-12-11 Thread Srivatsa S. Bhat
Once stop_machine() is gone from the CPU offline path, we won't be able to
depend on local_irq_save() to prevent CPUs from going offline from under us.

Use the get/put_online_cpus_atomic() APIs to prevent CPUs from going offline,
while invoking from atomic context.

Signed-off-by: Srivatsa S. Bhat 
---

 kernel/sched/core.c |4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index cff7656..4b982bf 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4312,6 +4312,7 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
unsigned long flags;
bool yielded = 0;
 
+   get_online_cpus_atomic();
local_irq_save(flags);
rq = this_rq();
 
@@ -4339,13 +4340,14 @@ again:
 * Make p's CPU reschedule; pick_next_entity takes care of
 * fairness.
 */
-   if (preempt && rq != p_rq)
+   if (preempt && rq != p_rq && cpu_online(task_cpu(p)))
resched_task(p_rq->curr);
}
 
 out:
double_rq_unlock(rq, p_rq);
local_irq_restore(flags);
+   put_online_cpus_atomic();
 
if (yielded)
schedule();

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH v4 0/9] CPU hotplug: stop_machine()-free CPU hotplug

2012-12-11 Thread Srivatsa S. Bhat
Hi,

This patchset removes CPU hotplug's dependence on stop_machine() from the CPU
offline path and provides an alternative (set of APIs) to preempt_disable() to
prevent CPUs from going offline, which can be invoked from atomic context.

This is an RFC patchset with only a few call-sites of preempt_disable()
converted to the new APIs for now, and the main goal is to get feedback on the
design of the new atomic APIs and see if it serves as a viable replacement for
stop_machine()-free CPU hotplug. A brief description of the algorithm is
available in the "Changes in vN" section.

Overview of the patches:
---

Patch 1 introduces the new APIs that can be used from atomic context, to
prevent CPUs from going offline.

Patch 2 is a cleanup; it converts preprocessor macros to static inline
functions.

Patches 3 to 8 convert various call-sites to use the new APIs.

Patch 9 is the one which actually removes stop_machine() from the CPU
offline path.

Changes in v4:
--
  The synchronization scheme has been simplified quite a bit, which makes it
  look a lot less complex than before. Some highlights:

* Implicit ACKs:

  The earlier design required the readers to explicitly ACK the writer's
  signal. The new design uses implicit ACKs instead. The reader switching
  over to rwlock implicitly tells the writer to stop waiting for that reader.

* No atomic operations:

  Since we got rid of explicit ACKs, we no longer have the need for a reader
  and a writer to update the same counter. So we can get rid of atomic ops
  too.

Changes in v3:
--
* Dropped the _light() and _full() variants of the APIs. Provided a single
  interface: get/put_online_cpus_atomic().

* Completely redesigned the synchronization mechanism again, to make it
  fast and scalable at the reader-side in the fast-path (when no hotplug
  writers are active). This new scheme also ensures that there is no
  possibility of deadlocks due to circular locking dependency.
  In summary, this provides the scalability and speed of per-cpu rwlocks
  (without actually using them), while avoiding the downside (deadlock
  possibilities) which is inherent in any per-cpu locking scheme that is
  meant to compete with preempt_disable()/enable() in terms of flexibility.

  The problem with using per-cpu locking to replace preempt_disable()/enable
  was explained here:
  https://lkml.org/lkml/2012/12/6/290

  Basically we use per-cpu counters (for scalability) when no writers are
  active, and then switch to global rwlocks (for lock-safety) when a writer
  becomes active. It is a slightly complex scheme, but it is based on
  standard principles of distributed algorithms.

Changes in v2:
-
* Completely redesigned the synchronization scheme to avoid using any extra
  cpumasks.

* Provided APIs for 2 types of atomic hotplug readers: "light" (for
  light-weight) and "full". We wish to have more "light" readers than
  the "full" ones, to avoid indirectly inducing the "stop_machine effect"
  without even actually using stop_machine().

  And the patches show that it _is_ generally true: 5 patches deal with
  "light" readers, whereas only 1 patch deals with a "full" reader.

  Also, the "light" readers happen to be in very hot paths. So it makes a
  lot of sense to have such a distinction and a corresponding light-weight
  API.

Links to previous versions:
v3: https://lkml.org/lkml/2012/12/7/287
v2: https://lkml.org/lkml/2012/12/5/322
v1: https://lkml.org/lkml/2012/12/4/88

Comments and suggestions welcome!

--
 Paul E. McKenney (1):
  cpu: No more __stop_machine() in _cpu_down()

Srivatsa S. Bhat (8):
  CPU hotplug: Provide APIs to prevent CPU offline from atomic context
  CPU hotplug: Convert preprocessor macros to static inline functions
  smp, cpu hotplug: Fix smp_call_function_*() to prevent CPU offline 
properly
  smp, cpu hotplug: Fix on_each_cpu_*() to prevent CPU offline properly
  sched, cpu hotplug: Use stable online cpus in try_to_wake_up() & 
select_task_rq()
  kick_process(), cpu-hotplug: Prevent offlining of target CPU properly
  yield_to(), cpu-hotplug: Prevent offlining of other CPUs properly
  kvm, vmx: Add atomic synchronization with CPU Hotplug


  arch/x86/kvm/vmx.c  |8 +-
 include/linux/cpu.h |8 +-
 kernel/cpu.c|  206 ++-
 kernel/sched/core.c |   22 +
 kernel/smp.c|   63 ++--
 5 files changed, 273 insertions(+), 34 deletions(-)



Thanks,
Srivatsa S. Bhat
IBM Linux Technology Center

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 1/9] CPU hotplug: Provide APIs to prevent CPU offline from atomic context

2012-12-11 Thread Srivatsa S. Bhat
On 12/11/2012 07:17 PM, Tejun Heo wrote:
> Hello, Srivatsa.
> 
> On Tue, Dec 11, 2012 at 06:43:54PM +0530, Srivatsa S. Bhat wrote:
>> This approach (of using synchronize_sched()) also looks good. It is simple,
>> yet effective, but unfortunately inefficient at the writer side (because
>> he'll have to wait for a full synchronize_sched()).
> 
> While synchornize_sched() is heavier on the writer side than the
> originally posted version, it doesn't stall the whole machine and
> wouldn't introduce latencies to others.  Shouldn't that be enough?
> 

Short answer: Yes. But we can do better, with almost comparable code
complexity. So I'm tempted to try that out.

Long answer:
Even in the synchronize_sched() approach, we still have to identify the
readers who need to be converted to use the new get/put_online_cpus_atomic()
APIs and convert them. Then, if we can come up with a scheme such that
the writer has to wait only for those readers to complete, then why not?

If such a scheme ends up becoming too complicated, then I agree, we
can use synchronize_sched() itself. (That's what I meant by saying that
we'll use this as a fallback).

But even in this scheme which uses synchronize_sched(), we are
already half-way through (we already use 2 types of sync schemes -
counters and rwlocks). Just a little more logic can get rid of the
unnecessary full-wait too.. So why not give it a shot?

Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH RT 3/4] sched/rt: Use IPI to trigger RT task push migration instead of pulling

2012-12-11 Thread Steven Rostedt
On Tue, 2012-12-11 at 13:43 +0100, Thomas Gleixner wrote:
> On Mon, 10 Dec 2012, Steven Rostedt wrote:
> > On Mon, 2012-12-10 at 17:15 -0800, Frank Rowand wrote:
> > 
> > > I should have also mentioned some previous experience using IPIs to
> > > avoid runq lock contention on wake up.  Someone encountered IPI
> > > storms when using the TTWU_QUEUE feature, thus it defaults to off
> > > for CONFIG_PREEMPT_RT_FULL:
> > > 
> > >   #ifndef CONFIG_PREEMPT_RT_FULL
> > >   /*
> > >* Queue remote wakeups on the target CPU and process them
> > >* using the scheduler IPI. Reduces rq->lock contention/bounces.
> > >*/
> > >   SCHED_FEAT(TTWU_QUEUE, true)
> > >   #else
> > >   SCHED_FEAT(TTWU_QUEUE, false)
> > > 
> > 
> > Interesting, but I'm wondering if this also does it for every wakeup? If
> > you have 1000 tasks waking up on another CPU, this could potentially
> > send out 1000 IPIs. The number of IPIs here looks to be # of tasks
> > waking up, and perhaps more than that, as there could be multiple
> > instances that try to wake up the same task.
> 
> Not using the TTWU_QUEUE feature limits the IPIs to a single one,
> which is only sent if the newly woken task preempts the current task
> on the remote cpu and the NEED_RESCHED flag was not yet set.
>  
> With TTWU_QUEUE you can induce massive latencies just by starting
> hackbench. You get a herd wakeup on CPU0 which then enqueues hundreds
> of tasks to the remote pull list and sends IPIs. The remote CPUs pulls
> the tasks and activate them on their runqueue in hard interrupt
> context. That easiliy can accumulate to hundreds of microseconds when
> you do a mass push of newly woken tasks.
> 
> Of course it avoids fiddling with the remote rq lock, but it becomes
> massivly non deterministic.

Agreed. I never suggested to use TTWU_QUEUE. I was just stating the
difference between that and my patches.

> 
> > Now this patch set, the # of IPIs is limited to the # of CPUs. If you
> > have 4 CPUs, you'll get a storm of 3 IPIs. That's a big difference.
> 
> Yeah, the big difference is that you offload the double lock to the
> IPI. So in the worst case you interrupt the most latency sensitive
> task running on the remote CPU. Not sure if I really like that
> "feature".
>  

First, the pulled CPU isn't necessarily running the most latency
sensitive task. It just happens to be running more than one RT task, and
the waiting RT task can migrate. The running task may be of the same
priority as the waiting task. And they both may be the lowest priority
RT tasks in the system, and a CPU just went idle.

Currently, what we have is a huge contention on both the pulled CPU rq
lock. We've measured over 500us latencies due to it. This hurts even the
CPU that has the overloaded task, as the contention is on its lock.

-- Steve


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V3 1/2] MCE: fix an error of mce_bad_pages statistics

2012-12-11 Thread Xishi Qiu
On 2012/12/11 20:42, Wanpeng Li wrote:

> On Tue, Dec 11, 2012 at 08:18:27PM +0800, Xishi Qiu wrote:
>> 1) move poisoned page check at the beginning of the function.
>> 2) add page_lock to avoid unpoison clear the flag.
>>
>> Signed-off-by: Xishi Qiu 
>> Signed-off-by: Jiang Liu 
>> ---
>> mm/memory-failure.c |   43 ++-
>> 1 files changed, 22 insertions(+), 21 deletions(-)
>>
>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>> index 8b20278..9b74983 100644
>> --- a/mm/memory-failure.c
>> +++ b/mm/memory-failure.c
>> @@ -1419,18 +1419,17 @@ static int soft_offline_huge_page(struct page *page, 
>> int flags)
>>  unsigned long pfn = page_to_pfn(page);
>>  struct page *hpage = compound_head(page);
>>
>> +if (PageHWPoison(hpage)) {
>> +pr_info("soft offline: %#lx hugepage already poisoned\n", pfn);
>> +return -EBUSY;
>> +}
>> +
>>  ret = get_any_page(page, pfn, flags);
>>  if (ret < 0)
>>  return ret;
>>  if (ret == 0)
>>  goto done;
>>
>> -if (PageHWPoison(hpage)) {
>> -put_page(hpage);
>> -pr_info("soft offline: %#lx hugepage already poisoned\n", pfn);
>> -return -EBUSY;
>> -}
>> -
>>  /* Keep page count to indicate a given hugepage is isolated. */
>>  ret = migrate_huge_page(hpage, new_page, MPOL_MF_MOVE_ALL, false,
>>  MIGRATE_SYNC);
>> @@ -1441,12 +1440,14 @@ static int soft_offline_huge_page(struct page *page, 
>> int flags)
>>  return ret;
>>  }
>> done:
>> -if (!PageHWPoison(hpage))
>> -atomic_long_add(1 << compound_trans_order(hpage),
>> -&mce_bad_pages);
>> +/* keep elevated page count for bad page */
>> +lock_page(hpage);
>> +atomic_long_add(1 << compound_trans_order(hpage), &mce_bad_pages);
>>  set_page_hwpoison_huge_page(hpage);
>> +unlock_page(hpage);
>> +
>>  dequeue_hwpoisoned_huge_page(hpage);
>> -/* keep elevated page count for bad page */
>> +
>>  return ret;
>> }
>>
>> @@ -1488,6 +1489,11 @@ int soft_offline_page(struct page *page, int flags)
>>  }
>>  }
>>
>> +if (PageHWPoison(page)) {
>> +pr_info("soft offline: %#lx page already poisoned\n", pfn);
>> +return -EBUSY;
>> +}
>> +
>>  ret = get_any_page(page, pfn, flags);
>>  if (ret < 0)
>>  return ret;
>> @@ -1519,19 +1525,11 @@ int soft_offline_page(struct page *page, int flags)
>>  return -EIO;
>>  }
>>
>> -lock_page(page);
>> -wait_on_page_writeback(page);
>> -
>>  /*
>>   * Synchronized using the page lock with memory_failure()
>>   */
>> -if (PageHWPoison(page)) {
>> -unlock_page(page);
>> -put_page(page);
>> -pr_info("soft offline: %#lx page already poisoned\n", pfn);
>> -return -EBUSY;
>> -}
>> -
>> +lock_page(page);
>> +wait_on_page_writeback(page);
>>  /*
>>   * Try to invalidate first. This should work for
>>   * non dirty unmapped page cache pages.
>> @@ -1582,8 +1580,11 @@ int soft_offline_page(struct page *page, int flags)
>>  return ret;
>>
>> done:
>> +/* keep elevated page count for bad page */
>> +lock_page(page);
>>  atomic_long_add(1, &mce_bad_pages);
>>  SetPageHWPoison(page);
>> -/* keep elevated page count for bad page */
>> +unlock_page(page);
>> +
> 
> Hi Xishi,
> 
> Why add lock_page here, the comment in function unpoison_memory tell us
> we don't need it.
> 

Hi Wangpeng,

Since unpoison is only for debugging, so it's unnecessary to add the lock
here, right?

> /*
>  * This test is racy because PG_hwpoison is set outside of page lock.
>  * That's acceptable because that won't trigger kernel panic. Instead,
>  * the PG_hwpoison page will be caught and isolated on the entrance to
>  * the free buddy page pool.
>  */
> 
> Futhermore, Andrew didn't like a variable called "mce_bad_pages".
> 
> - Why do we have a variable called "mce_bad_pages"?  MCE is an x86
>   concept, and this code is in mm/.  Lights are flashing, bells are
>   ringing and a loudspeaker is blaring "layering violation" at us!
>

Just change the name?

Thanks
Xishi Qiu

> Regards,
> Wanpeng Li 
> 
>>  return ret;
>> }
>> -- 
>> 1.7.1
>>
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majord...@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: mailto:"d...@kvack.org";> em...@kvack.org 
> 
> 
> .
> 



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] EDAC fixes for 3.8

2012-12-11 Thread Borislav Petkov
Hi Linus,

please pull the 'for-linus' branch below to receive the following
changes to EDAC:

* EDAC core error path fix, from Denis Kirjanov.

* Generalization of AMD MCE bank names and some minor error reporting
improvements.

* EDAC core cleanups and simplifications, from Wei Yongjun.

* amd64_edac fixes for sysfs-reported values, from Josh Hunt.

* some heavy amd64_edac error reporting path shaving, leading to
removing a bunch of code.

* amd64_edac error injection method improvements.

* EDAC core cleanups and fixes.

Thanks.

The following changes since commit 9489e9dcae718d5fde988e4a684a0f55b5f94d17:

  Linux 3.7-rc7 (2012-11-25 17:59:19 -0800)

are available in the git repository at:

  ssh://git.kernel.org/pub/scm/linux/kernel/git/bp/bp.git for-linus

for you to fetch changes up to 3bfe5aae8edd8436d26cddfeab783492d8950821:

  EDAC, pci_sysfs: Use for_each_pci_dev to simplify the code (2012-12-04 
08:27:39 +0100)


Borislav Petkov (19):
  EDAC: Respect operational state in edac_pci.c
  EDAC: Boundary-check edac_debug_level
  EDAC: Remove useless assignment of error type
  EDAC: Handle empty msg strings when reporting errors
  amd64_edac: Small fixlets and cleanups
  amd64_edac: Cleanup error injection code
  amd64_edac: Improve error injection
  amd64_edac: Do not check whether error address is valid
  amd64_edac: Reorganize error reporting path
  amd64_edac: Fix K8 chip select reporting
  amd64_edac: Use DBAM_DIMM macro
  amd64_edac: Fix csrows size and pages computation
  EDAC: Add memory controller flags
  EDAC: Pass mci parent
  EDAC: Fix csrow size reported in sysfs
  MCE, AMD: Remove functional unit references
  MCE, AMD: Dump CPU f/m/s triple with the error
  MCE, AMD: Report decoded error type first
  MCE, AMD: Dump error status

Denis Kirjanov (1):
  EDAC: Handle error path in edac_mc_sysfs_init() properly

Josh Hunt (1):
  EDAC: Fix mc size reported in sysfs

Wei Yongjun (3):
  EDAC, Calxeda highbank: Convert to use simple_open()
  EDAC: Convert to use simple_open()
  EDAC, pci_sysfs: Use for_each_pci_dev to simplify the code

 drivers/edac/Kconfig|   8 +-
 drivers/edac/amd64_edac.c   | 297 +++-
 drivers/edac/amd64_edac.h   |  59 ++--
 drivers/edac/amd64_edac_inj.c   | 128 -
 drivers/edac/edac_mc.c  |  51 +++
 drivers/edac/edac_mc_sysfs.c|  39 --
 drivers/edac/edac_module.c  |  27 +++-
 drivers/edac/edac_pci.c |   3 +-
 drivers/edac/edac_pci_sysfs.c   |  12 +-
 drivers/edac/highbank_mc_edac.c |   8 +-
 drivers/edac/mce_amd.c  | 254 ++
 drivers/edac/mce_amd.h  |  11 +-
 include/linux/edac.h|   3 +
 13 files changed, 454 insertions(+), 446 deletions(-)

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/6] nfsd: swap fs root in NFSd kthreads

2012-12-11 Thread Stanislav Kinsbursky

11.12.2012 00:28, J. Bruce Fields пишет:

On Thu, Dec 06, 2012 at 06:34:47PM +0300, Stanislav Kinsbursky wrote:

NFSd does lookup. Lookup is done starting from current->fs->root.
NFSd is a kthread, cloned by kthreadd, and thus have global (but luckely
unshared) root.
So we have to swap root to those, which process, started NFSd, has. Because
that process can be in a container with it's own root.


This doesn't sound right to me.

Which lookups exactly do you see being done relative to
current->fs->root ?



Ok, you are right. I was mistaken here.
This is not a exactly lookup, but d_path() problem in svc_export_request().
I.e. without root swapping, d_path() will give not local export path (like 
"/export")
but something like this "/root/containers_root/export".


--b.



Signed-off-by: Stanislav Kinsbursky 
---
  fs/nfsd/netns.h  |1 +
  fs/nfsd/nfssvc.c |   33 -
  2 files changed, 33 insertions(+), 1 deletions(-)

diff --git a/fs/nfsd/netns.h b/fs/nfsd/netns.h
index abfc97c..5c423c6 100644
--- a/fs/nfsd/netns.h
+++ b/fs/nfsd/netns.h
@@ -101,6 +101,7 @@ struct nfsd_net {
struct timeval nfssvc_boot;

struct svc_serv *nfsd_serv;
+   struct path root;
  };

  extern int nfsd_net_id;
diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
index cee62ab..177bb60 100644
--- a/fs/nfsd/nfssvc.c
+++ b/fs/nfsd/nfssvc.c
@@ -392,6 +392,7 @@ int nfsd_create_serv(struct net *net)

set_max_drc();
do_gettimeofday(&nn->nfssvc_boot);   /* record boot time */
+   get_fs_root(current->fs, &nn->root);
return 0;
  }

@@ -426,8 +427,10 @@ void nfsd_destroy(struct net *net)
if (destroy)
svc_shutdown_net(nn->nfsd_serv, net);
svc_destroy(nn->nfsd_serv);
-   if (destroy)
+   if (destroy) {
+   path_put(&nn->root);
nn->nfsd_serv = NULL;
+   }
  }

  int nfsd_set_nrthreads(int n, int *nthreads, struct net *net)
@@ -533,6 +536,25 @@ out:
return error;
  }

+/*
+ * This function is actually slightly modified set_fs_root()
+ */
+static void nfsd_swap_root(struct net *net)
+{
+   struct nfsd_net *nn = net_generic(net, nfsd_net_id);
+   struct fs_struct *fs = current->fs;
+   struct path old_root;
+
+   path_get(&nn->root);
+   spin_lock(&fs->lock);
+   write_seqcount_begin(&fs->seq);
+   old_root = fs->root;
+   fs->root = nn->root;
+   write_seqcount_end(&fs->seq);
+   spin_unlock(&fs->lock);
+   if (old_root.dentry)
+   path_put(&old_root);
+}

  /*
   * This is the NFS server kernel thread
@@ -559,6 +581,15 @@ nfsd(void *vrqstp)
current->fs->umask = 0;

/*
+* We have to swap NFSd kthread's fs->root.
+* Why so? Because NFSd can be started in container, which has it's own
+* root.
+* And so what? NFSd lookup files, and lookup start from
+* current->fs->root.
+*/
+   nfsd_swap_root(net);
+
+   /*
 * thread is spawned with all signals set to SIG_IGN, re-enable
 * the ones that will bring down the thread
 */




--
Best regards,
Stanislav Kinsbursky
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/2] pinctrl/nomadik: adopt pinctrl sleep mode management

2012-12-11 Thread Linus Walleij
From: Julien Delacou 

This fix makes pinctrl-nomadik able to handle suspend/resume
events and change hogged pins states accordingly.

Signed-off-by: Julien Delacou 
Signed-off-by: Linus Walleij 
---
 drivers/pinctrl/pinctrl-nomadik.c | 26 ++
 1 file changed, 26 insertions(+)

diff --git a/drivers/pinctrl/pinctrl-nomadik.c 
b/drivers/pinctrl/pinctrl-nomadik.c
index 8ef3e85..1068faa 100644
--- a/drivers/pinctrl/pinctrl-nomadik.c
+++ b/drivers/pinctrl/pinctrl-nomadik.c
@@ -1847,6 +1847,28 @@ static const struct of_device_id nmk_pinctrl_match[] = {
{},
 };
 
+static int nmk_pinctrl_suspend(struct platform_device *pdev, pm_message_t 
state)
+{
+   struct nmk_pinctrl *npct;
+
+   npct = platform_get_drvdata(pdev);
+   if (!npct)
+   return -EINVAL;
+
+   return pinctrl_force_sleep(npct->pctl);
+}
+
+static int nmk_pinctrl_resume(struct platform_device *pdev)
+{
+   struct nmk_pinctrl *npct;
+
+   npct = platform_get_drvdata(pdev);
+   if (!npct)
+   return -EINVAL;
+
+   return pinctrl_force_default(npct->pctl);
+}
+
 static int __devinit nmk_pinctrl_probe(struct platform_device *pdev)
 {
const struct platform_device_id *platid = platform_get_device_id(pdev);
@@ -1955,6 +1977,10 @@ static struct platform_driver nmk_pinctrl_driver = {
},
.probe = nmk_pinctrl_probe,
.id_table = nmk_pinctrl_id,
+#ifdef CONFIG_PM
+   .suspend = nmk_pinctrl_suspend,
+   .resume = nmk_pinctrl_resume,
+#endif
 };
 
 static int __init nmk_gpio_init(void)
-- 
1.7.11.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/2] pinctrl: add sleep mode management for hogs

2012-12-11 Thread Linus Walleij
From: Julien Delacou 

This fix allows handling sleep mode for hogged
pins in pinctrl. It provides functions to set pins
to sleep/default configurations according to their
current state from the individual pinctrl drivers.

Signed-off-by: Julien Delacou 
Signed-off-by: Linus Walleij 
---
 drivers/pinctrl/core.c  | 35 ---
 drivers/pinctrl/core.h  |  4 
 include/linux/pinctrl/pinctrl.h |  2 ++
 3 files changed, 38 insertions(+), 3 deletions(-)

diff --git a/drivers/pinctrl/core.c b/drivers/pinctrl/core.c
index 5cdee86..f9c50bb 100644
--- a/drivers/pinctrl/core.c
+++ b/drivers/pinctrl/core.c
@@ -1055,6 +1055,28 @@ void pinctrl_unregister_map(struct pinctrl_map const 
*map)
}
 }
 
+/**
+ * pinctrl_force_sleep() - turn a given controller device into sleep state
+ * @pctldev: pin controller device
+ */
+int pinctrl_force_sleep(struct pinctrl_dev *pctldev)
+{
+   if (!IS_ERR(pctldev->p) && !IS_ERR(pctldev->hog_sleep))
+   return pinctrl_select_state(pctldev->p, pctldev->hog_sleep);
+   return 0;
+}
+
+/**
+ * pinctrl_force_default() - turn a given controller device into default state
+ * @pctldev: pin controller device
+ */
+int pinctrl_force_default(struct pinctrl_dev *pctldev)
+{
+   if (!IS_ERR(pctldev->p) && !IS_ERR(pctldev->hog_default))
+   return pinctrl_select_state(pctldev->p, pctldev->hog_default);
+   return 0;
+}
+
 #ifdef CONFIG_DEBUG_FS
 
 static int pinctrl_pins_show(struct seq_file *s, void *what)
@@ -1500,16 +1522,23 @@ struct pinctrl_dev *pinctrl_register(struct 
pinctrl_desc *pctldesc,
 
pctldev->p = pinctrl_get_locked(pctldev->dev);
if (!IS_ERR(pctldev->p)) {
-   struct pinctrl_state *s =
+   pctldev->hog_default =
pinctrl_lookup_state_locked(pctldev->p,
PINCTRL_STATE_DEFAULT);
-   if (IS_ERR(s)) {
+   if (IS_ERR(pctldev->hog_default)) {
dev_dbg(dev, "failed to lookup the default state\n");
} else {
-   if (pinctrl_select_state_locked(pctldev->p, s))
+   if (pinctrl_select_state_locked(pctldev->p,
+   pctldev->hog_default))
dev_err(dev,
"failed to select default state\n");
}
+
+   pctldev->hog_sleep =
+   pinctrl_lookup_state_locked(pctldev->p,
+   PINCTRL_STATE_SLEEP);
+   if (IS_ERR(pctldev->hog_sleep))
+   dev_dbg(dev, "failed to lookup the sleep state\n");
}
 
mutex_unlock(&pinctrl_mutex);
diff --git a/drivers/pinctrl/core.h b/drivers/pinctrl/core.h
index 12f5694..b5bbb09 100644
--- a/drivers/pinctrl/core.h
+++ b/drivers/pinctrl/core.h
@@ -30,6 +30,8 @@ struct pinctrl_gpio_range;
  * @driver_data: driver data for drivers registering to the pin controller
  * subsystem
  * @p: result of pinctrl_get() for this device
+ * @hog_default: default state for pins hogged by this device
+ * @hog_sleep: sleep state for pins hogged by this device
  * @device_root: debugfs root for this device
  */
 struct pinctrl_dev {
@@ -41,6 +43,8 @@ struct pinctrl_dev {
struct module *owner;
void *driver_data;
struct pinctrl *p;
+   struct pinctrl_state *hog_default;
+   struct pinctrl_state *hog_sleep;
 #ifdef CONFIG_DEBUG_FS
struct dentry *device_root;
 #endif
diff --git a/include/linux/pinctrl/pinctrl.h b/include/linux/pinctrl/pinctrl.h
index 04d6700..814cb56 100644
--- a/include/linux/pinctrl/pinctrl.h
+++ b/include/linux/pinctrl/pinctrl.h
@@ -155,6 +155,8 @@ struct pinctrl_dev *of_pinctrl_get(struct device_node *np)
 
 extern const char *pinctrl_dev_get_name(struct pinctrl_dev *pctldev);
 extern void *pinctrl_dev_get_drvdata(struct pinctrl_dev *pctldev);
+extern int pinctrl_force_sleep(struct pinctrl_dev *pctldev);
+extern int pinctrl_force_default(struct pinctrl_dev *pctldev);
 #else
 
 struct pinctrl_dev;
-- 
1.7.11.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] vfs: remove unneeded permission check from path_init

2012-12-11 Thread Jeff Layton
When path_init is called with a valid dfd, that code checks permissions
on the open directory fd and returns an error if the check fails. This
permission check is redundant, however.

Both callers of path_init immediately call link_path_walk afterward. The
first thing that link_path_walk does is to check for exec permissions
at the starting point of the path walk.

In most cases, these checks are very quick, but when the dfd is for a
file on a NFS mount with the actimeo=0, each permission check goes
out onto the wire. The result is 2 identical ACCESS calls.

Given that these codepaths are fairly "hot", I think it makes sense to
eliminate the permission check in path_init and simply assume that the
caller will eventually check the permissions before proceeding.

Reported-by: Dave Wysochanski 
Signed-off-by: Jeff Layton 
---
 fs/namei.c | 7 +--
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 40d864a..deefbc3 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1894,6 +1894,7 @@ static int path_init(int dfd, const char *name, unsigned 
int flags,
get_fs_pwd(current->fs, &nd->path);
}
} else {
+   /* Caller must check execute permissions on the starting path 
component */
struct fd f = fdget_raw(dfd);
struct dentry *dentry;
 
@@ -1907,12 +1908,6 @@ static int path_init(int dfd, const char *name, unsigned 
int flags,
fdput(f);
return -ENOTDIR;
}
-
-   retval = inode_permission(dentry->d_inode, MAY_EXEC);
-   if (retval) {
-   fdput(f);
-   return retval;
-   }
}
 
nd->path = f.file->f_path;
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3 v2] iio: add rtc-driver for HID sensors of type time

2012-12-11 Thread Jonathan Cameron

On 11/12/12 12:39, Alexander Holler wrote:

Am 11.12.2012 10:40, schrieb Lars-Peter Clausen:


Yes, move the header or merge into existing one as makes sense.
I'm not pulling this driver into the IIO tree (unless for some
reason Alessandro wants me to and I can't think why he would...).



Alessandro has been pretty quiet for quite some time now. Luckily Andrew
Morton usually picks up the stuff for orphaned subsystems. So put him
on Cc
for v4.


Will do it. Thanks a lot for your review.

I willl post the whole series (4 patches including the merge of
hid-sensor-attributes.h) again, when I've finished v3 of the driver
(hopefully this evening), marking some patches as RESEND. So 3 out of
those 4 patches will be for iio (as hid-sensor-hub is part of it), and
the last one, the rtc driver itself, will be for the rtc subsystem. I
don't know if they have to be pulled by different maintainers. ;)

Regards,

Alexander

We'll see what Andrew says.  We can take the lot through IIO
(doubt Greg will mind) as long as we have the correct acks to do so
(or at least statements of not caring ;)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] regulator: vexpress: Add missing n_voltages setting

2012-12-11 Thread Pawel Moll
On Tue, 2012-12-11 at 08:04 +, Axel Lin wrote:
> Otherwise regulator_can_change_voltage() return 0 for this driver.
> 
> Signed-off-by: Axel Lin 

We've been here before, haven't we? ;-) So I'll just repeat myself -
this regulator does _not_ have operating points. What I believe should
be fixed is the mentioned function itself, something like the patch
below (untested)...

Pawel

8<
>From 1cafb644747c276a6c601096b8dc0972d10daac7 Mon Sep 17 00:00:00 2001
From: Pawel Moll 
Date: Tue, 11 Dec 2012 13:44:07 +
Subject: [PATCH] regulator: core: Fix regulator_can_change_voltage() for
 continuous case

Function regulator_can_change_voltage() didn't take regulators with
continuous voltage range under consideration. Fixed now.

Signed-off-by: Pawel Moll 
---
 drivers/regulator/core.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/regulator/core.c b/drivers/regulator/core.c
index cd1b201..92768c3 100644
--- a/drivers/regulator/core.c
+++ b/drivers/regulator/core.c
@@ -1886,7 +1886,8 @@ int regulator_can_change_voltage(struct regulator 
*regulator)
 
if (rdev->constraints &&
rdev->constraints->valid_ops_mask & REGULATOR_CHANGE_VOLTAGE &&
-   rdev->desc->n_voltages > 1)
+   (rdev->desc->n_voltages > 1 ||
+rdev->desc->continuous_voltage_range))
return 1;
 
return 0;
-- 
1.7.10.4



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 2/4] videobuf2-dma-streaming: new videobuf2 memory allocator

2012-12-11 Thread Federico Vaga
Sorry for the late answer to this.

> > This allocator is needed because some device (like STA2X11 VIP) cannot
> > work
> > with DMA sg or DMA coherent. Some other device (like the one used by
> > Jonathan when he proposes vb2-dma-nc allocator) can obtain much better
> > performance with DMA streaming than coherent.
> 
> Ok, please add such explanations at the patch's descriptions, as it is
> important not only for me, but to others that may need to use it..

OK

> >>2) why vb2-dma-config can't be patched to use dma_map_single
> >> 
> >> (eventually using a different vb2_io_modes bit?);
> > 
> > I did not modify vb2-dma-contig because I was thinking that each DMA
> > memory
> > allocator should reflect a DMA API.
> 
> The basic reason for having more than one VB low-level handling (vb2 was
> inspired on this concept) is that some DMA APIs are very different than
> the other ones (see vmalloc x DMA S/G for example).
> 
> I didn't make a diff between videobuf2-dma-streaming and
> videobuf2-dma-contig, so I can't tell if it makes sense to merge them or
> not, but the above argument seems too weak. I was expecting for a technical
> reason why it wouldn't make sense for merging them.

I cannot work on this now. But I think that I can do an integration like the 
one that I pushed some month ago (a8f3c203e19b702fa5e8e83a9b6fb3c5a6d1cce4).
Wind River made that changes to videobuf-contig and I tested, fixed and 
pushed.

> >>3) what are the usecases for it.
> >> 
> >> Could you please detail it? Without that, one that would be needing to
> >> write a driver will have serious doubts about what would be the right
> >> driver for its usage. Also, please document it at the driver itself.

I don't have a full understand of the board so I don't know exactly why 
dma_alloc_coherent does not work. I focused my development on previous work by 
Wind River. I asked to Wind River (which did all the work on this board) for 
the technical explanation about why coherent doesn't work, but they do not 
know. That's why I made the new allocator: coherent doesn't work and HW 
doesn't support SG.

> I'm not a DMA performance expert. As such, from that comment, it sounded to
> me that replacing dma-config/dma-sg by dma streaming will always give
> "performance optimizations the hardware allow".

me too, I'm not a DMA performance expert. I'm just an user of the DMA API. On 
my hardware simply it works only with that interface, it is not a performance 
problem.

> On a separate but related issue, while doing DMABUF tests with an Exynos4
> hardware, using a s5p sensor, sending data to s5p-tv, I noticed a CPU
> consumption of about 42%, which seems too high. Could it be related to
> not using the DMA streaming API?

As I wrote above, I'm not a DMA performance expert. I skip this

-- 
Federico Vaga
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 1/9] CPU hotplug: Provide APIs to prevent CPU offline from atomic context

2012-12-11 Thread Tejun Heo
Hello, Srivatsa.

On Tue, Dec 11, 2012 at 06:43:54PM +0530, Srivatsa S. Bhat wrote:
> This approach (of using synchronize_sched()) also looks good. It is simple,
> yet effective, but unfortunately inefficient at the writer side (because
> he'll have to wait for a full synchronize_sched()).

While synchornize_sched() is heavier on the writer side than the
originally posted version, it doesn't stall the whole machine and
wouldn't introduce latencies to others.  Shouldn't that be enough?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V3 1/2] MCE: fix an error of mce_bad_pages statistics

2012-12-11 Thread Andi Kleen
On Tue, Dec 11, 2012 at 08:18:27PM +0800, Xishi Qiu wrote:
> 1) move poisoned page check at the beginning of the function.
> 2) add page_lock to avoid unpoison clear the flag.

That doesn't make sense, obviously you would need to recheck
inside the lock again to really protect against unpoison.

But unpoison is only for debugging anyways, so it doesn't matter
if the count is 100% correct.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH/RESEND v3 2/2] row: Add support for urgent request handling

2012-12-11 Thread Tanya Brokhman
This patch add support for handling urgent requests.
ROW queue can be marked as "urgent". If an urgent queue was
un-served in a previous dispatch cycle and a request was added
to it - it will trigger issuing urgent request to the device driver.

Signed-off-by: Tatyana Brokhman 

diff --git a/block/row-iosched.c b/block/row-iosched.c
index b3204d6..41cc028 100644
--- a/block/row-iosched.c
+++ b/block/row-iosched.c
@@ -58,6 +58,17 @@ static const bool queue_idling_enabled[] = {
false,  /* ROWQ_PRIO_LOW_SWRITE */
 };
 
+/* Flags indicating whether the queue can notify on urgent requests */
+static const bool urgent_queues[] = {
+   true,   /* ROWQ_PRIO_HIGH_READ */
+   true,   /* ROWQ_PRIO_REG_READ */
+   false,  /* ROWQ_PRIO_HIGH_SWRITE */
+   false,  /* ROWQ_PRIO_REG_SWRITE */
+   false,  /* ROWQ_PRIO_REG_WRITE */
+   false,  /* ROWQ_PRIO_LOW_READ */
+   false,  /* ROWQ_PRIO_LOW_SWRITE */
+};
+
 /* Default values for row queues quantums in each dispatch cycle */
 static const int queue_quantum[] = {
100,/* ROWQ_PRIO_HIGH_READ */
@@ -271,7 +282,13 @@ static void row_add_request(struct request_queue *q,
 
rqueue->idle_data.last_insert_time = ktime_get();
}
-   row_log_rowq(rd, rqueue->prio, "added request");
+   if (urgent_queues[rqueue->prio] &&
+   row_rowq_unserved(rd, rqueue->prio)) {
+   row_log_rowq(rd, rqueue->prio,
+"added urgent req curr_queue = %d",
+rd->curr_queue);
+   } else
+   row_log_rowq(rd, rqueue->prio, "added request");
 }
 
 /**
@@ -306,6 +323,29 @@ static int row_reinsert_req(struct request_queue *q,
return 0;
 }
 
+/*
+ * row_urgent_pending() - Return TRUE if there is an urgent
+ *   request on scheduler
+ * @q: requests queue
+ *
+ */
+static bool row_urgent_pending(struct request_queue *q)
+{
+   struct row_data *rd = q->elevator->elevator_data;
+   int i;
+
+   for (i = 0; i < ROWQ_MAX_PRIO; i++)
+   if (urgent_queues[i] && row_rowq_unserved(rd, i) &&
+   !list_empty(&rd->row_queues[i].rqueue.fifo)) {
+   row_log_rowq(rd, i,
+"Urgent request pending (curr=%i)",
+rd->curr_queue);
+   return true;
+   }
+
+   return false;
+}
+
 /**
  * row_remove_request() -  Remove given request from scheduler
  * @q: requests queue
@@ -697,6 +737,7 @@ static struct elevator_type iosched_row = {
.elevator_dispatch_fn   = row_dispatch_requests,
.elevator_add_req_fn= row_add_request,
.elevator_reinsert_req_fn   = row_reinsert_req,
+   .elevator_is_urgent_fn  = row_urgent_pending,
.elevator_former_req_fn = elv_rb_former_request,
.elevator_latter_req_fn = elv_rb_latter_request,
.elevator_set_req_fn= row_set_request,
-- 
1.7.6
--
QUALCOMM ISRAEL, on behalf of Qualcomm Innovation Center, Inc. 
Is a member of Code Aurora Forum, hosted by the Linux Foundation
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v3 1/2] row: Adding support for reinsert already dispatched req

2012-12-11 Thread Tanya Brokhman
Add support for reinserting already dispatched request back to the
schedulers internal data structures.
The request will be reinserted back to the queue (head) it was
dispatched from as if it was never dispatched.

Signed-off-by: Tatyana Brokhman 
---
v3: Update error handling when row queue is not set for a request  
v2: Nothing changed. Resend
v1: Initial version

diff --git a/block/row-iosched.c b/block/row-iosched.c
index 1f50180..b3204d6 100644
--- a/block/row-iosched.c
+++ b/block/row-iosched.c
@@ -274,7 +274,39 @@ static void row_add_request(struct request_queue *q,
row_log_rowq(rd, rqueue->prio, "added request");
 }
 
-/*
+/**
+ * row_reinsert_req() - Reinsert request back to the scheduler
+ * @q: requests queue
+ * @rq:request to add
+ *
+ * Reinsert the given request back to the queue it was
+ * dispatched from as if it was never dispatched.
+ *
+ * Returns 0 on success, error code otherwise
+ */
+static int row_reinsert_req(struct request_queue *q,
+   struct request *rq)
+{
+   struct row_data*rd = q->elevator->elevator_data;
+   struct row_queue   *rqueue = RQ_ROWQ(rq);
+
+   /* Verify rqueue is legitimate */
+   if (rqueue->prio >= ROWQ_MAX_PRIO) {
+   pr_err("\n\nROW BUG: row_reinsert_req() rqueue->prio = %d\n",
+  rqueue->prio);
+   blk_dump_rq_flags(rq, "");
+   return -EIO;
+   }
+
+   list_add(&rq->queuelist, &rqueue->fifo);
+   rd->nr_reqs[rq_data_dir(rq)]++;
+
+   row_log_rowq(rd, rqueue->prio, "request reinserted");
+
+   return 0;
+}
+
+/**
  * row_remove_request() -  Remove given request from scheduler
  * @q: requests queue
  * @rq:request to remove
@@ -664,6 +696,7 @@ static struct elevator_type iosched_row = {
.elevator_merge_req_fn  = row_merged_requests,
.elevator_dispatch_fn   = row_dispatch_requests,
.elevator_add_req_fn= row_add_request,
+   .elevator_reinsert_req_fn   = row_reinsert_req,
.elevator_former_req_fn = elv_rb_former_request,
.elevator_latter_req_fn = elv_rb_latter_request,
.elevator_set_req_fn= row_set_request,
-- 
1.7.6
--
QUALCOMM ISRAEL, on behalf of Qualcomm Innovation Center, Inc. 
Is a member of Code Aurora Forum, hosted by the Linux Foundation
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v4 1/2] block: Adding ROW scheduling algorithm

2012-12-11 Thread Tanya Brokhman
This patch adds the implementation of a new scheduling algorithm - ROW.
The policy of this algorithm is to prioritize READ requests over WRITE
as much as possible without starving the WRITE requests.
The requests are kept in queues according to their priority. The dispatch
is done in a Round Robin manner with a different slice for each queue.
READ request queues get bigger dispatch quantum than the write requests.

Signed-off-by: Tatyana Brokhman 
---
v4: 
  - Fix bug in idling mechanism:
-- Fix the delay passed to scedule_delayed_work
-- Change the idle delay and idling-trigger frequency to be HZ dependent
  - Destroy idle_workqueue on queue exit
  - Update the documentation
v3: 
  - Fix bug in forced dispatch: Don't idle if forced dispatch
v2: 
  No changes. Resend.
v1: Initial version

diff --git a/Documentation/block/row-iosched.txt 
b/Documentation/block/row-iosched.txt
new file mode 100644
index 000..0d794ee
--- /dev/null
+++ b/Documentation/block/row-iosched.txt
@@ -0,0 +1,134 @@
+Introduction
+
+
+The ROW scheduling algorithm will be used in mobile devices as default
+block layer IO scheduling algorithm. ROW stands for "READ Over WRITE"
+which is the main requests dispatch policy of this algorithm.
+
+The ROW IO scheduler was developed with the mobile devices needs in
+mind. In mobile devices we favor user experience upon everything else,
+thus we want to give READ IO requests as much priority as possible.
+The main idea of the ROW scheduling policy is just that:
+- If there are READ requests in pipe - dispatch them, while write
+starvation is considered.
+
+Software description
+
+The elevator defines a registering mechanism for different IO scheduler
+to implement. This makes implementing a new algorithm quite straight
+forward and requires almost no changes to block/elevator framework. A
+new IO scheduler just has to implement a set of callback functions
+defined by the elevator.
+These callbacks cover all the required IO operations such as
+adding/removing request to/from the scheduler, merging two requests,
+dispatching a request etc.
+
+Design
+==
+
+The requests are kept in queues according to their priority. The
+dispatching of requests is done in a Round Robin manner with a
+different slice for each queue. The dispatch quantum for a specific
+queue is set according to the queues priority. READ queues are
+given bigger dispatch quantum than the WRITE queues, within a dispatch
+cycle.
+
+At the moment there are 6 types of queues the requests are
+distributed to:
+-  High priority READ queue
+-  High priority Synchronous WRITE queue
+-  Regular priority READ queue
+-  Regular priority Synchronous WRITE queue
+-  Regular priority WRITE queue
+-  Low priority READ queue
+
+The marking of request as high/low priority will be done by the
+application adding the request and not the scheduler. See TODO section.
+If the request is not marked in any way (high/low) the scheduler
+assigns it to one of the regular priority queues:
+read/write/sync write.
+
+If in a certain dispatch cycle one of the queues was empty and didn't
+use its quantum that queue will be marked as "un-served". If we're in
+a middle of a dispatch cycle dispatching from queue Y and a request
+arrives for queue X that was un-served in the previous cycle, if X's
+priority is higher than Y's, queue X will be preempted in the favor of
+queue Y.
+
+For READ request queues ROW IO scheduler allows idling within a
+dispatch quantum in order to give the application a chance to insert
+more requests. Idling means adding some extra time for serving a
+certain queue even if the queue is empty. The idling is enabled if
+the ROW IO scheduler identifies the application is inserting requests
+in a high frequency.
+Not all queues can idle. ROW scheduler exposes an enablement struct
+for idling.
+For idling on READ queues, the ROW IO scheduler uses timer mechanism.
+When the timer expires we schedule a delayed work that will signal the
+device driver to fetch another request for dispatch.
+
+ROW scheduler will support additional services for block devices that
+supports Urgent Requests. That is, the scheduler may inform the
+device driver upon urgent requests using a newly defined callback.
+In addition it will support rescheduling of requests that were
+interrupted. For example if the device driver issues a long write
+request and a sudden urgent request is received by the scheduler.
+The scheduler will inform the device driver about the urgent request,
+so the device driver can stop the current write request and serve the
+urgent request. In such a case the device driver may also insert back
+to the scheduler the remainder of the interrupted write request, such
+that the scheduler may continue sending urgent requests without the
+need to interrupt the ongoing write again and again. The write
+remainder will be sent later on according to the scheduler policy.
+
+SMP/multi-core

[PATCH/RESEND v4 2/2] block: compile ROW statically into the kernel

2012-12-11 Thread Tanya Brokhman
From: Tatyana Brokhman 

ROW is a new scheduling algorithm. Similar to the existing scheduling
algorithms it should be compiled to the kernel statically giving the user
the ability to switch to it without kernel recompilation.

Signed-off-by: Tatyana Brokhman 

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 5a747e2..401f42d 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -23,6 +23,7 @@ config IOSCHED_DEADLINE
 
 config IOSCHED_ROW
tristate "ROW I/O scheduler"
+   default y
---help---
  The ROW I/O scheduler gives priority to READ requests over the
  WRITE requests when dispatching, without starving WRITE requests.
-- 
1.7.6
--
QUALCOMM ISRAEL, on behalf of Qualcomm Innovation Center, Inc. 
Is a member of Code Aurora Forum, hosted by the Linux Foundation
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] drivers/pinctrl: grab default handles from device core

2012-12-11 Thread Linus Walleij
From: Linus Walleij 

This makes the device core auto-grab the pinctrl handle and set
the "default" (PINCTRL_STATE_DEFAULT) state for every device
that is present in the device model right before probe. This will
account for the lion's share of embedded silicon devcies.

A modification of the semantics for pinctrl_get() is also done:
previously if the pinctrl handle for a certain device was already
taken, the pinctrl core would return an error. Now, since the
core may have already default-grabbed the handle and set its
state to "default", if the handle was already taken, this will
be disregarded and the located, previously instanitated handle
will be returned to the caller.

This way all code in drivers explicitly requesting their pinctrl
handlers will still be functional, and drivers that want to
explicitly retrieve and switch their handles can still do that.
But if the desired functionality is just boilerplate of this
type in the probe() function:

struct pinctrl  *p;

p = devm_pinctrl_get_select_default(&dev);
if (IS_ERR(p)) {
   if (PTR_ERR(p) == -EPROBE_DEFER)
return -EPROBE_DEFER;
dev_warn(&dev, "no pinctrl handle\n");
}

The discussion began with the addition of such boilerplate
to the omap4 keypad driver:
http://marc.info/?l=linux-input&m=135091157719300&w=2

A previous approach using notifiers was discussed:
http://marc.info/?l=linux-kernel&m=135263661110528&w=2
This failed because it could not handle deferred probes.

This patch alone does not solve the entire dilemma faced:
whether code should be distributed into the drivers or
if it should be centralized to e.g. a PM domain. But it
solves the immediate issue of the addition of boilerplate
to a lot of drivers that just want to grab the default
state. As mentioned, they can later explicitly retrieve
the handle and set different states, and this could as
well be done by e.g. PM domains as it is only related
to a certain struct device * pointer.

Cc: Felipe Balbi 
Cc: Benoit Cousson 
Cc: Dmitry Torokhov 
Cc: Thomas Petazzoni 
Cc: Mitch Bradley 
Cc: Mark Brown 
Cc: Ulf Hansson 
Cc: Rafael J. Wysocki 
Cc: Kevin Hilman 
Cc: Jean-Christophe PLAGNIOL-VILLARD 
Cc: Rickard Andersson 
Cc: Greg Kroah-Hartman 
Cc: Russell King 
Signed-off-by: Linus Walleij 
---
 Documentation/pinctrl.txt   | 24 +--
 drivers/base/Makefile   |  1 +
 drivers/base/dd.c   |  7 +
 drivers/base/pinctrl.c  | 66 +
 drivers/pinctrl/core.c  | 11 +--
 include/linux/device.h  |  7 +
 include/linux/pinctrl/devinfo.h | 45 
 7 files changed, 156 insertions(+), 5 deletions(-)
 create mode 100644 drivers/base/pinctrl.c
 create mode 100644 include/linux/pinctrl/devinfo.h

diff --git a/Documentation/pinctrl.txt b/Documentation/pinctrl.txt
index da40efb..68836e5 100644
--- a/Documentation/pinctrl.txt
+++ b/Documentation/pinctrl.txt
@@ -972,6 +972,18 @@ pinmux core.
 Pin control requests from drivers
 =
 
+When a device driver is about to probe the device core will automatically
+attempt to issue pinctrl_get_select_default() on these devices.
+This way driver writers do not need to add any of the boilerplate code
+of the type found below. However when doing fine-grained state selection
+and not using the "default" state, you may have to do some device driver
+handling of the pinctrl handles and states.
+
+So if you just want to put the pins for a certain device into the default
+state and be done with it, there is nothing you need to do besides
+providing the proper mapping table. The device core will take care of
+the rest.
+
 Generally it is discouraged to let individual drivers get and enable pin
 control. So if possible, handle the pin control in platform code or some other
 place where you have access to all the affected struct device * pointers. In
@@ -1097,9 +1109,9 @@ situations that can be electrically unpleasant, you will 
certainly want to
 mux in and bias pins in a certain way before the GPIO subsystems starts to
 deal with them.
 
-The above can be hidden: using pinctrl hogs, the pin control driver may be
-setting up the config and muxing for the pins when it is probing,
-nevertheless orthogonal to the GPIO subsystem.
+The above can be hidden: using the device core, the pinctrl core may be
+setting up the config and muxing for the pins right before the device is
+probing, nevertheless orthogonal to the GPIO subsystem.
 
 But there are also situations where it makes sense for the GPIO subsystem
 to communicate directly with with the pinctrl subsystem, using the latter
@@ -1144,6 +1156,12 @@ PIN_MAP_MUX_GROUP_HOG_DEFAULT("pinctrl-foo", NULL /* 
group */, "power_func")
 
 This gives the exact same result as the above construction.
 
+This should not be used for any kind of device which is represented in
+the device model, as the pinctrl core will attempt to do the equal of
+pinctrl_get_select_default()

Re: [PATCH v3 2/2] scripts/tags.sh: Support compiled source

2012-12-11 Thread Michal Marek
Dne 10.12.2012 16:11, Joonsoo Kim napsal(a):
> We usually have interst in compiled files only,
> because they are strongly related to individual's work.
> Current tags.sh can't select compiled files, so support it.
> 
> We can use this functionality like below.
> "make cscope O=. SRCARCH= COMPILED_SOURCE=compiled"
> 
> It must be executed after building the kernel.
> 
> Signed-off-by: Joonsoo Kim 
> ---
> v2: change bash specific '[[]]' to 'case in' statement.
> use COMPILED_SOURCE env var, instead of abusing SUBARCH
> v3: change [ "$COMPILED_SOURCE"="compiled" ] to [ -n $COMPILED_SOURCE" ]

Thanks, I applied both patches to kbuild.git#misc now.

Michal

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: livelock in __writeback_inodes_wb ?

2012-12-11 Thread Jan Kara
On Tue 11-12-12 16:23:27, Wu Fengguang wrote:
> On Wed, Nov 28, 2012 at 09:55:15AM -0500, Dave Jones wrote:
> > We had a user report the soft lockup detector kicked after 22
> > seconds of no progress, with this trace..
> 
> Where is the original report? The reporter may help provide some clues
> on the workload that triggered the bug.
> 
> > :BUG: soft lockup - CPU#1 stuck for 22s! [flush-8:16:3137]
> > :Pid: 3137, comm: flush-8:16 Not tainted 3.6.7-4.fc17.x86_64 #1
> > :RIP: 0010:[]  [] 
> > __list_del_entry+0x2c/0xd0
> > :Call Trace:
> > : [] redirty_tail+0x5e/0x80
> > : [] __writeback_inodes_wb+0x72/0xd0
> > : [] wb_writeback+0x23b/0x2d0
> > : [] wb_do_writeback+0xac/0x1f0
> > : [] ? __internal_add_timer+0x130/0x130
> > : [] bdi_writeback_thread+0x8b/0x230
> > : [] ? wb_do_writeback+0x1f0/0x1f0
> > : [] kthread+0x93/0xa0
> > : [] kernel_thread_helper+0x4/0x10
> > : [] ? kthread_freezable_should_stop+0x70/0x70
> > : [] ? gs_change+0x13/0x13
> > 
> > Looking over the code, is it possible that something could be
> > dirtying pages faster than writeback can get them written out,
> > keeping us in this loop indefitely ?
> 
> The bug reporter should know best whether there are heavy IO.
> 
> However I suspect it's not directly caused by heavy IO: we will
> release &wb->list_lock before each __writeback_single_inode() call,
> which starts writeback IO for each inode.
  Umm, it's not about releasing wb->list_lock I think. Softlockup will
trigger whenever we are looping in a kernel for more than given timeout
(e.g. those 22 s) without sleeping.

> > Should there be something in this loop periodically poking
> > the watchdog perhaps ?
> 
> It seems we failed to release &wb->list_lock in wb_writeback() for
> long time (dozens of seconds). That is, the inode_sleep_on_writeback()
> is somehow not called. However it's not obvious to me how come this
> can happen..
  Maybe, progress is always non-zero but small and nr_pages is high (e.g.
when writeback is triggered by wakeup_flusher_threads()). What filesystem
is the guy using? I remember e.g. btrfs used to have always-dirty inodes
which could confuse us.

>From the backtrace it is clear there's some superblock which has s_umount
locked and we cannot writeback inodes there. So if this superblock contains
most of the dirty pages we need to write and there's another superblock
with always dirty inode we would livelock like observed... So my question
would be about what filesystems are there in the system (/proc/mounts),
what load does trigger this, trigger sysrq-w when the lockup happens.

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/2] ASoC: aic: Support for AIC family DSPs

2012-12-11 Thread Mehar Bajwa
AIC family of audio CODECs from TI features a programmable miniDSP for
performing signal processing operations. Due to commonality of functions
across the CODECs a common library will be used to provide support for them.

Signed-off-by: Mehar Bajwa 
---
 sound/soc/codecs/Kconfig   |5 +
 sound/soc/codecs/Makefile  |2 +
 sound/soc/codecs/aic3xxx_cfw.h |  529 +++
 sound/soc/codecs/aic3xxx_cfw_ops.c |  989 
 sound/soc/codecs/aic3xxx_cfw_ops.h |   98 
 5 files changed, 1623 insertions(+), 0 deletions(-)
 create mode 100644 sound/soc/codecs/aic3xxx_cfw.h
 create mode 100644 sound/soc/codecs/aic3xxx_cfw_ops.c
 create mode 100644 sound/soc/codecs/aic3xxx_cfw_ops.h

diff --git a/sound/soc/codecs/Kconfig b/sound/soc/codecs/Kconfig
index 73d8cea..52b3601 100644
--- a/sound/soc/codecs/Kconfig
+++ b/sound/soc/codecs/Kconfig
@@ -188,6 +188,11 @@ config SND_SOC_ADAV80X
 config SND_SOC_ADS117X
tristate
 
+config SND_SOC_AIC_CFW
+   tristate
+   default y if SND_SOC_TLV320AIC3262=y
+   default m if SND_SOC_TLV320AIC3262=m
+
 config SND_SOC_AK4104
tristate
 
diff --git a/sound/soc/codecs/Makefile b/sound/soc/codecs/Makefile
index a3f8f4f..ffda11b 100644
--- a/sound/soc/codecs/Makefile
+++ b/sound/soc/codecs/Makefile
@@ -9,6 +9,7 @@ snd-soc-adau1701-objs := adau1701.o
 snd-soc-adau1373-objs := adau1373.o
 snd-soc-adav80x-objs := adav80x.o
 snd-soc-ads117x-objs := ads117x.o
+snd-soc-aic3xxx-cfw-ops-objs := aic3xxx_cfw_ops.o
 snd-soc-ak4104-objs := ak4104.o
 snd-soc-ak4535-objs := ak4535.o
 snd-soc-ak4641-objs := ak4641.o
@@ -132,6 +133,7 @@ obj-$(CONFIG_SND_SOC_ADAU1373)  += snd-soc-adau1373.o
 obj-$(CONFIG_SND_SOC_ADAU1701)  += snd-soc-adau1701.o
 obj-$(CONFIG_SND_SOC_ADAV80X)  += snd-soc-adav80x.o
 obj-$(CONFIG_SND_SOC_ADS117X)  += snd-soc-ads117x.o
+obj-$(CONFIG_SND_SOC_AIC_CFW)  += snd-soc-aic3xxx-cfw-ops.o
 obj-$(CONFIG_SND_SOC_AK4104)   += snd-soc-ak4104.o
 obj-$(CONFIG_SND_SOC_AK4535)   += snd-soc-ak4535.o
 obj-$(CONFIG_SND_SOC_AK4641)   += snd-soc-ak4641.o
diff --git a/sound/soc/codecs/aic3xxx_cfw.h b/sound/soc/codecs/aic3xxx_cfw.h
new file mode 100644
index 000..b88485d
--- /dev/null
+++ b/sound/soc/codecs/aic3xxx_cfw.h
@@ -0,0 +1,529 @@
+/*
+ *  aic3xxx_cfw.h  --  SoC audio for TI OMAP44XX SDP
+ *  Codec Firmware Declarations
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * version 2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA
+ * 02110-1301 USA
+ *
+ */
+
+#ifndef AIC_FIRMWARE_H_
+#define AIC_FIRMWARE_H_
+
+
+#define AIC_FW_MAGIC 0xC0D1F1ED
+
+
+/** \defgroup pd Arbitrary Limitations */
+/* @{ */
+#ifndef AIC_MAX_ID
+#define AIC_MAX_ID  (64)   /**= AIC_CMD_DELAY)
+
+/**
+ * AIC Block Type
+ *
+ * Block identifier
+ *
+ */
+enum __attribute__ ((__packed__)) aic_block_t {
+   AIC_BLOCK_SYSTEM_PRE,
+   AIC_BLOCK_A_INST,
+   AIC_BLOCK_A_A_COEF,
+   AIC_BLOCK_A_B_COEF,
+   AIC_BLOCK_A_F_COEF,
+   AIC_BLOCK_D_INST,
+   AIC_BLOCK_D_A1_COEF,
+   AIC_BLOCK_D_B1_COEF,
+   AIC_BLOCK_D_A2_COEF,
+   AIC_BLOCK_D_B2_COEF,
+   AIC_BLOCK_D_F_COEF,
+   AIC_BLOCK_SYSTEM_POST,
+   AIC_BLOCK_N,
+   AIC_BLOCK_INVALID,
+};
+#define AIC_BLOCK_D_A_COEF AIC_BLOCK_D_A1_COEF
+#define AIC_BLOCK_D_B_COEF AIC_BLOCK_D_B1_COEF
+
+/**
+ * AIC Block
+ *
+ * A block of logically grouped sequences/commands/cmd-commands
+ *
+ */
+struct aic_block {
+   enum aic_block_t type;
+   int ncmds;
+   union aic_cmd cmd[];
+};
+#define AIC_BLOCK_SIZE(ncmds) (sizeof(struct aic_block) + \
+   ((ncmds)*sizeof(union aic_cmd)))
+
+/**
+ * AIC Image
+ *
+ * A downloadable image
+ */
+struct aic_image {
+   char name[AIC_MAX_ID];  /**< Name of the pfw/overlay/configuration*/
+   char *desc; /**< User string*/
+   int mute_flags;
+   struct aic_block *block[AIC_BLOCK_N];
+};
+
+
+
+/**
+ * AIC PLL
+ *
+ * PLL configuration sequence and match critirea
+ */
+struct aic_pll {
+   char name[AIC_MAX_ID];  /**< Name of the PLL sequence*/
+   char *desc; /**< User string*/
+   struct aic_block *seq;
+};
+
+/**
+ * AIC Control
+ *
+ * Run-time control for a process flow
+ */
+struct aic_control {
+   char name[AIC_MAX_ID];  /**< Control identifier*/
+   char *desc; /**< User string*/
+   int mute_flags;
+
+   int min;   

[PATCH v3 1/2] block: Add support for reinsert a dispatched req

2012-12-11 Thread Tanya Brokhman
From: Tatyana Brokhman 

Add support for reinserting a dispatched request back to the
scheduler's internal data structures.
This capability is used by the device driver when it chooses to
interrupt the current request transmission and execute another (more
urgent) pending request. For example: interrupting long write in order
to handle pending read. The device driver re-inserts the
remaining write request back to the scheduler, to be rescheduled
for transmission later on.

Add API for verifying whether the current scheduler
supports reinserting requests mechanism. If reinsert mechanism isn't
supported by the scheduler, this code path will never be activated.

Signed-off-by: Tatyana Brokhman 
---
v3: 
  - Added some error chaking
  - In elv_reinsert_request(): update counters only if reinsert succeded
v1: Initial version

diff --git a/block/blk-core.c b/block/blk-core.c
index a8b1527..f8f6762 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1177,6 +1177,50 @@ void blk_requeue_request(struct request_queue *q, struct 
request *rq)
 }
 EXPORT_SYMBOL(blk_requeue_request);
 
+/**
+ * blk_reinsert_request() - Insert a request back to the scheduler
+ * @q: request queue
+ * @rq:request to be inserted
+ *
+ * This function inserts the request back to the scheduler as if
+ * it was never dispatched.
+ *
+ * Return: 0 on success, error code on fail
+ */
+int blk_reinsert_request(struct request_queue *q, struct request *rq)
+{
+   if (unlikely(!rq) || unlikely(!q))
+   return -EIO;
+
+   blk_delete_timer(rq);
+   blk_clear_rq_complete(rq);
+   trace_block_rq_requeue(q, rq);
+
+   if (blk_rq_tagged(rq))
+   blk_queue_end_tag(q, rq);
+
+   BUG_ON(blk_queued_rq(rq));
+
+   return elv_reinsert_request(q, rq);
+}
+EXPORT_SYMBOL(blk_reinsert_request);
+
+/**
+ * blk_reinsert_req_sup() - check whether the scheduler supports
+ *  reinsertion of requests
+ * @q: request queue
+ *
+ * Returns true if the current scheduler supports reinserting
+ * request. False otherwise
+ */
+bool blk_reinsert_req_sup(struct request_queue *q)
+{
+   if (unlikely(!q))
+   return false;
+   return q->elevator->type->ops.elevator_reinsert_req_fn ? true : false;
+}
+EXPORT_SYMBOL(blk_reinsert_req_sup);
+
 static void add_acct_request(struct request_queue *q, struct request *rq,
 int where)
 {
diff --git a/block/elevator.c b/block/elevator.c
index 9edba1b..0ff6c5a 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -547,6 +547,41 @@ void elv_requeue_request(struct request_queue *q, struct 
request *rq)
__elv_add_request(q, rq, ELEVATOR_INSERT_REQUEUE);
 }
 
+/**
+ * elv_reinsert_request() - Insert a request back to the scheduler
+ * @q: request queue where request should be inserted
+ * @rq:request to be inserted
+ *
+ * This function returns the request back to the scheduler to be
+ * inserted as if it was never dispatched
+ *
+ * Return: 0 on success, error code on failure
+ */
+int elv_reinsert_request(struct request_queue *q, struct request *rq)
+{
+   int res;
+
+   if (!q->elevator->type->ops.elevator_reinsert_req_fn)
+   return -EPERM;
+
+   res = q->elevator->type->ops.elevator_reinsert_req_fn(q, rq);
+   if (!res) {
+   /*
+* it already went through dequeue, we need to decrement the
+* in_flight count again
+*/
+   if (blk_account_rq(rq)) {
+   q->in_flight[rq_is_sync(rq)]--;
+   if (rq->cmd_flags & REQ_SORTED)
+   elv_deactivate_rq(q, rq);
+   }
+   rq->cmd_flags &= ~REQ_STARTED;
+   q->nr_sorted++;
+   }
+
+   return res;
+}
+
 void elv_drain_elevator(struct request_queue *q)
 {
static int printed;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 1756001..e725303 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -722,6 +722,8 @@ extern struct request *blk_get_request(struct request_queue 
*, int, gfp_t);
 extern struct request *blk_make_request(struct request_queue *, struct bio *,
gfp_t);
 extern void blk_requeue_request(struct request_queue *, struct request *);
+extern int blk_reinsert_request(struct request_queue *q, struct request *rq);
+extern bool blk_reinsert_req_sup(struct request_queue *q);
 extern void blk_add_request_payload(struct request *rq, struct page *page,
unsigned int len);
 extern int blk_rq_check_limits(struct request_queue *q, struct request *rq);
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index c03af76..f70d05d 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -22,6 +22,8 @@ typedef void (elevator_bio_merged_fn) (struct request_queue *,
 typedef int (elevator_dispatch_fn) (struct requ

[PATCH v3 2/2] block: Add API for urgent request handling

2012-12-11 Thread Tanya Brokhman
From: Tatyana Brokhman 

This patch add support in block & elevator layers for handling
urgent requests. The decision if a request is urgent or not is taken
by the scheduler. Urgent request notification is passed to the underlying
block device driver (eMMC for example). Block device driver may decide to
interrupt the currently running low priority request to serve the new
urgent request. By doing so READ latency is greatly reduced in read&write
collision scenarios.

Note that if the current scheduler doesn't implement the urgent request
mechanism, this code path is never activated.

Signed-off-by: Tatyana Brokhman 
---
v3:
  - Commenting update
  - Add boolean flag indicating if an urgent request was dispatched,
instead of a pointer to the request itself
  - In elv_completed_request() use test_bit() to verify if the completed
request was urgent or not
v1: Initial version

diff --git a/block/blk-core.c b/block/blk-core.c
index f8f6762..eb0f2ae 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -300,13 +300,26 @@ EXPORT_SYMBOL(blk_sync_queue);
  * Description:
  *See @blk_run_queue. This variant must be called with the queue lock
  *held and interrupts disabled.
+ *Device driver will be notified of an urgent request
+ *pending under the following conditions:
+ *1. The driver and the current scheduler support urgent reques handling
+ *2. There is an urgent request pending in the scheduler
+ *3. There isn't already an urgent request in flight, meaning previously
+ *   notified urgent request completed (!q->notified_urgent)
  */
 void __blk_run_queue(struct request_queue *q)
 {
if (unlikely(blk_queue_stopped(q)))
return;
 
-   q->request_fn(q);
+   if (!q->notified_urgent &&
+   q->elevator->type->ops.elevator_is_urgent_fn &&
+   q->urgent_request_fn &&
+   q->elevator->type->ops.elevator_is_urgent_fn(q)) {
+   q->notified_urgent = true;
+   q->urgent_request_fn(q);
+   } else
+   q->request_fn(q);
 }
 EXPORT_SYMBOL(__blk_run_queue);
 
@@ -2232,8 +2245,17 @@ struct request *blk_fetch_request(struct request_queue 
*q)
struct request *rq;
 
rq = blk_peek_request(q);
-   if (rq)
+   if (rq) {
+   /*
+* Assumption: the next request fetched from scheduler after we
+* notified "urgent request pending" - will be the urgent one
+*/
+   if (q->notified_urgent && !q->dispatched_urgent) {
+   q->dispatched_urgent = true;
+   (void)blk_mark_rq_urgent(rq);
+   }
blk_start_request(rq);
+   }
return rq;
 }
 EXPORT_SYMBOL(blk_fetch_request);
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 779bb76..6d594c4 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -100,6 +100,18 @@ void blk_queue_lld_busy(struct request_queue *q, 
lld_busy_fn *fn)
 EXPORT_SYMBOL_GPL(blk_queue_lld_busy);
 
 /**
+ * blk_urgent_request() - Set an urgent_request handler function for queue
+ * @q: queue
+ * @fn:handler for urgent requests
+ *
+ */
+void blk_urgent_request(struct request_queue *q, request_fn_proc *fn)
+{
+   q->urgent_request_fn = fn;
+}
+EXPORT_SYMBOL(blk_urgent_request);
+
+/**
  * blk_set_default_limits - reset limits to default values
  * @lim:  the queue_limits structure to reset
  *
diff --git a/block/blk.h b/block/blk.h
index ca51543..5fba856 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -42,6 +42,7 @@ void blk_add_timer(struct request *);
  */
 enum rq_atomic_flags {
REQ_ATOM_COMPLETE = 0,
+   REQ_ATOM_URGENT = 1,
 };
 
 /*
@@ -58,6 +59,16 @@ static inline void blk_clear_rq_complete(struct request *rq)
clear_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags);
 }
 
+static inline int blk_mark_rq_urgent(struct request *rq)
+{
+   return test_and_set_bit(REQ_ATOM_URGENT, &rq->atomic_flags);
+}
+
+static inline void blk_clear_rq_urgent(struct request *rq)
+{
+   clear_bit(REQ_ATOM_URGENT, &rq->atomic_flags);
+}
+
 /*
  * Internal elevator interface
  */
diff --git a/block/elevator.c b/block/elevator.c
index 0ff6c5a..2104368 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -756,6 +756,11 @@ void elv_completed_request(struct request_queue *q, struct 
request *rq)
 {
struct elevator_queue *e = q->elevator;
 
+   if (test_bit(REQ_ATOM_URGENT, &rq->atomic_flags)) {
+   q->notified_urgent = false;
+   q->dispatched_urgent = false;
+   blk_clear_rq_urgent(rq);
+   }
/*
 * request is released from the driver, io must be done
 */
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index e725303..db5e1bc 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -303,6 +303,7 @@ struct request_queue {
struct request_list root_rl;
 

[PATCH 2/2] pinctrl: skip deferral of hogs

2012-12-11 Thread Linus Walleij
From: Linus Walleij 

Up until now, as hogs were always taken at the end of the
pin control device registration, it didn't cause any problem.
But when starting to hog pins from the device core it will
cause deferral of the pin controller device itself since the
default pin fetch is done *before* the device probes, so
let's fix this annoyance (which is also aesthetically ugly).

Signed-off-by: Linus Walleij 
---
 drivers/pinctrl/core.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/pinctrl/core.c b/drivers/pinctrl/core.c
index 59f5a96..55d67bf 100644
--- a/drivers/pinctrl/core.c
+++ b/drivers/pinctrl/core.c
@@ -609,13 +609,16 @@ static int add_setting(struct pinctrl *p, struct 
pinctrl_map const *map)
 
setting->pctldev = get_pinctrl_dev_from_devname(map->ctrl_dev_name);
if (setting->pctldev == NULL) {
-   dev_info(p->dev, "unknown pinctrl device %s in map entry, 
deferring probe",
-   map->ctrl_dev_name);
kfree(setting);
+   /* Do not defer probing of hogs (circular loop) */
+   if (!strcmp(map->ctrl_dev_name, map->dev_name))
+   return -ENODEV;
/*
 * OK let us guess that the driver is not there yet, and
 * let's defer obtaining this pinctrl handle to later...
 */
+   dev_info(p->dev, "unknown pinctrl device %s in map entry, 
deferring probe",
+   map->ctrl_dev_name);
return -EPROBE_DEFER;
}
 
-- 
1.7.11.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/2] pinctrl: fix comment mistake

2012-12-11 Thread Linus Walleij
From: Linus Walleij 

This variable pertains to pinctrl handles not muxes
specifically.

Signed-off-by: Linus Walleij 
---
 drivers/pinctrl/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/pinctrl/core.c b/drivers/pinctrl/core.c
index 5cdee86..59f5a96 100644
--- a/drivers/pinctrl/core.c
+++ b/drivers/pinctrl/core.c
@@ -700,7 +700,7 @@ static struct pinctrl *create_pinctrl(struct device *dev)
}
}
 
-   /* Add the pinmux to the global list */
+   /* Add the pinctrl handle to the global list */
list_add_tail(&p->node, &pinctrl_list);
 
return p;
-- 
1.7.11.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v2] Add mempressure cgroup

2012-12-11 Thread Bartlomiej Zolnierkiewicz
On Monday 10 December 2012 21:05:12 Anton Vorontsov wrote:
> On Mon, Dec 10, 2012 at 01:23:09PM +0100, Bartlomiej Zolnierkiewicz wrote:
> > On Monday 10 December 2012 10:58:38 Anton Vorontsov wrote:
> > 
> > > +static void consume_memory(void)
> > > +{
> > > + unsigned int i = 0;
> > > + unsigned int j = 0;
> > > +
> > > + puts("consuming memory...");
> > > +
> > > + while (1) {
> > > + pthread_mutex_lock(&locks[i]);
> > > + if (!chunks[i]) {
> > > + chunks[i] = malloc(CHUNK_SIZE);
> > > + pabort(!chunks[i], 0, "chunks alloc failed");
> > > + memset(chunks[i], 0, CHUNK_SIZE);
> > > + j++;
> > > + }
> > > + pthread_mutex_unlock(&locks[i]);
> > > +
> > > + if (j >= num_chunks / 10) {
> > > + add_reclaimable(num_chunks / 10);
> > 
> > Shouldn't it use j instead of num_chunks / 10 here?
> 
> Um.. They should be equal. Or am I missing the point?

Oh, ok.  You're right.

j > num_chunks / 10 condition should never happen and may be removed.

> > > + printf("added %d reclaimable chunks\n", j);
> > > + j = 0;
> 
> Here, we reset it.
> 
> > > + }
> > > +
> > > + i = (i + 1) % num_chunks;
> > > + }
> > > +}
> 
> Thanks!
> Anton.

Best regards,
--
Bartlomiej Zolnierkiewicz
Samsung Poland R&D Center
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Linux 3.8-merge version information

2012-12-11 Thread Oliver Hartkopp
As the automatically generated git version information is misleading in the
merge window, name the kernel in the merge window as 3.8-merge .

This 'merge' version information helps to not interfere with 3.7-stable git
versions in the bootloader (grub) selection until 3.8-rc1 is tagged.

Signed-off-by: Oliver Hartkopp 

---

diff --git a/Makefile b/Makefile
index 540f7b2..1fd4bf4 100644
--- a/Makefile
+++ b/Makefile
@@ -1,7 +1,7 @@
 VERSION = 3
-PATCHLEVEL = 7
+PATCHLEVEL = 8
 SUBLEVEL = 0
-EXTRAVERSION =
+EXTRAVERSION = -merge
 NAME = Terrified Chipmunk

 # *DOCUMENTATION*




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 3/5] page_alloc: Introduce zone_movable_limit[] to keep movable limit for nodes

2012-12-11 Thread Simon Jeons
On Tue, 2012-12-11 at 20:41 +0800, Jianguo Wu wrote:
> On 2012/12/11 20:24, Simon Jeons wrote:
> 
> > On Tue, 2012-12-11 at 11:07 +0800, Jianguo Wu wrote:
> >> On 2012/12/11 10:33, Tang Chen wrote:
> >>
> >>> This patch introduces a new array zone_movable_limit[] to store the
> >>> ZONE_MOVABLE limit from movablecore_map boot option for all nodes.
> >>> The function sanitize_zone_movable_limit() will find out to which
> >>> node the ranges in movable_map.map[] belongs, and calculates the
> >>> low boundary of ZONE_MOVABLE for each node.
> >>>
> >>> Signed-off-by: Tang Chen 
> >>> Signed-off-by: Jiang Liu 
> >>> Reviewed-by: Wen Congyang 
> >>> Reviewed-by: Lai Jiangshan 
> >>> Tested-by: Lin Feng 
> >>> ---
> >>>  mm/page_alloc.c |   77 
> >>> +++
> >>>  1 files changed, 77 insertions(+), 0 deletions(-)
> >>>
> >>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >>> index 1c91d16..4853619 100644
> >>> --- a/mm/page_alloc.c
> >>> +++ b/mm/page_alloc.c
> >>> @@ -206,6 +206,7 @@ static unsigned long __meminitdata 
> >>> arch_zone_highest_possible_pfn[MAX_NR_ZONES];
> >>>  static unsigned long __initdata required_kernelcore;
> >>>  static unsigned long __initdata required_movablecore;
> >>>  static unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES];
> >>> +static unsigned long __meminitdata zone_movable_limit[MAX_NUMNODES];
> >>>  
> >>>  /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from 
> >>> */
> >>>  int movable_zone;
> >>> @@ -4340,6 +4341,77 @@ static unsigned long __meminit 
> >>> zone_absent_pages_in_node(int nid,
> >>>   return __absent_pages_in_range(nid, zone_start_pfn, zone_end_pfn);
> >>>  }
> >>>  
> >>> +/**
> >>> + * sanitize_zone_movable_limit - Sanitize the zone_movable_limit array.
> >>> + *
> >>> + * zone_movable_limit is initialized as 0. This function will try to get
> >>> + * the first ZONE_MOVABLE pfn of each node from movablecore_map, and
> >>> + * assigne them to zone_movable_limit.
> >>> + * zone_movable_limit[nid] == 0 means no limit for the node.
> >>> + *
> >>> + * Note: Each range is represented as [start_pfn, end_pfn)
> >>> + */
> >>> +static void __meminit sanitize_zone_movable_limit(void)
> >>> +{
> >>> + int map_pos = 0, i, nid;
> >>> + unsigned long start_pfn, end_pfn;
> >>> +
> >>> + if (!movablecore_map.nr_map)
> >>> + return;
> >>> +
> >>> + /* Iterate all ranges from minimum to maximum */
> >>> + for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) {
> >>> + /*
> >>> +  * If we have found lowest pfn of ZONE_MOVABLE of the node
> >>> +  * specified by user, just go on to check next range.
> >>> +  */
> >>> + if (zone_movable_limit[nid])
> >>> + continue;
> >>> +
> >>> +#ifdef CONFIG_ZONE_DMA
> >>> + /* Skip DMA memory. */
> >>> + if (start_pfn < arch_zone_highest_possible_pfn[ZONE_DMA])
> >>> + start_pfn = arch_zone_highest_possible_pfn[ZONE_DMA];
> >>> +#endif
> >>> +
> >>> +#ifdef CONFIG_ZONE_DMA32
> >>> + /* Skip DMA32 memory. */
> >>> + if (start_pfn < arch_zone_highest_possible_pfn[ZONE_DMA32])
> >>> + start_pfn = arch_zone_highest_possible_pfn[ZONE_DMA32];
> >>> +#endif
> >>> +
> >>> +#ifdef CONFIG_HIGHMEM
> >>> + /* Skip lowmem if ZONE_MOVABLE is highmem. */
> >>> + if (zone_movable_is_highmem() &&
> >>
> >> Hi Tang,
> >>
> >> I think zone_movable_is_highmem() is not work correctly here.
> >>sanitize_zone_movable_limit
> >>zone_movable_is_highmem  <--using movable_zone here
> >>find_zone_movable_pfns_for_nodes
> >>find_usable_zone_for_movable <--movable_zone is specified here
> >>
> > 
> > Hi Jiangguo and Chen,
> > 
> > - What's the meaning of zone_movable_is_highmem(), does it mean all zone
> > highmem pages are zone movable pages or 
> 
> Hi Simon,
> 
> zone_movable_is_highmem() means whether zone pages in ZONE_MOVABLE are taken 
> from
> highmem.
> 
> > - dmesg 
> > 
> >> 0.00] Zone ranges:
> >> [0.00]   DMA  [mem 0x0001-0x00ff]
> >> [0.00]   Normal   [mem 0x0100-0x373fdfff]
> >> [0.00]   HighMem  [mem 0x373fe000-0xb6cf]
> >> [0.00] Movable zone start for each node
> >> [0.00]   Node 0: 0x9780
> > 
> > Why the start of zone movable is in the range of zone highmem, if all
> > the pages of zone movable are from zone highmem? If the answer is yes, 
> 
> > zone movable and zone highmem are in the equal status or not?
> 
> The pages of zone_movable can be taken from zone_movalbe or zone_normal,
> if we have highmem, then zone_movable will be taken from zone_highmem,
> otherwise zone_movable will be taken from zone_normal.
> 
> you can refer to find_usable_zone_for_movable().

Hi Jiangguo,

I have 8G memory, movablecore=5G, but dmesg looks strange, what
happended to me?

> [0.00] Zone ranges:
> [0.00]   DMA  

Re: [PATCH] clk: mxs: Remove unneeded NULL pointer check

2012-12-11 Thread Shawn Guo
On Tue, Dec 11, 2012 at 08:33:34AM -0200, Fabio Estevam wrote:
> Shawn/Mike,
> 
> On Wed, Nov 21, 2012 at 7:33 PM, Fabio Estevam  wrote:
> > From: Fabio Estevam 
> >
> > mxs platform has been converted to device tree.
> >
> > There is no need to check if np is NULL after doing:
> >
> > np = of_find_compatible_node(NULL, NULL, "fsl,imx[23/28]-clkctrl");
> >
> > ,as it will always be non-NULL.
> >
> > Signed-off-by: Fabio Estevam 
> 
> Does this patch look good?

Hmm, technically the check is still valid, as np can be NULL if
the clkctrl node with correct compatible string is not present.

Shawn

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 1/9] CPU hotplug: Provide APIs to prevent CPU offline from atomic context

2012-12-11 Thread Srivatsa S. Bhat
On 12/10/2012 10:54 PM, Oleg Nesterov wrote:
> On 12/10, Srivatsa S. Bhat wrote:
>>
>> On 12/10/2012 01:52 AM, Oleg Nesterov wrote:
>>> On 12/10, Srivatsa S. Bhat wrote:

 On 12/10/2012 12:44 AM, Oleg Nesterov wrote:

> But yes, it is easy to blame somebody else's code ;) And I can't suggest
> something better at least right now. If I understand correctly, we can not
> use, say, synchronize_sched() in _cpu_down() path

 We can't sleep in that code.. so that's a no-go.
>>>
>>> But we can?
>>>
>>> Note that I meant _cpu_down(), not get_online_cpus_atomic() or 
>>> take_cpu_down().
>>>
>>
>> Maybe I'm missing something, but how would it help if we did a
>> synchronize_sched() so early (in _cpu_down())? Another bunch of 
>> preempt_disable()
>> sections could start immediately after our call to synchronize_sched() no?
>> How would we deal with that?
> 
> Sorry for confusion. Of course synchronize_sched() alone is not enough.
> But we can use it to synchronize with preempt-disabled section and avoid
> the barriers/atomic in the fast-path.
> 
> For example,
> 
>   bool writer_pending;
>   DEFINE_RWLOCK(writer_rwlock);
>   DEFINE_PER_CPU(int, reader_ctr);
> 
>   void get_online_cpus_atomic(void)
>   {
>   preempt_disable();
>   
>   if (likely(!writer_pending) || __this_cpu_read(reader_ctr)) {
>   __this_cpu_inc(reader_ctr);
>   return;
>   }
> 
>   read_lock(&writer_rwlock);
>   __this_cpu_inc(reader_ctr);
>   read_unlock(&writer_rwlock);
>   }
> 
>   // lacks release semantics, but we don't care
>   void put_online_cpus_atomic(void)
>   {
>   __this_cpu_dec(reader_ctr);
>   preempt_enable();
>   }
> 
> Now, _cpu_down() does
> 
>   writer_pending = true;
>   synchronize_sched();
> 
> before stop_one_cpu(). When synchronize_sched() returns, we know that
> every get_online_cpus_atomic() must see writer_pending == T. And, if
> any CPU incremented its reader_ctr we must see it is not zero.
> 
> take_cpu_down() does
> 
>   write_lock(&writer_rwlock);
> 
>   for_each_online_cpu(cpu) {
>   while (per_cpu(reader_ctr, cpu))
>   cpu_relax();
>   }
> 
> and takes the lock.
> 
> However. This can lead to the deadlock we already discussed. So
> take_cpu_down() should do
> 
>  retry:
>   write_lock(&writer_rwlock);
> 
>   for_each_online_cpu(cpu) {
>   if (per_cpu(reader_ctr, cpu)) {
>   write_unlock(&writer_rwlock);
>   goto retry;
>   }
>   }
> 
> to take the lock. But this is livelockable. However, I do not think it
> is possible to avoid the livelock.
> 
> Just in case, the code above is only for illustration, perhaps it is not
> 100% correct and perhaps we can do it better. cpu_hotplug.active_writer
> is ignored for simplicity, get/put should check current == active_writer.
> 

This approach (of using synchronize_sched()) also looks good. It is simple,
yet effective, but unfortunately inefficient at the writer side (because
he'll have to wait for a full synchronize_sched()).

So I would say that we should keep this approach as a fallback, if we don't
come up with any better synchronization scheme (in terms of efficiency) at
a comparable level of simplicity/complexity.

I have come up with a v4 that simplifies several aspects of the synchronization
and makes it look a lot more simpler than v3. (Lesser race windows to take
care of, implicit ACKs, no atomic ops etc..)

I'll post it soon. Let me know what you think...

Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V3 1/2] MCE: fix an error of mce_bad_pages statistics

2012-12-11 Thread Borislav Petkov
On Tue, Dec 11, 2012 at 08:42:39PM +0800, Wanpeng Li wrote:
> Futhermore, Andrew didn't like a variable called "mce_bad_pages".
>
> - Why do we have a variable called "mce_bad_pages"? MCE is an x86
> concept, and this code is in mm/. Lights are flashing, bells are
> ringing and a loudspeaker is blaring "layering violation" at us!

Yes, this should simply be called num_poisoned_pages because this is
what this thing counts.

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/3] Add O_DENY* flags to fcntl and cifs

2012-12-11 Thread Jeff Layton
On Mon, 10 Dec 2012 11:41:16 -0500
"J. Bruce Fields"  wrote:

> On Sat, Dec 08, 2012 at 12:43:14AM +0400, Pavel Shilovsky wrote:
> > The problem is the possibility of denial-of-service attacks here. We
> > can try to prevent them by:
> > 1) specifying an extra security bit on the file that indicates that
> > share flags are accepted (like we have for mandatory locks now) and
> > setting it for neccessary files only, or
> > 2) adding a special mount option (but it it probably makes sense if
> > we decided to add this support for CIFS and NFS only).
> 
> In the case of knfsd and samba exporting a common filesystem, you'd also
> want to be able to enforce it on the exported filesystem.
> 

Sorry for chiming in late on this, but I've been looking at this
problem from the other end as well, from the POV of a fileserver. For
there, you absolutely do want to have some mechanism for enforcing this
on local filesystems. 

Currently, file servers generally enforce share reservations
internally. The upshot is that they're not aware when other tasks
outside the server modify a file. This is also problematic too in many
common fileserving situations -- when exporting files via both NFS and
SMB, for instance.

One thing that's important to note is that there is already some
support for this in the kernel. The LOCK_MAND flag and its associates
are intended for just this purpose. Samba even already calls into the
kernel to set LOCK_MAND locks, but the kernel just passes them through
to the fs. Rumor has it that GPFS does something with these flags, but
I can't confirm that.

Of course, LOCK_MAND is racy since you can't set it on open(), but it
might be nice to use that as a starting point for trying to add this
support.

At the very least, if we're going to do this, we need to consider what
to do with the LOCK_MAND interfaces. As a starting point for
discussion, here's a patch that I was playing with a few months ago. I
haven't had much time to really spend on this project, but it may be
worthwhile to consider. It works, but I'm not sure about the
semantics...

-[snip]

locks: add enforcement of LOCK_MAND locks

The LOCK_MAND lock mechanism is currently a no-op for any in-tree
filesystem. The flags are passed to f_ops->flock, but the standard
flock routines basically ignore them.

Change this by adding enforcement against other LOCK_MAND locks. Also,
assume that LOCK_MAND also implies LOCK_NB.

Signed-off-by: Jeff Layton 
---
 fs/locks.c | 45 ++---
 1 file changed, 42 insertions(+), 3 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index 814c51d..736e38b 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -625,6 +625,43 @@ static int posix_locks_conflict(struct file_lock 
*caller_fl, struct file_lock *s
return (locks_conflict(caller_fl, sys_fl));
 }
 
+/*
+ * locks_mand_conflict - Determine if there's a share reservation conflict
+ * @caller_fl: lock we're attempting to acquire
+ * @sys_fl: lock already present on system that we're checking against
+ *
+ * Check to see if there's a share_reservation conflict. LOCK_READ/LOCK_WRITE
+ * tell us whether the reservation allows other readers and writers.
+ *
+ * We only check against other LOCK_MAND locks, so applications that want to
+ * use share mode locking will only conflict against one another. "normal"
+ * applications that open files won't be affected by and won't themselves
+ * affect the share reservations.
+ */
+static int locks_mand_conflict(struct file_lock *caller_fl,
+   struct file_lock *sys_fl)
+{
+   unsigned char caller_type = caller_fl->fl_type;
+   unsigned char sys_type = sys_fl->fl_type;
+   fmode_t caller_fmode = caller_fl->fl_file->f_mode;
+   fmode_t sys_fmode = sys_fl->fl_file->f_mode;
+
+   /* they can only conflict if they're both LOCK_MAND */
+   if (!(caller_type & LOCK_MAND) || !(sys_type & LOCK_MAND))
+   return 0;
+
+   if (!(caller_type & LOCK_READ) && (sys_fmode & FMODE_READ))
+   return 1;
+   if (!(caller_type & LOCK_WRITE) && (sys_fmode & FMODE_WRITE))
+   return 1;
+   if (!(sys_type & LOCK_READ) && (caller_fmode & FMODE_READ))
+   return 1;
+   if (!(sys_type & LOCK_WRITE) && (caller_fmode & FMODE_WRITE))
+   return 1;
+
+   return 0;
+}
+
 /* Determine if lock sys_fl blocks lock caller_fl. FLOCK specific
  * checking before calling the locks_conflict().
  */
@@ -633,9 +670,11 @@ static int flock_locks_conflict(struct file_lock 
*caller_fl, struct file_lock *s
/* FLOCK locks referring to the same filp do not conflict with
 * each other.
 */
-   if (!IS_FLOCK(sys_fl) || (caller_fl->fl_file == sys_fl->fl_file))
-   return (0);
+   if (!IS_FLOCK(sys_fl))
+   return 0;
if ((caller_fl->fl_type & LOCK_MAND) || (sys_fl->fl_type & LOCK_MAND))
+

Re: [PATCH] drivers/pinctrl/pinctrl-at91.c: convert kfree to devm_kfree

2012-12-11 Thread Julia Lawall
On Tue, 11 Dec 2012, Sergei Shtylyov wrote:

> Hello.
>
> On 11-12-2012 14:58, Julia Lawall wrote:
>
> > From: Julia Lawall 
>
> > The function at91_dt_node_to_map is ultimately called by the function
> > pinctrl_get, which is an exported function.  Since it is possible that this
> > function is not called from within a probe function, for safety, the kfree
> > is converted to a devm_kfree, to both free the data and remove it from the
> > device in a failure situation.
>
> > A newline is added in the call to devm_kfree to avoid exceeding the 80
>
>devm_kzalloc() you mean?

Yes, sorry.  Should I send a new patch?

thanks,
julia

> > character limit.
>
> > Signed-off-by: Julia Lawall 
>
> > ---
> >   drivers/pinctrl/pinctrl-at91.c |5 +++--
> >   1 file changed, 3 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/pinctrl/pinctrl-at91.c b/drivers/pinctrl/pinctrl-at91.c
> > index c5e7571..0da8a5f 100644
> > --- a/drivers/pinctrl/pinctrl-at91.c
> > +++ b/drivers/pinctrl/pinctrl-at91.c
> > @@ -255,7 +255,8 @@ static int at91_dt_node_to_map(struct pinctrl_dev
> > *pctldev,
> > }
> >
> > map_num += grp->npins;
> > -   new_map = devm_kzalloc(pctldev->dev, sizeof(*new_map) * map_num,
> > GFP_KERNEL);
> > +   new_map = devm_kzalloc(pctldev->dev, sizeof(*new_map) * map_num,
> > +  GFP_KERNEL);
>
> WBR, Sergei
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 1/9] CPU hotplug: Provide APIs to prevent CPU offline from atomic context

2012-12-11 Thread Srivatsa S. Bhat
On 12/10/2012 10:58 PM, Oleg Nesterov wrote:
> On 12/10, Srivatsa S. Bhat wrote:
>>
>> On 12/10/2012 02:43 AM, Oleg Nesterov wrote:
>>> Damn, sorry for noise. I missed this part...
>>>
>>> On 12/10, Srivatsa S. Bhat wrote:

 On 12/10/2012 12:44 AM, Oleg Nesterov wrote:
> the latency. And I guess something like kick_all_cpus_sync() is "too 
> heavy".

 I hadn't considered that. Thinking of it, I don't think it would help us..
 It won't get rid of the currently running preempt_disable() sections no?
>>>
>>> Sure. But (again, this is only my feeling so far) given that 
>>> get_online_cpus_atomic()
>>> does cli/sti,
>>
>> Ah, that one! Actually, the only reason I do that cli/sti is because, 
>> potentially
>> interrupt handlers can be hotplug readers too. So we need to protect the 
>> portion
>> of the code of get_online_cpus_atomic() which is not re-entrant.
> 
> Yes, I understand.
> 
>>> this can help to implement ensure-the-readers-must-see-the-pending-writer.
>>> IOW this might help to implement sync-with-readers.
>>>
>>
>> 2 problems:
>>
>> 1. It won't help with cases like this:
>>
>>preempt_disable()
>>  ...
>>  preempt_disable()
>>  ...
>>  <--- Here
>>  ...
>>  preempt_enable()
>>  ...
>>preempt_enable()
> 
> No, I meant that kick_all_cpus_sync() can be used to synchronize with
> cli/sti in get_online_cpus_atomic(), just like synchronize_sched() does
> in the code I posted a minute ago.
> 

Ah, OK.

>> 2. Part of the reason we want to get rid of stop_machine() is to avoid the
>> latency it induces on _all_ CPUs just to take *one* CPU offline. If we use
>> kick_all_cpus_sync(), we get into that territory again : we unfairly 
>> interrupt
>> every CPU, _even when_ that CPU's existing preempt_disabled() sections might
>> not actually be hotplug readers! (ie., not bothered about CPU Hotplug).
> 
> I agree, that is why I said it is "too heavy".
> 

Got it :)

Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH v3 1/9] CPU hotplug: Provide APIs to prevent CPU offline from atomic context

2012-12-11 Thread Srivatsa S. Bhat
On 12/10/2012 11:45 PM, Oleg Nesterov wrote:
> On 12/10, Srivatsa S. Bhat wrote:
>>
>> On 12/10/2012 02:27 AM, Oleg Nesterov wrote:
>>> However. If this is true, then compared to preempt_disable/stop_machine
>>> livelock is possible. Probably this is fine, we have the same problem with
>>> get_online_cpus(). But if we can accept this fact I feel we can simmplify
>>> this somehow... Can't prove, only feel ;)
>>
>> Not sure I follow..
> 
> I meant that write_lock_irqsave(&hotplug_rwlock) in take_cpu_down()
> can spin "forever".
> 
> Suppose that reader_acked() == T on every CPU, so that
> get_online_cpus_atomic() always takes read_lock(&hotplug_rwlock).
> 
> It is possible that this lock will be never released by readers,
> 
>   CPU_0   CPU_1
> 
>   get_online_cpus_atomic()
>   get_online_cpus_atomic()
>   put_online_cpus_atomic()
> 
>   get_online_cpus_atomic()
>   put_online_cpus_atomic()
> 
>   get_online_cpus_atomic()
>   put_online_cpus_atomic()
> 
> and so on.
> 

Right, and we can't do anything about it :(
Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mfd: vexpress-config: Export __vexpress_config_func_get and vexpress_config_func_put

2012-12-11 Thread Pawel Moll
On Tue, 2012-12-11 at 08:23 +, Axel Lin wrote:
> This fixes below build error:
> 
>   Building modules, stage 2.
>   MODPOST 17 modules
> ERROR: "__vexpress_config_func_get" [drivers/regulator/vexpress.ko] undefined!
> ERROR: "vexpress_config_func_put" [drivers/regulator/vexpress.ko] undefined!
> make[1]: *** [__modpost] Error 1
> make: *** [modules] Error 2
> 
> Signed-off-by: Axel Lin 

Thanks, I'll get this queued with other vexpress fixes.

Paweł


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] drivers/pinctrl/pinctrl-at91.c: convert kfree to devm_kfree

2012-12-11 Thread Sergei Shtylyov

Hello.

On 11-12-2012 14:58, Julia Lawall wrote:


From: Julia Lawall 



The function at91_dt_node_to_map is ultimately called by the function
pinctrl_get, which is an exported function.  Since it is possible that this
function is not called from within a probe function, for safety, the kfree
is converted to a devm_kfree, to both free the data and remove it from the
device in a failure situation.



A newline is added in the call to devm_kfree to avoid exceeding the 80


   devm_kzalloc() you mean?


character limit.



Signed-off-by: Julia Lawall 



---
  drivers/pinctrl/pinctrl-at91.c |5 +++--
  1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/pinctrl/pinctrl-at91.c b/drivers/pinctrl/pinctrl-at91.c
index c5e7571..0da8a5f 100644
--- a/drivers/pinctrl/pinctrl-at91.c
+++ b/drivers/pinctrl/pinctrl-at91.c
@@ -255,7 +255,8 @@ static int at91_dt_node_to_map(struct pinctrl_dev *pctldev,
}

map_num += grp->npins;
-   new_map = devm_kzalloc(pctldev->dev, sizeof(*new_map) * map_num, 
GFP_KERNEL);
+   new_map = devm_kzalloc(pctldev->dev, sizeof(*new_map) * map_num,
+  GFP_KERNEL);


WBR, Sergei


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 2/3] regulator: max77686: Add support for various operating modes

2012-12-11 Thread Mark Brown
On Mon, Dec 10, 2012 at 02:06:49PM +0530, Abhilash Kesavan wrote:
> On Mon, Dec 10, 2012 at 1:49 PM, Abhilash Kesavan
>  wrote:

> > Mark Brown  opensource.wolfsonmicro.com> writes:

> >> Binding documenation is mandatory for any new OF properties, please add
> >> this.

> > Patch 3/3 of this series adds documentation for the max77686-opmode 
> > property.

There is no point in splitting changes like this up, it just makes
review harder - in this case it caused me to not read your patch due
to the missing documentation.  Put the whole change together unless
things are getting too big to review.


signature.asc
Description: Digital signature


RE: [PATCH v3 2/3] mtd: devices: elm: Add support for ELM error correction

2012-12-11 Thread Philip, Avinash
On Tue, Dec 11, 2012 at 14:33:56, Grant Likely wrote:
> On Thu, 29 Nov 2012 17:16:33 +0530, "Philip, Avinash"  
> wrote:
> > The ELM hardware module can be used to speedup BCH 4/8/16 ECC scheme
> > error correction.
> > For now only 4 & 8 bit support is added
> > 
> > Signed-off-by: Philip, Avinash 
> > Cc: Grant Likely 
> > Cc: Rob Herring 
> > Cc: Rob Landley 
> > ---
> > Changes since v2:
> > - Remove __devinit & __devexit annotations
> > 
> > Changes since v1:
> > - Change build attribute to CONFIG_MTD_NAND_OMAP_BCH
> > - Reduced indentation using by passing elm_info , offset
> >   to elm_read & elm_write
> > - Removed syndrome manipulation functions.
> > 
> > :00 100644 000... b88ee83... A  
> > Documentation/devicetree/bindings/mtd/elm.txt
> > :100644 100644 395733a... 369a194... M  drivers/mtd/devices/Makefile
> > :00 100644 000... d2667f3... A  drivers/mtd/devices/elm.c
> > :00 100644 000... d4fce31... A  
> > include/linux/platform_data/elm.h
> >  Documentation/devicetree/bindings/mtd/elm.txt |   17 +
> >  drivers/mtd/devices/Makefile  |4 +-
> >  drivers/mtd/devices/elm.c |  418 
> > +
> >  include/linux/platform_data/elm.h |   54 
> >  4 files changed, 493 insertions(+), 1 deletions(-)
> > 
> > diff --git a/Documentation/devicetree/bindings/mtd/elm.txt 
> > b/Documentation/devicetree/bindings/mtd/elm.txt
> > new file mode 100644
> > index 000..b88ee83
> > --- /dev/null
> > +++ b/Documentation/devicetree/bindings/mtd/elm.txt
> > @@ -0,0 +1,17 @@
> > +Error location module
> > +
> > +Required properties:
> > +- compatible: Must be "ti,elm"
> 
> Compatible string is too generic. Need to specify a specific SoC here.
> ie: "ti,omap3430-elm"

I will change to "ti,am33xx-elm" in next version.

Thanks
Avinash


> 
> Otherwise the binding looks fine. I haven't reviewed the code though.
> 
> g.
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] ima: policy search speedup

2012-12-11 Thread Kasatkin, Dmitry
Hello Linus,

Can you please comment on the feature flag in this patchset?

Thanks,
Dmitry


On Tue, Nov 27, 2012 at 3:42 PM, Kasatkin, Dmitry
 wrote:
> Hello,
>
> Any thoughts about this proposal?
>
> - Dmitry
>
> On Thu, Nov 22, 2012 at 11:54 PM, Dmitry Kasatkin
>  wrote:
>> Hello,
>>
>> Here is two patches for policy search speedup.
>>
>> First patch adds additional features flags to superblock.
>> Second - implementation for IMA.
>>
>> Two months ago I was asking about it on mailing lists.
>> Suggestion was not to use s_flags, but e.g. s_feature_flags.
>>
>> Any objections about such approach?
>>
>> Thanks,
>> Dmitry
>>
>> Dmitry Kasatkin (2):
>>   vfs: new super block feature flags attribute
>>   ima: skip policy search for never appraised or measured files
>>
>>  include/linux/fs.h  |4 
>>  security/integrity/ima/ima_api.c|8 ++--
>>  security/integrity/ima/ima_policy.c |   20 +---
>>  security/integrity/integrity.h  |3 +++
>>  4 files changed, 26 insertions(+), 9 deletions(-)
>>
>> --
>> 1.7.10.4
>>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 3/5] page_alloc: Introduce zone_movable_limit[] to keep movable limit for nodes

2012-12-11 Thread Jianguo Wu
On 2012/12/11 20:24, Simon Jeons wrote:

> On Tue, 2012-12-11 at 11:07 +0800, Jianguo Wu wrote:
>> On 2012/12/11 10:33, Tang Chen wrote:
>>
>>> This patch introduces a new array zone_movable_limit[] to store the
>>> ZONE_MOVABLE limit from movablecore_map boot option for all nodes.
>>> The function sanitize_zone_movable_limit() will find out to which
>>> node the ranges in movable_map.map[] belongs, and calculates the
>>> low boundary of ZONE_MOVABLE for each node.
>>>
>>> Signed-off-by: Tang Chen 
>>> Signed-off-by: Jiang Liu 
>>> Reviewed-by: Wen Congyang 
>>> Reviewed-by: Lai Jiangshan 
>>> Tested-by: Lin Feng 
>>> ---
>>>  mm/page_alloc.c |   77 
>>> +++
>>>  1 files changed, 77 insertions(+), 0 deletions(-)
>>>
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index 1c91d16..4853619 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -206,6 +206,7 @@ static unsigned long __meminitdata 
>>> arch_zone_highest_possible_pfn[MAX_NR_ZONES];
>>>  static unsigned long __initdata required_kernelcore;
>>>  static unsigned long __initdata required_movablecore;
>>>  static unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES];
>>> +static unsigned long __meminitdata zone_movable_limit[MAX_NUMNODES];
>>>  
>>>  /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
>>>  int movable_zone;
>>> @@ -4340,6 +4341,77 @@ static unsigned long __meminit 
>>> zone_absent_pages_in_node(int nid,
>>> return __absent_pages_in_range(nid, zone_start_pfn, zone_end_pfn);
>>>  }
>>>  
>>> +/**
>>> + * sanitize_zone_movable_limit - Sanitize the zone_movable_limit array.
>>> + *
>>> + * zone_movable_limit is initialized as 0. This function will try to get
>>> + * the first ZONE_MOVABLE pfn of each node from movablecore_map, and
>>> + * assigne them to zone_movable_limit.
>>> + * zone_movable_limit[nid] == 0 means no limit for the node.
>>> + *
>>> + * Note: Each range is represented as [start_pfn, end_pfn)
>>> + */
>>> +static void __meminit sanitize_zone_movable_limit(void)
>>> +{
>>> +   int map_pos = 0, i, nid;
>>> +   unsigned long start_pfn, end_pfn;
>>> +
>>> +   if (!movablecore_map.nr_map)
>>> +   return;
>>> +
>>> +   /* Iterate all ranges from minimum to maximum */
>>> +   for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) {
>>> +   /*
>>> +* If we have found lowest pfn of ZONE_MOVABLE of the node
>>> +* specified by user, just go on to check next range.
>>> +*/
>>> +   if (zone_movable_limit[nid])
>>> +   continue;
>>> +
>>> +#ifdef CONFIG_ZONE_DMA
>>> +   /* Skip DMA memory. */
>>> +   if (start_pfn < arch_zone_highest_possible_pfn[ZONE_DMA])
>>> +   start_pfn = arch_zone_highest_possible_pfn[ZONE_DMA];
>>> +#endif
>>> +
>>> +#ifdef CONFIG_ZONE_DMA32
>>> +   /* Skip DMA32 memory. */
>>> +   if (start_pfn < arch_zone_highest_possible_pfn[ZONE_DMA32])
>>> +   start_pfn = arch_zone_highest_possible_pfn[ZONE_DMA32];
>>> +#endif
>>> +
>>> +#ifdef CONFIG_HIGHMEM
>>> +   /* Skip lowmem if ZONE_MOVABLE is highmem. */
>>> +   if (zone_movable_is_highmem() &&
>>
>> Hi Tang,
>>
>> I think zone_movable_is_highmem() is not work correctly here.
>>  sanitize_zone_movable_limit
>>  zone_movable_is_highmem  <--using movable_zone here
>>  find_zone_movable_pfns_for_nodes
>>  find_usable_zone_for_movable <--movable_zone is specified here
>>
> 
> Hi Jiangguo and Chen,
> 
> - What's the meaning of zone_movable_is_highmem(), does it mean all zone
> highmem pages are zone movable pages or 

Hi Simon,

zone_movable_is_highmem() means whether zone pages in ZONE_MOVABLE are taken 
from
highmem.

> - dmesg 
> 
>> 0.00] Zone ranges:
>> [0.00]   DMA  [mem 0x0001-0x00ff]
>> [0.00]   Normal   [mem 0x0100-0x373fdfff]
>> [0.00]   HighMem  [mem 0x373fe000-0xb6cf]
>> [0.00] Movable zone start for each node
>> [0.00]   Node 0: 0x9780
> 
> Why the start of zone movable is in the range of zone highmem, if all
> the pages of zone movable are from zone highmem? If the answer is yes, 

> zone movable and zone highmem are in the equal status or not?

The pages of zone_movable can be taken from zone_movalbe or zone_normal,
if we have highmem, then zone_movable will be taken from zone_highmem,
otherwise zone_movable will be taken from zone_normal.

you can refer to find_usable_zone_for_movable().

Thanks,
Jianguo Wu

> 
>> I think Jiang Liu's patch works fine for highmem, please refer to:
>> http://marc.info/?l=linux-mm&m=135476085816087&w=2
>>
>> Thanks,
>> Jianguo Wu
>>
>>> +   start_pfn < arch_zone_lowest_possible_pfn[ZONE_HIGHMEM])
>>> +   start_pfn = arch_zone_lowest_possible_pfn[ZONE_HIGHMEM];
>>> +#endif
>>> +
>>> +   if (start_pfn >= end_pfn)
>>> + 

Re: [PATCH] regulator: core: Fix logic to determinate if regulator can change voltage

2012-12-11 Thread Mark Brown
On Tue, Dec 11, 2012 at 08:36:37PM +0800, Axel Lin wrote:
> Having a linear_min_sel setting means the first linear_min_sel selectors are
> invalid. We need to subtract linear_min_sel when use n_voltages to determinate
> if regulator can change voltage.

Applied, thanks.


signature.asc
Description: Digital signature


Re: [RFC][PATCH RT 3/4] sched/rt: Use IPI to trigger RT task push migration instead of pulling

2012-12-11 Thread Thomas Gleixner
On Mon, 10 Dec 2012, Steven Rostedt wrote:
> On Mon, 2012-12-10 at 17:15 -0800, Frank Rowand wrote:
> 
> > I should have also mentioned some previous experience using IPIs to
> > avoid runq lock contention on wake up.  Someone encountered IPI
> > storms when using the TTWU_QUEUE feature, thus it defaults to off
> > for CONFIG_PREEMPT_RT_FULL:
> > 
> >   #ifndef CONFIG_PREEMPT_RT_FULL
> >   /*
> >* Queue remote wakeups on the target CPU and process them
> >* using the scheduler IPI. Reduces rq->lock contention/bounces.
> >*/
> >   SCHED_FEAT(TTWU_QUEUE, true)
> >   #else
> >   SCHED_FEAT(TTWU_QUEUE, false)
> > 
> 
> Interesting, but I'm wondering if this also does it for every wakeup? If
> you have 1000 tasks waking up on another CPU, this could potentially
> send out 1000 IPIs. The number of IPIs here looks to be # of tasks
> waking up, and perhaps more than that, as there could be multiple
> instances that try to wake up the same task.

Not using the TTWU_QUEUE feature limits the IPIs to a single one,
which is only sent if the newly woken task preempts the current task
on the remote cpu and the NEED_RESCHED flag was not yet set.
 
With TTWU_QUEUE you can induce massive latencies just by starting
hackbench. You get a herd wakeup on CPU0 which then enqueues hundreds
of tasks to the remote pull list and sends IPIs. The remote CPUs pulls
the tasks and activate them on their runqueue in hard interrupt
context. That easiliy can accumulate to hundreds of microseconds when
you do a mass push of newly woken tasks.

Of course it avoids fiddling with the remote rq lock, but it becomes
massivly non deterministic.

> Now this patch set, the # of IPIs is limited to the # of CPUs. If you
> have 4 CPUs, you'll get a storm of 3 IPIs. That's a big difference.

Yeah, the big difference is that you offload the double lock to the
IPI. So in the worst case you interrupt the most latency sensitive
task running on the remote CPU. Not sure if I really like that
"feature".
 
Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH net-next rfc 0/2] Allow unpriveledge user to disable tuntap queue

2012-12-11 Thread Michael S. Tsirkin
On Tue, Dec 11, 2012 at 07:03:45PM +0800, Jason Wang wrote:
> This series is an rfc that tries to solve the issue that the queues of tuntap
> could not be disabled/enabled by unpriveledged user. This is needed for
> unpriveledge userspace such as qemu since guest may change the number of 
> queues
> at any time, qemu needs to configure the tuntap to disable/enable a specific
> queue.
> 
> Instead of introducting new flag/ioctls, this series tries to re-use the 
> current
> TUNSETQUEUE and IFF_ATTACH_QUEUE/IFF_DETACH_QUEUE. After this change,
> IFF_DETACH_QUEUE is used to disable a specific queue instead of detaching all
> its state from tuntap. IFF_ATTACH_QUEUE is used to do: 1) creating new queue 
> to
> a tuntap device, in this situation, previous DAC check is still done. 2)
> re-enable the queue previously disabled by IFF_DETACH_QUEUE, in this 
> situation,
> we can bypass some checking when we do during queue creating (the check need 
> to
> be done here needs discussion.
> 
> Management software (such as libvirt) then can do:
> - TUNSETIFF to creating device and queue 0
> - TUNSETQUEUE to create the rest of queues
> - Passing them to unpriveledge userspace (such as qemu)

Sorry I find this somewhat confusing.
Why doesn't management call TUNSETIFF to create all queues -
seems cleaner, no? Also has the advantage that it works
without selinux changes.

So why don't we simply fix TUNSETQUEUE such that
1. It only works if already attached to device by TUNSETIFF
2. It does not attach/detach, instead simply enables/disables the queue

This way no new flags, just tweak the semantics of the
existing ones. Need to do this before 3.8 is out though
otherwise we'll end up maintaining the old semantics forever.

> Then the unpriveledge userspace can enable and disable a specific queue 
> through
> IFF_ATTACH_QUEUE and IFF_DETACH_QUEUE.
> 
> This is done by introducing a enabled flags were used to notify whether the
> queue is enabled, and tuntap only send/receive packets when it was enabled.
> 
> Please comment, thanks!
> 
> Jason Wang (2):
>   tuntap: forbid calling TUNSETQUEUE for a persistent device with no
> queues
>   tuntap: allow unpriveledge user to enable and disable queues
> 
>  drivers/net/tun.c |   78 +---
>  1 files changed, 73 insertions(+), 5 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] avoid entropy starvation due to stack protection

2012-12-11 Thread Stephan Mueller
Hi Ted, kernel hackers,

Some time ago, I noticed the fact that for every newly
executed process, the function create_elf_tables requests 16 bytes of
randomness from get_random_bytes. This is easily visible when calling

while [ 1 ]
do
cat /proc/sys/kernel/random/entropy_avail
sleep 1
done

You would expect that this number does not change significantly with the
call. But in fact, it does drop significantly -- it should drop
by 256 (bit) per loop count due to the exec of cat and sleep unless we
hit the lower boundary where nothing is copied from the input_pool. See
graph at http://www.eperm.de/entropy_estimator_time.png which indicates
the entropy counter on an Ubuntu with KDE installed (x-axis: time since
user space boot). Starting around 50 seconds, I log into lightdm which
spawns many processes.

Each request to get_random_bytes retrieves (good) entropy from the
input_pool, if available. The entire logic of getting 16 bytes per exec
depletes good quality entropy on a fast scale that also affects /dev/random.

This alone is already a burn of entropy that leaves the kernel starved
of entropy much more than we want it to.

The receiver of that entropy for this is the stack protector of glibc.
The patch that added this behavior is found in
http://mirror.lividpenguin.com/pub/linux/kernel/people/akpm/patches/2.6/2.6.28-rc2/2.6.28-rc2-mm1/broken-out/elf-implement-at_random-for-glibc-prng-seeding.patch

Even when considering an initial installation process, you assume it
generates massive entropy due to copying 100s of megabytes of data to
disk. That entropy can already be retained for the first reboot of the
system to ensure that already the first start has sufficient entropy
available (which for example may be used to generate missing
cryptographic keys, e.g. for OpenSSH).

However, analyzing the installation process and the entropy behavior
again showed very surprising results. See graph at
http://www.eperm.de/entropy_estimator_time_install.png (x-axis: time
since mount of root partition; red line: installed data in MB (rhs),
black/blue lines: estimated entropy in input_pool (lhs)). The graph is
for the entire installation process of the RHEL 6.2 minimal
installation. The spike of entropy at the end is caused *only* because
of the grub installation (see that there is no data added to the hard disk).

I would have expected that the entropy rises to the top of 4096 early on
in the install cycle and stayed there. But then, after thinking some
more, it is clear why this is not the case: when installing rpm packages
from anaconda, you exec many processes (rpm -Uhv, the pre/post install
scripts of the RPM packages).

So, if we would not have had the grub installation phase, our entropy
count would still be low.

Now my question to kernel hackers: may I propose the addition of a new
entropy pool solely used for purposes within the kernel? The entropy
pool has the following characteristics:

- it is non-blocking similarly to nonblocking_pool

- it has the same size as nonblocking_pool

- it draws from input_pool until the entropy counter in the kernel pool
reaches the maximum of poolinfo->poolwords. Starting at that point, we
assume that the pool is filled with entropy completely and the SHA-1
hash including the back-mixing ensures the entropy is preserved as much
as possible. The pulling of the entropy from input_pool is identical to
the nonblocking_pool.

- after reaching the entropy limit, it will never be seeded again.

Bottom line: before reaching the threshold, the kernel pool behaves
exactly like the nonblocking_pool. After reaching the threshold, it
decouples itself from the input_pool.

May I further propose to replace the get_random_bytes invocation from
create_elf_tables with a call to the retrieval of random numbers from
the kernel pool?

I think that approach satisfies the requirement for stack protection and
ASLR as there is still no prediction possible of the random number.
Maybe even other in-kernel users can use that for sufficient good random
numbers.

When testing the patch, the following behavior of the entropy estimator is 
seen: http://www.eperm.de/entropy_estimator_time_patch.png -- compare that to 
the initial graph! The first minute, you have the low numbers as the kernel 
pool needs to fill. Once it is filled, the entropy climbs. The sharp reductions 
are due to the start of Firefox which seems to pull entropy.

See patch attached. It applies against vanilla version 3.6.

Thanks
Stephan

Signed-off-by: Stephan Mueller 

---

diff -purN linux-3.6/drivers/char/random.c linux-3.6-sm/drivers/char/random.c
--- linux-3.6/drivers/char/random.c 2012-10-01 01:47:46.0 +0200
+++ linux-3.6-sm/drivers/char/random.c  2012-12-11 11:51:58.997172447 +0100
@@ -404,11 +404,12 @@ static bool debug;
 module_param(debug, bool, 0644);
 #define DEBUG_ENT(fmt, arg...) do { \
if (debug) \
-   printk(KERN_DEBUG "random %04d %04d %04d: " \
+   printk(KERN_DEB

Re: [PATCH 5/6] ACPI: Replace struct acpi_bus_ops with enum type

2012-12-11 Thread Rafael J. Wysocki
On Monday, December 10, 2012 06:26:08 PM Yinghai Lu wrote:
> On Mon, Dec 10, 2012 at 5:28 PM, Rafael J. Wysocki  wrote:
> >>
> >> OK, thanks for the pointers.  I actually see more differences between our
> >> patchsets.  For one example, you seem to have left the parent->ops.bind()
> >> stuff in acpi_add_single_object() which calls it even drivers_autoprobe is
> >> set.
> >
> > Sorry, that should have been "which calls it even when drivers_autoprobe is
> > not set".  I need to be more careful.
> >
> 
> oh,  Jiang Liu had one patch to remove that workaround.
> 
> http://git.kernel.org/?p=linux/kernel/git/yinghai/linux-yinghai.git;a=commitdiff;h=b40dba80c2b8395570d8357e6b3f417c27c84504
> 
> ACPI/pci-bind: remove bind/unbind callbacks from acpi_device_ops

OK, so I'm looking at the current code, which for me is the master branch of
the linux-pm.git tree (the linux-next branch is the same ATM), and I'm not
seeing acpi_pci_unbind_cb() in there.  So surely this patch applies to
something different, right?  In which case I wonder what reason there is for
me to look at it at all?

Besides, I think it may be done differently and in a more straightforward
way.  Namely, on top of my current patchset, it is guaranteed that not only
struct pci_dev objects will always be registered after the companion struct
acpi_device ones, but also they always will be *created* after those
companion objects have been registered.  So in principle we can populate
a new struct pci_dev's ACPI handle as soon as in pci_scan_device(),
next to pci_set_of_node().  Then, we can do something like acpi_pci_bind(),
although without the whole acpi_get_pci_dev() nonsense, in pci_setup_device(),
in which case we won't need to do it anywhere else.

As an added benefit, acpi_platform_notify() would then see a populated ACPI
handle in that struct pci_dev when finally registering the PCI device, so it
wouldn't need to do the whole acpi_find_bridge_device() and type->find_device()
dance.

> Maybe you can review that patches in my for-pci-next2...
> those are ACPI related anyway.

I can, provided that (1) they are based on top of my tree or v3.7 and (2)
they don't conflict with patches we're currently discussing.

> those patches have been there for a while, and Bjorn did not have time
> to digest them.

Well, Bjorn's review bandwidth is limited and we need to take that into
account.

> or you prefer I resend updated version as huge whole patchset?

No, no huge patchsets, please.  Let's take one step at a time, so that
everyone involved/interested can understand what's going on, OK?

My review capacity also is not unlimited, mind you.  I can't promise I'll
have the time to review more than a few patches a day (where "a few" is
rather less than "several").

Thanks,
Rafael


-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3 v2] iio: add rtc-driver for HID sensors of type time

2012-12-11 Thread Alexander Holler

Am 11.12.2012 10:40, schrieb Lars-Peter Clausen:


Yes, move the header or merge into existing one as makes sense.
I'm not pulling this driver into the IIO tree (unless for some
reason Alessandro wants me to and I can't think why he would...).



Alessandro has been pretty quiet for quite some time now. Luckily Andrew
Morton usually picks up the stuff for orphaned subsystems. So put him on Cc
for v4.


Will do it. Thanks a lot for your review.

I willl post the whole series (4 patches including the merge of 
hid-sensor-attributes.h) again, when I've finished v3 of the driver 
(hopefully this evening), marking some patches as RESEND. So 3 out of 
those 4 patches will be for iio (as hid-sensor-hub is part of it), and 
the last one, the rtc driver itself, will be for the rtc subsystem. I 
don't know if they have to be pulled by different maintainers. ;)


Regards,

Alexander
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] regulator: core: Fix logic to determinate if regulator can change voltage

2012-12-11 Thread Axel Lin
Having a linear_min_sel setting means the first linear_min_sel selectors are
invalid. We need to subtract linear_min_sel when use n_voltages to determinate
if regulator can change voltage.

Signed-off-by: Axel Lin 
---
 drivers/regulator/core.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/regulator/core.c b/drivers/regulator/core.c
index cd1b201..0f65b24 100644
--- a/drivers/regulator/core.c
+++ b/drivers/regulator/core.c
@@ -1886,7 +1886,7 @@ int regulator_can_change_voltage(struct regulator 
*regulator)
 
if (rdev->constraints &&
rdev->constraints->valid_ops_mask & REGULATOR_CHANGE_VOLTAGE &&
-   rdev->desc->n_voltages > 1)
+   (rdev->desc->n_voltages - rdev->desc->linear_min_sel) > 1)
return 1;
 
return 0;
-- 
1.7.9.5



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 1/2] iommu/shmobile: Add iommu driver for Renesas IPMMU modules

2012-12-11 Thread Laurent Pinchart
Hi Eiraku-san,

On Tuesday 11 December 2012 19:10:42 Hideki EIRAKU wrote:
> On Mon, 10 Dec 2012 16:55:58 +0100 Laurent Pinchart wrote:
> > On Monday 15 October 2012 17:34:52 Hideki EIRAKU wrote:
> >> This is the Renesas IPMMU driver and IOMMU API implementation.
> >> 
> >> The IPMMU module supports the MMU function and the PMB function.
> > 
> > That sentence make me believe that both MMU and PMB were supported by the
> > driver, as "module" often refers to Linux kernel modules in this context.
> > Maybe you could replace "module" by "hardware module".
> 
> OK,
> 
> >> The MMU function provides address translation by pagetable compatible
> >> with ARMv6. The PMB function provides address translation including
> >> tile-linear translation. This patch implements the MMU function.
> >> 
> >> The iommu driver does not register a platform driver directly because:
> >> - the register space of the MMU function and the PMB function
> >>   have a common register (used for settings flush), so they should
> >>   ideally have a way to appropriately share this register.
> >> - the MMU function uses the IOMMU API while the PMB function does not.
> >> - the two functions may be used independently.
> >> 
> >> Signed-off-by: Hideki EIRAKU 
> >> ---
> >> 
> >>  arch/arm/mach-shmobile/Kconfig  |6 +
> >>  arch/arm/mach-shmobile/Makefile |3 +
> >>  arch/arm/mach-shmobile/include/mach/ipmmu.h |   16 ++
> >>  arch/arm/mach-shmobile/ipmmu.c  |  150 
> >>  drivers/iommu/Kconfig   |   56 +
> >>  drivers/iommu/Makefile  |1 +
> >>  drivers/iommu/shmobile-iommu.c  |  352 +
> >>  7 files changed, 584 insertions(+), 0 deletions(-)
> >>  create mode 100644 arch/arm/mach-shmobile/include/mach/ipmmu.h
> >>  create mode 100644 arch/arm/mach-shmobile/ipmmu.c
> >>  create mode 100644 drivers/iommu/shmobile-iommu.c
> > 
> > What is the reason for splitting the driver in two files ? Can't you put
> > all the code in drivers/iommu/shmobile-iommu.c ? Storing driver code in
> > arch/* is discouraged.
> 
> The reason is that I described in the above text. The PMB function is
> completely different from the MMU function but both functions are on
> the same IPMMU hardware module and sharing the register space. I think
> that a driver using the PMB part which is not yet released should not
> depend on the Linux's iommu interface, so I split the driver in two
> files: the IPMMU platform driver part (in arch/arm/mach-shmobile/) and
> Linux's iommu part (in drivers/iommu/). For the IPMMU platform driver part,
> do you have any suggestions other than arch/* where this should go? It is a
> generic platform device.

I think both parts should go to drivers/iommu/. You can keep the code split 
across two files, but I think you should register a single platform driver. 
The IPMMU is a single hardware module, so it should be handled by a single 
driver. That driver can expose two different APIs (IOMMU and whatever API will 
be used for PMB), and you can make those APIs selectable as Kconfig options, 
but they should in my opinion be implemented in a single driver.

> >> + * You should have received a copy of the GNU General Public License
> >> + * along with this program; if not, write to the Free Software
> >> + * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301
> >> USA
> > 
> > You can remove this last paragraph, we don't want to patch every file in
> > the kernel if the FSF moves to a new building :-)
> 
> OK, I will remove the paragraph.
> 
> >> +  for (dev = ipmmu_devices; dev; dev = dev->archdata.iommu) {
> >> +  if (arm_iommu_attach_device(dev, iommu_mapping))
> >> +  pr_err("arm_iommu_attach_device failed\n");
> >> +  }
> >> +err:
> >> +  spin_unlock(&lock_add);
> >> +  return 0;
> >> +}
> >> +
> >> +void ipmmu_add_device(struct device *dev)
> >> +{
> >> +  spin_lock(&lock_add);
> >> +  dev->archdata.iommu = ipmmu_devices;
> >> +  ipmmu_devices = dev;
> > 
> > That looks a bit hackish to me. I'd like to suggest a different approach,
> > that would be compatible with supporting multiple IPMMU instances.
> > 
> > dev->archdata.iommu should point to a new sh_ipmmu_arch_data structure
> > that would contain an IPMMU name (const char *) and a pointer to a struct
> > shmobile_iommu_priv.
> > 
> > ipmmu_add_device() would take a new IPMMU name argument, allocate an
> > sh_ipmmu_arch_data instance dynamically and initialize its name field to
> > the name passed to the function. The shmobile_iommu_priv pointer would be
> > set to NULL. No other operation would be performed (you will likely get
> > rid of the global ipmmu_devices and iommu_mapping variables).
> > 
> > Then, the attach_dev operation handler would retrieve the
> > dev->archdata.iommu pointer, cast that to an sh_ipmmu_arch_data, and
> > retrieve the IPMMU associated with the name (either by walking a
> > driver-global li

WARNING: at drivers/gpu/drm/i915/i915_gem.c:3437 i915_gem_object_pin+0x151/0x1a0()

2012-12-11 Thread Nikola Pajkovsky
Hey folks,

looks like we still have some oops in i915. i915 maintainers do you have
any ideas what's going on? I will try to trigger that oops later today
and provide more information.

[10733.442608] WARNING: at drivers/gpu/drm/i915/i915_gem.c:3437 
i915_gem_object_pin+0x151/0x1a0()
[10733.442612] Hardware name: 4243BQ9
[10733.442615] Modules linked in: kvm_intel kvm
[10733.442632] Pid: 1361, comm: X Not tainted 3.7.0 #19
[10733.442635] Call Trace:
[10733.442650]  [] warn_slowpath_common+0x7a/0xb0
[10733.442659]  [] warn_slowpath_null+0x15/0x20
[10733.442669]  [] i915_gem_object_pin+0x151/0x1a0
[10733.442679]  [] 
i915_gem_object_pin_to_display_plane+0x5a/0x90
[10733.442689]  [] ? i915_mutex_lock_interruptible+0x35/0xc0
[10733.442700]  [] intel_pin_and_fence_fb_obj+0x50/0x130
[10733.442709]  [] intel_gen6_queue_flip+0x3e/0x160
[10733.442719]  [] intel_crtc_page_flip+0x1d2/0x370
[10733.442728]  [] drm_mode_page_flip_ioctl+0x273/0x300
[10733.442738]  [] drm_ioctl+0x2d0/0x520
[10733.442746]  [] ? drm_mode_gamma_get_ioctl+0x120/0x120
[10733.442757]  [] ? _raw_spin_unlock_irqrestore+0x5f/0x70
[10733.442768]  [] ? del_timer+0x61/0x90
[10733.442779]  [] ? rcu_cleanup_after_idle+0x24/0x40
[10733.442829]  [] ? rcu_prepare_for_idle+0x1b6/0x300
[10733.442838]  [] ? rcu_user_exit+0x69/0xc0
[10733.442847]  [] do_vfs_ioctl+0x8a/0x2f0
[10733.442859]  [] ? trace_hardirqs_on+0xd/0x10
[10733.442866]  [] sys_ioctl+0x91/0xa0
[10733.442876]  [] tracesys+0xe1/0xe6

-- 
Nikola
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH net-next rfc 2/2] tuntap: allow unpriveledge user to enable and disable queues

2012-12-11 Thread Michael S. Tsirkin
On Tue, Dec 11, 2012 at 07:03:47PM +0800, Jason Wang wrote:
> Currently, when a file is attached to tuntap through TUNSETQUEUE, the uid/gid
> and CAP_NET_ADMIN were checked, and we use this ioctl to create and destroy
> queues. Sometimes, userspace such as qemu need to the ability to enable and
> disable a specific queue without priveledge since guest operating system may
> change the number of queues it want use.
> 
> To support this kind of ability, this patch introduce a flag enabled which is
> used to track whether the queue is enabled by userspace. And also restrict 
> that
> only one deivce could be used for a queue to attach. With this patch, the DAC
> checking when adding queues through IFF_ATTACH_QUEUE is still done and after
> this, IFF_DETACH_QUEUE/IFF_ATTACH_QUEUE  could be used to disable/enable this
> queue.
> 
> Signed-off-by: Jason Wang 
> ---
>  drivers/net/tun.c |   81 +++-
>  1 files changed, 73 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index d593f56..43831a7 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -138,6 +138,7 @@ struct tun_file {
>   /* only used for fasnyc */
>   unsigned int flags;
>   u16 queue_index;
> + bool enabled;
>  };
>  
>  struct tun_flow_entry {
> @@ -345,9 +346,11 @@ unlock:
>  static u16 tun_select_queue(struct net_device *dev, struct sk_buff *skb)
>  {
>   struct tun_struct *tun = netdev_priv(dev);
> + struct tun_file *tfile;
>   struct tun_flow_entry *e;
>   u32 txq = 0;
>   u32 numqueues = 0;
> + int i;
>  
>   rcu_read_lock();
>   numqueues = tun->numqueues;
> @@ -366,6 +369,19 @@ static u16 tun_select_queue(struct net_device *dev, 
> struct sk_buff *skb)
>   txq -= numqueues;
>   }
>  
> + tfile = rcu_dereference(tun->tfiles[txq]);
> + if (unlikely(!tfile->enabled))

This unlikely tag is suspicious. It should be perfectly
legal to use less queues than created.

> + /* tun_detach() should make sure there's at least one queue
> +  * could be used to do the tranmission.
> +  */
> + for (i = 0; i < numqueues; i++) {
> + tfile = rcu_dereference(tun->tfiles[i]);
> + if (tfile->enabled) {
> + txq = i;
> + break;
> + }
> + }
> +

Worst case this will do a linear scan over all queueus on each packet.
Instead, I think we need a list of all queues and only install
the active ones in the array.

>   rcu_read_unlock();
>   return txq;
>  }
> @@ -386,6 +402,36 @@ static void tun_set_real_num_queues(struct tun_struct 
> *tun)
>   netif_set_real_num_rx_queues(tun->dev, tun->numqueues);
>  }
>  
> +static int tun_enable(struct tun_file *tfile)
> +{
> + if (tfile->enabled == true)

simply if (tfile->enabled)

> + return -EINVAL;

Actually it's better to have operations be
idempotent. If it's enabled, enabling should
be a NOP not an error.

> +
> + tfile->enabled = true;
> + return 0;
> +}
> +
> +static int tun_disable(struct tun_file *tfile)
> +{
> + struct tun_struct *tun = rcu_dereference_protected(tfile->tun,
> +
> lockdep_rtnl_is_held());
> + u16 index = tfile->queue_index;
> +
> + if (!tun)
> + return -EINVAL;
> +
> + if (tun->numqueues == 1)
> + return -EINVAL;

So if there's a single queue we can't disable it,
but if there are > 1 we can disable them all.
This seems arbitrary.

> +
> + BUG_ON(index >= tun->numqueues);
> + tfile->enabled = false;
> +
> + synchronize_net();
> + tun_flow_delete_by_queue(tun, index);
> +
> + return 0;
> +}
> +
>  static void __tun_detach(struct tun_file *tfile, bool clean)
>  {
>   struct tun_file *ntfile;
> @@ -446,6 +492,7 @@ static void tun_detach_all(struct net_device *dev)
>   BUG_ON(!tfile);
>   wake_up_all(&tfile->wq.wait);
>   rcu_assign_pointer(tfile->tun, NULL);
> + tfile->enabled = false;
>   --tun->numqueues;
>   }
>   BUG_ON(tun->numqueues != 0);
> @@ -490,6 +537,7 @@ static int tun_attach(struct tun_struct *tun, struct file 
> *file)
>   rcu_assign_pointer(tun->tfiles[tun->numqueues], tfile);
>   sock_hold(&tfile->sk);
>   tun->numqueues++;
> + tfile->enabled = true;
>  
>   tun_set_real_num_queues(tun);
>  
> @@ -672,6 +720,10 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, 
> struct net_device *dev)
>   if (txq >= tun->numqueues)
>   goto drop;
>  
> + /* Drop packet if the queue was not enabled */
> + if (!tfile->enabled)
> + goto drop;
> +
>   tun_debug(KERN_INFO, tun, "tun_net_xmit %d\n", skb->len);
>  
>   BUG_ON(!tfile);
> @@ -1010,6 +1062,9 @@ static ssize_t tun_get_u

Re: [PATCH v3 3/5] page_alloc: Introduce zone_movable_limit[] to keep movable limit for nodes

2012-12-11 Thread Simon Jeons
On Tue, 2012-12-11 at 11:07 +0800, Jianguo Wu wrote:
> On 2012/12/11 10:33, Tang Chen wrote:
> 
> > This patch introduces a new array zone_movable_limit[] to store the
> > ZONE_MOVABLE limit from movablecore_map boot option for all nodes.
> > The function sanitize_zone_movable_limit() will find out to which
> > node the ranges in movable_map.map[] belongs, and calculates the
> > low boundary of ZONE_MOVABLE for each node.
> > 
> > Signed-off-by: Tang Chen 
> > Signed-off-by: Jiang Liu 
> > Reviewed-by: Wen Congyang 
> > Reviewed-by: Lai Jiangshan 
> > Tested-by: Lin Feng 
> > ---
> >  mm/page_alloc.c |   77 
> > +++
> >  1 files changed, 77 insertions(+), 0 deletions(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 1c91d16..4853619 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -206,6 +206,7 @@ static unsigned long __meminitdata 
> > arch_zone_highest_possible_pfn[MAX_NR_ZONES];
> >  static unsigned long __initdata required_kernelcore;
> >  static unsigned long __initdata required_movablecore;
> >  static unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES];
> > +static unsigned long __meminitdata zone_movable_limit[MAX_NUMNODES];
> >  
> >  /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
> >  int movable_zone;
> > @@ -4340,6 +4341,77 @@ static unsigned long __meminit 
> > zone_absent_pages_in_node(int nid,
> > return __absent_pages_in_range(nid, zone_start_pfn, zone_end_pfn);
> >  }
> >  
> > +/**
> > + * sanitize_zone_movable_limit - Sanitize the zone_movable_limit array.
> > + *
> > + * zone_movable_limit is initialized as 0. This function will try to get
> > + * the first ZONE_MOVABLE pfn of each node from movablecore_map, and
> > + * assigne them to zone_movable_limit.
> > + * zone_movable_limit[nid] == 0 means no limit for the node.
> > + *
> > + * Note: Each range is represented as [start_pfn, end_pfn)
> > + */
> > +static void __meminit sanitize_zone_movable_limit(void)
> > +{
> > +   int map_pos = 0, i, nid;
> > +   unsigned long start_pfn, end_pfn;
> > +
> > +   if (!movablecore_map.nr_map)
> > +   return;
> > +
> > +   /* Iterate all ranges from minimum to maximum */
> > +   for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) {
> > +   /*
> > +* If we have found lowest pfn of ZONE_MOVABLE of the node
> > +* specified by user, just go on to check next range.
> > +*/
> > +   if (zone_movable_limit[nid])
> > +   continue;
> > +
> > +#ifdef CONFIG_ZONE_DMA
> > +   /* Skip DMA memory. */
> > +   if (start_pfn < arch_zone_highest_possible_pfn[ZONE_DMA])
> > +   start_pfn = arch_zone_highest_possible_pfn[ZONE_DMA];
> > +#endif
> > +
> > +#ifdef CONFIG_ZONE_DMA32
> > +   /* Skip DMA32 memory. */
> > +   if (start_pfn < arch_zone_highest_possible_pfn[ZONE_DMA32])
> > +   start_pfn = arch_zone_highest_possible_pfn[ZONE_DMA32];
> > +#endif
> > +
> > +#ifdef CONFIG_HIGHMEM
> > +   /* Skip lowmem if ZONE_MOVABLE is highmem. */
> > +   if (zone_movable_is_highmem() &&
> 
> Hi Tang,
> 
> I think zone_movable_is_highmem() is not work correctly here.
>   sanitize_zone_movable_limit
>   zone_movable_is_highmem  <--using movable_zone here
>   find_zone_movable_pfns_for_nodes
>   find_usable_zone_for_movable <--movable_zone is specified here
> 

Hi Jiangguo and Chen,

- What's the meaning of zone_movable_is_highmem(), does it mean all zone
highmem pages are zone movable pages or 
- dmesg 

> 0.00] Zone ranges:
> [0.00]   DMA  [mem 0x0001-0x00ff]
> [0.00]   Normal   [mem 0x0100-0x373fdfff]
> [0.00]   HighMem  [mem 0x373fe000-0xb6cf]
> [0.00] Movable zone start for each node
> [0.00]   Node 0: 0x9780

Why the start of zone movable is in the range of zone highmem, if all
the pages of zone movable are from zone highmem? If the answer is yes, 
zone movable and zone highmem are in the equal status or not?

> I think Jiang Liu's patch works fine for highmem, please refer to:
> http://marc.info/?l=linux-mm&m=135476085816087&w=2
> 
> Thanks,
> Jianguo Wu
> 
> > +   start_pfn < arch_zone_lowest_possible_pfn[ZONE_HIGHMEM])
> > +   start_pfn = arch_zone_lowest_possible_pfn[ZONE_HIGHMEM];
> > +#endif
> > +
> > +   if (start_pfn >= end_pfn)
> > +   continue;
> > +
> > +   while (map_pos < movablecore_map.nr_map) {
> > +   if (end_pfn <= movablecore_map.map[map_pos].start_pfn)
> > +   break;
> > +
> > +   if (start_pfn >= movablecore_map.map[map_pos].end_pfn) {
> > +   map_pos++;
> > +   continue;
> > +   }
> > +
> > +   /*
> > +

[PATCH V3 2/2] MCE: fix an error of mce_bad_pages statistics

2012-12-11 Thread Xishi Qiu
1) adjust the function structure, there are too many return points
   randomly intermingled with some "goto done" return points.
2) use atomic_long_inc instead of atomic_long_add.

Signed-off-by: Xishi Qiu 
Signed-off-by: Jiang Liu 
---
 mm/memory-failure.c |   34 --
 1 files changed, 20 insertions(+), 14 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 9b74983..81f942d 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1421,12 +1421,13 @@ static int soft_offline_huge_page(struct page *page, 
int flags)

if (PageHWPoison(hpage)) {
pr_info("soft offline: %#lx hugepage already poisoned\n", pfn);
-   return -EBUSY;
+   ret = -EBUSY;
+   goto out;
}

ret = get_any_page(page, pfn, flags);
if (ret < 0)
-   return ret;
+   goto out;
if (ret == 0)
goto done;

@@ -1437,7 +1438,7 @@ static int soft_offline_huge_page(struct page *page, int 
flags)
if (ret) {
pr_info("soft offline: %#lx: migration failed %d, type %lx\n",
pfn, ret, page->flags);
-   return ret;
+   goto out;
}
 done:
/* keep elevated page count for bad page */
@@ -1447,7 +1448,7 @@ done:
unlock_page(hpage);

dequeue_hwpoisoned_huge_page(hpage);
-
+out:
return ret;
 }

@@ -1479,24 +1480,28 @@ int soft_offline_page(struct page *page, int flags)
unsigned long pfn = page_to_pfn(page);
struct page *hpage = compound_trans_head(page);

-   if (PageHuge(page))
-   return soft_offline_huge_page(page, flags);
+   if (PageHuge(page)) {
+   ret = soft_offline_huge_page(page, flags);
+   goto out;
+   }
if (PageTransHuge(hpage)) {
if (PageAnon(hpage) && unlikely(split_huge_page(hpage))) {
pr_info("soft offline: %#lx: failed to split THP\n",
pfn);
-   return -EBUSY;
+   ret = -EBUSY;
+   goto out;
}
}

if (PageHWPoison(page)) {
pr_info("soft offline: %#lx page already poisoned\n", pfn);
-   return -EBUSY;
+   ret = -EBUSY;
+   goto out;
}

ret = get_any_page(page, pfn, flags);
if (ret < 0)
-   return ret;
+   goto out;
if (ret == 0)
goto done;

@@ -1515,14 +1520,15 @@ int soft_offline_page(struct page *page, int flags)
 */
ret = get_any_page(page, pfn, 0);
if (ret < 0)
-   return ret;
+   goto out;
if (ret == 0)
goto done;
}
if (!PageLRU(page)) {
pr_info("soft_offline: %#lx: unknown non LRU page type %lx\n",
pfn, page->flags);
-   return -EIO;
+   ret = -EIO;
+   goto out;
}

/*
@@ -1577,14 +1583,14 @@ int soft_offline_page(struct page *page, int flags)
pfn, ret, page_count(page), page->flags);
}
if (ret)
-   return ret;
+   goto out;

 done:
/* keep elevated page count for bad page */
lock_page(page);
-   atomic_long_add(1, &mce_bad_pages);
+   atomic_long_inc(&mce_bad_pages);
SetPageHWPoison(page);
unlock_page(page);
-
+out:
return ret;
 }
-- 
1.7.1


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH V3 1/2] MCE: fix an error of mce_bad_pages statistics

2012-12-11 Thread Xishi Qiu
1) move poisoned page check at the beginning of the function.
2) add page_lock to avoid unpoison clear the flag.

Signed-off-by: Xishi Qiu 
Signed-off-by: Jiang Liu 
---
 mm/memory-failure.c |   43 ++-
 1 files changed, 22 insertions(+), 21 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 8b20278..9b74983 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1419,18 +1419,17 @@ static int soft_offline_huge_page(struct page *page, 
int flags)
unsigned long pfn = page_to_pfn(page);
struct page *hpage = compound_head(page);

+   if (PageHWPoison(hpage)) {
+   pr_info("soft offline: %#lx hugepage already poisoned\n", pfn);
+   return -EBUSY;
+   }
+
ret = get_any_page(page, pfn, flags);
if (ret < 0)
return ret;
if (ret == 0)
goto done;

-   if (PageHWPoison(hpage)) {
-   put_page(hpage);
-   pr_info("soft offline: %#lx hugepage already poisoned\n", pfn);
-   return -EBUSY;
-   }
-
/* Keep page count to indicate a given hugepage is isolated. */
ret = migrate_huge_page(hpage, new_page, MPOL_MF_MOVE_ALL, false,
MIGRATE_SYNC);
@@ -1441,12 +1440,14 @@ static int soft_offline_huge_page(struct page *page, 
int flags)
return ret;
}
 done:
-   if (!PageHWPoison(hpage))
-   atomic_long_add(1 << compound_trans_order(hpage),
-   &mce_bad_pages);
+   /* keep elevated page count for bad page */
+   lock_page(hpage);
+   atomic_long_add(1 << compound_trans_order(hpage), &mce_bad_pages);
set_page_hwpoison_huge_page(hpage);
+   unlock_page(hpage);
+
dequeue_hwpoisoned_huge_page(hpage);
-   /* keep elevated page count for bad page */
+
return ret;
 }

@@ -1488,6 +1489,11 @@ int soft_offline_page(struct page *page, int flags)
}
}

+   if (PageHWPoison(page)) {
+   pr_info("soft offline: %#lx page already poisoned\n", pfn);
+   return -EBUSY;
+   }
+
ret = get_any_page(page, pfn, flags);
if (ret < 0)
return ret;
@@ -1519,19 +1525,11 @@ int soft_offline_page(struct page *page, int flags)
return -EIO;
}

-   lock_page(page);
-   wait_on_page_writeback(page);
-
/*
 * Synchronized using the page lock with memory_failure()
 */
-   if (PageHWPoison(page)) {
-   unlock_page(page);
-   put_page(page);
-   pr_info("soft offline: %#lx page already poisoned\n", pfn);
-   return -EBUSY;
-   }
-
+   lock_page(page);
+   wait_on_page_writeback(page);
/*
 * Try to invalidate first. This should work for
 * non dirty unmapped page cache pages.
@@ -1582,8 +1580,11 @@ int soft_offline_page(struct page *page, int flags)
return ret;

 done:
+   /* keep elevated page count for bad page */
+   lock_page(page);
atomic_long_add(1, &mce_bad_pages);
SetPageHWPoison(page);
-   /* keep elevated page count for bad page */
+   unlock_page(page);
+
return ret;
 }
-- 
1.7.1


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH V3 0/2] MCE: fix an error of mce_bad_pages statistics

2012-12-11 Thread Xishi Qiu
When we use "/sys/devices/system/memory/soft_offline_page" to offline a
*free* page, the value of mce_bad_pages will be added, and the page is set
HWPoison flag, but it is still managed by page buddy alocator.

$ cat /proc/meminfo | grep HardwareCorrupted shows the value.

If we offline the same page, the value of mce_bad_pages will be added
*again*, this means the value is incorrect now. Assume the page is
still free during this short time.

soft_offline_page()
get_any_page()
"else if (is_free_buddy_page(p))" branch return 0
"goto done";
"atomic_long_add(1, &mce_bad_pages);"

Changelog:
V3:
-add page lock when set HWPoison flag
-adjust the function structure
V2 and V1:
-fix the error

Xishi Qiu (2):
  move poisoned page check at the beginning of the function
  fix the function structure

 mm/memory-failure.c |   69 ---
 1 files changed, 38 insertions(+), 31 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI

2012-12-11 Thread Steven Whitehouse
Hi,

On Mon, 2012-12-10 at 13:20 -0500, Theodore Ts'o wrote:
> A sentence or two got chopped out during an editing pass.  Let me try
> that again so it's a bit clearer what I was trying to say
> 
> Sure, but if the block device supports WRITE_SAME or persistent
> discard, then presumably fallocate() should do this automatically all
> the time, and not require a flag to request this behavior.  The only
> reason why you might not is if the WRITE_SAME is more costly.  That is
> when a seek plus writing 1MB does take more time than the amount of
> disk time fraction that it consumes if you compare it to a seek plus
> writing 4k or 32k.
> 
Well there are two cases here I think

One is the GFS2 type case where the metadata doesn't support "these
blocks are allocated but zero" so that we must, for all fallocate
requests, zero out the blocks at fallocate time to avoid exposing stale
data to userspace.

The advantage over dd from userspace in this case is firstly that no
copy from userspace means that it should be faster. Also the use of
sb_issue_zeroout means that block devices which don't need an explicit
block of zeros to write should be able to do this faster - however that
is implemented at the block layer. The fs shouldn't need to care about
how is it implemented. In the case of GFS2, we implemented fallocate
because it was useful to have the feature of being able to allocate
beyond the end of file without changing the file size. This helped us
fix a bug in our fs grow code, so performance was a secondary (but
welcome!) consideration. 

The other case is ext4/XFS type case where the metadata does support
"these blocks are allocated but zero" which means that the metadata
needs to be changed twice. Once to "these blocks are allocated but zero"
at fallocate time and again to "these blocks have valid content" at
write time. As I understand the issue, the problem is that this second
metadata change is what is causing the performance issue.

> Ext4 currently uses a threshold of 32k for this break point (below
> that, we will use sb_issue_zeroout; above that, we will break apart an
> uninitialized extent when writing into a preallocated region).  It may
> be that 32k is too low, especailly for certain types of devices (i.e.,
> SSD's versus RAID 5, where it should be aligned on a RAID strip,
> etc.).  More of an issue might be that there will be some disagreement
> about whether people want to the system to automatically tune for
> average throughput vs 99.9 percentile latency.
> 
> Regardless, this is actually something which I think the file system
> should try to do automatically if at all possible, via some kind of
> auto-tuning hueristic, instead of using an explicit fallocate(2) flag.
> (See, I don't propose using a new fallocate flag for everything.  :-)
> 
> - Ted
> 

It sounds like it might well be worth experimenting with the thresholds
as you suggest, 32k is really pretty small. I guess that the real
question here is what is the cost of the metadata change (to say what is
written and what remains unwritten) vs. simply zeroing out the unwritten
blocks in the extent when the write occurs.

There are likely to be a number of factors affecting that, and the
answer doesn't appear straightforward,

Steve.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/18] sched: simplified fork, enable load average into LB and power awareness scheduling

2012-12-11 Thread Alex Shi
On 12/11/2012 08:51 AM, Alex Shi wrote:
> On Mon, Dec 10, 2012 at 4:22 PM, Alex Shi  wrote:
>> This patchset base on tip/sched/core tree temporary, since it is more
>> steady than tip/master. and it's easy to rebase on tip/master.
>>
>> It includes 3 parts changes.
>>
>> 1, simplified fork, patch 1~4, that simplified the fork/exec/wake log in
>> find_idlest_group and select_task_rq_fair. it can increase 10+%
>> hackbench process and thread performance on our 4 sockets SNB EP machine.
>>
>> 2, enable load average into LB, patch 5~9, that using load average in
>> load balancing, with a runnable load value industrialization bug fix and
>> new fork task load contrib enhancement.
>>
>> 3, power awareness scheduling, patch 10~18,
>> Defined 2 new power aware policy balance and
>> powersaving, and then try to spread or shrink tasks on CPU unit
>> according the different scheduler policy. That can save much power when
>> task number in system is no more then cpu number.
> 
> tried with sysbench fileio test rndrw mode, with half thread of LCPU number,
> performance is similar, power can save about 5~10 Watts on 2 sockets SNB EP
> and NHM EP boxes.

Another testing of parallel compress with pigz on Linus' git tree.
results show we get much better performance/power with powersaving and
balance policy:

testing command:
#pigz -k -c  -p$x -r linux* &> /dev/null

On a NHM EP box
 powersaving   balance   performance
x = 4166.516 /88 68   170.515 /82 71 165.283 /103 58
x = 8173.654 /61 94   177.693 /60 93 172.31 /76 76

On a 2 sockets SNB EP box.
 powersaving   balance   performance
x = 4190.995 /149 35  200.6 /129 38  208.561 /135 35
x = 8197.969 /108 46  208.885 /103 46213.96 /108 43
x = 16   205.163 /76 64   212.144 /91 51 229.287 /97 44

data format is: 166.516 /88 68
166.516: avagerage Watts
88: seconds(compress time)
68:  scaled performance/power = 100 / time / power


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


<    1   2   3   4   5   6   7   >