Re: [f2fs-dev] [PATCH 1/2] f2fs: check only data or node for summary

2016-07-05 Thread He YunLei

On 2016/6/11 5:01, Jaegeuk Kim wrote:

We can check data or node types only for gc, since we allocate different type of
data/node blocks in a different logs occasionally.

Signed-off-by: Jaegeuk Kim 
---
  fs/f2fs/gc.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c
index e1d274c..c2c4ac3 100644
--- a/fs/f2fs/gc.c
+++ b/fs/f2fs/gc.c
@@ -806,7 +806,8 @@ static int do_garbage_collect(struct f2fs_sb_info *sbi,
f2fs_put_page(sum_page, 0);

sum = page_address(sum_page);
-   f2fs_bug_on(sbi, type != GET_SUM_TYPE((&sum->footer)));
+   f2fs_bug_on(sbi, IS_DATASEG(type) !=
+   IS_DATASEG(GET_SUM_TYPE((&sum->footer;


Hi, Kim
type has been transformed into SUM_TYPE_DATA or SUM_TYPE_NODE, so here
no need to do this.

Thanks


/*
 * this is to avoid deadlock:





Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed

2016-07-05 Thread Neo Jia
On Wed, Jul 06, 2016 at 10:22:59AM +0800, Xiao Guangrong wrote:
> 
> 
> On 07/05/2016 11:07 PM, Neo Jia wrote:
> >This is kept there in case the validate_map_request() is not provided by 
> >vendor
> >driver then by default assume 1:1 mapping. So if validate_map_request() is 
> >not
> >provided, fault handler should not fail.
> 
> THESE are the parameters you passed to validate_map_request(), and these info 
> is
> available in mmap(), it really does not matter if you move 
> validate_map_request()
> to mmap(). That's what i want to say.

Let me answer this at the end of my response.

> 
> >
> >>
> >>>None of such information is available at VFIO mmap() time. For example, 
> >>>several VMs
> >>>are sharing the same physical device to provide mediated access. All VMs 
> >>>will
> >>>call the VFIO mmap() on their virtual BAR as part of QEMU vfio/pci 
> >>>initialization
> >>>process, at that moment, we definitely can't mmap the entire physical MMIO
> >>>into both VM blindly for obvious reason.
> >>>
> >>
> >>mmap() carries @length information, so you only need to allocate the 
> >>specified size
> >>(corresponding to @length) of memory for them.
> >
> >Again, you still look at this as a static partition at QEMU configuration 
> >time
> >where the guest mmio will be mapped as a whole at some offset of the physical
> >mmio region. (You still can do that like I said above by not providing
> >validate_map_request() in your vendor driver.)
> >
> 
> Then you can move validate_map_request() to here to achieve custom 
> allocation-policy.
> 
> >But this is not the framework we are defining here.
> >
> >The framework we have here is to provide the driver vendor flexibility to 
> >decide
> >the guest mmio and physical mmio mapping on page basis, and such information 
> >is
> >available during runtime.
> >
> >How such information gets communicated between guest and host driver is up to
> >driver vendor.
> 
> The problems is the sequence of the way "provide the driver vendor
> flexibility to decide the guest mmio and physical mmio mapping on page basis"
> and mmap().
> 
> We should provide such allocation info first then do mmap(). You current 
> design,
> do mmap() -> communication telling such info -> use such info when fault 
> happens,
> is really BAD, because you can not control the time when memory fault will 
> happen.
> The guest may access this memory before the communication you mentioned above,
> and another reason is that KVM MMU can prefetch memory at any time.

Like I have said before if your implementation doesn't need such flexibility,
you can still do a static mapping at VFIO mmap() time, then your mediated driver
doesn't have to provide validate_map_request, also the fault handler will not be
called. 

Let me address your questions below.

1. Information available at VFIO mmap() time?

So you are saying that the @req_size and &pgoff are both available in the time
when people are calling VFIO mmap() when guest OS is not even running, right?

The answer is No, the only thing are available at VFIO mmap are the following:

1) guest MMIO size

2) host physical MMIO size

3) guest MMIO starting address

4) host MMIO starting address

But none of above are the @req_size and @pgoff that we are talking about at the
validate_map_request time.

Our host MMIO is representing the means to access GPU HW resource. Those GPU HW
resources are allocated dynamically at runtime. So we have no visibility of the
@pgoff and @req_size that is covering some specific type of GPU HW resource at
VFIO mmap time. Also we don't even know if such resource will be required for a
particular VM or not.

For example, VM1 will need to launch a lot of graphics workload than VM2. So the
end result is that VM1 will gets a lot of resource A allocated than VM2 to
support his graphics workload. And to access resource A, the host mmio region
will be allocated as well, say [pfn_a size_a], the VM2 is [pfn_b, size_b].

Clearly, such region can be destroyed and reallocated through mediated driver
lifetime. This is why we need to have a fault handler there to map the proper
pages into guest after validation in the runtime.

I hope above response can address your question why we can't provide such
allocation info at VFIO mmap() time.

2. Guest might access mmio region at any time ...

Guest with a mediated GPU inside can definitely access his BARs at any time. If
guest is accessing some his BAR region that is not previously allocated, then
such access will be denied and with current scheme VM will crash to prevent
malicious access from the guest. This is another reason we choose to keep the
guest MMIO mediated.

3. KVM MMU can prefetch memory at any time.

You are talking about the KVM MMU prefetch the guest mmio region which is marked
as prefetchable right?

On baremetal, the prefetch is basic a cache line fill, where the range needs to
be marked as cachable for the CPU, then it issues a read to anywhere in the
cache line.

Is KVM MMU prefetch the sa

Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed

2016-07-05 Thread Xiao Guangrong



On 07/06/2016 10:57 AM, Neo Jia wrote:

On Wed, Jul 06, 2016 at 10:35:18AM +0800, Xiao Guangrong wrote:



On 07/06/2016 10:18 AM, Neo Jia wrote:

On Wed, Jul 06, 2016 at 10:00:46AM +0800, Xiao Guangrong wrote:



On 07/05/2016 08:18 PM, Paolo Bonzini wrote:



On 05/07/2016 07:41, Neo Jia wrote:

On Thu, Jun 30, 2016 at 03:01:49PM +0200, Paolo Bonzini wrote:

The vGPU folks would like to trap the first access to a BAR by setting
vm_ops on the VMAs produced by mmap-ing a VFIO device.  The fault handler
then can use remap_pfn_range to place some non-reserved pages in the VMA.

KVM lacks support for this kind of non-linear VM_PFNMAP mapping, and these
patches should fix this.


Hi Paolo,

I have tested your patches with the mediated passthru patchset that is being
reviewed in KVM and QEMU mailing list.

The fault handler gets called successfully and the previously mapped memory gets
unmmaped correctly via unmap_mapping_range.


Great, then I'll include them in 4.8.


Code is okay, but i still suspect if this implementation, fetch mmio pages in 
fault
handler, is needed. We'd better include these patches after the design of vfio
framework is decided.


Hi Guangrong,

I disagree. The design of VFIO framework has been actively discussed in the KVM
and QEMU mailing for a while and the fault handler is agreed upon to provide the
flexibility for different driver vendors' implementation. With that said, I am
still open to discuss with you and anybody else about this framework as the goal
is to allow multiple vendor to plugin into this framework to support their
mediated device virtualization scheme, such as Intel, IBM and us.


The discussion is still going on. And current vfio patchset we reviewed is still
problematic.


My point is the fault handler part has been discussed already, with that said I
am always open to any constructive suggestions to make things better and
maintainable. (Appreciate your code review on the VFIO thread, I think we still
own you another response, will do that.)



It always can be changed especially the vfio patchset is not in a good shape.





May I ask you what the exact issue you have with this interface for Intel to 
support
your own GPU virtualization?


Intel's vGPU can work with this framework. We really appreciate your / nvidia's
contribution.


Then, I don't think we should embargo Paolo's patch.


This patchset is specific for the framework design, i.e, mapping memory when 
fault
happens rather than mmap(), and this design is exact what we are discussing for
nearly two days.



Re: More parallel atomic_open/d_splice_alias fun with NFS and possibly more FSes.

2016-07-05 Thread Oleg Drokin

On Jul 5, 2016, at 11:20 PM, Al Viro wrote:

> On Tue, Jul 05, 2016 at 08:29:37PM -0400, Oleg Drokin wrote:
>>> +   /* Otherwise we just unhash it to be rehashed afresh via
>>> +* lookup if necessary
>>> +*/
>>> +   d_drop(dentry);
>> 
>> So we can even drop this part and retain the top condition as it was.
>> d_add does not care if the dentry we are feeding it was hashed or not,
>> so do you see any downsides to doing that I wonder?
> 
> d_add() on hashed dentry will end up reaching this:
> static void __d_rehash(struct dentry * entry, struct hlist_bl_head *b)
> {
>BUG_ON(!d_unhashed(entry));

Ah, ok. Yes, I remember about it now from the older issue with nfs.

It's still puzzling why I did not hit it yet, but oh well.

I wonder if handling of negative dentries broke… Time for more investigations.



Re: [PATCH 2/2] s390/mm: use ipte range to invalidate multiple page table entries

2016-07-05 Thread Hillf Danton
> 
> +void ptep_invalidate_range(struct mm_struct *mm, unsigned long start,
> +unsigned long end, pte_t *ptep)
> +{
> + unsigned long nr;
> +
> + if (!MACHINE_HAS_IPTE_RANGE || mm_has_pgste(mm))
> + return;
> + preempt_disable();
> + nr = (end - start) >> PAGE_SHIFT;
> + /* If the flush is likely to be local skip the ipte range */
> + if (nr && !cpumask_equal(mm_cpumask(mm),
> +  cpumask_of(smp_processor_id(

s/smp/raw_smp/ to avoid adding schedule entry with page table
lock held?

> + __ptep_ipte_range(start, nr - 1, ptep);
> + preempt_enable();
> +}
> +EXPORT_SYMBOL(ptep_invalidate_range);
> +

thanks
Hillf



Re: [GIT PULL 3/3] bcm2835-arm64-next-2016-07-03

2016-07-05 Thread Florian Fainelli
Le 05/07/2016 11:03, Jason Cooper a écrit :
> Eric,
> 
> On Sun, Jul 03, 2016 at 05:45:36PM -0700, Eric Anholt wrote:
>>   Linux 4.7-rc1 (2016-05-29 09:29:24 -0700)
>>
>> are available in the git repository at:
>>
>>   https://github.com/anholt/linux tags/bcm2835-arm64-next-2016-07-03
>>
>> for you to fetch changes up to 628d30d1ccb897fee54a6f7312561cf2f6f72f09:
>>
>>   arm64: Add platform selection for BCM2835. (2016-06-15 14:10:30 -0700)
>>
>> 
>> This pull request brings in the build support for the Raspberry Pi
>> arm64 port.
>>
>> 
>> Alexander Graf (1):
>>   arm64: Allow for different DMA and CPU bus offsets
>>
>> Eric Anholt (3):
>>   irqchip: bcm2835: Avoid arch/arm-specific handle_IRQ
>>   Merge remote-tracking branch 'irqchip/irqchip/bcm' into 
>> bcm2835-arm64-next
> 
> You may want to draw attention to the fact that this has an external
> dependency on my irqchip/bcm branch.  Also, that it's stable, been in
> -next for a couple of weeks, and based on v4.7-rc1, etc.

Thanks for the heads-up, merged and updated the merge commit to make
this clear, thanks everyone.
-- 
Florian


Re: [GIT PULL 2/3] bcm2835-dt-64-next-2016-07-03

2016-07-05 Thread Florian Fainelli
Le 03/07/2016 17:45, Eric Anholt a écrit :
>   Linux 4.7-rc1 (2016-05-29 09:29:24 -0700)
> 
> are available in the git repository at:
> 
>   https://github.com/anholt/linux tags/bcm2835-dt-64-next-2016-07-03
> 
> for you to fetch changes up to 02d08603649816a941246c18252e5c41fd07625a:
> 
>   ARM: bcm2837: dt: Add the ethernet to the device trees (2016-06-07 15:23:08 
> -0700)
> 
> 
> This pull request brings in the Raspberry Pi 3 DT for its arm64
> support.  Note that it also merges in the ethernet DT changes so that
> the Pi3's ethernet can also get the MAC address.
> 
> 
> Eric Anholt (3):
>   dt-bindings: Add root properties for Raspberry Pi 3
>   ARM: bcm2835: Add devicetree for the Raspberry Pi 3.
>   Merge tag 'bcm2835-dt-ethernet' into HEAD
> 
> Gerd Hoffmann (1):
>   ARM: bcm2837: dt: Add the ethernet to the device trees
> 
> Lubomir Rintel (1):
>   ARM: bcm2835: dt: Add the ethernet to the device trees

Merged, thanks Eric!
-- 
Florian


Re: [GIT PULL 1/3] bcm2835-dt-next-2016-07-03

2016-07-05 Thread Florian Fainelli
Le 03/07/2016 17:45, Eric Anholt a écrit :
>   Linux 4.7-rc1 (2016-05-29 09:29:24 -0700)
> 
> are available in the git repository at:
> 
>   https://github.com/anholt/linux tags/bcm2835-dt-next-2016-07-03
> 
> for you to fetch changes up to 6a93792774fc72861b7e8efaa3545a88272b4413:
> 
>   ARM: bcm2835: dt: Add the ethernet to the device trees (2016-05-31 10:32:34 
> -0700)
> 
> 
> This pull request brings in the change to describe the ethernet in the
> DT so that the firmware can tell us its MAC address.
> 
> 
> Lubomir Rintel (1):
>   ARM: bcm2835: dt: Add the ethernet to the device trees

Merged, thanks!
-- 
Florian


[PATCH v3] fs/dcache.c: avoid soft-lockup in dput()

2016-07-05 Thread Wei Fang
We triggered soft-lockup under stress test which
open/access/write/close one file concurrently on more than
five different CPUs:

WARN: soft lockup - CPU#0 stuck for 11s! [who:30631]
...
[] dput+0x100/0x298
[] terminate_walk+0x4c/0x60
[] path_lookupat+0x5cc/0x7a8
[] filename_lookup+0x38/0xf0
[] user_path_at_empty+0x78/0xd0
[] user_path_at+0x1c/0x28
[] SyS_faccessat+0xb4/0x230

->d_lock trylock may failed many times because of concurrently
operations, and dput() may execute a long time.

Fix this by replacing cpu_relax() with cond_resched().
dput() used to be sleepable, so make it sleepable again
should be safe.

Cc: 
Signed-off-by: Wei Fang 
---
Changes v1->v2:
- add might_sleep() to annotate that dput() can sleep
Changes v2->v3:
- put cond_resched() in dput()

 fs/dcache.c |7 +--
 1 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index ad4a542..77f6687 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -589,7 +589,6 @@ static struct dentry *dentry_kill(struct dentry *dentry)
 
 failed:
spin_unlock(&dentry->d_lock);
-   cpu_relax();
return dentry; /* try again with same dentry */
 }
 
@@ -763,6 +762,8 @@ void dput(struct dentry *dentry)
return;
 
 repeat:
+   might_sleep();
+
rcu_read_lock();
if (likely(fast_dput(dentry))) {
rcu_read_unlock();
@@ -796,8 +797,10 @@ repeat:
 
 kill_it:
dentry = dentry_kill(dentry);
-   if (dentry)
+   if (dentry) {
+   cond_resched();
goto repeat;
+   }
 }
 EXPORT_SYMBOL(dput);
 
-- 
1.7.1



Re: More parallel atomic_open/d_splice_alias fun with NFS and possibly more FSes.

2016-07-05 Thread Al Viro
On Tue, Jul 05, 2016 at 08:29:37PM -0400, Oleg Drokin wrote:
> > +   /* Otherwise we just unhash it to be rehashed afresh via
> > +* lookup if necessary
> > +*/
> > +   d_drop(dentry);
> 
> So we can even drop this part and retain the top condition as it was.
> d_add does not care if the dentry we are feeding it was hashed or not,
> so do you see any downsides to doing that I wonder?

d_add() on hashed dentry will end up reaching this:
static void __d_rehash(struct dentry * entry, struct hlist_bl_head *b)
{
BUG_ON(!d_unhashed(entry));



Re: [RESEND PATCH v5 1/5] mfd: RK808: Add RK818 support

2016-07-05 Thread Andy Yan

Hi Wadim:

On 2016年06月09日 16:23, Wadim Egorov wrote:

Hi,

On 08.06.2016 16:17, Lee Jones wrote:

On Thu, 02 Jun 2016, Wadim Egorov wrote:


The RK818 chip is a power management IC for multimedia and handheld

"Power Management IC (PMIC)"


devices. It contains the following components:

- Regulators
- RTC
- Clkout

Clocking


- battery support

Battery support


Both chips RK808 and RK818 are using a similar register map.

"Both RK808 ad RK818 chips"


So we can reuse the RTC and Clkout functionality.

Swap '.' for ','.


Signed-off-by: Wadim Egorov 
---
  drivers/mfd/Kconfig   |   4 +-
  drivers/mfd/rk808.c   | 231 ++
  include/linux/mfd/rk808.h | 162 ++--
  3 files changed, 350 insertions(+), 47 deletions(-)

diff --git a/drivers/mfd/Kconfig b/drivers/mfd/Kconfig
index 1bcf601..7ba464b 100644
--- a/drivers/mfd/Kconfig
+++ b/drivers/mfd/Kconfig
@@ -839,13 +839,13 @@ config MFD_RC5T583
  different functionality of the device.
  
  config MFD_RK808

-   tristate "Rockchip RK808 Power Management chip"
+   tristate "Rockchip RK808/RK818 Power Management chip"

"Chip"


depends on I2C && OF
select MFD_CORE
select REGMAP_I2C
select REGMAP_IRQ
help
- If you say yes here you get support for the RK808
+ If you say yes here you get support for the RK808 and RK818
  Power Management chips.
  This driver provides common support for accessing the device
  through I2C interface. The device supports multiple sub-devices
diff --git a/drivers/mfd/rk808.c b/drivers/mfd/rk808.c
index 49d7f62..3cf9724 100644
--- a/drivers/mfd/rk808.c
+++ b/drivers/mfd/rk808.c
@@ -1,11 +1,15 @@
  /*
- * MFD core driver for Rockchip RK808
+ * MFD core driver for Rockchip RK808/RK818
   *
   * Copyright (c) 2014, Fuzhou Rockchip Electronics Co., Ltd
   *
   * Author: Chris Zhong 
   * Author: Zhang Qing 
   *
+ * Copyright (C) 2016 PHYTEC Messtechnik GmbH
+ *
+ * Author: Wadim Egorov 
+ *
   * This program is free software; you can redistribute it and/or modify it
   * under the terms and conditions of the GNU General Public License,
   * version 2, as published by the Free Software Foundation.
@@ -22,12 +26,7 @@
  #include 
  #include 
  #include 
-
-struct rk808_reg_data {
-   int addr;
-   int mask;
-   int value;
-};

Why are you moving this to the header?

It is now part of the rk808 struct.


+#include 
  
  static bool rk808_is_volatile_reg(struct device *dev, unsigned int reg)

  {
@@ -57,6 +56,14 @@ static bool rk808_is_volatile_reg(struct device *dev, 
unsigned int reg)
return false;
  }
  
+static const struct regmap_config rk818_regmap_config = {

+   .reg_bits = 8,
+   .val_bits = 8,
+   .max_register = RK818_USB_CTRL_REG,
+   .cache_type = REGCACHE_RBTREE,
+   .volatile_reg = rk808_is_volatile_reg,
+};
+
  static const struct regmap_config rk808_regmap_config = {
.reg_bits = 8,
.val_bits = 8,
@@ -83,7 +90,17 @@ static const struct mfd_cell rk808s[] = {
},
  };
  
-static const struct rk808_reg_data pre_init_reg[] = {

+static const struct mfd_cell rk818s[] = {
+   { .name = "rk808-clkout", },

How does this differ to a normal -clock driver?

I don't know. It is a normal clock driver.


+   { .name = "rk808-regulator", },
+   {
+   .name = "rk808-rtc",
+   .num_resources = ARRAY_SIZE(rtc_resources),
+   .resources = &rtc_resources[0],

.resources = rtc_resources,  ?


+   },
+};
+
+static const struct rk8xx_reg_data rk808_pre_init_reg[] = {
{ RK808_BUCK3_CONFIG_REG, BUCK_ILMIN_MASK,  BUCK_ILMIN_150MA },
{ RK808_BUCK4_CONFIG_REG, BUCK_ILMIN_MASK,  BUCK_ILMIN_200MA },
{ RK808_BOOST_CONFIG_REG, BOOST_ILMIN_MASK, BOOST_ILMIN_100MA },
@@ -94,6 +111,22 @@ static const struct rk808_reg_data pre_init_reg[] = {
VB_LO_SEL_3500MV },
  };
  
+static const struct rk8xx_reg_data rk818_pre_init_reg[] = {

+   { RK818_USB_CTRL_REG,   RK818_USB_ILIM_SEL_MASK,
+   RK818_USB_ILMIN_2000MA },
+   /* close charger when usb lower then 3.4V */
+   { RK818_USB_CTRL_REG,   RK818_USB_CHG_SD_VSEL_MASK, (0x7 << 4) },
+   /* no action when vref */
+   { RK818_H5V_EN_REG, BIT(1), RK818_REF_RDY_CTRL },
+   /* enable HDMI 5V */
+   { RK818_H5V_EN_REG, BIT(0), RK818_H5V_EN },
+   /* improve efficiency */
+   { RK818_BUCK2_CONFIG_REG, BUCK2_RATE_MASK,  BUCK_ILMIN_250MA },
+   { RK818_BUCK4_CONFIG_REG, BUCK_ILMIN_MASK,  BUCK_ILMIN_250MA },
+   { RK818_BOOST_CONFIG_REG, BOOST_ILMIN_MASK, BOOST_ILMIN_100MA },
+   { RK808_VB_MON_REG, MASK_ALL,   VB_LO_ACT | VB_LO_SEL_3500MV },
+};

The alignment here looks odd.

I will fix it in the next version.


  static const struct reg

Re: [PATCH] drm/radeon: Remove deprecated create_singlethread_workqueue

2016-07-05 Thread Michel Dänzer
On 06.07.2016 06:06, Tejun Heo wrote:
> On Mon, Jul 04, 2016 at 12:58:32PM +0900, Michel Dänzer wrote:
>> On 02.07.2016 22:46, Tejun Heo wrote:
>>> On Sat, Jul 02, 2016 at 04:33:50PM +0530, Bhaktipriya Shridhar wrote:
 alloc_workqueue replaces deprecated create_singlethread_workqueue().

 A dedicated workqueue has been used since work items need to be flushed
 as a group rather than individually.
> 
> There seem to be two work items involved, the flip one and unpin one.
> Provided that there's no ordering requirement between the two, can't
> we just flush them individually?

There is an ordering requirement between the two queues, but it's
enforced by the driver (by only queuing the unpin work once a flip has
completed, which only happens after the corresponding flip work has run).


Not being very familiar with the workqueue APIs, I'll describe how it's
supposed to work from a driver POV, which will hopefully help you guys
decide on the most appropriate alloc_workqueue parameters.

There is one flip work queue for each hardware CRTC. At most one
radeon_flip_work_func item can be queued for any of them at any time.
When a radeon_flip_work_func item is queued, it should be executed ASAP
(so WQ_HIGHPRI might be appropriate?).

In contrast, the radeon_unpin_work_func items aren't particularly
urgent, and it's okay for several of them to be queued up. So I guess it
would actually make sense to use a different workqueue for them, maybe
even the default one?


 Since there are only a fixed number of work items, explicit concurrency
 limit is unnecessary here.
>>>
>>> What are the involved work items?
>>
>> drivers/gpu/drm/radeon/radeon_display.c:radeon_flip_work_func()
>>
>>> Is it safe to run them concurrently?
>>
>> Concurrently with what exactly?
> 
> The flip and unpin work items.

Yes, it's safe to run those concurrently.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed

2016-07-05 Thread Neo Jia
On Wed, Jul 06, 2016 at 10:35:18AM +0800, Xiao Guangrong wrote:
> 
> 
> On 07/06/2016 10:18 AM, Neo Jia wrote:
> >On Wed, Jul 06, 2016 at 10:00:46AM +0800, Xiao Guangrong wrote:
> >>
> >>
> >>On 07/05/2016 08:18 PM, Paolo Bonzini wrote:
> >>>
> >>>
> >>>On 05/07/2016 07:41, Neo Jia wrote:
> On Thu, Jun 30, 2016 at 03:01:49PM +0200, Paolo Bonzini wrote:
> >The vGPU folks would like to trap the first access to a BAR by setting
> >vm_ops on the VMAs produced by mmap-ing a VFIO device.  The fault handler
> >then can use remap_pfn_range to place some non-reserved pages in the VMA.
> >
> >KVM lacks support for this kind of non-linear VM_PFNMAP mapping, and 
> >these
> >patches should fix this.
> 
> Hi Paolo,
> 
> I have tested your patches with the mediated passthru patchset that is 
> being
> reviewed in KVM and QEMU mailing list.
> 
> The fault handler gets called successfully and the previously mapped 
> memory gets
> unmmaped correctly via unmap_mapping_range.
> >>>
> >>>Great, then I'll include them in 4.8.
> >>
> >>Code is okay, but i still suspect if this implementation, fetch mmio pages 
> >>in fault
> >>handler, is needed. We'd better include these patches after the design of 
> >>vfio
> >>framework is decided.
> >
> >Hi Guangrong,
> >
> >I disagree. The design of VFIO framework has been actively discussed in the 
> >KVM
> >and QEMU mailing for a while and the fault handler is agreed upon to provide 
> >the
> >flexibility for different driver vendors' implementation. With that said, I 
> >am
> >still open to discuss with you and anybody else about this framework as the 
> >goal
> >is to allow multiple vendor to plugin into this framework to support their
> >mediated device virtualization scheme, such as Intel, IBM and us.
> 
> The discussion is still going on. And current vfio patchset we reviewed is 
> still
> problematic.

My point is the fault handler part has been discussed already, with that said I
am always open to any constructive suggestions to make things better and
maintainable. (Appreciate your code review on the VFIO thread, I think we still
own you another response, will do that.)

> 
> >
> >May I ask you what the exact issue you have with this interface for Intel to 
> >support
> >your own GPU virtualization?
> 
> Intel's vGPU can work with this framework. We really appreciate your / 
> nvidia's
> contribution.

Then, I don't think we should embargo Paolo's patch.

> 
> i didn’t mean to offend you, i just want to make sure if this complexity is 
> really
> needed and inspect if this framework is safe enough and think it over if we 
> have
> a better implementation.

Not at all. :-)

Suggestions are always welcome, I just want to know the exact issues you have
with the code so I can have a better response to address that with proper 
information.

Thanks,
Neo



[PATCH 0/2] i2c-dev: Don't let userspace block adapter

2016-07-05 Thread Viresh Kumar
Hi Wolfram/Jean,

I am part of the kernel team for Google's projectara [1], where we are
building a module smart phone.

This series tries to fix one of the problems we hit on our system as we
are required to hotplug pretty much every thing on the phone and so this
fixes hotplug issues with i2c-dev.

As described in the second patch, the current implementation of i2c-dev
file operations doesn't let the modules (hardware attached to the phone)
eject from the phone as the cleanup path for the module hasn't finished
yet (i2c adapter not removed).

We can't let the userspace block the kernel devices forever in such
cases.

I was able to test them on the ARA phone with kernel 3.10 only and not
mainline.

--
viresh

[1] https://atap.google.com/ara/

Viresh Kumar (2):
  i2c-dev: don't get i2c adapter via i2c_dev
  i2c-dev: Don't block the adapter from unregistering

 drivers/i2c/i2c-dev.c | 79 ++-
 include/linux/i2c.h   |  1 +
 2 files changed, 67 insertions(+), 13 deletions(-)

-- 
2.7.4



[PATCH 2/2] i2c-dev: Don't block the adapter from unregistering

2016-07-05 Thread Viresh Kumar
The i2c-dev calls i2c_get_adapter() from the .open() callback, which
doesn't let the adapter device unregister unless the .close() callback
is called.

On some platforms (like Google ARA), this doesn't let the modules
(hardware attached to the phone) eject from the phone as the cleanup
path for the module hasn't finished yet (i2c adapter not removed).

We can't let the userspace block the kernel forever in such cases.

Fix this by calling i2c_get_adapter() from all other file operations,
i.e.  read/write/ioctl, to make sure the adapter doesn't get away while
we are in the middle of a operation, but not otherwise. In .open() we
will release the adapter device before returning and so if there is no
data transfer in progress, then the i2c-dev doesn't block the adapter
from unregistering.

Signed-off-by: Viresh Kumar 
---
 drivers/i2c/i2c-dev.c | 72 ++-
 include/linux/i2c.h   |  1 +
 2 files changed, 66 insertions(+), 7 deletions(-)

diff --git a/drivers/i2c/i2c-dev.c b/drivers/i2c/i2c-dev.c
index 66f323fd3982..b2562603daa9 100644
--- a/drivers/i2c/i2c-dev.c
+++ b/drivers/i2c/i2c-dev.c
@@ -142,13 +142,25 @@ static ssize_t i2cdev_read(struct file *file, char __user 
*buf, size_t count,
int ret;
 
struct i2c_client *client = file->private_data;
+   struct i2c_adapter *adap;
+
+   adap = i2c_get_adapter(client->adapter_nr);
+   if (!adap)
+   return -ENODEV;
+
+   if (adap != client->adapter) {
+   ret = -EINVAL;
+   goto put_adapter;
+   }
 
if (count > 8192)
count = 8192;
 
tmp = kmalloc(count, GFP_KERNEL);
-   if (tmp == NULL)
-   return -ENOMEM;
+   if (tmp == NULL) {
+   ret = -ENOMEM;
+   goto put_adapter;
+   }
 
pr_debug("i2c-dev: i2c-%d reading %zu bytes.\n",
iminor(file_inode(file)), count);
@@ -157,6 +169,9 @@ static ssize_t i2cdev_read(struct file *file, char __user 
*buf, size_t count,
if (ret >= 0)
ret = copy_to_user(buf, tmp, count) ? -EFAULT : ret;
kfree(tmp);
+
+put_adapter:
+   i2c_put_adapter(adap);
return ret;
 }
 
@@ -166,19 +181,34 @@ static ssize_t i2cdev_write(struct file *file, const char 
__user *buf,
int ret;
char *tmp;
struct i2c_client *client = file->private_data;
+   struct i2c_adapter *adap;
+
+   adap = i2c_get_adapter(client->adapter_nr);
+   if (!adap)
+   return -ENODEV;
+
+   if (adap != client->adapter) {
+   ret = -EINVAL;
+   goto put_adapter;
+   }
 
if (count > 8192)
count = 8192;
 
tmp = memdup_user(buf, count);
-   if (IS_ERR(tmp))
-   return PTR_ERR(tmp);
+   if (IS_ERR(tmp)) {
+   ret = PTR_ERR(tmp);
+   goto put_adapter;
+   }
 
pr_debug("i2c-dev: i2c-%d writing %zu bytes.\n",
iminor(file_inode(file)), count);
 
ret = i2c_master_send(client, tmp, count);
kfree(tmp);
+
+put_adapter:
+   i2c_put_adapter(adap);
return ret;
 }
 
@@ -412,9 +442,9 @@ static noinline int i2cdev_ioctl_smbus(struct i2c_client 
*client,
return res;
 }
 
-static long i2cdev_ioctl(struct file *file, unsigned int cmd, unsigned long 
arg)
+static long __i2cdev_ioctl(struct i2c_client *client, unsigned int cmd,
+  unsigned long arg)
 {
-   struct i2c_client *client = file->private_data;
unsigned long funcs;
 
dev_dbg(&client->adapter->dev, "ioctl, cmd=0x%02x, arg=0x%02lx\n",
@@ -480,6 +510,28 @@ static long i2cdev_ioctl(struct file *file, unsigned int 
cmd, unsigned long arg)
return 0;
 }
 
+static long i2cdev_ioctl(struct file *file, unsigned int cmd, unsigned long 
arg)
+{
+   struct i2c_client *client = file->private_data;
+   struct i2c_adapter *adap;
+   unsigned long ret;
+
+   adap = i2c_get_adapter(client->adapter_nr);
+   if (!adap)
+   return -ENODEV;
+
+   if (adap != client->adapter) {
+   ret = -EINVAL;
+   goto put_adapter;
+   }
+
+   ret = __i2cdev_ioctl(client, cmd, arg);
+
+put_adapter:
+   i2c_put_adapter(adap);
+   return ret;
+}
+
 static int i2cdev_open(struct inode *inode, struct file *file)
 {
unsigned int minor = iminor(inode);
@@ -504,9 +556,16 @@ static int i2cdev_open(struct inode *inode, struct file 
*file)
}
snprintf(client->name, I2C_NAME_SIZE, "i2c-dev %d", adap->nr);
 
+   client->adapter_nr = minor;
client->adapter = adap;
file->private_data = client;
 
+   /*
+* Allow the adapter to unregister while userspace has opened the i2c
+* device.
+*/
+   i2c_put_adapter(client->adapter);
+
return 0;
 }
 
@@ -514,7 +573,6 @@ static int i2cdev_release(struct inode *inode, struct file 
*f

[PATCH 1/2] i2c-dev: don't get i2c adapter via i2c_dev

2016-07-05 Thread Viresh Kumar
There is no code protecting i2c_dev to be freed after it is returned
from i2c_dev_get_by_minor() and using it to access the value which we
already have (minor) isn't safe really.

Avoid using it and get the adapter directly from 'minor'.

Signed-off-by: Viresh Kumar 
---
 drivers/i2c/i2c-dev.c | 7 +--
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/drivers/i2c/i2c-dev.c b/drivers/i2c/i2c-dev.c
index 6ecfd76270f2..66f323fd3982 100644
--- a/drivers/i2c/i2c-dev.c
+++ b/drivers/i2c/i2c-dev.c
@@ -485,13 +485,8 @@ static int i2cdev_open(struct inode *inode, struct file 
*file)
unsigned int minor = iminor(inode);
struct i2c_client *client;
struct i2c_adapter *adap;
-   struct i2c_dev *i2c_dev;
-
-   i2c_dev = i2c_dev_get_by_minor(minor);
-   if (!i2c_dev)
-   return -ENODEV;
 
-   adap = i2c_get_adapter(i2c_dev->adap->nr);
+   adap = i2c_get_adapter(minor);
if (!adap)
return -ENODEV;
 
-- 
2.7.4



net/batman-adv/bridge_loop_avoidance.c:1105:9-10: WARNING: return of 0/1 in function 'batadv_bla_process_claim' with return type bool

2016-07-05 Thread kbuild test robot
tree:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 
master
head:   a99cde438de0c4c0cecc1d1af1a55a75b10bfdef
commit: 4b426b108ac82b27f5af40df7da05a2501fd2aca batman-adv: Use bool as return 
type for boolean functions
date:   8 weeks ago


coccinelle warnings: (new ones prefixed by >>)

>> net/batman-adv/bridge_loop_avoidance.c:1105:9-10: WARNING: return of 0/1 in 
>> function 'batadv_bla_process_claim' with return type bool

Please review and possibly fold the followup patch.

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


[PATCH] batman-adv: fix boolreturn.cocci warnings

2016-07-05 Thread kbuild test robot
net/batman-adv/bridge_loop_avoidance.c:1105:9-10: WARNING: return of 0/1 in 
function 'batadv_bla_process_claim' with return type bool

 Return statements in functions returning bool should use
 true/false instead of 1/0.
Generated by: scripts/coccinelle/misc/boolreturn.cocci

CC: Sven Eckelmann 
Signed-off-by: Fengguang Wu 
---

 bridge_loop_avoidance.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/net/batman-adv/bridge_loop_avoidance.c
+++ b/net/batman-adv/bridge_loop_avoidance.c
@@ -1102,7 +1102,7 @@ static bool batadv_bla_process_claim(str
 
/* Let the loopdetect frames on the mesh in any case. */
if (bla_dst->type == BATADV_CLAIM_TYPE_LOOPDETECT)
-   return 0;
+   return false;
 
/* check if it is a claim frame. */
ret = batadv_check_claim_group(bat_priv, primary_if, hw_src, hw_dst,


RE: [PATCH 7/7] devfreq: exynos-bus: add missing of_node_put after calling of_parse_phandle

2016-07-05 Thread MyungJoo Ham
> 
> - Original Message -
> Sender : Peter Chen 
> Date   : 2016-07-01 18:49 (GMT+9)
> Title  : [PATCH 7/7] devfreq: exynos-bus: add missing of_node_put after 
> calling of_parse_phandle
>  
> of_node_put needs to be called when the device node which is got
> from of_parse_phandle has finished using.
>  
> Cc: Chanwoo Choi 
> Cc: MyungJoo Ham 
> Cc: Kyungmin Park 
> Cc: Kukjin Kim 
> Cc: Krzysztof Kozlowski 
> Signed-off-by: Peter Chen 

With the change (build error fix) of the following added, (I'll update)
Signed-off-by: MyungJoo Ham 


Cheers,
MyungJoo

> ---
>  drivers/devfreq/exynos-bus.c | 9 ++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
>  
> diff --git a/drivers/devfreq/exynos-bus.c b/drivers/devfreq/exynos-bus.c
> index 2363d0a..a38b5ec 100644
> --- a/drivers/devfreq/exynos-bus.c
> +++ b/drivers/devfreq/exynos-bus.c
> @@ -383,7 +383,7 @@ err_clk:
>  static int exynos_bus_probe(struct platform_device *pdev)
>  {
>  struct device *dev = &pdev->dev;
> -struct device_node *np = dev->of_node;
> +struct device_node *np = dev->of_node, node;

+   struct device_node *np = dev->of_node, *node;

>  struct devfreq_dev_profile *profile;
>  struct devfreq_simple_ondemand_data *ondemand_data;
>  struct devfreq_passive_data *passive_data;
...


Re: [PATCH v2 7/8] mm/zsmalloc: add __init,__exit attribute

2016-07-05 Thread Minchan Kim
On Tue, Jul 05, 2016 at 10:00:28AM +0900, Sergey Senozhatsky wrote:
> Hello Ganesh,
> 
> On (07/04/16 17:21), Ganesh Mahendran wrote:
> > > On (07/04/16 14:49), Ganesh Mahendran wrote:
> > > [..]
> > >> -static void zs_unregister_cpu_notifier(void)
> > >> +static void __exit zs_unregister_cpu_notifier(void)
> > >>  {
> > >
> > > this __exit symbol is called from `__init zs_init()' and thus is
> > > free to crash.
> > 
> > I change code to force the code goto notifier_fail where the
> > zs_unregister_cpu_notifier will be called.
> > I tested with zsmalloc module buildin and built as a module.
> 
> sorry, not sure I understand what do you mean by this.

It seems he tested it both builtin and module with simulating to fail
zs_register_cpu_notifier so that finally called zs_unergister_cpu_notifier.
With that, he cannot find any problem.
> 

> 
> > Please correct me, if I miss something.
> 
> you have an __exit section function being called from
> __init section:
> 
> static void __exit zs_unregister_cpu_notifier(void)
> {
> }
> 
> static int __init zs_init(void)
> {
>   zs_unregister_cpu_notifier();
> }
> 
> it's no good.

Agree.

I didn't look at linker script how to handle it. Although it works well,
it would be not desirable to mark __exit to the function we already
know it would be called from non-exit functions.


Re: [PATCH v2 6/8] mm/zsmalloc: keep comments consistent with code

2016-07-05 Thread Minchan Kim
On Mon, Jul 04, 2016 at 02:49:57PM +0800, Ganesh Mahendran wrote:
> some minor change of comments:
> 1). update zs_malloc(),zs_create_pool() function header
> 2). update "Usage of struct page fields"
> 
> Signed-off-by: Ganesh Mahendran 
Acked-by: Minchan Kim 



Re: [PATCH v2 5/8] mm/zsmalloc: avoid calculate max objects of zspage twice

2016-07-05 Thread Minchan Kim
On Mon, Jul 04, 2016 at 02:49:56PM +0800, Ganesh Mahendran wrote:
> Currently, if a class can not be merged, the max objects of zspage
> in that class may be calculated twice.
> 
> This patch calculate max objects of zspage at the begin, and pass
> the value to can_merge() to decide whether the class can be merged.
> 
> Also this patch remove function get_maxobj_per_zspage(), as there
> is no other place to call this funtion.
> 
> Signed-off-by: Ganesh Mahendran 
Acked-by: Minchan Kim 



Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed

2016-07-05 Thread Xiao Guangrong



On 07/06/2016 10:18 AM, Neo Jia wrote:

On Wed, Jul 06, 2016 at 10:00:46AM +0800, Xiao Guangrong wrote:



On 07/05/2016 08:18 PM, Paolo Bonzini wrote:



On 05/07/2016 07:41, Neo Jia wrote:

On Thu, Jun 30, 2016 at 03:01:49PM +0200, Paolo Bonzini wrote:

The vGPU folks would like to trap the first access to a BAR by setting
vm_ops on the VMAs produced by mmap-ing a VFIO device.  The fault handler
then can use remap_pfn_range to place some non-reserved pages in the VMA.

KVM lacks support for this kind of non-linear VM_PFNMAP mapping, and these
patches should fix this.


Hi Paolo,

I have tested your patches with the mediated passthru patchset that is being
reviewed in KVM and QEMU mailing list.

The fault handler gets called successfully and the previously mapped memory gets
unmmaped correctly via unmap_mapping_range.


Great, then I'll include them in 4.8.


Code is okay, but i still suspect if this implementation, fetch mmio pages in 
fault
handler, is needed. We'd better include these patches after the design of vfio
framework is decided.


Hi Guangrong,

I disagree. The design of VFIO framework has been actively discussed in the KVM
and QEMU mailing for a while and the fault handler is agreed upon to provide the
flexibility for different driver vendors' implementation. With that said, I am
still open to discuss with you and anybody else about this framework as the goal
is to allow multiple vendor to plugin into this framework to support their
mediated device virtualization scheme, such as Intel, IBM and us.


The discussion is still going on. And current vfio patchset we reviewed is still
problematic.



May I ask you what the exact issue you have with this interface for Intel to 
support
your own GPU virtualization?


Intel's vGPU can work with this framework. We really appreciate your / nvidia's
contribution.

i didn’t mean to offend you, i just want to make sure if this complexity is 
really
needed and inspect if this framework is safe enough and think it over if we have
a better implementation.





Re: [PATCH v2 3/8] mm/zsmalloc: take obj index back from find_alloced_obj

2016-07-05 Thread Minchan Kim
On Mon, Jul 04, 2016 at 02:49:54PM +0800, Ganesh Mahendran wrote:
> the obj index value should be updated after return from
> find_alloced_obj() to avoid CPU buring caused by unnecessary
> object scanning.
> 
> Signed-off-by: Ganesh Mahendran 
Acked-by: Minchan Kim 



RE: [PATCH 6/7] devfreq: add missing of_node_put after calling of_parse_phandle

2016-07-05 Thread MyungJoo Ham
> - Original Message -
> Sender : Peter Chen 
> Date   : 2016-07-01 18:49 (GMT+9)
> Title  : [PATCH 6/7] devfreq: add missing of_node_put after calling 
> of_parse_phandle
>  
> of_node_put needs to be called when the device node which is got
> from of_parse_phandle has finished using.
>  
> Cc: MyungJoo Ham 
> Cc: Kyungmin Park 
> Signed-off-by: Peter Chen 

Thank you!


Signed-off-by: MyungJoo Ham 



Re: [PATCH v2] fs/dcache.c: avoid soft-lockup in dput()

2016-07-05 Thread Wei Fang
Hi, Boqun,

>> diff --git a/fs/dcache.c b/fs/dcache.c
>> index d5ecc6e..074fc1c 100644
>> --- a/fs/dcache.c
>> +++ b/fs/dcache.c
>> @@ -578,7 +578,7 @@ static struct dentry *dentry_kill(struct dentry *dentry)
>>  
>>  failed:
>>  spin_unlock(&dentry->d_lock);
>> -cpu_relax();
>> +cond_resched();
> 
> Is it better to put the cond_resched() in the caller(i.e. dput()), right
> before "goto repeat"? Because it's obviously a loop there, which makes
> the purpose of cond_resched() more straightforward.

Agreed, that's more reasonable. I'll send v3 soon.

Thanks,
Wei



[PATCH 2/2] livepatch/x86: apply alternatives and paravirt patches after relocations

2016-07-05 Thread Jessica Yu
Implement arch_klp_init_object_loaded() for x86, which applies
alternatives/paravirt patches. This fixes the order in which relocations
and alternatives/paravirt patches are applied.

Previously, if a patch module had alternatives or paravirt patches,
these were applied first by the module loader before livepatch can apply
per-object relocations. The (buggy) sequence of events was:

(1) Load patch module
(2) Apply alternatives and paravirt patches to patch module
* Note that these are applied to the new functions in the patch module
(3) Apply per-object relocations to patch module when target module loads.
* This clobbers what was written in step 2

This lead to crashes and corruption in general, since livepatch would
overwrite or step on previously applied alternative/paravirt patches.
The correct sequence of events should be:

(1) Load patch module
(2) Apply per-object relocations to patch module
(3) Apply per-object alternatives and paravirt patches to patch module

This is fixed by delaying paravirt/alternatives patching until after
relocations are applied. Any .altinstructions or .parainstructions sections in
a patch module are prefixed with ".klp.arch.${objname}" and applied in
arch_klp_init_object_loaded().

Signed-off-by: Jessica Yu 
---
 arch/x86/kernel/Makefile|  1 +
 arch/x86/kernel/livepatch.c | 66 +
 2 files changed, 67 insertions(+)
 create mode 100644 arch/x86/kernel/livepatch.c

diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 0503f5b..4f656fe 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -83,6 +83,7 @@ obj-$(CONFIG_X86_MPPARSE) += mpparse.o
 obj-y  += apic/
 obj-$(CONFIG_X86_REBOOTFIXUPS) += reboot_fixups_32.o
 obj-$(CONFIG_DYNAMIC_FTRACE)   += ftrace.o
+obj-$(CONFIG_LIVEPATCH)+= livepatch.o
 obj-$(CONFIG_FUNCTION_GRAPH_TRACER) += ftrace.o
 obj-$(CONFIG_FTRACE_SYSCALLS)  += ftrace.o
 obj-$(CONFIG_X86_TSC)  += trace_clock.o
diff --git a/arch/x86/kernel/livepatch.c b/arch/x86/kernel/livepatch.c
new file mode 100644
index 000..f0845b2
--- /dev/null
+++ b/arch/x86/kernel/livepatch.c
@@ -0,0 +1,66 @@
+/*
+ * livepatch.c - x86-specific Kernel Live Patching Core
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see .
+ */
+
+#include 
+#include 
+#include 
+#include 
+
+/* Apply per-object alternatives. Based on x86 module_finalize() */
+void arch_klp_init_object_loaded(struct klp_patch *patch,
+struct klp_object *obj)
+{
+   int cnt;
+   struct klp_modinfo *info;
+   Elf_Shdr *s, *alt = NULL, *para = NULL;
+   void *aseg, *pseg;
+   const char *objname;
+   char sec_objname[MODULE_NAME_LEN];
+   char secname[KSYM_NAME_LEN];
+
+   info = patch->mod->klp_info;
+   objname = obj->name ? obj->name : "vmlinux";
+
+   for (s = info->sechdrs; s < info->sechdrs + info->hdr.e_shnum; s++) {
+   /*
+* Search for .klp.arch sections that apply to this object.
+* See BUILD_BUG_ON() in livepatch core code for field width
+* explanations.
+*/
+   cnt = sscanf(info->secstrings + s->sh_name,
+".klp.arch.%55[^.].%127s",
+sec_objname, secname);
+   if (cnt != 2)
+   continue;
+   if (strcmp(sec_objname, objname))
+   continue;
+   if (!strcmp("altinstructions", secname))
+   alt = s;
+   if (!strcmp("parainstructions", secname))
+   para = s;
+   }
+
+   if (alt) {
+   aseg = (void *) alt->sh_addr;
+   apply_alternatives(aseg, aseg + alt->sh_size);
+   }
+
+   if (para) {
+   pseg = (void *) para->sh_addr;
+   apply_paravirt(pseg, pseg + para->sh_size);
+   }
+}
-- 
2.4.3



Re: [PATCH v2 2/8] mm/zsmalloc: use obj_index to keep consistent with others

2016-07-05 Thread Minchan Kim
On Mon, Jul 04, 2016 at 02:49:53PM +0800, Ganesh Mahendran wrote:
> This is a cleanup patch. Change "index" to "obj_index" to keep
> consistent with others in zsmalloc.
> 
> Signed-off-by: Ganesh Mahendran 
Acked-by: Minchan Kim 



[PATCH 1/2] livepatch: use arch_klp_init_object_loaded() to finish arch-specific tasks

2016-07-05 Thread Jessica Yu
Introduce arch_klp_init_object_loaded() to complete any additional
arch-specific tasks during patching. Architecture code may override this
function.

Signed-off-by: Jessica Yu 
---
 include/linux/livepatch.h |  3 +++
 kernel/livepatch/core.c   | 12 ++--
 2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/include/linux/livepatch.h b/include/linux/livepatch.h
index a93a0b2..9072f04 100644
--- a/include/linux/livepatch.h
+++ b/include/linux/livepatch.h
@@ -116,6 +116,9 @@ int klp_unregister_patch(struct klp_patch *);
 int klp_enable_patch(struct klp_patch *);
 int klp_disable_patch(struct klp_patch *);
 
+void arch_klp_init_object_loaded(struct klp_patch *patch,
+struct klp_object *obj);
+
 /* Called from the module loader during module coming/going states */
 int klp_module_coming(struct module *mod);
 void klp_module_going(struct module *mod);
diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c
index 5c2bc10..164eff6 100644
--- a/kernel/livepatch/core.c
+++ b/kernel/livepatch/core.c
@@ -274,7 +274,6 @@ static int klp_write_object_relocations(struct module *pmod,
 
objname = klp_is_module(obj) ? obj->name : "vmlinux";
 
-   module_disable_ro(pmod);
/* For each klp relocation section */
for (i = 1; i < pmod->klp_info->hdr.e_shnum; i++) {
sec = pmod->klp_info->sechdrs + i;
@@ -309,7 +308,6 @@ static int klp_write_object_relocations(struct module *pmod,
break;
}
 
-   module_enable_ro(pmod);
return ret;
 }
 
@@ -763,6 +761,12 @@ static int klp_init_func(struct klp_object *obj, struct 
klp_func *func)
func->old_sympos ? func->old_sympos : 1);
 }
 
+/* Arches may override this to finish any remaining arch-specific tasks */
+void __weak arch_klp_init_object_loaded(struct klp_patch *patch,
+   struct klp_object *obj)
+{
+}
+
 /* parts of the initialization that is done only when the object is loaded */
 static int klp_init_object_loaded(struct klp_patch *patch,
  struct klp_object *obj)
@@ -770,10 +774,14 @@ static int klp_init_object_loaded(struct klp_patch *patch,
struct klp_func *func;
int ret;
 
+   module_disable_ro(patch->mod);
ret = klp_write_object_relocations(patch->mod, obj);
if (ret)
return ret;
 
+   arch_klp_init_object_loaded(patch, obj);
+   module_enable_ro(patch->mod);
+
klp_for_each_func(obj, func) {
ret = klp_find_object_symbol(obj->name, func->old_name,
 func->old_sympos,
-- 
2.4.3



[PATCH 0/2] Fix issue with alternatives/paravirt patches

2016-07-05 Thread Jessica Yu
Hi,

A few months ago, Chris Arges reported a bug involving alternatives/paravirt
patching that was discussed here [1] and here [2]. To briefly summarize the
bug, patch modules that contained .altinstructions or .parainstructions
sections would break because these alternative/paravirt patches would be
applied first by the module loader (see x86 module_finalize()), then
livepatch would later clobber these patches when applying per-object
relocations. This lead to crashes and unpredictable behavior.

One conclusion we reached from our last discussion was that we will
need to introduce some arch-specific code to address this problem.
This patchset presents a possible fix for the bug by adding a new
arch-specific arch_klp_init_object_loaded() function that by default
does nothing but can be overridden by different arches.

To fix this issue for x86, since we can access a patch module's Elf
sections through mod->klp_info, we can simply delay the calls to
apply_paravirt() and apply_alternatives() to arch_klp_init_object_loaded(),
which is called after relocations have been written for an object.
In addition, for patch modules, .parainstructions and .altinstructions are
prefixed by ".klp.arch.${objname}" so that the module loader ignores them
and livepatch can apply them manually.

Currently for kpatch, we don't support including jump table sections in
the patch module, and supporting .smp_locks is currently broken, so we
don't consider those sections (for now).

I did some light testing with some patches to kvm and verified that the
original issue reported in [2] was fixed.

Based on linux-next.

[1] http://thread.gmane.org/gmane.linux.kernel/2185604/
[2] https://github.com/dynup/kpatch/issues/580

Jessica Yu (2):
  livepatch: use arch_klp_init_object_loaded() to finish arch-specific tasks
  livepatch/x86: apply alternatives and paravirt patches after relocations

 arch/x86/kernel/Makefile|  1 +
 arch/x86/kernel/livepatch.c | 66 +
 include/linux/livepatch.h   |  3 +++
 kernel/livepatch/core.c | 12 +++--
 4 files changed, 80 insertions(+), 2 deletions(-)
 create mode 100644 arch/x86/kernel/livepatch.c

-- 
2.4.3



Re: [PATCH v2 1/8] mm/zsmalloc: modify zs compact trace interface

2016-07-05 Thread Minchan Kim
Hi Ganesh,

On Mon, Jul 04, 2016 at 02:49:52PM +0800, Ganesh Mahendran wrote:
> This patch changes trace_zsmalloc_compact_start[end] to
> trace_zs_compact_start[end] to keep function naming consistent
> with others in zsmalloc
> 
> Also this patch remove pages_total_compacted information which
> may not really needed.
> 
> Signed-off-by: Ganesh Mahendran 

Once we decide to add event trace, I prefer getting more detailed
information which is hard to get it via /sys/block/zram/.
So, we can add trace __zs_compact as well as zs_compact with
some changes.

IOW,

zs_compact
trace_zs_compact_start(pool->name)
__zs_compact
trace_zs_compact(class, scanned_obj, freed_pages)
trace_zs_compact_end(pool->name)


Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed

2016-07-05 Thread Xiao Guangrong



On 07/05/2016 11:07 PM, Neo Jia wrote:

On Tue, Jul 05, 2016 at 05:02:46PM +0800, Xiao Guangrong wrote:




It is physically contiguous but it is done during the runtime, physically 
contiguous doesn't mean
static partition at boot time. And only during runtime, the proper HW resource 
will be requested therefore
the right portion of MMIO region will be granted by the mediated device driver 
on the host.


Okay. This is your implantation design rather than the hardware limitation, 
right?


I don't think it matters here. We are talking about framework so it should
provide the flexibility for different driver vendor.


It really matters. It is the reason why we design the framework like this and
we need to make sure whether we have a better design to fill the requirements.





For example, if the instance require 512M memory (the size can be specified by 
QEMU
command line), it can tell its requirement to the mediated device driver via 
create()
interface, then the driver can allocate then memory for this instance before it 
is running.


BAR != your device memory

We don't set the BAR size via QEMU command line, BAR size is extracted by QEMU
from config space provided by vendor driver.



Anyway, the BAR size have a way to configure, e.g, specify the size as a 
parameter when
you create a mdev via sysfs.



Theoretically, the hardware is able to do memory management as this style, but 
for some
reasons you choose allocating memory in the runtime. right? If my understanding 
is right,
could you please tell us what benefit you want to get from this 
running-allocation style?


Your understanding is incorrect.


Then WHY?







Then the req_size and pgoff will both come from the mediated device driver 
based on his internal book
keeping of the hw resource allocation, which is only available during runtime. 
And such book keeping
can be built part of para-virtualization scheme between guest and host device 
driver.



I am talking the parameters you passed to validate_map_request(). req_size is 
calculated like this:

+   offset   = virtaddr - vma->vm_start;
+   phyaddr  = (vma->vm_pgoff << PAGE_SHIFT) + offset;
+   pgoff= phyaddr >> PAGE_SHIFT;

All these info is from vma which is available in mmmap().

pgoff is got from:
+   pg_prot  = vma->vm_page_prot;
that is also available in mmap().


This is kept there in case the validate_map_request() is not provided by vendor
driver then by default assume 1:1 mapping. So if validate_map_request() is not
provided, fault handler should not fail.


THESE are the parameters you passed to validate_map_request(), and these info is
available in mmap(), it really does not matter if you move 
validate_map_request()
to mmap(). That's what i want to say.






None of such information is available at VFIO mmap() time. For example, several 
VMs
are sharing the same physical device to provide mediated access. All VMs will
call the VFIO mmap() on their virtual BAR as part of QEMU vfio/pci 
initialization
process, at that moment, we definitely can't mmap the entire physical MMIO
into both VM blindly for obvious reason.



mmap() carries @length information, so you only need to allocate the specified 
size
(corresponding to @length) of memory for them.


Again, you still look at this as a static partition at QEMU configuration time
where the guest mmio will be mapped as a whole at some offset of the physical
mmio region. (You still can do that like I said above by not providing
validate_map_request() in your vendor driver.)



Then you can move validate_map_request() to here to achieve custom 
allocation-policy.


But this is not the framework we are defining here.

The framework we have here is to provide the driver vendor flexibility to decide
the guest mmio and physical mmio mapping on page basis, and such information is
available during runtime.

How such information gets communicated between guest and host driver is up to
driver vendor.


The problems is the sequence of the way "provide the driver vendor
flexibility to decide the guest mmio and physical mmio mapping on page basis"
and mmap().

We should provide such allocation info first then do mmap(). You current design,
do mmap() -> communication telling such info -> use such info when fault 
happens,
is really BAD, because you can not control the time when memory fault will 
happen.
The guest may access this memory before the communication you mentioned above,
and another reason is that KVM MMU can prefetch memory at any time.



Re: [PATCH] lockdep: Add a document describing crossrelease feature

2016-07-05 Thread Byungchul Park
On Wed, Jul 06, 2016 at 08:49:43AM +0800, Boqun Feng wrote:
> On Mon, Jul 04, 2016 at 03:42:59PM +0900, Byungchul Park wrote:
> [snip]
> > > > +2. A lock has dependency with all locks in the releasing context, 
> > > > having
> > > > +   been held since the lock was held.
> > > 
> > > But you cannot tell this. The 'since the lock was held' thing fully
> > > depends on timing and is not fundamentally correct.
> > > 
> > >   lock(A)
> > >   unlock(A)
> > >   lock(A)
> > >   wait_for(B)
> > >   unlock(A)
> > >   wake(B)
> > > 
> > > Between the wait_for(B) and wake(B), _nothing_ has been held, yet still
> > > there's the deadlock potential.

I mis-understood this sentence. However, anyway this does not cause
deadlock. Of course it's deadlock if an example below actually happens.

We should not presume that the below can happen, once the above happened
because some dependencies between these contexts may prevent the below.
Therefore we should decide to check only when the actual problematic
sequence happens like below.

lock(A)
wait_for(B)
 <- serialized by atomic operation
lock(A)
unlock(A)
wake(B)
unlock(A)

So I meant crossrelease can detect this deadlock if actually deadlock
causable sequence happens at least once. I tried to avoid false positive
detection, in other words, I made lockdep's detection stronger only with
true positive ones.

Of course I want this crosslock to be stronger than current
implementation. However I have no idea to make even what peterz's example
can be detected regarding all dependency with avoiding false positive
detection. It should be a future work. But it will be not simple or
impossible.

> > 
> > Crossreleas feature can detect this situation as a deadlock. wait_for()
> > is not an actual lock, but we can make it detectable by using acquring and
> > releasing semantics on wait_for() and wake().
> > 
> > > And note that if the timing was 'right', you would never get to wake(B)
> > > because deadlock, so you'd never establish that there would be a
> > > deadlock.
> > 
> > If a deadlock actually happens, then we cannot establish it as you said.
> > Remind that current lockdep does nothing for this situation. But at least
> > crossrelease feature can detect this deadlock possibility at the time the
> > dependency tree(graph) is built, which is better than doing nothing.
> > 
> 
> Confused, how?

And I am sorry for making you confused.

> 
> Say the sequence of events is as follow:
> 
> (two tasks are initially with no lock held)
> 
>   Task 1  Task 2
>   =   
>   lock(A)
>   unlock(A)
>   lock(A)
>   wait_for(B) // acquire
>   wake(B) // commit + release
>   unlock(A)
> 

As you know this is not actual deadlock.

> by the time, the commit are called, the dependency tree will be built,
> and we will find there is _no_ lock held before wake(B). Therefore at
> the release stage, you will end up only adding dependency chain A->B in
> the lockdep, right? And it looks like neither Task1 or Task2 will break
> the dependency chain A->B. So how can crossrelease detect the potential
> deadlock?

It's similar to how the crossrelease work, but a little bit different.
I will reinforce and resend the document later.

> 
> It will be better, that you could provide some samples that crossrelease
> can detect after your confirmation.

Yes I will.

Thank you,
Byungchul

> 
> Regards,
> Boqun




Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed

2016-07-05 Thread Neo Jia
On Wed, Jul 06, 2016 at 10:00:46AM +0800, Xiao Guangrong wrote:
> 
> 
> On 07/05/2016 08:18 PM, Paolo Bonzini wrote:
> >
> >
> >On 05/07/2016 07:41, Neo Jia wrote:
> >>On Thu, Jun 30, 2016 at 03:01:49PM +0200, Paolo Bonzini wrote:
> >>>The vGPU folks would like to trap the first access to a BAR by setting
> >>>vm_ops on the VMAs produced by mmap-ing a VFIO device.  The fault handler
> >>>then can use remap_pfn_range to place some non-reserved pages in the VMA.
> >>>
> >>>KVM lacks support for this kind of non-linear VM_PFNMAP mapping, and these
> >>>patches should fix this.
> >>
> >>Hi Paolo,
> >>
> >>I have tested your patches with the mediated passthru patchset that is being
> >>reviewed in KVM and QEMU mailing list.
> >>
> >>The fault handler gets called successfully and the previously mapped memory 
> >>gets
> >>unmmaped correctly via unmap_mapping_range.
> >
> >Great, then I'll include them in 4.8.
> 
> Code is okay, but i still suspect if this implementation, fetch mmio pages in 
> fault
> handler, is needed. We'd better include these patches after the design of vfio
> framework is decided.

Hi Guangrong,

I disagree. The design of VFIO framework has been actively discussed in the KVM
and QEMU mailing for a while and the fault handler is agreed upon to provide the
flexibility for different driver vendors' implementation. With that said, I am
still open to discuss with you and anybody else about this framework as the goal
is to allow multiple vendor to plugin into this framework to support their
mediated device virtualization scheme, such as Intel, IBM and us.

May I ask you what the exact issue you have with this interface for Intel to 
support 
your own GPU virtualization? 

Thanks,
Neo

> 


[PATCH net] net: mvneta: set real interrupt per packet for tx_done

2016-07-05 Thread Marcin Wojtas
From: Dmitri Epshtein 

Commit aebea2ba0f74 ("net: mvneta: fix Tx interrupt delay") intended to
set coalescing threshold to a value guaranteeing interrupt generation
per each sent packet, so that buffers can be released with no delay.

In fact setting threshold to '1' was wrong, because it causes interrupt
every two packets. According to the documentation a reason behind it is
following - interrupt occurs once sent buffers counter reaches a value,
which is higher than one specified in MVNETA_TXQ_SIZE_REG(q). This
behavior was confirmed during tests. Also when testing the SoC working
as a NAS device, better performance was observed with int-per-packet,
as it strongly depends on the fact that all transmitted packets are
released immediately.

This commit enables NETA controller work in interrupt per sent packet mode
by setting coalescing threshold to 0.

Signed-off-by: Dmitri Epshtein 
Signed-off-by: Marcin Wojtas 
Cc:  # v3.10+
Fixes aebea2ba0f74 ("net: mvneta: fix Tx interrupt delay")
---
 drivers/net/ethernet/marvell/mvneta.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c 
b/drivers/net/ethernet/marvell/mvneta.c
index cf04c97..0ad2fa3 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -244,7 +244,7 @@
 /* Various constants */
 
 /* Coalescing */
-#define MVNETA_TXDONE_COAL_PKTS1
+#define MVNETA_TXDONE_COAL_PKTS0   /* interrupt per packet 
*/
 #define MVNETA_RX_COAL_PKTS32
 #define MVNETA_RX_COAL_USEC100
 
-- 
1.8.3.1



Re: [PATCH 2/2] f2fs: fix to avoid data update racing between GC and DIO

2016-07-05 Thread Chao Yu
On 2016/7/6 8:24, Jaegeuk Kim wrote:
> On Fri, Jul 01, 2016 at 02:03:17PM +0800, Chao Yu wrote:
>> Hi Jaegeuk,
>>
>> On 2016/7/1 8:03, Jaegeuk Kim wrote:
>>> Hi Chao,
>>>
>>> On Thu, Jun 30, 2016 at 04:42:48PM +0800, Chao Yu wrote:
 Datas in file can be operated by GC and DIO simultaneously, so we will
 face race case as below:

 For write case:
 Thread A   Thread B
 - generic_file_direct_write
  - invalidate_inode_pages2_range
  - f2fs_direct_IO
   - do_blockdev_direct_IO
- do_direct_IO
 - get_more_blocks
- f2fs_gc
 - do_garbage_collect
  - gc_data_segment
   - move_data_page
- do_write_data_page
migrate data block to new block 
 address
- dio_bio_submit
update user data to old block address

 For read case:
 Thread AThread B
 - generic_file_direct_write
  - invalidate_inode_pages2_range
  - f2fs_direct_IO
   - do_blockdev_direct_IO
- do_direct_IO
 - get_more_blocks
- f2fs_balance_fs
 - f2fs_gc
  - do_garbage_collect
   - gc_data_segment
- move_data_page
 - do_write_data_page
 migrate data block to new block 
 address
  - write_checkpoint
   - do_checkpoint
- clear_prefree_segments
 - f2fs_issue_discard
  discard old block adress
- dio_bio_submit
update user buffer from obsolete block address

 In order to fix this, for one file, we should let DIO and GC getting 
 exclusion
 against with each other.

 Signed-off-by: Chao Yu 
 ---
  fs/f2fs/data.c  |  2 ++
  fs/f2fs/f2fs.h  |  1 +
  fs/f2fs/gc.c| 14 +-
  fs/f2fs/super.c |  1 +
  4 files changed, 17 insertions(+), 1 deletion(-)

 diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
 index ba4963f..08dc060 100644
 --- a/fs/f2fs/data.c
 +++ b/fs/f2fs/data.c
 @@ -1716,7 +1716,9 @@ static ssize_t f2fs_direct_IO(struct kiocb *iocb, 
 struct iov_iter *iter)
  
trace_f2fs_direct_IO_enter(inode, offset, count, iov_iter_rw(iter));
  
 +  mutex_lock(&F2FS_I(inode)->dio_mutex);
err = blockdev_direct_IO(iocb, inode, iter, get_data_block_dio);
 +  mutex_unlock(&F2FS_I(inode)->dio_mutex);
>>>
>>> This means we need to sacrifice entire parallism even in the normal cases?
>>> Can we find another way?
>>
>> 1. For dio write vs dio write, writer will grab i_mutex before dio_mutex, so
>> anyway, concurrent dio writes will be exclusive.
>>
>> 2. For dio write vs gc, keep using dio_mutex for making them exclusive.
>>
>> 3. For dio read vs dio read, and dio read vs gc, what about adding dio_rwsem 
>> to
>> control the parallelism?
>>
>> 4. For dio write vs dio read, we grab different lock (write grabs dio_mutex,
>> read grabs dio_rwsem), so there is no race condition.
> 
> How about adding a flag in a dio inode and avoiding GCs for there-in blocks?

Hmm.. IMO, without lock, it's hard to keep the sequence that let GC checking the
flag after setting it, right?

Thanks,

> 
> Thanks,
> 
>>
>> Thanks,
>>
>>>
>>> Thanks,
>>>
if (iov_iter_rw(iter) == WRITE) {
if (err > 0)
set_inode_flag(inode, FI_UPDATE_WRITE);
 diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
 index bd82b6d..a241576 100644
 --- a/fs/f2fs/f2fs.h
 +++ b/fs/f2fs/f2fs.h
 @@ -474,6 +474,7 @@ struct f2fs_inode_info {
struct list_head inmem_pages;   /* inmemory pages managed by f2fs */
struct mutex inmem_lock;/* lock for inmemory pages */
struct extent_tree *extent_tree;/* cached extent_tree entry */
 +  struct mutex dio_mutex; /* avoid racing between dio and gc */
  };
  
  static inline void get_extent_info(struct extent_info *ext,
 diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c
 index c2c4ac3..98e3763 100644
 --- a/fs/f2fs/gc.c
 +++ b/fs/f2fs/gc.c
 @@ -744,12 +744,24 @@ next_step:
/* phase 3 */
inode = find_gc_inode(gc_list, dni.ino);
if (inode) {
 +  bool locked = false;
 +
 +  if (S_ISREG(inode->i_mode)) {
 +  if (!mut

Re: [RFC PATCH 3/4] drm: bridge: Generic GPIO mux driver

2016-07-05 Thread Nicolas Boichat
(whoops, apologies for the CHROMIUM tag in the subject line...)

On Tue, Jun 28, 2016 at 4:52 PM, Archit Taneja  wrote:
>
>
> On 06/27/2016 01:18 PM, Nicolas Boichat wrote:
>>
>> This driver supports single input, 2 output display mux (e.g.
>> HDMI mux), that provides its status via a GPIO.
>
>
> This might not work if we had a 2 or more bridges connected
> one after the other at the output ports. It would be nicer
> if the interrupt handler directly modified the drm_bridge's
> next pointer rather than managing things by its own.

I don't think we can directly change ->next as different functions
need different handling. For example, pre_enable should be called on
all possible downstream bridges, as we cannot know in advance which
bridge should be ready. So does attach, but I think that's easier to
solve.

> That being said, the bridge chains (by setting the next
> pointers) aren't expected to change after they are set up.
> In order to use make this a truly generic driver, we'd
> probably need to add some sort of locking for the entire
> bridge chain to make sure we don't change things in the
> middle of a modeset. We don't really need to add this
> functionality unless there are many more platforms like
> these have non-static muxes in the display chain.

Agree with that. Also, since our downstream bridges (HDMI connector,
so no bridge, and anx7688) don't have enable/disable callbacks, we are
actually using a simpler version of this driver that just leaves all
downstream bridges enabled.

> It would be better if this driver was considered to be
> used only for the mux hardware that's used on the board.
>
> Archit
>
>
>>
>> Signed-off-by: Nicolas Boichat 
>> ---
>>   drivers/gpu/drm/bridge/Kconfig|  11 +
>>   drivers/gpu/drm/bridge/Makefile   |   1 +
>>   drivers/gpu/drm/bridge/generic-gpio-mux.c | 347
>> ++
>>   3 files changed, 359 insertions(+)
>>   create mode 100644 drivers/gpu/drm/bridge/generic-gpio-mux.c
>>
>> diff --git a/drivers/gpu/drm/bridge/Kconfig
>> b/drivers/gpu/drm/bridge/Kconfig
>> index da489f0..f1f6fc6 100644
>> --- a/drivers/gpu/drm/bridge/Kconfig
>> +++ b/drivers/gpu/drm/bridge/Kconfig
>> @@ -41,6 +41,17 @@ config DRM_DW_HDMI_AHB_AUDIO
>>   Designware HDMI block.  This is used in conjunction with
>>   the i.MX6 HDMI driver.
>>
>> +config DRM_GENERIC_GPIO_MUX
>> +   tristate "Generic GPIO-controlled mux"
>> +   depends on DRM
>> +   depends on OF
>> +   select DRM_KMS_HELPER
>> +   ---help---
>> + This bridge driver models a GPIO-controlled display mux with one
>> + input, 2 outputs (e.g. an HDMI mux). The hardware decides which
>> output
>> + is active, reports it as a GPIO, and the driver redirects calls
>> to the
>> + appropriate downstream bridge (if any).
>> +
>>   config DRM_NXP_PTN3460
>> tristate "NXP PTN3460 DP/LVDS bridge"
>> depends on OF
>> diff --git a/drivers/gpu/drm/bridge/Makefile
>> b/drivers/gpu/drm/bridge/Makefile
>> index 4846465..cb97274fd 100644
>> --- a/drivers/gpu/drm/bridge/Makefile
>> +++ b/drivers/gpu/drm/bridge/Makefile
>> @@ -4,6 +4,7 @@ obj-$(CONFIG_DRM_ANALOGIX_ANX7688) += analogix-anx7688.o
>>   obj-$(CONFIG_DRM_ANALOGIX_ANX78XX) += analogix-anx78xx.o
>>   obj-$(CONFIG_DRM_DW_HDMI) += dw-hdmi.o
>>   obj-$(CONFIG_DRM_DW_HDMI_AHB_AUDIO) += dw-hdmi-ahb-audio.o
>> +obj-$(CONFIG_DRM_GENERIC_GPIO_MUX) += generic-gpio-mux.o
>>   obj-$(CONFIG_DRM_NXP_PTN3460) += nxp-ptn3460.o
>>   obj-$(CONFIG_DRM_PARADE_PS8622) += parade-ps8622.o
>>   obj-$(CONFIG_DRM_SII902X) += sii902x.o
>> diff --git a/drivers/gpu/drm/bridge/generic-gpio-mux.c
>> b/drivers/gpu/drm/bridge/generic-gpio-mux.c
>> new file mode 100644
>> index 000..d3367e2
>> --- /dev/null
>> +++ b/drivers/gpu/drm/bridge/generic-gpio-mux.c
>> @@ -0,0 +1,347 @@
>> +/*
>> + * ANX7688 HDMI->DP bridge driver
>> + *
>> + * Copyright (C) 2016 Google, Inc.
>> + *
>> + * This software is licensed under the terms of the GNU General Public
>> + * License version 2, as published by the Free Software Foundation, and
>> + * may be copied, distributed, and modified under those terms.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + * GNU General Public License for more details.
>> + */
>> +
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +
>> +struct gpio_display_mux {
>> +   struct device *dev;
>> +
>> +   struct gpio_desc *gpiod_detect;
>> +   int detect_irq;
>> +
>> +   struct drm_bridge bridge;
>> +
>> +   struct drm_bridge *next[2];
>> +
>> +   struct mutex lock;
>> +   int active;
>> +   bool enabled;
>> +};
>> +
>> +static inline struct gpio_display_mux *bridge_to_gpio_display_mux(
>> +   stru

RE: [PATCH 2/5] ACPI / debugger: Fix regressions that AML debugger stops working

2016-07-05 Thread Zheng, Lv
Hi, Rafael

> From: rjwyso...@gmail.com [mailto:rjwyso...@gmail.com] On Behalf Of
> Rafael J. Wysocki
> Subject: Re: [PATCH 2/5] ACPI / debugger: Fix regressions that AML
> debugger stops working
> 
> On Tue, Jul 5, 2016 at 1:18 PM, Lv Zheng  wrote:
> > The FIFO unlocking mechanism in acpi_dbg has been messed up by the
> > following commit:
> >   Commit: 287980e49ffc0f6d911601e7e352a812ed27768e
> >   Subject: remove lots of IS_ERR_VALUE abuses
> > It converts !IS_ERR_VALUE(ret) into !ret. This patch fixes the
> > regression.
> >
> > Fixes: 287980e49ffc ("remove lots of IS_ERR_VALUE abuses")
> > Signed-off-by: Lv Zheng 
> > Cc: Arnd Bergmann 
> 
> OK, but ->
> 
> > ---
> >  drivers/acpi/acpi_dbg.c |6 --
> >  1 file changed, 4 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/acpi/acpi_dbg.c b/drivers/acpi/acpi_dbg.c
> > index 1f41284..ebc8d18 100644
> > --- a/drivers/acpi/acpi_dbg.c
> > +++ b/drivers/acpi/acpi_dbg.c
> > @@ -602,7 +602,8 @@ static int acpi_aml_read_user(char __user *buf,
> int len)
> > crc->tail = (crc->tail + n) & (ACPI_AML_BUF_SIZE - 1);
> > ret = n;
> >  out:
> > -   acpi_aml_unlock_fifo(ACPI_AML_OUT_USER, !ret);
> > +   acpi_aml_unlock_fifo(ACPI_AML_OUT_USER,
> > +ret < 0 ? false : true);
> 
> -> The ternary operation is not necessary here, because the result of
> (ret < 0) is already boolean.  So this can be written as
> 
> acpi_aml_unlock_fifo(ACPI_AML_OUT_USER, ret >= 0);
> 
> and analogously below.
> 
> I've made that change, please check the result in bleeding-edge.

[Lv Zheng] 
Confirmed to be working.
Thanks for the simplification.

Thanks and best regards
-Lv


Re: [PATCH 0/2] KVM: MMU: support VMAs that got remap_pfn_range-ed

2016-07-05 Thread Xiao Guangrong



On 07/05/2016 08:18 PM, Paolo Bonzini wrote:



On 05/07/2016 07:41, Neo Jia wrote:

On Thu, Jun 30, 2016 at 03:01:49PM +0200, Paolo Bonzini wrote:

The vGPU folks would like to trap the first access to a BAR by setting
vm_ops on the VMAs produced by mmap-ing a VFIO device.  The fault handler
then can use remap_pfn_range to place some non-reserved pages in the VMA.

KVM lacks support for this kind of non-linear VM_PFNMAP mapping, and these
patches should fix this.


Hi Paolo,

I have tested your patches with the mediated passthru patchset that is being
reviewed in KVM and QEMU mailing list.

The fault handler gets called successfully and the previously mapped memory gets
unmmaped correctly via unmap_mapping_range.


Great, then I'll include them in 4.8.


Code is okay, but i still suspect if this implementation, fetch mmio pages in 
fault
handler, is needed. We'd better include these patches after the design of vfio
framework is decided.



Re: [PATCH] [RFC] Kbuild: avoid "make tinyconfig" warnings

2016-07-05 Thread Masahiro Yamada
Hi Arnd,


2016-07-04 23:25 GMT+09:00 Arnd Bergmann :
> The introduction of "make *.config" as a shorthand for merging configuration
> files unfortunately introduced some build warnings that we see in every
> single run of the various build bots testing tinyconfig:

I am not convinced with this statement.

Why do you think 63a91033d52e is a bad commit?


With/without 63a91033d52e, I see the same warnings.


$ git log --oneline  -1
63a9103 kbuild: add generic mergeconfig target, %.config
$ make -s mrproper
$ make tinyconfig >/dev/null
.config:871:warning: override: KERNEL_XZ changes choice state
.config:873:warning: override: SLOB changes choice state
.config:874:warning: override: NOHIGHMEM changes choice state
$ git checkout HEAD^
Previous HEAD position was 63a9103... kbuild: add generic mergeconfig
target, %.config
HEAD is now at bc8f8f5... merge_config.sh: rename MAKE to RUNMAKE
$ git log --oneline  -1
bc8f8f5 merge_config.sh: rename MAKE to RUNMAKE
$ make -s mrproper
$ make tinyconfig >/dev/null
.config:871:warning: override: KERNEL_XZ changes choice state
.config:873:warning: override: SLOB changes choice state
.config:874:warning: override: NOHIGHMEM changes choice state







> diff --git a/scripts/kconfig/Makefile b/scripts/kconfig/Makefile
> index ebced77deb9c..35d0e191fe3f 100644
> --- a/scripts/kconfig/Makefile
> +++ b/scripts/kconfig/Makefile
> @@ -125,8 +125,8 @@ xenconfig: xen.config
> @:
>
>  PHONY += tinyconfig
> -tinyconfig:
> -   $(Q)$(MAKE) -f $(srctree)/Makefile allnoconfig tiny.config
> +tinyconfig: $(obj)/conf
> +   $(Q)$(MAKE) -f $(srctree)/Makefile allnoconfig 
> KCONFIG_ALLCONFIG=$(wildcard $(srctree)/kernel/configs/tiny.config 
> $(srctree)/arch/$(SRCARCH)/configs/tiny.config)
>
>  # Help text used by make help
>  help:


The dependency "tinyconfig: $(obj)/conf" is redundant.


It is already specified by:

allnoconfig allyesconfig allmodconfig alldefconfig randconfig: $(obj)/conf
 $< --$@ $(Kconfig)







What is worse, this patch breaks "make tinyconfig" on x86.


$ uname -m
x86_64
$ git log --oneline -2
754e472 Kbuild: avoid "make tinyconfig" warnings
a99cde4 Linux 4.7-rc6
$ make -s mrproper
$ make tinyconfig
  HOSTCC  scripts/basic/fixdep
  HOSTCC  scripts/kconfig/conf.o
  SHIPPED scripts/kconfig/zconf.tab.c
  SHIPPED scripts/kconfig/zconf.lex.c
  SHIPPED scripts/kconfig/zconf.hash.c
  HOSTCC  scripts/kconfig/zconf.tab.o
  HOSTLD  scripts/kconfig/conf
scripts/kconfig/conf  --allnoconfig Kconfig
#
# configuration written to .config
#
scripts/kconfig/Makefile:115: *** No configuration exists for this
target on this architecture.  Stop.
make[3]: *** [arch/x86/configs/tiny.config] Error 2
make[2]: *** [__build_one_by_one] Error 2
make[1]: *** [tinyconfig] Error 2
make: *** [tinyconfig] Error 2


-- 
Best Regards
Masahiro Yamada


Re: [RFC PATCH 0/4] drm: bridge: anx7688 and mux drivers

2016-07-05 Thread Nicolas Boichat
Hi Archit,

Sorry got swamped by other things and forgot to reply.

On Tue, Jun 28, 2016 at 4:28 PM, Archit Taneja  wrote:
>
>
> On 06/27/2016 01:18 PM, Nicolas Boichat wrote:
>>
>> Hi all,
>>
>> This is a follow up to the 2 patches to add support for ANX7688 sent here:
>> https://patchwork.kernel.org/patch/9187809/, thanks Archit and Philipp for
>> the comments.
>>
>> I also added 2 patches to add support for a simple display MUX, as I'm
>> facing
>> similar issues while trying to implement it, i.e. the current DRM core
>> does not
>> seem to support this kind of simple pass-thru bridge very well: it is not
>> very
>> clear where connectors should be defined and attached. In this case, not
>> defining any connectors in the 2 bridges (and reusing the connector in MTK
>> HDMI driver) seem to work best, but makes little logical sense as the
>> physical
>> connectors are actually attached to the bridges.
>
>
> Bridges aren't really drm objects in themselves, they can just be
> thought of as entities that attach to an encoder. From a drm
> perspective, the connector is only linked to an encoder. It
> doesn't see any bridges. Therefore, it doesn't matter much
> if the bridge driver doesn't create connectors. The DT bindings,
> however, should be close to the physical connections.
>
>>
>> In any case, the board has the following layout:
>>   - MT8173 HDMI bridge
>> - HDMI mux with 2 ports
>>   1. ANX7688 for HDMI->DP over USB-C conversion
>>   2. Native HDMI
>>
>
> So, the MTK SoC's HDMI output (TMDS lines) can be routed to the
> connector on the board directly (native mode), or via the ANX7688
> bridge using the gpio mux. Did I get this part right?

Yes.

> Is there only one connector at the end of both the output paths?

Err, there are:
 - 2 physical connectors (HDMI, USB-C)
 - 1 DT connector in the device tree (native HDMI), I don't bother
adding a connector to anx7688, but in theory I could.
 - 1 DRM connector object (defined in the mtk hdmi driver)

>> The mux is controlled by hardware, looking at the HPD signals from both
>> ANX7688
>> and native HDMI, with a priority on the native HDMI output.
>
>
> I didn't understand this. I can see that ANX7688 could generate a HPD
> signal on behalf of the connected monitor, but why would the native MTK
> HDMI controller generate a HPD signal? I would expect it to receive HPD
> and trigger a CPU interrupt.
>
> Could you also give an idea about why the hardware switches between the
> two paths? It it based on what kind of device plugs into the connector?

The mux is controlled in hardware, by looking at both HPD lines:
 - USB-C (HPD controlled by ANX7688, depending if there is a DP over
USB-C device connected)
 - native HDMI (HPD controlled by monitor on remote side)

Note that, like the other HDMI lines, HPD is "muxed" (depending on the
logic above), and wired up to MTK SoC HDMI/CEC module, which generates
the interrupts.

Priority is set to native HDMI port, so if both HPD signals are
active, the output will stay on the HDMI port.

Best,

Nicolas

> Thanks,
> Archit
>
>
>>
>> The whole setup works fairly well without any Linux kernel drivers, except
>> the
>> 2 following cases:
>>   1. When ANX7688 is active, DP bandwidth may be limited, so we need to
>> filter
>>  resolutions that would exceed the available bandwidth.
>>   2. When both outputs HPD signals are active, the kernel does not receive
>> an
>>  HPD pulse when the HDMI input is unplugged.
>>
>> ANX7688 driver fixes issue 1. The mux driver fixes 2 by forcing the kernel
>> to
>> re-read the EDID on mux output change, and also issue 1 by filtering only
>> when
>> ANX7688 is active.
>>
>> I understand this patch series might not be acceptable as-is, but I hope
>> this
>> sort of setup can be taken into account when better support for connector
>> drivers is introduced.
>>
>> Thanks!
>>
>> Best,
>>
>> Nicolas
>>
>> Nicolas Boichat (4):
>>drm: bridge: anx7688: Add anx7688 bridge driver support.
>>devicetree: Add ANX7688 transmitter binding
>>drm: bridge: Generic GPIO mux driver
>>devicetree: Add GPIO display mux binding
>>
>>   .../devicetree/bindings/drm/bridge/anx7688.txt |  32 ++
>>   .../devicetree/bindings/drm/bridge/gpio-mux.txt|  59 
>>   drivers/gpu/drm/bridge/Kconfig |  20 ++
>>   drivers/gpu/drm/bridge/Makefile|   2 +
>>   drivers/gpu/drm/bridge/analogix-anx7688.c  | 233 ++
>>   drivers/gpu/drm/bridge/generic-gpio-mux.c  | 347
>> +
>>   6 files changed, 693 insertions(+)
>>   create mode 100644
>> Documentation/devicetree/bindings/drm/bridge/anx7688.txt
>>   create mode 100644
>> Documentation/devicetree/bindings/drm/bridge/gpio-mux.txt
>>   create mode 100644 drivers/gpu/drm/bridge/analogix-anx7688.c
>>   create mode 100644 drivers/gpu/drm/bridge/generic-gpio-mux.c
>>
>
> --
> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
> a Linux Foundation Colla

Re: [patch for-4.7] mm, compaction: prevent VM_BUG_ON when terminating freeing scanner

2016-07-05 Thread Minchan Kim
On Wed, Jul 06, 2016 at 10:41:09AM +0900, Joonsoo Kim wrote:

< snip >
> > Do you have any objection to this fix for 4.7?
> > 
> > Joonson and/or Minchan, does this address the issue that you reported?
> 
> Unfortunately, I have no test case to trigger it. But, I think that
> this patch will address it. Anyway, I commented one problem on this

I just queued this patch into my testing machine which triggered the
problem so let's wait. It triggered in 6 hours most of time.

Thanks.


Re: [PATCH 00/31] Move LRU page reclaim from zones to nodes v8

2016-07-05 Thread Minchan Kim
On Mon, Jul 04, 2016 at 10:55:09AM +0100, Mel Gorman wrote:
> On Mon, Jul 04, 2016 at 05:04:12PM +0900, Minchan Kim wrote:
> > > > How big ratio between highmem:lowmem do you think a problem?
> > > > 
> > > 
> > > That's a "how long is a piece of string" type question.  The ratio does
> > > not matter as much as whether the workload is both under memory pressure
> > > and requires large amounts of lowmem pages. Even on systems with very high
> > > ratios, it may not be a problem if HIGHPTE is enabled.
> > 
> > As well page table, pgd/kernelstack/zbud/slab and so on, every kernel
> > allocations wanted to mask __GFP_HIGHMEM off would be a problem in
> > 32bit system.
> > 
> 
> The same point applies -- it depends on the rate of these allocations,
> not the ratio of highmem:lowmem per se.
> 
> > It also depends on that how many drivers needed lowmem only we have
> > in the system.
> > 
> > I don't know how many such driver in the world. When I simply do grep,
> > I found several cases which mask __GFP_HIGHMEM off and among them,
> > I guess DRM might be a popular for us. However, it might be really rare
> > usecase among various i915 usecases.
> > 
> 
> It's also perfectly possible that such allocations are long-lived in which
> case they are not going to cause many skips. Hence why I cannot make a
> general prediction.
> 
> > > > > Conceptually, moving to node LRUs should be easier to understand. The
> > > > > page allocator plays fewer tricks to game reclaim and reclaim behaves
> > > > > similarly on all nodes. 
> > > > > 
> > > > > The series has been tested on a 16 core UMA machine and a 2-socket 48
> > > > > core NUMA machine. The UMA results are presented in most cases as the 
> > > > > NUMA
> > > > > machine behaved similarly.
> > > > 
> > > > I guess you would already test below with various highmem system(e.g.,
> > > > 2:1, 3:1, 4:1 and so on). If you have, could you mind sharing it?
> > > > 
> > > 
> > > I haven't that data, the baseline distribution used doesn't even have
> > > 32-bit support. Even if it was, the results may not be that interesting.
> > > The workloads used were not necessarily going to trigger lowmem pressure
> > > as HIGHPTE was set on the 32-bit configs.
> > 
> > That means we didn't test this on 32-bit with highmem.
> > 
> 
> No. I tested the skip logic and noticed that when forced on purpose that
> system CPU usage was higher but it functionally worked.

Yeb, it would work well functionally. I meant not functionally but
performance point of view, system cpu usage and majfault rate
and so on.

> 
> > I'm not sure it's really too rare case to spend a time for testing.
> > In fact, I really want to test all series to our production system
> > which is 32bit and highmem but as we know well, most of embedded
> > system kernel is rather old so backporting needs lots of time and
> > care. However, if we miss testing in those system at the moment,
> > we will be suprised after 1~2 years.
> > 
> 
> It would be appreciated if it could be tested on such platforms if at all
> possible. Even if I did set up a 32-bit x86 system, it won't have the same
> allocation/reclaim profile as the platforms you are considering.

Yeb. I just finished reviewing of all patches and found no *big* problem
with my brain so my remanining homework is just testing which would find
what my brain have missed.

I will give the backporing to old 32-bit production kernel a shot and
report if something strange happens.

Thanks for great work, Mel!


> 
> > I don't know what kinds of benchmark can we can check it so I cannot
> > insist on it but you might know it.
> > 
> 
> One method would be to use fsmark with very large numbers of small files
> to force slab to require low memory. It's not representative of many real
> workloads unfortunately. Usually such a configuration is for checking the
> slab shrinker is working as expected.

Thanks for the suggestion.

> 
> > Okay, do you have any idea to fix it if we see such regression report
> > in 32-bit system in future?
> 
> Two options, neither whose complexity is justified without a "real"
> workload to use as a reference.
> 
> 1. Long-term isolation of highmem pages when reclaim is lowmem
> 
>When pages are skipped, they are immediately added back onto the LRU
>list. If lowmem reclaim persisted for long periods of time, the same
>highmem pages get continually scanned. The idea would be that lowmem
>keeps those pages on a separate list until a reclaim for highmem pages
>arrives that splices the highmem pages back onto the LRU.
> 
>That would reduce the skip rate, the potential corner case is that
>highmem pages have to be scanned and reclaimed to free lowmem slab pages.
> 
> 2. Linear scan lowmem pages if the initial LRU shrink fails
> 
>This will break LRU ordering but may be preferable and faster during
>memory pressure than skipping LRU pages.

Okay. I guess it would be better to include this in descripion of [4/31].

> 
> -- 

RE: [PATCH 5/5] ACPI: Add configuration item to configure ACPICA error logs out

2016-07-05 Thread Zheng, Lv
Hi, Rafael

> From: linux-acpi-ow...@vger.kernel.org [mailto:linux-acpi-
> ow...@vger.kernel.org] On Behalf Of Rafael J. Wysocki
> Subject: Re: [PATCH 5/5] ACPI: Add configuration item to configure ACPICA
> error logs out
> 
> On Tue, Jul 5, 2016 at 1:18 PM, Lv Zheng  wrote:
> > Sometimes, we need to disable ACPICA error logs to leave only ACPICA
> > debug logs enabled for debugging purpose. This is useful when ACPICA
> error
> > logs become a flood.
> >
> > Reference: https://bugzilla.kernel.org/show_bug.cgi?id=114201
> > Signed-off-by: Lv Zheng 
> 
> I seem to remember seeing this change once before and ISTR I said I
> wouldn't apply it then.
> 
> Why would I say something different this time?
[Lv Zheng] 
I missed that.
Now that I deleted this patch from my local queue.

Thanks and best regards
-Lv

> 
> > ---
> >  drivers/acpi/Kconfig|7 +++
> >  include/acpi/platform/aclinux.h |4 
> >  2 files changed, 11 insertions(+)
> >
> > diff --git a/drivers/acpi/Kconfig b/drivers/acpi/Kconfig
> > index 04af18f..939d235 100644
> > --- a/drivers/acpi/Kconfig
> > +++ b/drivers/acpi/Kconfig
> > @@ -324,6 +324,13 @@ config ACPI_TABLE_UPGRADE
> >   initrd, therefore it's safe to say Y.
> >   See Documentation/acpi/initrd_table_override.txt for details
> >
> > +config ACPI_NO_ERROR_MESSAGES
> > +   bool "Disable error messages"
> > +   default n
> > +   help
> > + The ACPI subsystem can produce error messages. Saying Y disables
> > + this output.
> > +
> >  config ACPI_DEBUG
> > bool "Debug Statements"
> > default n
> > diff --git a/include/acpi/platform/aclinux.h
> b/include/acpi/platform/aclinux.h
> > index 93b61b1..ed27b52 100644
> > --- a/include/acpi/platform/aclinux.h
> > +++ b/include/acpi/platform/aclinux.h
> > @@ -77,6 +77,10 @@
> >  #define ACPI_MUTEX_DEBUG
> >  #endif
> >
> > +#ifdef CONFIG_ACPI_NO_ERROR_MESSAGES
> > +#define ACPI_NO_ERROR_MESSAGES
> > +#endif
> > +
> >  #include 
> >  #include 
> >  #include 
> > --
> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch for-4.7] mm, compaction: prevent VM_BUG_ON when terminating freeing scanner

2016-07-05 Thread Joonsoo Kim
On Tue, Jul 05, 2016 at 02:01:29PM -0700, David Rientjes wrote:
> On Thu, 30 Jun 2016, Vlastimil Babka wrote:
> 
> > >  Note: I really dislike the low watermark check in split_free_page() and
> > >  consider it poor software engineering.  The function should split a free
> > >  page, nothing more.  Terminating memory compaction because of a low
> > >  watermark check when we're simply trying to migrate memory seems like an
> > >  arbitrary heuristic.  There was an objection to removing it in the first
> > >  proposed patch, but I think we should really consider removing that
> > >  check so this is simpler.
> > 
> > There's a patch changing it to min watermark (you were CC'd on the series). 
> > We
> > could argue whether it belongs to split_free_page() or some wrapper of it, 
> > but
> > I don't think removing it completely should be done. If zone is struggling
> > with order-0 pages, a functionality for making higher-order pages shouldn't
> > make it even worse. It's also not that arbitrary, even if we succeeded the
> > migration and created a high-order page, the higher-order allocation would
> > still fail due to watermark checks. Worse, __compact_finished() would keep
> > telling the compaction to continue, creating an even longer lag, which is 
> > also
> > against your recent patches.
> > 
> 
> I'm suggesting we shouldn't check any zone watermark in split_free_page(): 
> that function should just split the free page.
> 
> I don't find our current watermark checks to determine if compaction is 
> worthwhile to be invalid, but I do think that we should avoid checking or 
> acting on any watermark in isolate_freepages() itself.  We could do more 
> effective checking in __compact_finished() to determine if we should 
> terminate compaction, but the freeing scanner feels like the wrong place 
> to do it -- it's also expensive to check while gathering free pages for 
> memory that we have already successfully isolated as part of the 
> iteration.
> 
> Do you have any objection to this fix for 4.7?
> 
> Joonson and/or Minchan, does this address the issue that you reported?

Unfortunately, I have no test case to trigger it. But, I think that
this patch will address it. Anyway, I commented one problem on this
patch in other e-mail so please fix it.

Thanks.


Re: [patch for-4.7] mm, compaction: prevent VM_BUG_ON when terminating freeing scanner

2016-07-05 Thread Joonsoo Kim
On Wed, Jun 29, 2016 at 02:47:20PM -0700, David Rientjes wrote:
> It's possible to isolate some freepages in a pageblock and then fail 
> split_free_page() due to the low watermark check.  In this case, we hit 
> VM_BUG_ON() because the freeing scanner terminated early without a 
> contended lock or enough freepages.
> 
> This should never have been a VM_BUG_ON() since it's not a fatal 
> condition.  It should have been a VM_WARN_ON() at best, or even handled 
> gracefully.
> 
> Regardless, we need to terminate anytime the full pageblock scan was not 
> done.  The logic belongs in isolate_freepages_block(), so handle its state
> gracefully by terminating the pageblock loop and making a note to restart 
> at the same pageblock next time since it was not possible to complete the 
> scan this time.
> 
> Reported-by: Minchan Kim 
> Signed-off-by: David Rientjes 
> ---
>  Note: I really dislike the low watermark check in split_free_page() and
>  consider it poor software engineering.  The function should split a free
>  page, nothing more.  Terminating memory compaction because of a low
>  watermark check when we're simply trying to migrate memory seems like an
>  arbitrary heuristic.  There was an objection to removing it in the first
>  proposed patch, but I think we should really consider removing that
>  check so this is simpler.
> 
>  mm/compaction.c | 37 +++--
>  1 file changed, 15 insertions(+), 22 deletions(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1009,8 +1009,6 @@ static void isolate_freepages(struct compact_control 
> *cc)
>   block_end_pfn = block_start_pfn,
>   block_start_pfn -= pageblock_nr_pages,
>   isolate_start_pfn = block_start_pfn) {
> - unsigned long isolated;
> -
>   /*
>* This can iterate a massively long zone without finding any
>* suitable migration targets, so periodically check if we need
> @@ -1034,36 +1032,31 @@ static void isolate_freepages(struct compact_control 
> *cc)
>   continue;
>  
>   /* Found a block suitable for isolating free pages from. */
> - isolated = isolate_freepages_block(cc, &isolate_start_pfn,
> - block_end_pfn, freelist, false);
> - /* If isolation failed early, do not continue needlessly */
> - if (!isolated && isolate_start_pfn < block_end_pfn &&
> - cc->nr_migratepages > cc->nr_freepages)
> - break;
> + isolate_freepages_block(cc, &isolate_start_pfn, block_end_pfn,
> + freelist, false);
>  
>   /*
> -  * If we isolated enough freepages, or aborted due to async
> -  * compaction being contended, terminate the loop.
> -  * Remember where the free scanner should restart next time,
> -  * which is where isolate_freepages_block() left off.
> -  * But if it scanned the whole pageblock, isolate_start_pfn
> -  * now points at block_end_pfn, which is the start of the next
> -  * pageblock.
> -  * In that case we will however want to restart at the start
> -  * of the previous pageblock.
> +  * If we isolated enough freepages, or aborted due to lock
> +  * contention, terminate.
>*/
>   if ((cc->nr_freepages >= cc->nr_migratepages)
>   || cc->contended) {
> - if (isolate_start_pfn >= block_end_pfn)
> + if (isolate_start_pfn >= block_end_pfn) {
> + /*
> +  * Restart at previous pageblock if more
> +  * freepages can be isolated next time.
> +  */
>   isolate_start_pfn =
>   block_start_pfn - pageblock_nr_pages;
> + }
>   break;
> - } else {
> + } else if (isolate_start_pfn < block_end_pfn) {
>   /*
> -  * isolate_freepages_block() should not terminate
> -  * prematurely unless contended, or isolated enough
> +  * If isolation failed early, do not continue
> +  * needlessly.
>*/
> - VM_BUG_ON(isolate_start_pfn < block_end_pfn);
> + isolate_start_pfn = block_start_pfn;
> + break;

I don't think this line is correct. It would make cc->free_pfn go
backward though it would not be a big problem. Just leaving
isolate_start_pfn as isolate_freepages_block returns would be a p

linux-next: manual merge of the net-next tree with the net tree

2016-07-05 Thread Stephen Rothwell
Hi all,

Today's linux-next merge of the net-next tree got a conflict in:

  drivers/net/usb/r8152.c

between commit:

  2609af19362d ("r8152: fix runtime function for RTL8152")

from the net tree and commit:

  a028a9e003f2 ("r8152: move the settings of PHY to a work queue")

from the net-next tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc drivers/net/usb/r8152.c
index 0da72d39b4f9,24d367280ecf..
--- a/drivers/net/usb/r8152.c
+++ b/drivers/net/usb/r8152.c
@@@ -624,7 -624,7 +624,8 @@@ struct r8152 
int (*eee_get)(struct r8152 *, struct ethtool_eee *);
int (*eee_set)(struct r8152 *, struct ethtool_eee *);
bool (*in_nway)(struct r8152 *);
 +  void (*autosuspend_en)(struct r8152 *tp, bool enable);
+   void (*hw_phy_cfg)(struct r8152 *);
} rtl_ops;
  
int intr_interval;
@@@ -4156,7 -4157,7 +4176,8 @@@ static int rtl_ops_init(struct r8152 *t
ops->eee_get= r8152_get_eee;
ops->eee_set= r8152_set_eee;
ops->in_nway= rtl8152_in_nway;
 +  ops->autosuspend_en = rtl_runtime_suspend_enable;
+   ops->hw_phy_cfg = r8152b_hw_phy_cfg;
break;
  
case RTL_VER_03:
@@@ -4172,7 -4173,7 +4193,8 @@@
ops->eee_get= r8153_get_eee;
ops->eee_set= r8153_set_eee;
ops->in_nway= rtl8153_in_nway;
 +  ops->autosuspend_en = rtl8153_runtime_enable;
+   ops->hw_phy_cfg = r8153_hw_phy_cfg;
break;
  
default:


RE: [PATCH 1/2] qe/ic: move qe_ic_init from platforms to irqchip

2016-07-05 Thread Qiang Zhao
On 07/05/2016 11:19 AM, Jason Cooper  wrote:

> -Original Message-
> From: Jason Cooper [mailto:ja...@lakedaemon.net]
> Sent: Tuesday, July 05, 2016 10:22 PM
> To: Qiang Zhao 
> Cc: o...@buserror.net; t...@linutronix.de; marc.zyng...@arm.com; linuxppc-
> d...@lists.ozlabs.org; linux-kernel@vger.kernel.org; Xiaobo Xie
> 
> Subject: Re: [PATCH 1/2] qe/ic: move qe_ic_init from platforms to irqchip
> 
> On Tue, Jul 05, 2016 at 07:27:21AM +, Qiang Zhao wrote:
> > On 07/05/2016 11:19 AM, Jason Cooper  wrote:
> > > -Original Message-
> > > From: Jason Cooper [mailto:ja...@lakedaemon.net]
> > > Sent: Tuesday, July 05, 2016 11:19 AM
> > > To: Qiang Zhao 
> > > Cc: o...@buserror.net; t...@linutronix.de; marc.zyng...@arm.com;
> > > linuxppc- d...@lists.ozlabs.org; linux-kernel@vger.kernel.org; Xiaobo
> > > Xie 
> > > Subject: Re: [PATCH 1/2] qe/ic: move qe_ic_init from platforms to
> > > > diff --git a/arch/powerpc/platforms/83xx/misc.c
> > > > b/arch/powerpc/platforms/83xx/misc.c
> > > > index 7e923ca..9431fc7 100644
> > > > --- a/arch/powerpc/platforms/83xx/misc.c
> > > > +++ b/arch/powerpc/platforms/83xx/misc.c
> > > > @@ -93,24 +93,9 @@ void __init mpc83xx_ipic_init_IRQ(void)  }
> > > >
> > > >  #ifdef CONFIG_QUICC_ENGINE
> > > > -void __init mpc83xx_qe_init_IRQ(void) -{
> > > > -   struct device_node *np;
> > > > -
> > > > -   np = of_find_compatible_node(NULL, NULL, "fsl,qe-ic");
> > > > -   if (!np) {
> > > > -   np = of_find_node_by_type(NULL, "qeic");
> > > > -   if (!np)
> > > > -   return;
> > > > -   }
> > >
> > > This block isn't preserved in the irqchip driver.  Why not?
> >
> > I grep qeic in arch/powerpc/boot/dts/*, doesn't find which board use qeic as
> type.
> 
> Unfortunately, checking powerpc/boot/dts/* isn't sufficient for confirming we
> aren't going to break backwards compatibility with boards *in the field*.
> 
> Please take a look at:
> 
>   d4fb5ebd83c70 powerpc/83xx: consolidate init_IRQ functions
>   8159df72d43e2 83xx: add support for the kmeter1 board.
> 
> Perhaps one or two of the authors is still around and can say why that check 
> is
> there and if it's ok to remove it.
> 
> Or, we could just play it safe and keep the check.
> 

Ok, I will add this check in next version.

Thanks
-Zhao Qiang


Re: [PATCH 11/31] mm: vmscan: do not reclaim from kswapd if there is any eligible zone

2016-07-05 Thread Minchan Kim
On Tue, Jul 05, 2016 at 11:38:06AM +0100, Mel Gorman wrote:
> On Tue, Jul 05, 2016 at 03:11:17PM +0900, Minchan Kim wrote:
> > > - if (i < 0)
> > > - goto out;
> > > + /*
> > > +  * Only reclaim if there are no eligible zones. Check from
> > > +  * high to low zone to avoid prematurely clearing pgdat
> > > +  * congested state.
> > 
> > I cannot understand "prematurely clearing pgdat congested state".
> > Could you add more words to clear it out?
> > 
> 
> It's surprisingly difficult to concisely explain. Is this any better?
> 
> /*
>  * Only reclaim if there are no eligible zones. Check from
>  * high to low zone as allocations prefer higher zones.
>  * Scanning from low to high zone would allow congestion to be
>  * cleared during a very small window when a small low
>  * zone was balanced even under extreme pressure when the
>  * overall node may be congested.
>  */

Surely, it's better. Thanks for the explaining.

I doubt we need such corner case logic at this moment and how it works well
without consistent scan from other callers of zone_balanced where scans
from low to high.

> > > +  */
> > > + for (i = classzone_idx; i >= 0; i--) {
> > > + zone = pgdat->node_zones + i;
> > > + if (!populated_zone(zone))
> > > + continue;
> > > +
> > > + if (zone_balanced(zone, sc.order, classzone_idx))
> > 
> > If buffer_head is over limit, old logic force to reclaim highmem but
> > this zone_balanced logic will prevent it.
> > 
> 
> The old logic was always busted on 64-bit because is_highmem would always
> be 0. The original intent appears to be that buffer_heads_over_limit
> would release the buffers when pages went inactive. There are a number

Yes but the difference is in old, it was handled both direct and background
reclaim once buffers_heads is over the limit but your change slightly
changs it so kswapd couldn't reclaim high zone if any eligible zone
is balanced. I don't know how big difference it can make but we saw
highmem buffer_head problems several times, IIRC. So, I just wanted
to notice it to you. whether it's handled or not, it's up to you.

> of things we treated inconsistently that get fixed up in the series and
> buffer_heads_over_limit is one of them.
> 
> -- 
> Mel Gorman
> SUSE Labs
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org";> em...@kvack.org 


Re: cgroup: Fix split bio been throttled more than once

2016-07-05 Thread Ming Lei
On Wed, Jul 6, 2016 at 6:26 AM, Tejun Heo  wrote:
> Hello,
>
> On Mon, Jul 04, 2016 at 10:46:54PM +0800, Jiale Li wrote:
>> These days, we have tested the cgroup blkio throttle function use fio,
>> and we found a problem that when we use buffered IO or set the big block
>> size like 1M, then the IO performance cannot reach the value we set.
>> For example we set blkio.throttle.read_bps_device as 10M, in kernel
>> version 4.3 IO performance can only reach 6M, and in kernel version
>> 4.4 the actual IO bps is only 3.1M.
>>
>> Then we did some research and find that in kernel version 4.3 brought in
>> blk_queue_split() function to split the big size bio into several parts,
>> and some of them are calling the generic_make_request() again, this result
>> the bio been throttled more than once. so the actual bio sent to device is
>> less than we expected.
>>
>> We have checked the newest kernel of 4.7-rc5, this problem is still exist.
>>
>> Based on this kind of situation, we propose a fix solution to add a flag bit
>> in bio to let the splited bio bypass the blk_queue_split(). Below is the 
>> patch
>> we used to fix this problem.
>
> Thanks a lot for the report.  Hmmm... there was another brekage around
> split bios, I wonder whether the two can be handled together somehow.
>
>  
> http://lkml.kernel.org/g/1466583730-28595-1-git-send-email-lars.ellenb...@linbit.com
>

That is another issue, which is a deadlock caused by queuing the remainder
bio from splitting before BIOs generated in .make_request_fn().

>> diff --git a/block/blk-merge.c b/block/blk-merge.c
>> index 2613531..7b17a65 100644
>> --- a/block/blk-merge.c
>> +++ b/block/blk-merge.c
>> @@ -190,6 +190,7 @@ void blk_queue_split(struct request_queue *q, struct bio 
>> **bio,
>>
>>   bio_chain(split, *bio);
>>   trace_block_split(q, split, (*bio)->bi_iter.bi_sector);
>> + bio_set_flag(*bio, BIO_SPLITED);
>
> Split's past participle form is split, so BIO_SPLIT would be right
> here.
>
>>   generic_make_request(*bio);
>>   *bio = split;
>>   }
>> diff --git a/block/blk-throttle.c b/block/blk-throttle.c
>> index 47a3e54..4ffde95 100644
>> --- a/block/blk-throttle.c
>> +++ b/block/blk-throttle.c
>> @@ -1403,6 +1403,10 @@ bool blk_throtl_bio(struct request_queue *q, struct 
>> blkcg_gq *blkg,
>>
>>   WARN_ON_ONCE(!rcu_read_lock_held());
>>
>> + /* if the bio has been splited, should not throttle again */
>> + if (bio_flagged(bio, BIO_SPLITED))
>> + goto out;
>> +
>
> But that said, can't we just copy BIO_THROTLED while splitting?
>
> Thanks.
>
> --
> tejun


Thanks,
Ming Lei


[lkp] [netfilter] d4978a76b6: INFO: suspicious RCU usage

2016-07-05 Thread kernel test robot

FYI, we noticed the following commit:

https://github.com/0day-ci/linux 
Aaron-Conole/Compact-netfilter-hooks-list/20160623-053808
commit d4978a76b6762f524932da854f285075811c6cf6 ("netfilter: replace list_head 
with single linked list")

in testcase: boot

on test machine: 1 threads qemu-system-x86_64 -enable-kvm -cpu Westmere with 
320M memory

caused below changes:


+--+++
|  | eed934b33b | d4978a76b6 |
+--+++
| boot_successes   | 0  | 0  |
| boot_failures| 16 | 14 |
| WARNING:at_kernel/rcu/rcuperf.c:#rcu_perf_writer | 16 | 14 |
| backtrace:rcu_perf_writer| 16 | 14 |
| INFO:suspicious_RCU_usage| 0  | 14 |
| backtrace:nf_register_hooks  | 0  | 14 |
| backtrace:smack_nf_ip_init   | 0  | 14 |
| backtrace:kernel_init_freeable   | 0  | 14 |
+--+++

[4.532413] ===
[4.532965] [ INFO: suspicious RCU usage. ]
[4.533609] 4.7.0-rc2-00775-gd4978a7 #1 Tainted: GW  
[4.534698] ---
[4.535361] net/netfilter/core.c:75 suspicious rcu_dereference_check() usage!
[4.536748] 
[4.536748] other info that might help us debug this:
[4.536748] 
[4.538181] 
[4.538181] rcu_scheduler_active = 1, debug_locks = 1
[4.539149] 2 locks held by swapper/1:
[4.539694]  #0:  (rtnl_mutex){+.+.+.}, at: [] 
rtnl_lock+0x17/0x20
[4.540793]  #1:  (nf_hook_mutex){+.+...}, at: [] 
nf_register_net_hook+0xcb/0x240
[4.542295] 
[4.542295] stack backtrace:
[4.542926] CPU: 0 PID: 1 Comm: swapper Tainted: GW   
4.7.0-rc2-00775-gd4978a7 #1
[4.544048] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
Debian-1.8.2-1 04/01/2014
[4.545274]  0001 88053d68 812f0ed6 
88053d98
[4.546370]  810f42e9  0003 
82b1cfa0
[4.547577]   88053dc8 81c58deb 
82b1cfa0
[4.548871] Call Trace:
[4.549248]  [] dump_stack+0x1e/0x28
[4.550019]  [] lockdep_rcu_suspicious+0xd9/0xf0
[4.551003]  [] nf_find_hook_list+0x8b/0x120
[4.552179]  [] nf_register_net_hook+0xe3/0x240
[4.553152]  [] nf_register_hook+0x85/0x110
[4.554058]  [] nf_register_hooks+0x34/0xa0
[4.555032]  [] ? init_smk_fs+0x17a/0x17a
[4.555836]  [] ? do_early_param+0xc0/0xc0
[4.556653]  [] smack_nf_ip_init+0x35/0x58
[4.557532]  [] ? init_smk_fs+0x17a/0x17a
[4.558545]  [] do_one_initcall+0xa0/0x159
[4.559422]  [] ? do_early_param+0xc0/0xc0
[4.560309]  [] kernel_init_freeable+0x11e/0x1df
[4.561414]  [] kernel_init+0x13/0x160
[4.562246]  [] ? schedule_tail+0xa/0x50
[4.563059]  [] ret_from_fork+0x1f/0x40
[4.563840]  [] ? rest_init+0x170/0x170


FYI, raw QEMU command line is:

qemu-system-x86_64 -enable-kvm -cpu Westmere -kernel 
/pkg/linux/x86_64-randconfig-a0-06301127/gcc-6/d4978a76b6762f524932da854f285075811c6cf6/vmlinuz-4.7.0-rc2-00775-gd4978a7
 -append 'root=/dev/ram0 user=lkp 
job=/lkp/scheduled/vm-kbuild-yocto-ia32-2/bisect_boot-1-yocto-minimal-i386.cgz-x86_64-randconfig-a0-06301127-d4978a76b6762f524932da854f285075811c6cf6-20160702-22167-mdemd3-0.yaml
 ARCH=x86_64 kconfig=x86_64-randconfig-a0-06301127 
branch=linux-devel/devel-hourly-2016063005 
commit=d4978a76b6762f524932da854f285075811c6cf6 
BOOT_IMAGE=/pkg/linux/x86_64-randconfig-a0-06301127/gcc-6/d4978a76b6762f524932da854f285075811c6cf6/vmlinuz-4.7.0-rc2-00775-gd4978a7
 max_uptime=600 
RESULT_ROOT=/result/boot/1/vm-kbuild-yocto-ia32/yocto-minimal-i386.cgz/x86_64-randconfig-a0-06301127/gcc-6/d4978a76b6762f524932da854f285075811c6cf6/0
 LKP_SERVER=inn earlyprintk=ttyS0,115200 systemd.log_level=err debug apic=debug 
sysrq_always_enabled rcupdate.rcu_cpu_stall_timeout=100 panic=-1 
softlockup_panic=1 nmi_watchdog=panic oops=panic load_ramdisk=2 
prompt_ramdisk=0 console=ttyS0,115200 console=tty0 vga=normal rw 
ip=vm-kbuild-yocto-ia32-2::dhcp drbd.minor_count=8'  -initrd 
/fs/sda1/initrd-vm-kbuild-yocto-ia32-2 -m 320 -smp 1 -device e1000,netdev=net0 
-netdev user,id=net0 -boot order=nc -no-reboot -watchdog i6300esb -rtc 
base=localtime -drive 
file=/fs/sda1/disk0-vm-kbuild-yocto-ia32-2,media=disk,if=virtio -pidfile 
/dev/shm/kboot/pid-vm-kbuild-yocto-ia32-2 -serial 
file:/dev/shm/kboot/serial-vm-kbuild-yocto-ia32-2 -daemonize -display none 
-monitor null 





Thanks,
Xiaolong
#
# Automatically generated file; DO NOT EDIT.
# Linux/x86_64 4.7.0-rc2 Kernel C

Re: cgroup: Fix split bio been throttled more than once

2016-07-05 Thread Ming Lei
Cc block list

On Mon, Jul 4, 2016 at 10:46 PM, Jiale Li  wrote:
> Hi Tejun,
>
> These days, we have tested the cgroup blkio throttle function use fio,
> and we found a problem that when we use buffered IO or set the big block
> size like 1M, then the IO performance cannot reach the value we set.
> For example we set blkio.throttle.read_bps_device as 10M, in kernel
> version 4.3 IO performance can only reach 6M, and in kernel version
> 4.4 the actual IO bps is only 3.1M.
>
> Then we did some research and find that in kernel version 4.3 brought in
> blk_queue_split() function to split the big size bio into several parts,
> and some of them are calling the generic_make_request() again, this result
> the bio been throttled more than once. so the actual bio sent to device is
> less than we expected.

Except for blk_queue_split(), there are other(stacked) drivers which call
generic_make_request() too, such as drbd, dm, md and bcache.

>
> We have checked the newest kernel of 4.7-rc5, this problem is still exist.
>
> Based on this kind of situation, we propose a fix solution to add a flag bit
> in bio to let the splited bio bypass the blk_queue_split(). Below is the patch
> we used to fix this problem.

The splitted bio is just a fast-cloned bio(except for discard bio) and not very
special compared with other fast-cloned bio, which is quite common used.

So I guess what you need is to bypass BIO_CLONED bio for this purpose
since all fast-cloned bio shares the same bvec table of the source bio.

Thanks,
Ming

>
> Thanks
>
> From b5ea98c9fc5612f9390b65bd9cf4ff344b6cfe92 Mon Sep 17 00:00:00 2001
> From: Jiale Li 
> Date: Mon, 4 Jul 2016 09:23:32 -0400
> Subject: [PATCH] cgroup: Fix split bio been throttled more than once
>
> From kernel version 4.3, blk_queue_split() been added into the kernel,
> it calls the generic_make_request() function,  witch means a part of
> the bio will be counted more than once in blk_throtl_bio(). This result
> the throttle function of cgroup cannot work as we expected.
>
> This patch add a new flag bit in bio, to let the split bio bypass the
> blk_throtl_bio().
>
> Signed-off-by: Jiale Li 
> ---
>  block/blk-merge.c | 1 +
>  block/blk-throttle.c  | 4 
>  include/linux/blk_types.h | 1 +
>  3 files changed, 6 insertions(+)
>
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 2613531..7b17a65 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -190,6 +190,7 @@ void blk_queue_split(struct request_queue *q, struct bio 
> **bio,
>
> bio_chain(split, *bio);
> trace_block_split(q, split, (*bio)->bi_iter.bi_sector);
> +   bio_set_flag(*bio, BIO_SPLITED);
> generic_make_request(*bio);
> *bio = split;
> }
> diff --git a/block/blk-throttle.c b/block/blk-throttle.c
> index 47a3e54..4ffde95 100644
> --- a/block/blk-throttle.c
> +++ b/block/blk-throttle.c
> @@ -1403,6 +1403,10 @@ bool blk_throtl_bio(struct request_queue *q, struct 
> blkcg_gq *blkg,
>
> WARN_ON_ONCE(!rcu_read_lock_held());
>
> +   /* if the bio has been splited, should not throttle again */
> +   if (bio_flagged(bio, BIO_SPLITED))
> +   goto out;
> +
> /* see throtl_charge_bio() */
> if ((bio->bi_rw & REQ_THROTTLED) || !tg->has_rules[rw])
> goto out;
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index 77e5d81..b294780 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -120,6 +120,7 @@ struct bio {
>  #define BIO_QUIET  6   /* Make BIO Quiet */
>  #define BIO_CHAIN  7   /* chained bio, ->bi_remaining in effect */
>  #define BIO_REFFED 8   /* bio has elevated ->bi_cnt */
> +#define BIO_SPLITED 9   /* bio has been splited */
>
>  /*
>   * Flags starting here get preserved by bio_reset() - this includes
> --
> 1.9.1
>



-- 
Ming Lei


Re: [PATCH] gpio: of: Allow overriding the device node

2016-07-05 Thread Masahiro Yamada
Hi.

2016-07-06 0:25 GMT+09:00 Linus Walleij :
> On Tue, Jul 5, 2016 at 2:11 PM, Thierry Reding  
> wrote:
>
>> From: Thierry Reding 
>>
>> When registering a GPIO chip, drivers can override the device tree node
>> associated with the chip by setting the chip's ->of_node field. If set,
>> this field is supposed to take precedence over the ->parent->of_node
>> field, but the code doesn't actually do that.
>>
>> Commit 762c2e46c059 ("gpio: of: remove of_gpiochip_and_xlate() and
>> struct gg_data") exposes this because it now no longer matches on the
>> GPIO chip's ->of_node field, but the GPIO device's ->of_node field that
>> is set using the procedure described above.
>>
>> Signed-off-by: Thierry Reding 
>
> Thanks for catching this, patch applied with Alexandre's ACK.
>
> Masahiro: does this look all right to you?

Yes.
Now, the code matches to the comment.  Nice!

Reviewed-by: Masahiro Yamada 


Question:

When we reference the node of gpiochip,
we should use
  chip->gpiodev->dev->of_node

instead of
  chip->of_node

because we can make chip->of_node optional
as long as chip->parent is set in the probe method.

Correct?






-- 
Best Regards
Masahiro Yamada


Re: [PATCH v4] kasan/quarantine: fix bugs on qlist_move_cache()

2016-07-05 Thread Joonsoo Kim
On Mon, Jul 04, 2016 at 12:49:08PM +0300, Andrey Ryabinin wrote:
> 
> 
> On 07/04/2016 07:31 AM, js1...@gmail.com wrote:
> > From: Joonsoo Kim 
> > 
> > There are two bugs on qlist_move_cache(). One is that qlist's tail
> > isn't set properly. curr->next can be NULL since it is singly linked
> > list and NULL value on tail is invalid if there is one item on qlist.
> > Another one is that if cache is matched, qlist_put() is called and
> > it will set curr->next to NULL. It would cause to stop the loop
> > prematurely.
> > 
> > These problems come from complicated implementation so I'd like to
> > re-implement it completely. Implementation in this patch is really
> > simple. Iterate all qlist_nodes and put them to appropriate list.
> > 
> > Unfortunately, I got this bug sometime ago and lose oops message.
> > But, the bug looks trivial and no need to attach oops.
> > 
> > v4: fix cache size bug s/cache->size/obj_cache->size/
> > v3: fix build warning
> > 
> > Signed-off-by: Joonsoo Kim 
> > ---
> >  mm/kasan/quarantine.c | 21 +++--
> >  1 file changed, 7 insertions(+), 14 deletions(-)
> > 
> > diff --git a/mm/kasan/quarantine.c b/mm/kasan/quarantine.c
> > index 4973505..b2e1827 100644
> > --- a/mm/kasan/quarantine.c
> > +++ b/mm/kasan/quarantine.c
> > @@ -238,30 +238,23 @@ static void qlist_move_cache(struct qlist_head *from,
> >struct qlist_head *to,
> >struct kmem_cache *cache)
> >  {
> > -   struct qlist_node *prev = NULL, *curr;
> > +   struct qlist_node *curr;
> >  
> > if (unlikely(qlist_empty(from)))
> > return;
> >  
> > curr = from->head;
> > +   qlist_init(from);
> > while (curr) {
> > struct qlist_node *qlink = curr;
> > struct kmem_cache *obj_cache = qlink_to_cache(qlink);
> >  
> > -   if (obj_cache == cache) {
> > -   if (unlikely(from->head == qlink)) {
> > -   from->head = curr->next;
> > -   prev = curr;
> > -   } else
> > -   prev->next = curr->next;
> > -   if (unlikely(from->tail == qlink))
> > -   from->tail = curr->next;
> > -   from->bytes -= cache->size;
> > -   qlist_put(to, qlink, cache->size);
> > -   } else {
> > -   prev = curr;
> > -   }
> > curr = curr->next;
> 
> Nit: Wouldn't be more appropriate to swap 'curr' and 'qlink' variable names?
> Because now qlink is acts as a "current" pointer.

Okay. I sent fixed version.

Thanks.


[PATCH v5] kasan/quarantine: fix bugs on qlist_move_cache()

2016-07-05 Thread js1304
From: Joonsoo Kim 

There are two bugs on qlist_move_cache(). One is that qlist's tail
isn't set properly. curr->next can be NULL since it is singly linked
list and NULL value on tail is invalid if there is one item on qlist.
Another one is that if cache is matched, qlist_put() is called and
it will set curr->next to NULL. It would cause to stop the loop
prematurely.

These problems come from complicated implementation so I'd like to
re-implement it completely. Implementation in this patch is really
simple. Iterate all qlist_nodes and put them to appropriate list.

Unfortunately, I got this bug sometime ago and lose oops message.
But, the bug looks trivial and no need to attach oops.

v5: rename some variable for better readability
v4: fix cache size bug s/cache->size/obj_cache->size/
v3: fix build warning

Reviewed-by: Dmitry Vyukov 
Signed-off-by: Joonsoo Kim 
---
 mm/kasan/quarantine.c | 29 +++--
 1 file changed, 11 insertions(+), 18 deletions(-)

diff --git a/mm/kasan/quarantine.c b/mm/kasan/quarantine.c
index 4973505..65793f1 100644
--- a/mm/kasan/quarantine.c
+++ b/mm/kasan/quarantine.c
@@ -238,30 +238,23 @@ static void qlist_move_cache(struct qlist_head *from,
   struct qlist_head *to,
   struct kmem_cache *cache)
 {
-   struct qlist_node *prev = NULL, *curr;
+   struct qlist_node *curr;
 
if (unlikely(qlist_empty(from)))
return;
 
curr = from->head;
+   qlist_init(from);
while (curr) {
-   struct qlist_node *qlink = curr;
-   struct kmem_cache *obj_cache = qlink_to_cache(qlink);
-
-   if (obj_cache == cache) {
-   if (unlikely(from->head == qlink)) {
-   from->head = curr->next;
-   prev = curr;
-   } else
-   prev->next = curr->next;
-   if (unlikely(from->tail == qlink))
-   from->tail = curr->next;
-   from->bytes -= cache->size;
-   qlist_put(to, qlink, cache->size);
-   } else {
-   prev = curr;
-   }
-   curr = curr->next;
+   struct qlist_node *next = curr->next;
+   struct kmem_cache *obj_cache = qlink_to_cache(curr);
+
+   if (obj_cache == cache)
+   qlist_put(to, curr, obj_cache->size);
+   else
+   qlist_put(from, curr, obj_cache->size);
+
+   curr = next;
}
 }
 
-- 
1.9.1



Re: [PATCH] lockdep: Add a document describing crossrelease feature

2016-07-05 Thread Boqun Feng
On Mon, Jul 04, 2016 at 03:42:59PM +0900, Byungchul Park wrote:
[snip]
> > > +2. A lock has dependency with all locks in the releasing context, having
> > > +   been held since the lock was held.
> > 
> > But you cannot tell this. The 'since the lock was held' thing fully
> > depends on timing and is not fundamentally correct.
> > 
> > lock(A)
> > unlock(A)
> > lock(A)
> > wait_for(B)
> > unlock(A)
> > wake(B)
> > 
> > Between the wait_for(B) and wake(B), _nothing_ has been held, yet still
> > there's the deadlock potential.
> 
> Crossreleas feature can detect this situation as a deadlock. wait_for()
> is not an actual lock, but we can make it detectable by using acquring and
> releasing semantics on wait_for() and wake().
> 
> > And note that if the timing was 'right', you would never get to wake(B)
> > because deadlock, so you'd never establish that there would be a
> > deadlock.
> 
> If a deadlock actually happens, then we cannot establish it as you said.
> Remind that current lockdep does nothing for this situation. But at least
> crossrelease feature can detect this deadlock possibility at the time the
> dependency tree(graph) is built, which is better than doing nothing.
> 

Confused, how?

Say the sequence of events is as follow:

(two tasks are initially with no lock held)

Task 1  Task 2
=   
lock(A)
unlock(A)
lock(A)
wait_for(B) // acquire
wake(B) // commit + release
unlock(A)

by the time, the commit are called, the dependency tree will be built,
and we will find there is _no_ lock held before wake(B). Therefore at
the release stage, you will end up only adding dependency chain A->B in
the lockdep, right? And it looks like neither Task1 or Task2 will break
the dependency chain A->B. So how can crossrelease detect the potential
deadlock?

It will be better, that you could provide some samples that crossrelease
can detect after your confirmation.

Regards,
Boqun


signature.asc
Description: PGP signature


RE: [PATCH v2 kernel 0/7] Extend virtio-balloon for fast (de)inflating & fast live migration

2016-07-05 Thread Li, Liang Z
Ping ...

Liang

> -Original Message-
> From: Li, Liang Z
> Sent: Wednesday, June 29, 2016 6:32 PM
> To: m...@redhat.com
> Cc: linux-kernel@vger.kernel.org; virtualizat...@lists.linux-foundation.org;
> k...@vger.kernel.org; qemu-de...@nongnu.org; virtio-dev@lists.oasis-
> open.org; dgilb...@redhat.com; quint...@redhat.com; Li, Liang Z
> Subject: [PATCH v2 kernel 0/7] Extend virtio-balloon for fast (de)inflating &
> fast live migration
> 
> This patch set contains two parts of changes to the virtio-balloon.
> 
> One is the change for speeding up the inflating & deflating process, the main
> idea of this optimization is to use bitmap to send the page information to
> host instead of the PFNs, to reduce the overhead of virtio data transmission,
> address translation and madvise(). This can help to improve the performance
> by about 85%.
> 
> Another change is for speeding up live migration. By skipping process guest's
> free pages in the first round of data copy, to reduce needless data 
> processing,
> this can help to save quite a lot of CPU cycles and network bandwidth. We
> put guest's free page information in bitmap and send it to host with the virt
> queue of virtio-balloon. For an idle 8GB guest, this can help to shorten the
> total live migration time from 2Sec to about 500ms in the 10Gbps network
> environment.
> 
> 
> Changes from v1 to v2:
> * Abandon the patch for dropping page cache.
> * Put some structures to uapi head file.
> * Use a new way to determine the page bitmap size.
> * Use a unified way to send the free page information with the bitmap
> * Address the issues referred in MST's comments
> 
> Liang Li (7):
>   virtio-balloon: rework deflate to add page to a list
>   virtio-balloon: define new feature bit and page bitmap head
>   mm: add a function to get the max pfn
>   virtio-balloon: speed up inflate/deflate process
>   virtio-balloon: define feature bit and head for misc virt queue
>   mm: add the related functions to get free page info
>   virtio-balloon: tell host vm's free page info
> 
>  drivers/virtio/virtio_balloon.c | 306
> +++-
>  include/uapi/linux/virtio_balloon.h |  41 +
>  mm/page_alloc.c |  52 ++
>  3 files changed, 359 insertions(+), 40 deletions(-)
> 
> --
> 1.8.3.1



[PATCH] PM / hibernate: Add missing braces in hibernate_setup()

2016-07-05 Thread Rafael J. Wysocki
From: Rafael J. Wysocki 

Make hibernate_setup() follow the coding style more closely by adding
some missing braces to the if () statement in it.

Signed-off-by: Rafael J. Wysocki 
---

This is on top of https://patchwork.kernel.org/patch/9214179/

---
 kernel/power/hibernate.c |6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

Index: linux-pm/kernel/power/hibernate.c
===
--- linux-pm.orig/kernel/power/hibernate.c
+++ linux-pm/kernel/power/hibernate.c
@@ -1119,11 +1119,11 @@ static int __init resume_offset_setup(ch
 
 static int __init hibernate_setup(char *str)
 {
-   if (!strncmp(str, "noresume", 8))
+   if (!strncmp(str, "noresume", 8)) {
noresume = 1;
-   else if (!strncmp(str, "nocompress", 10))
+   } else if (!strncmp(str, "nocompress", 10)) {
nocompress = 1;
-   else if (!strncmp(str, "no", 2)) {
+   } else if (!strncmp(str, "no", 2)) {
noresume = 1;
nohibernate = 1;
} else if (!strncmp(str, "protect_image", 13)) {



Re: More parallel atomic_open/d_splice_alias fun with NFS and possibly more FSes.

2016-07-05 Thread Oleg Drokin

On Jul 5, 2016, at 4:21 PM, Oleg Drokin wrote:
> 
>>> So with Lustre the dentry can be in three states, really:
>>> 
>>> 1. hashed dentry that's all A-ok to reuse.
>>> 2. hashed dentry that's NOT valid (dlm lock went away) - this is 
>>> distinguished in d_compare by looking at a bit in the fs_data
>>> 3. unhashed dentry ( I guess could be both valid and invalid lustre-wise).
>>> 
>>> So the logic in ll_lookup_it_finish() (part of regular lookup) is this:
>>> 
>>> If the dentry we have is not hashed - this is a new lookup, so we need to
>>> call into ll_splice_alias() to see if there's a better dentry we need to
>>> reuse that was already rejected by VFS before since we did not have 
>>> necessary locks,
>>> but we do have them now.
>>> The comment at the top of ll_dcompare() explains why we don't just unhash 
>>> the
>>> dentry on lock-loss - that apparently leads to a loop around real_lookup for
>>> real-contended dentries.
>>> This is also why we cannot use d_splice_alias here - such cases are possible
>>> for regular files and directories.
>>> 
>>> Anyway, I guess additional point of confusion here is then why does
>>> ll_lookup_it_finish() need to check for hashedness of the dentry since it's 
>>> in
>>> lookup, so we should be unhashed here.
>>> I checked the commit history and this check was added along with atomic_open
>>> support, so I imagine we can just move it up into ll_atomic_open and then 
>>> your
>>> change starts to make sense along with a few other things.
>> 
>> So basically this
>>   } else if (!it_disposition(it, DISP_LOOKUP_NEG)  &&
>>  !it_disposition(it, DISP_OPEN_CREATE)) {
>>   /* With DISP_OPEN_CREATE dentry will be
>>* instantiated in ll_create_it.
>>*/
>>   LASSERT(!d_inode(*de));
>>   d_instantiate(*de, inode);
>>   }
>> is something we should do only for negative hashed fed to it by
>> ->atomic_open(), and that - only if we have no O_CREAT in flags?
> 
> Yes, and in fact we can totally avoid it I think.
> 
>> Then, since 3/3 eliminates that case completely, we could just rip that
>> else-if, along with the check for d_unhashed itself, making the call of
>> ll_splice_alias() unconditional there.  Or am I misreading what you said?
>> Do you see any problems with what's in #for-linus now (head at 11f0083)?
> 
> Yes, we can make it unconditional
> I think we can simplify it even more and since unlike NFS our negative 
> dentries
> are a lot less speculative, we can just return ENOENT if this is not a create
> request and unhash otherwise. Still needs to go through the whole test suite
> to make sure it does not break anything unexpected.
> 
> At least this is the patch I am playing with right now:
> --- a/drivers/staging/lustre/lustre/llite/namei.c
> +++ b/drivers/staging/lustre/lustre/llite/namei.c
> @@ -391,6 +391,7 @@ static int ll_lookup_it_finish(struct ptlrpc_request 
> *request,
>   struct inode *inode = NULL;
>   __u64 bits = 0;
>   int rc = 0;
> + struct dentry *alias;
> 
>   /* NB 1 request reference will be taken away by ll_intent_lock()
>* when I return
> @@ -415,26 +416,12 @@ static int ll_lookup_it_finish(struct ptlrpc_request 
> *request,
>*/
>   }
> 
> - /* Only hash *de if it is unhashed (new dentry).
> -  * Atoimc_open may passing hashed dentries for open.
> -  */
> - if (d_unhashed(*de)) {
> - struct dentry *alias;
> -
> - alias = ll_splice_alias(inode, *de);
> - if (IS_ERR(alias)) {
> - rc = PTR_ERR(alias);
> - goto out;
> - }
> - *de = alias;
> - } else if (!it_disposition(it, DISP_LOOKUP_NEG)  &&
> -!it_disposition(it, DISP_OPEN_CREATE)) {
> - /* With DISP_OPEN_CREATE dentry will be
> -  * instantiated in ll_create_it.
> -  */
> - LASSERT(!d_inode(*de));
> - d_instantiate(*de, inode);
> + alias = ll_splice_alias(inode, *de);
> + if (IS_ERR(alias)) {
> + rc = PTR_ERR(alias);
> + goto out;
>   }
> + *de = alias;
> 
>   if (!it_disposition(it, DISP_LOOKUP_NEG)) {
>   /* we have lookup look - unhide dentry */
> @@ -590,6 +577,24 @@ static int ll_atomic_open(struct inode *dir, struct 
> dentry *dentry,
>  dentry, PFID(ll_inode2fid(dir)), dir, file, open_flags, mode,
>  *opened);
> 
> + /* Only negative dentries enter here */
> + LASSERT(!d_inode(dentry));
> +
> + if (!(open_flags & O_CREAT) && !d_in_lookup(dentry)) {

Duh, this obviously was supposed to be
 if (!d_in_lookup(dentry)) {

But even in the form above nothing bad happened in the full testing
because we cannot find any aliases without an inode.

> + /* A valid negative dentry that just passed revalidation,
> +  * there's little point to try and open it serve

Re: [PATCH 08/31] mm, vmscan: simplify the logic deciding whether kswapd sleeps

2016-07-05 Thread Minchan Kim
On Tue, Jul 05, 2016 at 11:26:39AM +0100, Mel Gorman wrote:



> > > @@ -3418,10 +3426,10 @@ void wakeup_kswapd(struct zone *zone, int order, 
> > > enum zone_type classzone_idx)
> > >   if (!cpuset_zone_allowed(zone, GFP_KERNEL | __GFP_HARDWALL))
> > >   return;
> > >   pgdat = zone->zone_pgdat;
> > > - if (pgdat->kswapd_max_order < order) {
> > > - pgdat->kswapd_max_order = order;
> > > - pgdat->classzone_idx = min(pgdat->classzone_idx, classzone_idx);
> > > - }
> > > + if (pgdat->kswapd_classzone_idx == -1)
> > > + pgdat->kswapd_classzone_idx = classzone_idx;
> > 
> > It's tricky. Couldn't we change kswapd_classzone_idx to integer type
> > and remove if above if condition?
> > 
> 
> It's tricky and not necessarily better overall. It's perfectly possible
> to be woken up for zone index 0 so it's changing -1 to another magic
> value.

I don't get it. What is a problem with this?

diff --git a/mm/vmscan.c b/mm/vmscan.c
index c538a8c..6eb23f5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3413,9 +3413,7 @@ void wakeup_kswapd(struct zone *zone, int order, enum 
zone_type classzone_idx)
if (!cpuset_zone_allowed(zone, GFP_KERNEL | __GFP_HARDWALL))
return;
pgdat = zone->zone_pgdat;
-   if (pgdat->kswapd_classzone_idx == -1)
-   pgdat->kswapd_classzone_idx = classzone_idx;
-   pgdat->kswapd_classzone_idx = max(pgdat->kswapd_classzone_idx, 
classzone_idx);
+   pgdat->kswapd_classzone_idx = max_t(int, pgdat->kswapd_classzone_idx, 
classzone_idx);
pgdat->kswapd_order = max(pgdat->kswapd_order, order);
if (!waitqueue_active(&pgdat->kswapd_wait))
return;


Re: [PATCH v5 0/8] ACPI overlays

2016-07-05 Thread Rafael J. Wysocki
On Friday, July 01, 2016 11:19:04 PM Octavian Purdila wrote:
> This patch set enables custom ACPI board configuration by adding
> mechanisms in the Linux kernel for loading user defined SSDTs.
> 
> In order to support ACPI open-ended hardware configurations we need a
> way to augment the ACPI configuration provided by the firmware
> image. A common example is connecting sensors on I2C / SPI buses on
> development boards.
> 
> Although this can be accomplished by creating a kernel platform driver
> or recompiling the firmware image with updated ACPI tables, neither is
> practical: the former proliferates board specific kernel code while
> the latter requires access to firmware tools which are often not
> publicly available.
> 
> Because ACPI supports external references in AML code a more practical
> way to augment firmware ACPI configuration is by dynamically loading
> user defined SSDT tables that contain the board specific information.
> 
> Currently it is possible to load SSDT overlays using the upgrade
> initrd mechanism introduced in 4.7. This patch series adds support for
> two more methods:
> 
> * From an EFI variable
> 
>  This is the preferred method, when EFI is supported on the platform,
>  because it allows a persistent, OS independent way of storing and
>  updating the user defined SSDTs. There is also work underway to
>  implement EFI support for loading user defined SSDTs and using this
>  method will make it easier to convert to the EFI loading mechanism
>  when that will arrive.
> 
> * From userspace via configfs
> 
>  This is useful when we want to defer the operation to userspace for
>  platform detection, loading the SSDTs from a custom partition, etc.
> 
> 
> Changes from v4:
> 
> * EFI: use ucs2_as_utf8 and memcmp to check if the variable name
>   matches the kernel command line parameter; fold the EFI ACPI table
>   load code in the iterator function
> 
> * I2C/SPI: add more information in the commit logs about how the
>   enumeration status is set and checked; also add a new check of the
>   enumerated status for the ACPI device remove reconfiguration
>   notification
> 
> Changes from v3:
> 
> * fix a bisectability issue reported by kbuild
> 
> * rework the enumeration fix to support PRP0001 enumeration; also,
>   introduce acpi_device_set/clear_enumerated() for clarity
> 
> * clear the enumerated status for acpi_device when the I2C/SPI clients
>   are removed (for example as a result of an adapter removal) to allow
>   the devices to be re-enumerated later
> 
> Changes from v2:
> 
> * fix a few issues caught by the kbuild test robot
> 
> * add more configfs table attributes
> 
> * removed the initrd based loading functionality from this patch set
>   as this can already be accomplished in 4.7 using the ACPI table
>   upgrade mechanism
> 
> * rebased to 4.7-rc3
> 
> Changes from v1:
> 
> * rebased on top of the ACPI install from initrd table functionality;
>   there is significant overlap between the 1st patch in this series
>   and these patch [1] from Lv - I kept it in this series until the
>   discussions around the taint and config option are resolved
> 
> * make sure EFI_RUNTIME_SERVICES are available before trying to use
>   EFI variables to load tables
> 
> * rework the ACPI reconfiguration notifications to work on device
>   granularity (device added or removed) instead of table granularity
>   (table loaded or unloaded)
> 
> * add support for table unloading / device removal
> 
> * note that the last patch is just a hack to be able to test the table
>   unload / device remove functionality, if someone wants to try out
>   this patch set
> 
> [1] https://patchwork.kernel.org/patch/8795931/
> 
> Octavian Purdila (8):
>   Documentation: acpi: add SSDT overlays documentation
>   acpi: fix enumeration (visited) flags for bus rescans
>   acpi: add support for ACPI reconfiguration notifiers
>   i2c: add support for ACPI reconfigure notifications
>   spi: add support for ACPI reconfigure notifications
>   efi: load SSTDs from EFI variables
>   acpi: add support for configfs
>   acpi: add support for loading SSDTs via configfs
> 
>  Documentation/ABI/testing/configfs-acpi |  36 +
>  Documentation/acpi/ssdt-overlays.txt| 172 
>  Documentation/kernel-parameters.txt |   7 +
>  MAINTAINERS |   1 +
>  drivers/acpi/Kconfig|   9 ++
>  drivers/acpi/Makefile   |   1 +
>  drivers/acpi/bus.c  |   9 ++
>  drivers/acpi/configfs.c | 267 
> 
>  drivers/acpi/internal.h |   3 +
>  drivers/acpi/scan.c |  81 +-
>  drivers/acpi/sysfs.c|   6 +-
>  drivers/firmware/efi/efi.c  |  85 ++
>  drivers/i2c/i2c-core.c  | 175 -
>  drivers/spi/spi.c   | 100 +++-
>  include/linux/acpi.h| 

Re: [PATCH 2/2] f2fs: fix to avoid data update racing between GC and DIO

2016-07-05 Thread Jaegeuk Kim
On Fri, Jul 01, 2016 at 02:03:17PM +0800, Chao Yu wrote:
> Hi Jaegeuk,
> 
> On 2016/7/1 8:03, Jaegeuk Kim wrote:
> > Hi Chao,
> > 
> > On Thu, Jun 30, 2016 at 04:42:48PM +0800, Chao Yu wrote:
> >> Datas in file can be operated by GC and DIO simultaneously, so we will
> >> face race case as below:
> >>
> >> For write case:
> >> Thread A   Thread B
> >> - generic_file_direct_write
> >>  - invalidate_inode_pages2_range
> >>  - f2fs_direct_IO
> >>   - do_blockdev_direct_IO
> >>- do_direct_IO
> >> - get_more_blocks
> >>- f2fs_gc
> >> - do_garbage_collect
> >>  - gc_data_segment
> >>   - move_data_page
> >>- do_write_data_page
> >>migrate data block to new block 
> >> address
> >>- dio_bio_submit
> >>update user data to old block address
> >>
> >> For read case:
> >> Thread AThread B
> >> - generic_file_direct_write
> >>  - invalidate_inode_pages2_range
> >>  - f2fs_direct_IO
> >>   - do_blockdev_direct_IO
> >>- do_direct_IO
> >> - get_more_blocks
> >>- f2fs_balance_fs
> >> - f2fs_gc
> >>  - do_garbage_collect
> >>   - gc_data_segment
> >>- move_data_page
> >> - do_write_data_page
> >> migrate data block to new block 
> >> address
> >>  - write_checkpoint
> >>   - do_checkpoint
> >>- clear_prefree_segments
> >> - f2fs_issue_discard
> >>  discard old block adress
> >>- dio_bio_submit
> >>update user buffer from obsolete block address
> >>
> >> In order to fix this, for one file, we should let DIO and GC getting 
> >> exclusion
> >> against with each other.
> >>
> >> Signed-off-by: Chao Yu 
> >> ---
> >>  fs/f2fs/data.c  |  2 ++
> >>  fs/f2fs/f2fs.h  |  1 +
> >>  fs/f2fs/gc.c| 14 +-
> >>  fs/f2fs/super.c |  1 +
> >>  4 files changed, 17 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
> >> index ba4963f..08dc060 100644
> >> --- a/fs/f2fs/data.c
> >> +++ b/fs/f2fs/data.c
> >> @@ -1716,7 +1716,9 @@ static ssize_t f2fs_direct_IO(struct kiocb *iocb, 
> >> struct iov_iter *iter)
> >>  
> >>trace_f2fs_direct_IO_enter(inode, offset, count, iov_iter_rw(iter));
> >>  
> >> +  mutex_lock(&F2FS_I(inode)->dio_mutex);
> >>err = blockdev_direct_IO(iocb, inode, iter, get_data_block_dio);
> >> +  mutex_unlock(&F2FS_I(inode)->dio_mutex);
> > 
> > This means we need to sacrifice entire parallism even in the normal cases?
> > Can we find another way?
> 
> 1. For dio write vs dio write, writer will grab i_mutex before dio_mutex, so
> anyway, concurrent dio writes will be exclusive.
> 
> 2. For dio write vs gc, keep using dio_mutex for making them exclusive.
> 
> 3. For dio read vs dio read, and dio read vs gc, what about adding dio_rwsem 
> to
> control the parallelism?
> 
> 4. For dio write vs dio read, we grab different lock (write grabs dio_mutex,
> read grabs dio_rwsem), so there is no race condition.

How about adding a flag in a dio inode and avoiding GCs for there-in blocks?

Thanks,

> 
> Thanks,
> 
> > 
> > Thanks,
> > 
> >>if (iov_iter_rw(iter) == WRITE) {
> >>if (err > 0)
> >>set_inode_flag(inode, FI_UPDATE_WRITE);
> >> diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
> >> index bd82b6d..a241576 100644
> >> --- a/fs/f2fs/f2fs.h
> >> +++ b/fs/f2fs/f2fs.h
> >> @@ -474,6 +474,7 @@ struct f2fs_inode_info {
> >>struct list_head inmem_pages;   /* inmemory pages managed by f2fs */
> >>struct mutex inmem_lock;/* lock for inmemory pages */
> >>struct extent_tree *extent_tree;/* cached extent_tree entry */
> >> +  struct mutex dio_mutex; /* avoid racing between dio and gc */
> >>  };
> >>  
> >>  static inline void get_extent_info(struct extent_info *ext,
> >> diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c
> >> index c2c4ac3..98e3763 100644
> >> --- a/fs/f2fs/gc.c
> >> +++ b/fs/f2fs/gc.c
> >> @@ -744,12 +744,24 @@ next_step:
> >>/* phase 3 */
> >>inode = find_gc_inode(gc_list, dni.ino);
> >>if (inode) {
> >> +  bool locked = false;
> >> +
> >> +  if (S_ISREG(inode->i_mode)) {
> >> +  if (!mutex_trylock(&F2FS_I(inode)->dio_mutex))
> >> +  continue;
> >> +  locked = true;
> >> +  }
> >> +
> >> 

Re: [PATCH 01/31] mm, vmstat: add infrastructure for per-node vmstats

2016-07-05 Thread Minchan Kim
On Tue, Jul 05, 2016 at 09:14:05AM +0100, Mel Gorman wrote:
> On Tue, Jul 05, 2016 at 08:50:18AM +0900, Minchan Kim wrote:
> > > @@ -172,13 +174,17 @@ void refresh_zone_stat_thresholds(void)
> > >   int threshold;
> > >  
> > >   for_each_populated_zone(zone) {
> > > + struct pglist_data *pgdat = zone->zone_pgdat;
> > >   unsigned long max_drift, tolerate_drift;
> > >  
> > >   threshold = calculate_normal_threshold(zone);
> > >  
> > > - for_each_online_cpu(cpu)
> > > + for_each_online_cpu(cpu) {
> > >   per_cpu_ptr(zone->pageset, cpu)->stat_threshold
> > >   = threshold;
> > > + per_cpu_ptr(pgdat->per_cpu_nodestats, 
> > > cpu)->stat_threshold
> > > + = threshold;
> > > + }
> > 
> > I didn't see other patches yet so it might fix it then.
> > 
> > per_cpu_nodestats is per node not zone but it use per-zone threshold
> > and even overwritten by next zones. I don't think it's not intended.
> 
> It was intended that the threshold from one zone would be used but now
> that you point it out, it would use the threshold for the smallest zone
> in the node which is sub-optimal. I applied the patch below on top to
> use the threshold from the largest zone. I considered using the sum of
> all thresholds but feared it might allow too much per-cpu drift. It can
> be switched to the sum if we find a case where vmstat updates are too
> high.

Fair enough.


Re: [PATCH] memory: samsung: exynos-srom: fix wrong count of registers

2016-07-05 Thread Chanwoo Choi
On 2016년 07월 05일 20:40, Seung-Woo Kim wrote:
> This patch fixes wrong count of array for srom registers from probe
> function.
> 
> Signed-off-by: Seung-Woo Kim 
> ---
>  drivers/memory/samsung/exynos-srom.c |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/memory/samsung/exynos-srom.c 
> b/drivers/memory/samsung/exynos-srom.c
> index 96756fb..c66d2bd 100644
> --- a/drivers/memory/samsung/exynos-srom.c
> +++ b/drivers/memory/samsung/exynos-srom.c
> @@ -134,7 +134,7 @@ static int exynos_srom_probe(struct platform_device *pdev)
>   platform_set_drvdata(pdev, srom);
>  
>   srom->reg_offset = exynos_srom_alloc_reg_dump(exynos_srom_offsets,
> - sizeof(exynos_srom_offsets));
> + ARRAY_SIZE(exynos_srom_offsets));
>   if (!srom->reg_offset) {
>   iounmap(srom->reg_base);
>   return -ENOMEM;
> 

On the exynos-srom.c, use the ARRAY_SIZE to get the number of array entry.
Looks good to me.

Reviewed-by: Chanwoo Choi 

Thanks,
Chanwoo Choi


Re: [PATCH] extcon: Move extcon_get_edev_by_phandle() errors to dbg level

2016-07-05 Thread Chanwoo Choi
Hi Stephen,

On 2016년 07월 06일 03:57, Stephen Boyd wrote:
> Sometimes drivers may call this API and expect it to fail because
> the extcon they're looking for is optional. Let's move these
> prints to debug level so it doesn't look like there's a problem
> when there isn't one.
> 
> Signed-off-by: Stephen Boyd 
> ---
>  drivers/extcon/extcon.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/extcon/extcon.c b/drivers/extcon/extcon.c
> index 161acb826334..984f50d5297c 100644
> --- a/drivers/extcon/extcon.c
> +++ b/drivers/extcon/extcon.c
> @@ -1096,13 +1096,13 @@ struct extcon_dev *extcon_get_edev_by_phandle(struct 
> device *dev, int index)
>   return ERR_PTR(-EINVAL);
>  
>   if (!dev->of_node) {
> - dev_err(dev, "device does not have a device node entry\n");
> + dev_dbg(dev, "device does not have a device node entry\n");
>   return ERR_PTR(-EINVAL);
>   }
>  
>   node = of_parse_phandle(dev->of_node, "extcon", index);
>   if (!node) {
> - dev_err(dev, "failed to get phandle in %s node\n",
> + dev_dbg(dev, "failed to get phandle in %s node\n",
>   dev->of_node->full_name);
>   return ERR_PTR(-ENODEV);
>   }
> 

Applied it.

Thanks,
Chanwoo Choi


Re: [PATCH] extcon: adc-jack: update cable state during boot

2016-07-05 Thread Chanwoo Choi
Hi Venkat,

On 2016년 07월 05일 22:56, Venkat Reddy Talla wrote:
> Update cable state during boot to avoid any missing
> external cable events occurred before driver initialisation.
> 
> Signed-off-by: Venkat Reddy Talla 
> ---
>  drivers/extcon/extcon-adc-jack.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/extcon/extcon-adc-jack.c 
> b/drivers/extcon/extcon-adc-jack.c
> index 44e48aa..48dec94 100644
> --- a/drivers/extcon/extcon-adc-jack.c
> +++ b/drivers/extcon/extcon-adc-jack.c
> @@ -158,6 +158,7 @@ static int adc_jack_probe(struct platform_device *pdev)
>   if (data->wakeup_source)
>   device_init_wakeup(&pdev->dev, 1);
>  
> + adc_jack_handler(&data->handler.work);
>   return 0;
>  }
>  
> 

Applied it.

Thanks,
Chanwoo Choi


Re: [PATCH 31/31] mm, vmstat: Remove zone and node double accounting by approximating retries

2016-07-05 Thread Minchan Kim
On Fri, Jul 01, 2016 at 09:01:39PM +0100, Mel Gorman wrote:
> The number of LRU pages, dirty pages and writeback pages must be accounted
> for on both zones and nodes because of the reclaim retry logic, compaction
> retry logic and highmem calculations all depending on per-zone stats.
> 
> The retry logic is only critical for allocations that can use any zones.

Sorry, I cannot follow this assertion.
Could you explain?

> Hence this patch will not retry reclaim or compaction for such allocations.

What is such allocations?

> This should not be a problem for reclaim as zone-constrained allocations
> are immune from OOM kill. For retries, a very rough approximation is made

zone-constrained allocations are immune from OOM kill?
Please explain it, too.

Sorry for the many questions but I cannot review code without clear
understanding of assumption/background which I couldn't notice.

> whether to retry or not. While it is possible this will make the wrong
> decision on occasion, it will not infinite loop as the number of reclaim
> attempts is capped by MAX_RECLAIM_RETRIES.
> 
> The highmem calculations only care about the global count of file pages
> in highmem. Hence, a global counter is used instead of per-zone stats.
> With this, the per-zone double accounting disappears.
> 
> Suggested by: Michal Hocko 
> Signed-off-by: Mel Gorman 
> ---
>  include/linux/mm_inline.h | 20 +++--
>  include/linux/mmzone.h|  4 ---
>  include/linux/swap.h  |  1 -
>  mm/compaction.c   | 22 ++-
>  mm/migrate.c  |  2 --
>  mm/page-writeback.c   | 13 -
>  mm/page_alloc.c   | 71 
> ---
>  mm/vmscan.c   | 16 ---
>  mm/vmstat.c   |  3 --
>  9 files changed, 92 insertions(+), 60 deletions(-)
> 
> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> index 9aadcc781857..c68680aac044 100644
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -4,6 +4,22 @@
>  #include 
>  #include 
>  
> +#ifdef CONFIG_HIGHMEM
> +extern unsigned long highmem_file_pages;
> +
> +static inline void acct_highmem_file_pages(int zid, enum lru_list lru,
> + int nr_pages)
> +{
> + if (is_highmem_idx(zid) && is_file_lru(lru))
> + highmem_file_pages += nr_pages;
> +}
> +#else
> +static inline void acct_highmem_file_pages(int zid, enum lru_list lru,
> + int nr_pages)
> +{
> +}
> +#endif
> +
>  /**
>   * page_is_file_cache - should the page be on a file LRU or anon LRU?
>   * @page: the page to test
> @@ -29,9 +45,7 @@ static __always_inline void __update_lru_size(struct lruvec 
> *lruvec,
>   struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>  
>   __mod_node_page_state(pgdat, NR_LRU_BASE + lru, nr_pages);
> - __mod_zone_page_state(&pgdat->node_zones[zid],
> - NR_ZONE_LRU_BASE + !!is_file_lru(lru),
> - nr_pages);
> + acct_highmem_file_pages(zid, lru, nr_pages);
>  }
>  
>  static __always_inline void update_lru_size(struct lruvec *lruvec,
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index facee6b83440..9268528c20c0 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -110,10 +110,6 @@ struct zone_padding {
>  enum zone_stat_item {
>   /* First 128 byte cacheline (assuming 64 bit words) */
>   NR_FREE_PAGES,
> - NR_ZONE_LRU_BASE, /* Used only for compaction and reclaim retry */
> - NR_ZONE_LRU_ANON = NR_ZONE_LRU_BASE,
> - NR_ZONE_LRU_FILE,
> - NR_ZONE_WRITE_PENDING,  /* Count of dirty, writeback and unstable pages 
> */
>   NR_MLOCK,   /* mlock()ed pages found and moved off LRU */
>   NR_SLAB_RECLAIMABLE,
>   NR_SLAB_UNRECLAIMABLE,
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index b17cc4830fa6..cc753c639e3d 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -307,7 +307,6 @@ extern void lru_cache_add_active_or_unevictable(struct 
> page *page,
>   struct vm_area_struct *vma);
>  
>  /* linux/mm/vmscan.c */
> -extern unsigned long zone_reclaimable_pages(struct zone *zone);
>  extern unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat);
>  extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>   gfp_t gfp_mask, nodemask_t *mask);
> diff --git a/mm/compaction.c b/mm/compaction.c
> index a0bd85712516..dfe7dafe8e8b 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1446,6 +1446,13 @@ bool compaction_zonelist_suitable(struct alloc_context 
> *ac, int order,
>  {
>   struct zone *zone;
>   struct zoneref *z;
> + pg_data_t *last_pgdat = NULL;
> +
> +#ifdef CONFIG_HIGHMEM
> + /* Do not retry compaction for zone-constrained allocations */
> + if (!is_highmem_idx(ac->high_zone

Re: [PATCH v6 00/10] acpi, clocksource: add GTDT driver and GTDT support in arm_arch_timer

2016-07-05 Thread Rafael J. Wysocki
On Tue, Jul 5, 2016 at 4:18 PM, Graeme Gregory  wrote:
> On Mon, Jul 04, 2016 at 02:53:20PM +0200, Rafael J. Wysocki wrote:
>> On Fri, Jul 1, 2016 at 11:04 PM, Rafael J. Wysocki  
>> wrote:
>> > On Friday, July 01, 2016 04:23:40 PM Will Deacon wrote:
>> >> On Thu, Jun 30, 2016 at 09:48:02PM +0800, Hanjun Guo wrote:
>> >> > On 2016/6/30 21:27, Rafael J. Wysocki wrote:
>> >> > >On Thursday, June 30, 2016 10:10:02 AM Hanjun Guo wrote:
>> >> > >>GTDT is part of ACPI spec, drivers/acpi/ is for driver code of
>> >> > >>ACPI spec, I think it can stay in drivers/acpi/ from this point
>> >> > >>of view, am I right?
>> >> > >
>> >> > >The question is not "Can it?", but "Does it need to?".
>> >> > >
>> >> > >It is in the spec, but still there's only one architecture needing it.
>> >> > >
>> >> > >There is no way to test it on any other architecture and no reason to 
>> >> > >build it
>> >> > >for any other architecture, so why does it need to be located in 
>> >> > >drivers/acpi/ ?
>> >> >
>> >> > I'm fine to move it to other places such as arch/arm64/kernel/, but I
>> >> > would like to ask ARM64 maintainer's suggestion for this.
>> >> >
>> >> > Will, Catalin, what's your opinion on this?
>> >>
>> >> We don't have any device-tree code for the architected timer under
>> >> arch/arm64, so I don't see why we should need anything for ACPI either.
>> >
>> > And I don't see a reason for the GTDT code to be there in drivers/acpi/.
>> >
>> > What gives?
>>
>> Well, since there are things like acpi_lpss in there, my position here
>> is kind of weak. :-)
>>
>> That said I'm not particularly happy with having them in
>> drivers/acpi/, so I definitely won't object against attempts to moving
>> them somewhere else.
>>
>> > Maybe it should go to the same place as the analogus DT code, then?
>>
>> I'm mostly concerned about how (and by whom) that code is going to be
>> maintained going forward, though.  I also think it should be made
>> clear that it is ARM64-only.
>>
>
> So is this a documentation issue in which case Fu Wei can add that to
> the file to explain its limited to ARM64. Or we could even rename the
> file acpi_arm64_gtdt.c
>
> It seems a pity as the comment on this series were minors to block
> things on a filename/location.

Let me repeat what I said above:

I'm mostly concerned about how (and by whom) that code is going to be
maintained going forward.

This is not about documentation, it is about responsibility.

Honestly, I don't think I'm the right maintainer to apply the patch
introducing this code and then handle bug reports regarding it and so
on.  That has to be done by somebody else.

That's one thing.

Another one is the question I asked a few messages ago: Why having the
GTDT code in drivers/acpi/ is actually useful to anyone?  It
definitely would not be useful to me as the maintainer of
drivers/acpi/, but maybe it would be useful to somebody for a specific
practical reason.  Or is it just "let's put this into drivers/acpi/
for the lack of a better place"?

I have not received a good answer to this one yet.

Thanks,
Rafael


Re: [PATCH 5/5] ACPI: Add configuration item to configure ACPICA error logs out

2016-07-05 Thread Rafael J. Wysocki
On Tue, Jul 5, 2016 at 1:18 PM, Lv Zheng  wrote:
> Sometimes, we need to disable ACPICA error logs to leave only ACPICA
> debug logs enabled for debugging purpose. This is useful when ACPICA error
> logs become a flood.
>
> Reference: https://bugzilla.kernel.org/show_bug.cgi?id=114201
> Signed-off-by: Lv Zheng 

I seem to remember seeing this change once before and ISTR I said I
wouldn't apply it then.

Why would I say something different this time?

> ---
>  drivers/acpi/Kconfig|7 +++
>  include/acpi/platform/aclinux.h |4 
>  2 files changed, 11 insertions(+)
>
> diff --git a/drivers/acpi/Kconfig b/drivers/acpi/Kconfig
> index 04af18f..939d235 100644
> --- a/drivers/acpi/Kconfig
> +++ b/drivers/acpi/Kconfig
> @@ -324,6 +324,13 @@ config ACPI_TABLE_UPGRADE
>   initrd, therefore it's safe to say Y.
>   See Documentation/acpi/initrd_table_override.txt for details
>
> +config ACPI_NO_ERROR_MESSAGES
> +   bool "Disable error messages"
> +   default n
> +   help
> + The ACPI subsystem can produce error messages. Saying Y disables
> + this output.
> +
>  config ACPI_DEBUG
> bool "Debug Statements"
> default n
> diff --git a/include/acpi/platform/aclinux.h b/include/acpi/platform/aclinux.h
> index 93b61b1..ed27b52 100644
> --- a/include/acpi/platform/aclinux.h
> +++ b/include/acpi/platform/aclinux.h
> @@ -77,6 +77,10 @@
>  #define ACPI_MUTEX_DEBUG
>  #endif
>
> +#ifdef CONFIG_ACPI_NO_ERROR_MESSAGES
> +#define ACPI_NO_ERROR_MESSAGES
> +#endif
> +
>  #include 
>  #include 
>  #include 
> --


Re: [PATCH 2/5] ACPI / debugger: Fix regressions that AML debugger stops working

2016-07-05 Thread Rafael J. Wysocki
On Tue, Jul 5, 2016 at 1:18 PM, Lv Zheng  wrote:
> The FIFO unlocking mechanism in acpi_dbg has been messed up by the
> following commit:
>   Commit: 287980e49ffc0f6d911601e7e352a812ed27768e
>   Subject: remove lots of IS_ERR_VALUE abuses
> It converts !IS_ERR_VALUE(ret) into !ret. This patch fixes the
> regression.
>
> Fixes: 287980e49ffc ("remove lots of IS_ERR_VALUE abuses")
> Signed-off-by: Lv Zheng 
> Cc: Arnd Bergmann 

OK, but ->

> ---
>  drivers/acpi/acpi_dbg.c |6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/acpi/acpi_dbg.c b/drivers/acpi/acpi_dbg.c
> index 1f41284..ebc8d18 100644
> --- a/drivers/acpi/acpi_dbg.c
> +++ b/drivers/acpi/acpi_dbg.c
> @@ -602,7 +602,8 @@ static int acpi_aml_read_user(char __user *buf, int len)
> crc->tail = (crc->tail + n) & (ACPI_AML_BUF_SIZE - 1);
> ret = n;
>  out:
> -   acpi_aml_unlock_fifo(ACPI_AML_OUT_USER, !ret);
> +   acpi_aml_unlock_fifo(ACPI_AML_OUT_USER,
> +ret < 0 ? false : true);

-> The ternary operation is not necessary here, because the result of
(ret < 0) is already boolean.  So this can be written as

acpi_aml_unlock_fifo(ACPI_AML_OUT_USER, ret >= 0);

and analogously below.

I've made that change, please check the result in bleeding-edge.

Thanks,
Rafael


Re: [PATCH 2/2] ACPICA: Namespace: Fix lock order issue for namespace/interpreter

2016-07-05 Thread Rafael J. Wysocki
On Tue, Jul 5, 2016 at 7:53 AM, Lv Zheng  wrote:
> There is a lock order issue in acpi_load_tables(). The namespace lock is held
> before holding the interpreter lock.
> With ACPI_MUTEX_DEBUG enabled in kernel, we can reproduce this during boot:
>   [0.885699] ACPI Error: Invalid acquire order: Thread 405884224 owns 
> [ACPI_MTX_Namespace], wants [ACPI_MTX_Interpreter] (20160422/utmutex-263)
>   [0.885881] ACPI Error: Could not acquire AML Interpreter mutex 
> (20160422/exutils-95)
>   [0.893846] ACPI Error: Mutex [0x0] is not acquired, cannot release 
> (20160422/utmutex-326)
>   [0.894019] ACPI Error: Could not release AML Interpreter mutex 
> (20160422/exutils-133)
>
> The issue is introduced by the following commit:
>   Commit: 2f38b1b16d9280689e5cfa47a4c50956bf437f0d
>   ACPICA Commit: bfe03ffcde8ed56a7eae38ea0b188aeb12f9c52e
>   Subject: ACPICA: Namespace: Fix a regression that MLC support triggers
>dead lock in dynamic table loading
> Which fixes a dead lock issue for acpi_ns_load_table() in
> acpi_ex_add_table() but doesn't take care of the lock order in
> acpi_ns_load_table().
>
> Originally (before applying the wrong fix), ACPICA uses the
> namespace/interpreter locks in the following 2 key code paths:
> 1. Table loading (before applying 2f38b1b):
> acpi_ns_load_table
> L(Namespace)
> acpi_ns_parse_table
> acpi_ns_one_complete_parse
> U(Namespace)
> 2. Object evaluation:
> acpi_ns_evaluate
> L(Interpreter)
> acpi_ps_execute_method
> U(Interpreter)
> acpi_ns_load_table
> L(Namespace)
> U(Namespace)
> acpi_ev_initialize_region
> L(Namespace)
> U(Namespace)
> address_space.setup
> L(Namespace)
> U(Namespace)
> address_space.handler
> L(Namespace)
> U(Namespace)
> acpi_os_wait_semaphore
> acpi_os_acquire_mutex
> acpi_os_sleep
> L(Interpreter)
> U(Interpreter)
> During runtime, while acpi_ns_evaluate is called, lock order is always
> Interpreter -> Namespace.
>
> While the wrong fix holds the lock in the following order:
> 3. Table loading (after applying 2f38b1b):
> acpi_ns_load_table
> L(Namespace)
> acpi_ns_parse_table
> L(Interpreter)
> acpi_ns_one_complete_parse
> U(Interpreter)
> U(Namespace)
>
> This patch further fixes the lock order issue by moving interpreter lock
> to acpi_ns_load_table() to ensure the lock order correctness:
> 4. Table loading:
> acpi_ns_load_table
> L(Interpreter)
> L(Namespace)
> acpi_ns_parse_table
> acpi_ns_one_complete_parse
> U(Namespace)
> U(Interpreter)
>
> However, this fix is just a workaround which is compliant to the current
> design but doesn't fix the current design issues related to the namespace
> lock. For example, we can notice that in acpi_ns_evaluate(), outside of
> acpi_ns_load_table(), the namespace objects may be created by the named
> object creation control methods. And the creations of the method owned
> namespace objects are not locked by the namespace lock. This patch doesn't
> try to fix such kind of existing issues. Lv Zheng.
>
> Fixes: 2f38b1b16d92 ("ACPICA: Namespace: Fix a regression that MLC support 
> triggers dead lock in dynamic table loading")
> Signed-off-by: Lv Zheng 

So this one is a fix and it does not really depend on the [1/2] if I'm
not mistaken.

Accordingly, I've queued up the [2/2] as a fix for 4.7 and the [1/2] for 4.8.

Thanks,
Rafael


Re: [PATCH 0/4] MAINTAINERS: update BCM SoC entries

2016-07-05 Thread Florian Fainelli
On 06/20/2016 02:14 PM, Scott Branden wrote:
> Patch series looks good.
> 
> Acked-by: Scott Branden 

Series applied, thanks!
-- 
Florian


[GIT] Networking

2016-07-05 Thread David Miller

1) All users of AF_PACKET's fanout feature want a symmetric packet
   header hash for load balancing purposes, so give it to them.

2) Fix vlan state synchronization in e1000e, from Jarod Wilson.

3) Use correct socket pointer in ip_skb_dst_mtu(), from Shmulik
   Ladkani.

4) mlx5 bug fixes from Mohamad Haj Yahia, Daniel Jurgens, Matthew
   Finlay, Rana Shahout, and Shaker Daibes.  Mostly to do with
   operation timeouts and PCI error handling.

5) Fix checksum handling in mirred packet action, from WANG Cong.

6) Set skb->dev correctly when transmitting in !protect_frames case
   of macsec driver, from Daniel Borkmann.

7) Fix MTU calculation in geneve driver, from Haishuang Yan.

8) Missing netif_napi_del() in unregister path of qeth driver, from
   Ursula Braun.

9) Handle malformed route netlink messages in decnet properly, from
   Vergard Nossum.

10) Memory leak of percpu data in ipv6 routing code, from Martin
KaFai Lau.

Please pull, thanks a lot!

The following changes since commit e7bdea7750eb2a64aea4a08fa5c0a31719c8155d:

  Merge tag 'nfs-for-4.7-2' of git://git.linux-nfs.org/projects/anna/linux-nfs 
(2016-06-29 15:30:26 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 

for you to fetch changes up to 903ce4abdf374e3365d93bcb3df56c62008835ba:

  ipv6: Fix mem leak in rt6i_pcpu (2016-07-05 14:09:23 -0700)


Aviv Heller (1):
  bonding: fix enslavement slave link notifications

Bjørn Mork (1):
  cdc_ncm: workaround for EM7455 "silent" data interface

Christophe Jaillet (1):
  fsl/fman: fix error handling

Daniel Borkmann (1):
  macsec: set actual real device for xmit when !protect_frames

Daniel Jurgens (5):
  net/mlx5: Fix incorrect page count when in internal error
  net/mlx5: Fix wait_vital for VFs and remove fixed sleep
  net/mlx5e: Timeout if SQ doesn't flush during close
  net/mlx5e: Implement ndo_tx_timeout callback
  net/mlx5e: Handle RQ flush in error cases

David S. Miller (4):
  Merge branch 'master' of git://git.kernel.org/.../jkirsher/net-queue
  Merge branch 'mlx5-fixes'
  packet: Use symmetric hash for PACKET_FANOUT_HASH.
  Revert "fsl/fman: fix error handling"

Eric Dumazet (1):
  bonding: prevent out of bound accesses

Florian Fainelli (1):
  net: bcmsysport: Device stats are unsigned long

Ganesh Goudar (1):
  cxgb4: update latest firmware version supported

Haishuang Yan (1):
  geneve: fix max_mtu setting

Jarod Wilson (1):
  e1000e: keep Rx/Tx HW_VLAN_CTAG in sync

Martin KaFai Lau (1):
  ipv6: Fix mem leak in rt6i_pcpu

Matt Corallo (1):
  net: stmmac: Fix null-function call in ISR on stmmac1000

Matthew Finlay (1):
  net/mlx5e: Copy all L2 headers into inline segment

Mohamad Haj Yahia (4):
  net/mlx5: Fix teardown errors that happen in pci error handler
  net/mlx5: Avoid calling sleeping function by the health poll thread
  net/mlx5: Fix potential deadlock in command mode change
  net/mlx5: Add timeout handle to commands with callback

Or Gerlitz (1):
  net/mlx5: Avoid setting unused var when modifying vport node GUID

Rana Shahout (2):
  net/mlx5e: Fix select queue callback
  net/mlx5e: Validate BW weight values of ETS

Richard Alpe (1):
  tipc: fix nl compat regression for link statistics

Russell King - ARM Linux (1):
  net: mvneta: fix open() error cleanup

Sergio Valverde (1):
  enc28j60: Fix race condition in enc28j60 driver

Shaker Daibes (1):
  net/mlx5e: Log link state changes

Shmulik Ladkani (1):
  ipv4: Fix ip_skb_dst_mtu to use the sk passed by ip_finish_output

Sony Chacko (1):
  qlcnic: add wmb() call in transmit data path.

Soohoon Lee (1):
  usbnet: Stop RX Q on MTU change

Stefan Hauser (1):
  net: phy: dp83867: Fix initialization of PHYCR register

Ursula Braun (1):
  qeth: delete napi struct when removing a qeth device

Vegard Nossum (2):
  RDS: fix rds_tcp_init() error path
  net: fix decnet rtnexthop parsing

WANG Cong (1):
  net_sched: fix mirrored packets checksum

Xin Long (1):
  ixgbevf: ixgbevf_write/read_posted_mbx should use IXGBE_ERR_MBX to 
initialize ret_val

hayeswang (2):
  r8152: clear LINK_OFF_WAKE_EN after autoresume
  r8152: fix runtime function for RTL8152

 drivers/net/bonding/bond_3ad.c  |  11 ---
 drivers/net/bonding/bond_alb.c  |   7 ++---
 drivers/net/bonding/bond_main.c |   1 +
 drivers/net/ethernet/broadcom/bcmsysport.c  |   2 +-
 drivers/net/ethernet/chelsio/cxgb4/t4fw_version.h   |  12 +++
 drivers/net/ethernet/intel/e1000e/netdev.c  |  21 ++---
 drivers/net/ethernet/intel/ixgbevf/mbx.c|   4 +--
 drivers/net/ethernet/marvell/mvneta.c   |   2 ++
 drivers/net/ethernet/mellanox/mlx5/core/cmd.c   | 129 
+++

Re: [PATCH 6/7] dt-bindings: net: bgmac: add bindings documentation for bgmac

2016-07-05 Thread Jon Mason
On Tue, Jul 5, 2016 at 9:37 AM, Arnd Bergmann  wrote:
> On Monday, July 4, 2016 9:34:35 AM CEST Ray Jui wrote:
>> On 7/1/2016 8:42 AM, Arnd Bergmann wrote:
>> > On Friday, July 1, 2016 11:17:25 AM CEST Jon Mason wrote:
>> >> On Fri, Jul 1, 2016 at 5:46 AM, Arnd Bergmann  wrote:
>> >>> On Thursday, June 30, 2016 6:59:13 PM CEST Jon Mason wrote:
>>  +
>>  +Required properties:
>>  + - compatible: "brcm,bgmac-nsp"
>>  + - reg:Address and length of the GMAC registers,
>>  +   Address and length of the GMAC IDM registers
>>  + - reg-names:  Names of the registers.  Must have both "gmac_base" and
>>  +   "idm_base"
>>  + - interrupts: Interrupt number
>>  +
>> >>>
>> >>>
>> >>> "brcm,bgmac-nsp" sounds a bit too general. As I understand, this is a 
>> >>> family
>> >>> of SoCs that might not all have the exact same implementation of this
>> >>> ethernet device, as we can see from the long lookup table in 
>> >>> bgmac_probe().
>> >>
>> >> The Broadcom iProc family of SoCs contains:
>> >> Northstar
>> >> Northstar Plus
>> >> Cygnus
>> >> Northstar 2
>> >> a few SoCs that are under development
>> >> and a number of ethernet switches (which might never be officially 
>> >> supported)
>> >>
>> >> Each one of these SoCs could have a different revision of the gmac IP
>> >> block, but they should be uniform within each SoC (though there might
>> >> be a A0/B0 change necessary).  The Northstar Plus product family has a
>> >> number of different implementations, but the SoC is unchanged.  So, I
>> >> think this might be too specific, when we really need a general compat
>> >> string.
>> >
>> > Ok, thanks for the clarification, that sounds good enough.
>> >
>> >> Broadcom has a history of sharing IP blocks amongst the different
>> >> divisions.  So, this driver might be used on other SoC families (as it
>> >> apparently has been done in the past, based on the code you
>> >> reference).  I do not know of any way to know what legacy, non-iProc
>> >> chips have used this IP block.  I can make this "brcm,iproc-bgmac",
>> >> and add "brcm,iproc-nsp-bgmac" as an alternative compatible string in
>> >> this file (which I believe you are suggesting), but there might be
>> >> non-iProc SoCs that use this driver.  Is this acceptable?
>> >
>> > If it is also used outside of iProc, then I see no need for the
>> > extra compatible string, although it would not do any harm either.
>> >
>> > Ideally we should name it whatever the name for this IP block is
>> > inside of the company, with "nsp" as the designation for the variant
>> > in Northstar Plus. A lot of Broadcom IP blocks themselves seem to have
>> > some four-digit or five-digit number, maybe this one does too?
>> >
>> >   Arnd
>> >
>>
>> Note this IP block has an official IP controller name of "amac" from the
>> ASIC team.
>
> Ok, then I'd suggest making the compatible string here
>
> compatible = "brcm,nsp-amac", "brcm,amac";

It is called GMAC in the NS and NSP documentation, but AMAC is fine
with me (as it is called this in the NS2 documentation).  I'll make
the necessary change and repush.

Thanks for all of the input.

> or even better if you have a version number associated with it, make that
>
> compatible = "brcm,nsp-amac", "brcm,amac-1.234", "brcm,amac";
>
> replacing 1.234 with the actual version of course.
>
> Arnd
>


Re: [PATCH v2 0/7] lib: string: add functions to case-convert strings

2016-07-05 Thread Joe Perches
On Tue, 2016-07-05 at 15:36 -0700, Markus Mayer wrote:
> On 5 July 2016 at 15:14, Joe Perches  wrote:
> > On Tue, 2016-07-05 at 13:47 -0700, Markus Mayer wrote:
> > > This series introduces a family of generic string case conversion
> > > functions. This kind of functionality is needed in several places in
> > > the kernel. Right now, everybody seems to be implementing their own
> > > copy of this functionality.
> > > 
> > > Based on the discussion of the previous version of this series[1] and
> > > the use cases found in the kernel, it does look like having several
> > > flavours of case conversion functions is beneficial. The use cases fall
> > > into three categories:
> > > - copying a string and converting the case while specifying a
> > >   maximum length to mimic strncpy()
> > > - copying a string and converting the case without specifying a
> > >   length to mimic strcpy()
> > > - converting the case of a string in-place (i.e. modifying the
> > >   string that was passed in)
> > > 
> > > Consequently, I am proposing these new functions:
> > > char *strncpytoupper(char *dst, const char *src, size_t len);
> > > char *strncpytolower(char *dst, const char *src, size_t len);
> > > char *strcpytoupper(char *dst, const char *src);
> > > char *strcpytolower(char *dst, const char *src);
> > > char *strtoupper(char *s);
> > > char *strtolower(char *s);
> > I think there isn't much value in anything other
> > than strto.
> > 
> > Using str[n]cpy followed by strto is
> > pretty obvious and rarely used anyway.
> First time around, folks were proposing the "copy" variants when I
> submitted just strtolower() by itself[1]. They just asked for source
> and destination parameters to strtolower(), but looking at the use
> cases that wouldn't have worked so well. Hence it evolved into these 6
> functions.
> 
> Here's a breakdown of how the functions are being used (patches 2-7),
> see also [2]:
> 
> Patch 2: strncpytolower()
> Patch 3: strtolower()
> Patch 4: strncpytolower() and strtolower()
> Patch 5: strtolower()
> Patch 6: strcpytoupper()
> Patch 7: strcpytoupper()
> 
> So it does look like the copy + change case variant is more frequently
> used than just strto.

Are these functions useful?   Not to me, not so much.

None of the functions would have the strcpy performance of
the arch / asm
versions of strcpy and the savings in overall
code isn't significant (or
measured?).

Of course none of the uses are runtime performance important.

This patch also adds always compiled functions that aren't used
in many .configs.



Re: [PATCH 04/14] PCI: generic: make it explicitly non-modular

2016-07-05 Thread David Daney

On 07/04/2016 10:37 AM, Will Deacon wrote:

On Sat, Jul 02, 2016 at 07:13:24PM -0400, Paul Gortmaker wrote:

The Kconfig currently controlling compilation of this code is:

drivers/pci/host/Kconfig:config PCI_HOST_GENERIC
drivers/pci/host/Kconfig:   bool "Generic PCI host controller"

...meaning that it currently is not being built as a module by anyone.

Lets remove the few trace uses of modular code and macros, so that
when reading the driver there is no doubt it is builtin-only.

Since module_platform_driver() uses the same init level priority as
builtin_platform_driver() the init ordering remains unchanged with
this commit.

Also note that MODULE_DEVICE_TABLE is a no-op for non-modular code.

We also delete the MODULE_LICENSE tag etc. since all that information
is already contained at the top of the file in the comments.


Ideally, we'd simply fix this to build as a module, but it's not clear
how to do that now that the ecam accessors have been split out into
their own file. A liberal sprinkling of EXPORT_SYMBOL might work, but
it's a bit grotty.

David, Jayachandran -- do you have any desire to build your PCI host
controller drivers as modules?


I can only speak to the Cavium case.

The system is not usable without PCI, so there is no advantage to making 
the PCI host drivers modular.  At this point I don't see any reason to 
expend effort making it work as a module.


David.




Will





Re: [PATCH] [linux-next] workqueue: Fix a typo in workqueue.txt

2016-07-05 Thread Tejun Heo
On Tue, Jul 05, 2016 at 08:55:30PM +0900, Masanari Iida wrote:
> This patch fix a spelling typo in workqueue.txt
> 
> Signed-off-by: Masanari Iida 

Acked-by: Tejun Heo 

Jonathan, can you please route this one?

Thanks!

-- 
tejun


Re: cgroup: Fix split bio been throttled more than once

2016-07-05 Thread Tejun Heo
Hello,

On Mon, Jul 04, 2016 at 10:46:54PM +0800, Jiale Li wrote:
> These days, we have tested the cgroup blkio throttle function use fio,
> and we found a problem that when we use buffered IO or set the big block
> size like 1M, then the IO performance cannot reach the value we set.
> For example we set blkio.throttle.read_bps_device as 10M, in kernel 
> version 4.3 IO performance can only reach 6M, and in kernel version 
> 4.4 the actual IO bps is only 3.1M. 
> 
> Then we did some research and find that in kernel version 4.3 brought in
> blk_queue_split() function to split the big size bio into several parts,
> and some of them are calling the generic_make_request() again, this result
> the bio been throttled more than once. so the actual bio sent to device is 
> less than we expected.
> 
> We have checked the newest kernel of 4.7-rc5, this problem is still exist.
> 
> Based on this kind of situation, we propose a fix solution to add a flag bit
> in bio to let the splited bio bypass the blk_queue_split(). Below is the patch
> we used to fix this problem.

Thanks a lot for the report.  Hmmm... there was another brekage around
split bios, I wonder whether the two can be handled together somehow.

 
http://lkml.kernel.org/g/1466583730-28595-1-git-send-email-lars.ellenb...@linbit.com

> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 2613531..7b17a65 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -190,6 +190,7 @@ void blk_queue_split(struct request_queue *q, struct bio 
> **bio,
>  
>   bio_chain(split, *bio);
>   trace_block_split(q, split, (*bio)->bi_iter.bi_sector);
> + bio_set_flag(*bio, BIO_SPLITED);

Split's past participle form is split, so BIO_SPLIT would be right
here.

>   generic_make_request(*bio);
>   *bio = split;
>   }
> diff --git a/block/blk-throttle.c b/block/blk-throttle.c
> index 47a3e54..4ffde95 100644
> --- a/block/blk-throttle.c
> +++ b/block/blk-throttle.c
> @@ -1403,6 +1403,10 @@ bool blk_throtl_bio(struct request_queue *q, struct 
> blkcg_gq *blkg,
>  
>   WARN_ON_ONCE(!rcu_read_lock_held());
>  
> + /* if the bio has been splited, should not throttle again */
> + if (bio_flagged(bio, BIO_SPLITED))
> + goto out;
> +

But that said, can't we just copy BIO_THROTLED while splitting?

Thanks.

-- 
tejun


Re: [PATCH v2 0/7] lib: string: add functions to case-convert strings

2016-07-05 Thread Markus Mayer
On 5 July 2016 at 15:14, Joe Perches  wrote:
> On Tue, 2016-07-05 at 13:47 -0700, Markus Mayer wrote:
>> This series introduces a family of generic string case conversion
>> functions. This kind of functionality is needed in several places in
>> the kernel. Right now, everybody seems to be implementing their own
>> copy of this functionality.
>>
>> Based on the discussion of the previous version of this series[1] and
>> the use cases found in the kernel, it does look like having several
>> flavours of case conversion functions is beneficial. The use cases fall
>> into three categories:
>> - copying a string and converting the case while specifying a
>>   maximum length to mimic strncpy()
>> - copying a string and converting the case without specifying a
>>   length to mimic strcpy()
>> - converting the case of a string in-place (i.e. modifying the
>>   string that was passed in)
>>
>> Consequently, I am proposing these new functions:
>> char *strncpytoupper(char *dst, const char *src, size_t len);
>> char *strncpytolower(char *dst, const char *src, size_t len);
>> char *strcpytoupper(char *dst, const char *src);
>> char *strcpytolower(char *dst, const char *src);
>> char *strtoupper(char *s);
>> char *strtolower(char *s);
>
> I think there isn't much value in anything other
> than strto.
>
> Using str[n]cpy followed by strto is
> pretty obvious and rarely used anyway.

First time around, folks were proposing the "copy" variants when I
submitted just strtolower() by itself[1]. They just asked for source
and destination parameters to strtolower(), but looking at the use
cases that wouldn't have worked so well. Hence it evolved into these 6
functions.

Here's a breakdown of how the functions are being used (patches 2-7),
see also [2]:

Patch 2: strncpytolower()
Patch 3: strtolower()
Patch 4: strncpytolower() and strtolower()
Patch 5: strtolower()
Patch 6: strcpytoupper()
Patch 7: strcpytoupper()

So it does look like the copy + change case variant is more frequently
used than just strto.

Regards,
-Markus

[1] https://lkml.org/lkml/2016/7/1/652
[2] https://lkml.org/lkml/2016/7/5/542


Re: [PART2 RFC v2 00/10] iommu/AMD: Introduce IOMMU AVIC support

2016-07-05 Thread Suravee Suthikulpanit

Hi Paolo,

On 6/21/16 10:15, Paolo Bonzini wrote:



On 21/06/2016 15:50, Joerg Roedel wrote:

The code has a few style issues (thing I'd implemented differently), but
it looks functional. Anyway, before merging this the last 3 patches need
to be acked by the KVM maintainers.

Paolo?


I think patches 9 and 10 should be squashed because the code after patch
9 is only partly functional.

Likewise, I think this:


+
+   if (AMD_IOMMU_GUEST_IR_VAPIC(amd_iommu_guest_ir))
+   amd_iommu_irq_ops.capability |= (1 << IRQ_POSTING_CAP);

from patch 1 should be moved to patch 6.


I think you might have meant patch 7 instead of 6 since that is when we 
enable vAPIC mode in the IOMMU.


Thanks,
Suravee


Re: [PATCH v3 2/2] sound: lpass-platform: Move dma channel allocation to pcmops

2016-07-05 Thread Kenneth Westfield
On Mon, Jul 04, 2016 at 10:21:49AM +0100, Srinivas Kandagatla wrote:
> Move dma channel allocations to pcmops open and close functions. Reason
> to do this is that, lpass_platform_pcm_free() accesses snd_soc_pcm_runtime
> via substream->private data, However By this time runtimes are already
> freed as part of soc_cleanup_card_resources() sequence.
> 
> This patch moves the channel allocations/deallocations to pcmops open()
> and close() respectively, where the code has valid snd_soc_pcm_runtime.
> 
> Without this patch unloading lpass sound card module would result in below
> crash:

snip...

> Signed-off-by: Srinivas Kandagatla 
> ---

LGTM.

Acked-by: Kenneth Westfield 

-- 
Kenneth Westfield
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, 
a Linux Foundation Collaborative Project


[PATCHv3] arm64: Handle el1 synchronous instruction aborts cleanly

2016-07-05 Thread Laura Abbott
Executing from a non-executable area gives an ugly message:

lkdtm: Performing direct entry EXEC_RODATA
lkdtm: attempting ok execution at 084c0e08
lkdtm: attempting bad execution at 08880700
Bad mode in Synchronous Abort handler detected on CPU2, code 0x840e -- IABT 
(current EL)
CPU: 2 PID: 998 Comm: sh Not tainted 4.7.0-rc2+ #13
Hardware name: linux,dummy-virt (DT)
task: 800077e35780 ti: 80007797 task.ti: 80007797
PC is at lkdtm_rodata_do_nothing+0x0/0x8
LR is at execute_location+0x74/0x88

The 'IABT (current EL)' indicates the error but it's a bit cryptic
without knowledge of the ARM ARM. There is also no indication of the
specific address which triggered the fault. The increase in kernel
page permissions makes hitting this case more likely as well.
Handling the case in the vectors gives a much more familiar looking
error message:

lkdtm: Performing direct entry EXEC_RODATA
lkdtm: attempting ok execution at 084c0840
lkdtm: attempting bad execution at 08880680
Unable to handle kernel paging request at virtual address 08880680
pgd = 889b2000
[08880680] *pgd=489b4003, *pud=48904003, 
*pmd=
Internal error: Oops: 840e [#1] PREEMPT SMP
Modules linked in:
CPU: 1 PID: 997 Comm: sh Not tainted 4.7.0-rc1+ #24
Hardware name: linux,dummy-virt (DT)
task: 800077f9f080 ti: 88a1c000 task.ti: 88a1c000
PC is at lkdtm_rodata_do_nothing+0x0/0x8
LR is at execute_location+0x74/0x88

Signed-off-by: Laura Abbott 
---
v3: Fixup permission in do_page_fault to detect the kernel iabort, don't run
fixup handlers on kernel instruction aborts.

Dropped the Acked-by since the addition of checks is pretty significant.
---
 arch/arm64/kernel/entry.S | 18 ++
 arch/arm64/mm/fault.c | 11 +--
 2 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index 12e8d2b..54e93d12 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -336,6 +336,8 @@ el1_sync:
lsr x24, x1, #ESR_ELx_EC_SHIFT  // exception class
cmp x24, #ESR_ELx_EC_DABT_CUR   // data abort in EL1
b.eqel1_da
+   cmp x24, #ESR_ELx_EC_IABT_CUR   // instruction abort in EL1
+   b.eqel1_ia
cmp x24, #ESR_ELx_EC_SYS64  // configurable trap
b.eqel1_undef
cmp x24, #ESR_ELx_EC_SP_ALIGN   // stack alignment exception
@@ -347,6 +349,22 @@ el1_sync:
cmp x24, #ESR_ELx_EC_BREAKPT_CUR// debug exception in EL1
b.geel1_dbg
b   el1_inv
+el1_ia:
+   /*
+* Instruction abort handling
+*/
+   mrs x0, far_el1
+   enable_dbg
+   // re-enable interrupts if they were enabled in the aborted context
+   tbnzx23, #7, 1f // PSR_I_BIT
+   enable_irq
+1:
+   mov x2, sp  // struct pt_regs
+   bl  do_mem_abort
+
+   // disable interrupts before pulling preserved data off the stack
+   disable_irq
+   kernel_exit 1
 el1_da:
/*
 * Data abort handling
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 013e2cb..e25b0891 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -131,6 +131,11 @@ int ptep_set_access_flags(struct vm_area_struct *vma,
 }
 #endif
 
+static bool is_el1_instruction_abort(unsigned int esr)
+{
+   return ESR_ELx_EC(esr) == ESR_ELx_EC_IABT_CUR;
+}
+
 /*
  * The kernel tried to access some page that wasn't present.
  */
@@ -139,8 +144,9 @@ static void __do_kernel_fault(struct mm_struct *mm, 
unsigned long addr,
 {
/*
 * Are we prepared to handle this kernel fault?
+* We are almost certainly not prepared to handle instruction faults.
 */
-   if (fixup_exception(regs))
+   if (!is_el1_instruction_abort(esr) && fixup_exception(regs))
return;
 
/*
@@ -247,7 +253,8 @@ static inline int permission_fault(unsigned int esr)
unsigned int ec   = (esr & ESR_ELx_EC_MASK) >> ESR_ELx_EC_SHIFT;
unsigned int fsc_type = esr & ESR_ELx_FSC_TYPE;
 
-   return (ec == ESR_ELx_EC_DABT_CUR && fsc_type == ESR_ELx_FSC_PERM);
+   return (ec == ESR_ELx_EC_DABT_CUR && fsc_type == ESR_ELx_FSC_PERM) ||
+  (ec == ESR_ELx_EC_IABT_CUR && fsc_type == ESR_ELx_FSC_PERM);
 }
 
 static int __kprobes do_page_fault(unsigned long addr, unsigned int esr,
-- 
2.7.4



[PATCHv3] Better kernel instruction abort handling

2016-07-05 Thread Laura Abbott
Hi,

This is v3 of the patch to make instruction aborts print a nicer more standard\
error message (i.e. no more bad mode)

Mark Rutland pointed out in v2 that we need to audit do_mem_abort paths. Of the
functions that do_mem_abort can call, do_bad, do_translation_fault, and
do_alignment_fault all mostly reduce to calling do_bad_area which should call
__do_kernel_fault directly. This makes do_page_fault and __do_kernel_fault the
only cases to review.

Mark raised the problem of taking an instruction abort with a fixup handler.
Any fixup handler being run would not exist in the exception table so there
should be no risk of looping. Another instruction abort would just reduce to
the case of an instruction abort without a fixup handler. The fixup handlers
are expecting data aborts, not instruction aborts though so while they could
run successfully, it wouldn't be for the precise right reason. Practically
speaking, I don't think it matters but to be on the safe side, the fixup
handlers are not run in __do_kernel_fault if the abort is an instruction abort.
This should cover__do_kernel_fault.

do_page_fault gets a little bit more complicated. A fault on a kernel address
should just end up in __do_kernel_fault. Extending is_permission_fault to
cover instruction aborts should be sufficient, mostly because addr == regs->pc
and there should never be a userspace address in the exception table and there
should never be a userspace address in the exception table.

So I think this should cover all cases. The sample LKDTM test cases all work
now.

Thanks,
Laura

Laura Abbott (1):
  arm64: Handle el1 synchronous instruction aborts cleanly

 arch/arm64/kernel/entry.S | 18 ++
 arch/arm64/mm/fault.c | 11 +--
 2 files changed, 27 insertions(+), 2 deletions(-)

-- 
2.7.4



Re: [PATCH v2 0/7] lib: string: add functions to case-convert strings

2016-07-05 Thread Joe Perches
On Tue, 2016-07-05 at 13:47 -0700, Markus Mayer wrote:
> This series introduces a family of generic string case conversion
> functions. This kind of functionality is needed in several places in
> the kernel. Right now, everybody seems to be implementing their own
> copy of this functionality.
> 
> Based on the discussion of the previous version of this series[1] and
> the use cases found in the kernel, it does look like having several
> flavours of case conversion functions is beneficial. The use cases fall
> into three categories:
> - copying a string and converting the case while specifying a
>   maximum length to mimic strncpy()
> - copying a string and converting the case without specifying a
>   length to mimic strcpy()
> - converting the case of a string in-place (i.e. modifying the
>   string that was passed in)
> 
> Consequently, I am proposing these new functions:
> char *strncpytoupper(char *dst, const char *src, size_t len);
> char *strncpytolower(char *dst, const char *src, size_t len);
> char *strcpytoupper(char *dst, const char *src);
> char *strcpytolower(char *dst, const char *src);
> char *strtoupper(char *s);
> char *strtolower(char *s);

I think there isn't much value in anything other
than strto.

Using str[n]cpy followed by strto is
pretty obvious and rarely used anyway.



Re: [PATCH v4 0/3] Doc/memory-barriers: Add Korean translation

2016-07-05 Thread SeongJae Park
On Wed, Jul 6, 2016 at 6:36 AM, Paul E. McKenney
 wrote:
> On Mon, Jul 04, 2016 at 08:27:05AM +0900, SeongJae Park wrote:
>> This patchset adds Korean translation of memory-barriers.txt and fix few
>> nitpicks found during the translation.  The translation has started from Feb,
>> 2016 and using private git tree[1] to manage the work.  It's commit history
>> says that it is following upstream changes as well.
>>
>> Change from v3:
>>  - Polish readability for overall text:
>>`1 file changed, 533 insertions(+), 505 deletions(-)`
>>
>>  - Add disclaimer of translation
>>
>> SeongJae Park (3):
>>   memory-barriers.txt: maintain consistent blank line
>>   memory-barriers.txt: fix wrong section reference
>>   Doc/memory-barriers: Add Korean translation
>
> I applied the first two, thank you very much!

Thank you, Paul.

>
> I must defer to Ingo Molnar and Jonathan Corbet on the translation.

To defend this patch again,  I think those applied patches means that this
translation is not only following upper changes, but also makes enhancements to
original document.  Such enhancements were just nitpick cleanings, though.


Thanks,
SeongJae Park

>
> Thanx, Paul
>
>>  Documentation/ko_KR/memory-barriers.txt | 3123 
>> +++
>>  Documentation/memory-barriers.txt   |3 +-
>>  2 files changed, 3125 insertions(+), 1 deletion(-)
>>  create mode 100644 Documentation/ko_KR/memory-barriers.txt
>>
>> --
>> 1.9.1
>>
>


[PATCH RESEND net-next] netvsc: Use the new in-place consumption APIs in the rx path

2016-07-05 Thread kys
From: K. Y. Srinivasan 

Use the new APIs for eliminating a copy on the receive path. These new APIs also
help in minimizing the number of memory barriers we end up issuing (in the
ringbuffer code) since we can better control when we want to expose the ring
state to the host.

The patch is being resent to address earlier email issues.

Signed-off-by: K. Y. Srinivasan 
---
 drivers/net/hyperv/netvsc.c |   88 +--
 1 files changed, 59 insertions(+), 29 deletions(-)

diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 719cb35..8cd4c19 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -1141,6 +1141,39 @@ static inline void netvsc_receive_inband(struct 
hv_device *hdev,
}
 }
 
+static void netvsc_process_raw_pkt(struct hv_device *device,
+  struct vmbus_channel *channel,
+  struct netvsc_device *net_device,
+  struct net_device *ndev,
+  u64 request_id,
+  struct vmpacket_descriptor *desc)
+{
+   struct nvsp_message *nvmsg;
+
+   nvmsg = (struct nvsp_message *)((unsigned long)
+   desc + (desc->offset8 << 3));
+
+   switch (desc->type) {
+   case VM_PKT_COMP:
+   netvsc_send_completion(net_device, channel, device, desc);
+   break;
+
+   case VM_PKT_DATA_USING_XFER_PAGES:
+   netvsc_receive(net_device, channel, device, desc);
+   break;
+
+   case VM_PKT_DATA_INBAND:
+   netvsc_receive_inband(device, net_device, nvmsg);
+   break;
+
+   default:
+   netdev_err(ndev, "unhandled packet type %d, tid %llx\n",
+  desc->type, request_id);
+   break;
+   }
+}
+
+
 void netvsc_channel_cb(void *context)
 {
int ret;
@@ -1153,7 +1186,7 @@ void netvsc_channel_cb(void *context)
unsigned char *buffer;
int bufferlen = NETVSC_PACKET_SIZE;
struct net_device *ndev;
-   struct nvsp_message *nvmsg;
+   bool need_to_commit = false;
 
if (channel->primary_channel != NULL)
device = channel->primary_channel->device_obj;
@@ -1167,39 +1200,36 @@ void netvsc_channel_cb(void *context)
buffer = get_per_channel_state(channel);
 
do {
+   desc = get_next_pkt_raw(channel);
+   if (desc != NULL) {
+   netvsc_process_raw_pkt(device,
+  channel,
+  net_device,
+  ndev,
+  desc->trans_id,
+  desc);
+
+   put_pkt_raw(channel, desc);
+   need_to_commit = true;
+   continue;
+   }
+   if (need_to_commit) {
+   need_to_commit = false;
+   commit_rd_index(channel);
+   }
+
ret = vmbus_recvpacket_raw(channel, buffer, bufferlen,
   &bytes_recvd, &request_id);
if (ret == 0) {
if (bytes_recvd > 0) {
desc = (struct vmpacket_descriptor *)buffer;
-   nvmsg = (struct nvsp_message *)((unsigned long)
-desc + (desc->offset8 << 3));
-   switch (desc->type) {
-   case VM_PKT_COMP:
-   netvsc_send_completion(net_device,
-   channel,
-   device, desc);
-   break;
-
-   case VM_PKT_DATA_USING_XFER_PAGES:
-   netvsc_receive(net_device, channel,
-  device, desc);
-   break;
-
-   case VM_PKT_DATA_INBAND:
-   netvsc_receive_inband(device,
- net_device,
- nvmsg);
-   break;
-
-   default:
-   netdev_err(ndev,
-  "unhandled packet type %d, "
-  "tid %llx len %d\n",
-  desc->type, request_id,
-  bytes_recvd);
- 

Re: [PATCH -v3 2/2] printk: Add kernel parameter to control writes to /dev/kmsg

2016-07-05 Thread Steven Rostedt
On Wed, 6 Jul 2016 00:02:21 +0200
Borislav Petkov  wrote:

> > If you silently fail here, then we lose all logging because systemd
> > thinks this is working when it is not. That's not what I want.  
> 
> Hmm, ok, you're making sense to me.
> 
> Do you want an error message too or only an -ENODEV returned or
> somesuch?
> 

My patch returned -EPERM and it works on my box. That is, systemd finds
something else to log to.

-- Steve


Re: [PATCH v2 01/11] clk: tegra: Switch to using critical clks

2016-07-05 Thread Rhyland Klein
On 6/22/2016 8:16 AM, Thierry Reding wrote:
> * PGP Signed by an unknown key
> 
> On Fri, May 27, 2016 at 04:38:04PM -0400, Rhyland Klein wrote:
>> Mark some of the required-to-be-enabled clks as critical clks. These
>> need to be kept on through the disabling of unused clks during init.
>> They may not get any reference before or after init, but are required
>> to be on, therefore let the core enable them.
>>
>> Signed-off-by: Rhyland Klein 
>> ---
>> v2:
>>  - Remove mention of HAND_OFF clks as they are not supported yet.
>>
>>  drivers/clk/tegra/clk-tegra-periph.c | 21 ++---
>>  drivers/clk/tegra/clk-tegra-super-gen4.c | 12 +++-
>>  2 files changed, 21 insertions(+), 12 deletions(-)
> 
> I have some difficulty to follow why some of these clocks are critical.
> It might help if the commit message mentioned why each of these need to
> remain enabled forever, even if never used.

Sure, I can add what I found/know for each.

> 
> Also, it's fairly unlikely that pll_p for example would ever get
> disabled because a bunch of others are derived from it. I'm also not
> quite convinced yet that it really is critical. What does it drive which
> isn't claimed by any drivers?
> 

I think Peter already responded to this comment correctly, citing
mselect and sclk as necessary CRITICAL clk children of pll_p.

> A few more comments below.
> 
>> diff --git a/drivers/clk/tegra/clk-tegra-periph.c 
>> b/drivers/clk/tegra/clk-tegra-periph.c
>> index 29d04c663abf..9365770bcaa5 100644
>> --- a/drivers/clk/tegra/clk-tegra-periph.c
>> +++ b/drivers/clk/tegra/clk-tegra-periph.c
>> @@ -162,6 +162,13 @@
>>  _clk_num, _gate_flags, _clk_id, _parents##_idx, 0,\
>>  NULL)
>>  
>> +#define MUX8_FLAGS(_name, _parents, _offset,\
>> + _clk_num, _gate_flags, _clk_id, _flags)\
>> +TEGRA_INIT_DATA_TABLE(_name, NULL, NULL, _parents, _offset,\
>> +29, MASK(8), 0, 0, 8, 1, TEGRA_DIVIDER_ROUND_UP,\
>> +_clk_num, _gate_flags, _clk_id, _parents##_idx, _flags,\
>> +NULL)
>> +
>>  #define MUX8_NOGATE_LOCK(_name, _parents, _offset, _clk_id, _lock)  \
>>  TEGRA_INIT_DATA_TABLE(_name, NULL, NULL, _parents, _offset, \
>>29, MASK(3), 0, 0, 8, 1, TEGRA_DIVIDER_ROUND_UP,\
>> @@ -651,7 +658,7 @@ static struct tegra_periph_init_data periph_clks[] = {
>>  INT8("3d", mux_pllm_pllc2_c_c3_pllp_plla, CLK_SOURCE_3D, 24, 0, 
>> tegra_clk_gr3d_8),
>>  INT8("vic03", mux_pllm_pllc_pllp_plla_pllc2_c3_clkm, CLK_SOURCE_VIC03, 
>> 178, 0, tegra_clk_vic03),
>>  INT8("vic03", mux_pllc_pllp_plla1_pllc2_c3_clkm, CLK_SOURCE_VIC03, 178, 
>> 0, tegra_clk_vic03_8),
>> -INT_FLAGS("mselect", mux_pllp_clkm, CLK_SOURCE_MSELECT, 99, 0, 
>> tegra_clk_mselect, CLK_IGNORE_UNUSED),
>> +INT_FLAGS("mselect", mux_pllp_clkm, CLK_SOURCE_MSELECT, 99, 0, 
>> tegra_clk_mselect, CLK_IGNORE_UNUSED | CLK_IS_CRITICAL),
> 
> Doesn't CLK_IS_CRITICAL always trump CLK_IGNORE_UNUSED anyway? Why do we
> need to keep both?

Yep, makes sense to remove CLK_IGNORE_UNUSED.

> 
>>  MUX("i2s0", mux_pllaout0_audio0_2x_pllp_clkm, CLK_SOURCE_I2S0, 30, 
>> TEGRA_PERIPH_ON_APB, tegra_clk_i2s0),
>>  MUX("i2s1", mux_pllaout0_audio1_2x_pllp_clkm, CLK_SOURCE_I2S1, 11, 
>> TEGRA_PERIPH_ON_APB, tegra_clk_i2s1),
>>  MUX("i2s2", mux_pllaout0_audio2_2x_pllp_clkm, CLK_SOURCE_I2S2, 18, 
>> TEGRA_PERIPH_ON_APB, tegra_clk_i2s2),
>> @@ -691,8 +698,8 @@ static struct tegra_periph_init_data periph_clks[] = {
>>  MUX("dsiblp", mux_pllp_pllc_clkm, CLK_SOURCE_DSIBLP, 148, 0, 
>> tegra_clk_dsiblp),
>>  MUX("tsensor", mux_pllp_pllc_clkm_clk32, CLK_SOURCE_TSENSOR, 100, 
>> TEGRA_PERIPH_ON_APB, tegra_clk_tsensor),
>>  MUX("actmon", mux_pllp_pllc_clk32_clkm, CLK_SOURCE_ACTMON, 119, 0, 
>> tegra_clk_actmon),
>> -MUX("dfll_ref", mux_pllp_clkm, CLK_SOURCE_DFLL_REF, 155, 
>> TEGRA_PERIPH_ON_APB, tegra_clk_dfll_ref),
>> -MUX("dfll_soc", mux_pllp_clkm, CLK_SOURCE_DFLL_SOC, 155, 
>> TEGRA_PERIPH_ON_APB, tegra_clk_dfll_soc),
>> +MUX_FLAGS("dfll_ref", mux_pllp_clkm, CLK_SOURCE_DFLL_REF, 155, 
>> TEGRA_PERIPH_ON_APB, tegra_clk_dfll_ref, CLK_IS_CRITICAL),
>> +MUX_FLAGS("dfll_soc", mux_pllp_clkm, CLK_SOURCE_DFLL_SOC, 155, 
>> TEGRA_PERIPH_ON_APB, tegra_clk_dfll_soc, CLK_IS_CRITICAL),
> 
> Aren't these used by the CPU frequency scaling code? Do they have to
> remain on if we don't enable CPU frequency scaling?

They are. They seem to be only required to remain on through boot if
cpufreq is enabled. I am only seeing problems with them on Tegra124.
Right now there is no way to only mark them CRITICAL for T124 (or more
accurately, only enable them if cpufreq is enabled). Thats why I had
them marked CRITICAL.

> 
>>  MUX("i2cslow", mux_pllp_pllc_clk32_clkm, CLK_SOURCE_I2CSLOW, 81, 
>> TEGRA_PERIPH_ON_APB, tegra_clk_i2cslow),
>>  MUX("sbc1", mux_pllp_pllc_pllm_clkm, CLK_SOURCE_SBC1, 41, 
>> TEGRA_PER

Re: [PATCH 1/5] security, overlayfs: provide copy up security hook for unioned files

2016-07-05 Thread Paul Moore
On Tue, Jul 5, 2016 at 5:52 PM, Vivek Goyal  wrote:
> On Tue, Jul 05, 2016 at 05:35:22PM -0400, Paul Moore wrote:
>> On Tue, Jul 5, 2016 at 11:50 AM, Vivek Goyal  wrote:
>> > Provide a security hook to label new file correctly when a file is copied
>> > up from lower layer to upper layer of a overlay/union mount.
>> >
>> > This hook can prepare and switch to a new set of creds which are suitable
>> > for new file creation during copy up. Caller should revert to old creds
>> > after file creation.
>> >
>> > In SELinux, newly copied up file gets same label as lower file for
>> > non-context mounts. But it gets label specified in mount option context=
>> > for context mounts.
>> >
>> > Signed-off-by: Vivek Goyal 
>> > ---
>> >  fs/overlayfs/copy_up.c|  8 
>> >  include/linux/lsm_hooks.h | 13 +
>> >  include/linux/security.h  |  6 ++
>> >  security/security.c   |  8 
>> >  security/selinux/hooks.c  | 27 +++
>> >  5 files changed, 62 insertions(+)
>>
>> ..
>>
>> > diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
>> > index a86d537..1b1a1e5 100644
>> > --- a/security/selinux/hooks.c
>> > +++ b/security/selinux/hooks.c
>> > @@ -3270,6 +3270,32 @@ static void selinux_inode_getsecid(struct inode 
>> > *inode, u32 *secid)
>> > *secid = isec->sid;
>> >  }
>> >
>> > +static int selinux_inode_copy_up(struct dentry *src, const struct cred 
>> > **old)
>> > +{
>> > +   u32 sid;
>> > +   struct cred *new_creds;
>> > +   struct task_security_struct *tsec;
>> > +
>> > +   new_creds = prepare_creds();
>> > +   if (!new_creds)
>> > +   return -ENOMEM;
>> > +
>> > +   /* Get label from overlay inode and set it in create_sid */
>> > +   selinux_inode_getsecid(d_inode(src), &sid);
>> > +   tsec = new_creds->security;
>> > +   tsec->create_sid = sid;
>> > +   *old = override_creds(new_creds);
>> > +
>> > +   /*
>> > +* At this point of time we have 2 refs on new_creds. One by
>> > +* prepare_creds and other by override_creds. Drop one reference
>> > +* so that as soon as caller calls revert_creds(old), this cred
>> > +* will be freed.
>> > +*/
>> > +   put_cred(new_creds);
>> > +   return 0;
>> > +}

...

>> Beyond that, I'm not overly excited about reusing create_sid for this
>> purpose.  I understand why you did it, but what if the process had
>> explicitly set create_sid?
>
> When a file is copied up, either we retain the label of lower file or
> set the new label from context=. If any create_sid is set in task, that's
> ignored.
>
> And as we are setting create_sid in a new set of credentials, task will
> get to retain its create_sid for future operations.
>
> As task does not know we are creating a new file, create_sid of task
> should not matter at all. Task does not know if file is on upper or
> file is being copied up. For task this file already exists, so task
> should not expect create_sid label to be present.
>
> Am I missing something.

I forgot that you are manufacturing a new set of credentials; I must
have lost track of that when I was walking through some of the VFS
code, my mistake.  I'm still rather uneasy about this, but at least
you aren't overwriting a previously stored create_sid value.

-- 
paul moore
security @ redhat


Re: [PATCH -v3 2/2] printk: Add kernel parameter to control writes to /dev/kmsg

2016-07-05 Thread Borislav Petkov
On Tue, Jul 05, 2016 at 05:47:49PM -0400, Steven Rostedt wrote:
> I wonder if we should return some sort of error message here? ENODEV?

What for?

We're silently ignoring it. If we start returning an error here, we
might break doofus if it checks the write retval.

And it's not like we care - we're ignoring all writes whatsoever.

> If you silently fail here, then we lose all logging because systemd
> thinks this is working when it is not. That's not what I want.

Hmm, ok, you're making sense to me.

Do you want an error message too or only an -ENODEV returned or
somesuch?

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.


Re: WARNING: CPU: 0 PID: 0 at net/ipv4/af_inet.c:155 inet_sock_destruct+0x1c4/0x1dc

2016-07-05 Thread Florian Fainelli
On 07/05/2016 02:51 PM, Mason wrote:
>>> Therefore, loss of context cannot possibly explain the
>>> warning I am seeing.
>>
>> No, but if you go all the way down to trying to suspend and the last
>> step is the firmware failing, anything you have suspended needs to be
>> unwinded, for your ethernet driver that means that you went through a
>> successful suspend then resume cycle even if it failed down later when
>> the platform attempted to suspend.
> 
> So it is the driver's responsibility to "shut down" on resume?

It is the driver responsibility to know how to suspend and resume a
device it manages, and it does that by implementing appropriate
suspend/resume callbacks.

> (I had the vague impression that the suspend framework would
> "disable" the device through the appropriate callback.)

The suspend framework knows which drivers implement suspend/resume and
calls them appropriately (based on parenting/bus hierarchy), but it
won't automatigally do anything because there is no such thing as magic
when it comes to suspending hardware, this needs to be a controlled
sequence.
-- 
Florian


Re: [PATCH 2/5] security,overlayfs: Provide security hook for copy up of xattrs for overlay file

2016-07-05 Thread Vivek Goyal
On Tue, Jul 05, 2016 at 05:45:25PM -0400, Paul Moore wrote:
> On Tue, Jul 5, 2016 at 11:50 AM, Vivek Goyal  wrote:
> > Provide a security hook which is called when xattrs of a file are being
> > copied up. This hook is called once for each xattr and one can either
> > accept or reject xattr. If 0 is returned, xattr will be copied up, if 1
> > is returned, xattr will not be copied up and if negative error code
> > is returned, copy up will be aborted.
> >
> > In SELinux, label of lower file is not copied up. File already has been
> > set with right label at the time of creation and we don't want to overwrite
> > that label.
> >
> > Signed-off-by: David Howells 
> > Signed-off-by: Vivek Goyal 
> > ---
> >  fs/overlayfs/copy_up.c|  8 
> >  include/linux/lsm_hooks.h | 13 +
> >  include/linux/security.h  | 10 ++
> >  security/security.c   |  9 +
> >  security/selinux/hooks.c  | 14 ++
> >  5 files changed, 54 insertions(+)
> 
> To continue the earlier feedback about mixing generic LSM hook
> definitions with the SELinux specific hook implementations - I prefer
> to see patchsets organized in the following manner:
> 
> [PATCH 1/X] - add new LSM hooks and the calls from the relevant
> subsystems, e.g.
> {security/security.c,include/linux/security.h,fs/overlayfs/*}
> [PATCH 2/X] - LSM specific hook implementation, e.g. security/selinux/*
> [PATCH n/X] - LSM specific hook implementation, e.g. security/smack/*

Ok, will do this way.

Vivek


Re: [PATCH 1/5] security, overlayfs: provide copy up security hook for unioned files

2016-07-05 Thread Vivek Goyal
On Tue, Jul 05, 2016 at 05:35:22PM -0400, Paul Moore wrote:
> On Tue, Jul 5, 2016 at 11:50 AM, Vivek Goyal  wrote:
> > Provide a security hook to label new file correctly when a file is copied
> > up from lower layer to upper layer of a overlay/union mount.
> >
> > This hook can prepare and switch to a new set of creds which are suitable
> > for new file creation during copy up. Caller should revert to old creds
> > after file creation.
> >
> > In SELinux, newly copied up file gets same label as lower file for
> > non-context mounts. But it gets label specified in mount option context=
> > for context mounts.
> >
> > Signed-off-by: Vivek Goyal 
> > ---
> >  fs/overlayfs/copy_up.c|  8 
> >  include/linux/lsm_hooks.h | 13 +
> >  include/linux/security.h  |  6 ++
> >  security/security.c   |  8 
> >  security/selinux/hooks.c  | 27 +++
> >  5 files changed, 62 insertions(+)
> 
> ..
> 
> > diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
> > index a86d537..1b1a1e5 100644
> > --- a/security/selinux/hooks.c
> > +++ b/security/selinux/hooks.c
> > @@ -3270,6 +3270,32 @@ static void selinux_inode_getsecid(struct inode 
> > *inode, u32 *secid)
> > *secid = isec->sid;
> >  }
> >
> > +static int selinux_inode_copy_up(struct dentry *src, const struct cred 
> > **old)
> > +{
> > +   u32 sid;
> > +   struct cred *new_creds;
> > +   struct task_security_struct *tsec;
> > +
> > +   new_creds = prepare_creds();
> > +   if (!new_creds)
> > +   return -ENOMEM;
> > +
> > +   /* Get label from overlay inode and set it in create_sid */
> > +   selinux_inode_getsecid(d_inode(src), &sid);
> > +   tsec = new_creds->security;
> > +   tsec->create_sid = sid;
> > +   *old = override_creds(new_creds);
> > +
> > +   /*
> > +* At this point of time we have 2 refs on new_creds. One by
> > +* prepare_creds and other by override_creds. Drop one reference
> > +* so that as soon as caller calls revert_creds(old), this cred
> > +* will be freed.
> > +*/
> > +   put_cred(new_creds);
> > +   return 0;
> > +}
> 
> One quick point of clarification: in addition to the SELinux specific
> comments in lsm_hooks.h, I think it would be a good idea if the
> SELinux hook implementation, e.g. security/selinux/hooks.c, was in its
> own patch and not part of the hook definition.

Ok, may be I will break every hook in three parts.

- lsm hook declaration.
- selinux implementation
- overlay implementation of call to hook.

> 
> Beyond that, I'm not overly excited about reusing create_sid for this
> purpose.  I understand why you did it, but what if the process had
> explicitly set create_sid?

When a file is copied up, either we retain the label of lower file or
set the new label from context=. If any create_sid is set in task, that's
ignored.

And as we are setting create_sid in a new set of credentials, task will
get to retain its create_sid for future operations.

As task does not know we are creating a new file, create_sid of task
should not matter at all. Task does not know if file is on upper or
file is being copied up. For task this file already exists, so task
should not expect create_sid label to be present.

Am I missing something.

Vivek

> I think I would prefer the creation of a
> new field (create_up_sid?) to track this new label and then add an
> additional check to selinux_inode_init_security() to prefer the
> existing create_sid to this new field when both are set.
> 
> Sound reasonable?
> 
> -- 
> paul moore
> security @ redhat


Re: WARNING: CPU: 0 PID: 0 at net/ipv4/af_inet.c:155 inet_sock_destruct+0x1c4/0x1dc

2016-07-05 Thread Mason
On 05/07/2016 23:22, Florian Fainelli wrote:
> On 07/05/2016 01:26 PM, Mason wrote:
>> On 05/07/2016 18:20, Florian Fainelli wrote:
>>> On 07/05/2016 08:56 AM, Mason wrote:
 On 05/07/2016 17:28, Florian Fainelli wrote:

> nb8800.c does not currently show suspend/resume hooks implemented, are
> you positive that when you suspend, you properly tear down all HW, stop
> transmit queues, etc. and do the opposite upon resumption?

 I am currently testing the error path for my suspend routine.
 Firmware is, in fact, denying the suspend request, and immediately
 returns control to Linux, without having powered anything down.

 I expected not having to save any context in that situation.
 Am I mistaken?
>>>
>>> It depends what power state you are going to and resuming from, and how
>>> much of this is platform dependent, on the platforms I work with S2
>>> preserves register states for our On/Off domain, while S3 only keeps an
>>> always-on power island and shuts off the On/Off domain, you therefore
>>> need to have your drivers in the On/Off domain suspend any activity and
>>> preserve important register states, or re-initialize them from scratch
>>> whichever is the most convenient.
>>
>> Thanks for bringing these details to my attention, they will
>> definitely prove useful when I test an actual suspend/resume
>> sequence. However, I must stress that the platform did NOT
>> power down in my test case, because the firmware currently
>> denies all suspend requests.
>>
>> Therefore, loss of context cannot possibly explain the
>> warning I am seeing.
> 
> No, but if you go all the way down to trying to suspend and the last
> step is the firmware failing, anything you have suspended needs to be
> unwinded, for your ethernet driver that means that you went through a
> successful suspend then resume cycle even if it failed down later when
> the platform attempted to suspend.

So it is the driver's responsibility to "shut down" on resume?
(I had the vague impression that the suspend framework would
"disable" the device through the appropriate callback.)

>>> See drivers/net/ethernet/broadcom/genet/bcmgenet.c which is a driver
>>> that takes care of that for instance, look for bcmgenet_{suspend,resume}
>>
>> Thanks. I will look into it.
>>
>> If I understand correctly, something is missing in the
>> network interface code? (My system is using an NFS root
>> filesystem, so network is an important subsystem.)
> 
> The typical things are detaching the network device and stopping
> transmit queues, but without knowing what changes you have done to
> nb8800.c, hard to tell what else is needed.

I'm using the driver unaltered. So I guess I need to figure out
the exact steps required for suspending a network device.
(I'll look at bcmgenet.c tomorrow.)

> Is your system clocksource also correctly saved/restored, or if you go
> through a firmware in-between could it be changing the counter values
> and make Linux think that more time as elapsed than it really happened?

 Thanks for pointing this out, I was not aware I was supposed to save
 and restore the tick counter on suspend/resume. (This is not an issue
 in this specific situation, as the platform is NOT suspended.)
>>>
>>> You don't have to save and restore the clocksource counter, although if
>>> you want proper time accounting to be done across suspend states, you
>>> would want to use a clocksource which is persistent across these suspend
>>> states.
>>
>> The clocksource is a 27 MHz 32-bit tick counter. In other words,
>> the counter wraps around every 159 seconds. If Linux suspends
>> for several hours, how can it determine how much time went by?
> 
> Well, that's unfortunate, then you are pretty much either doomed to
> accepting to lose time in between and rely on e.g: NTP to resync your
> time upon resumption, or, if you had smarter hardware you could have a
> prescaler or something that makes this counter wrap far ahead (like
> years or days after).

Maybe the hardware devs thought of that problem, because they
"widened" the counter to 64 bits on newer platforms.

Regards.



Re: [RFC PATCH] tracing: add sched_prio_update

2016-07-05 Thread Mathieu Desnoyers
- On Jul 5, 2016, at 11:19 AM, rostedt rost...@goodmis.org wrote:

> On Mon,  4 Jul 2016 15:46:04 -0400
> Julien Desfossez  wrote:
> 
> 
>> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
>> index 9b90c57..fcb0f29 100644
>> --- a/include/trace/events/sched.h
>> +++ b/include/trace/events/sched.h
>> @@ -8,6 +8,34 @@
>>  #include 
>>  #include 
>>  
>> +#define SCHEDULING_POLICY   \
>> +EM( SCHED_NORMAL,   "SCHED_NORMAL") \
>> +EM( SCHED_FIFO, "SCHED_FIFO")   \
>> +EM( SCHED_RR,   "SCHED_RR") \
>> +EM( SCHED_BATCH,"SCHED_BATCH")  \
>> +EM( SCHED_IDLE, "SCHED_IDLE")   \
>> +EMe(SCHED_DEADLINE, "SCHED_DEADLINE")
>> +
>> +/*
>> + * First define the enums in the above macros to be exported to userspace
>> + * via TRACE_DEFINE_ENUM().
>> + */
>> +#undef EM
>> +#undef EMe
>> +#define EM(a, b)TRACE_DEFINE_ENUM(a);
>> +#define EMe(a, b)   TRACE_DEFINE_ENUM(a);
>> +
>> +SCHEDULING_POLICY
>> +
>> +/*
>> + * Now redefine the EM() and EMe() macros to map the enums to the strings
>> + * that will be printed in the output.
>> + */
>> +#undef EM
>> +#undef EMe
>> +#define EM(a, b){a, b},
>> +#define EMe(a, b)   {a, b}
>> +
>>  /*
>>   * Tracepoint for calling kthread_stop, performed to end a kthread:
>>   */
>> @@ -562,6 +590,46 @@ TRACE_EVENT(sched_wake_idle_without_ipi,
>>  
>>  TP_printk("cpu=%d", __entry->cpu)
>>  );
>> +
>> +/*
>> + * Tracepoint for showing scheduling priority changes.
>> + */
>> +TRACE_EVENT(sched_prio_update,
> 
> I'm fine with the addition of this tracepoint. You'll have to get by
> Peter Zijlstra for it.

Great!

> 
>> +
>> +TP_PROTO(struct task_struct *tsk),
>> +
>> +TP_ARGS(tsk),
>> +
>> +TP_STRUCT__entry(
>> +__array( char,  comm,   TASK_COMM_LEN   )
> 
> I could imagine this being a high frequency tracepoint, especially with
> a lot of boosting going on. Can we nuke the comm recording and let the
> userspace tools just hook to the sched_switch tracepoint for that?

We can surely do that.

Just to clarify: currently this tracepoint is *not* hooked on PI boosting,
as described in the changelog. This tracepoint is about the prio attributes
set by user-space. The PI boosting temporarily changes the task struct prio
without updating the associated policy, which seems rather
implementation-specific and odd to expose.

Thoughts ?

Thanks,

Mathieu


> 
> -- Steve
> 
> 
>> +__field( pid_t, pid )
>> +__field( unsigned int,  policy  )
>> +__field( int,   nice)
>> +__field( unsigned int,  rt_priority )
>> +__field( u64,   dl_runtime  )
>> +__field( u64,   dl_deadline )
>> +__field( u64,   dl_period   )
>> +),
>> +
>> +TP_fast_assign(
>> +memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN);
>> +__entry->pid= tsk->pid;
>> +__entry->policy = tsk->policy;
>> +__entry->nice   = task_nice(tsk);
>> +__entry->rt_priority= tsk->rt_priority;
>> +__entry->dl_runtime = tsk->dl.dl_runtime;
>> +__entry->dl_deadline= tsk->dl.dl_deadline;
>> +__entry->dl_period  = tsk->dl.dl_period;
>> +),
>> +
>> +TP_printk("comm=%s pid=%d, policy=%s, nice=%d, rt_priority=%u, "
>> +"dl_runtime=%Lu, dl_deadline=%Lu, dl_period=%Lu",
>> +__entry->comm, __entry->pid,
>> +__print_symbolic(__entry->policy, SCHEDULING_POLICY),
>> +__entry->nice, __entry->rt_priority,
>> +__entry->dl_runtime, __entry->dl_deadline,
>> +__entry->dl_period)
>> +);
>>  #endif /* _TRACE_SCHED_H */
>>  
>>  /* This part must be outside protection */
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index 7926993..ac4294a 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -1773,6 +1773,7 @@ long _do_fork(unsigned long clone_flags,
>>  struct pid *pid;
>>  
>>  trace_sched_process_fork(current, p);
>> +trace_sched_prio_update(p);
>>  
>>  pid = get_task_pid(p, PIDTYPE_PID);
>>  nr = pid_vnr(pid);
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index ce83e39..c729425 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -3708,6 +3708,7 @@ void set_user_nice(struct task_struct *p, long nice)
>>  resched_curr(rq);
>>  }
>>  out_unlock:
>> +trace_sched_prio_update(p);
>>  task_rq_unlock(rq, p, &rf);
>>  }
>>  EXPORT_SYMBOL(set_user_nice);
>> @@ -3912,6 +3913,8 @@ static void __setscheduler(struct rq *rq, struct
>> task_struct *p,
>>  p->sched_class = &rt_sched_class;
>>  else
>>  p->sched_

Re: [PATCH -v3 2/2] printk: Add kernel parameter to control writes to /dev/kmsg

2016-07-05 Thread Steven Rostedt
On Mon,  4 Jul 2016 16:24:52 +0200
Borislav Petkov  wrote:

> @@ -614,6 +663,7 @@ struct devkmsg_user {
>   u64 seq;
>   u32 idx;
>   enum log_flags prev;
> + struct ratelimit_state rs;
>   struct mutex lock;
>   char buf[CONSOLE_EXT_LOG_MAX];
>  };
> @@ -623,11 +673,24 @@ static ssize_t devkmsg_write(struct kiocb *iocb, struct 
> iov_iter *from)
>   char *buf, *line;
>   int level = default_message_loglevel;
>   int facility = 1;   /* LOG_USER */
> + struct file *file = iocb->ki_filp;
> + struct devkmsg_user *user = file->private_data;
>   size_t len = iov_iter_count(from);
>   ssize_t ret = len;
>  
> - if (len > LOG_LINE_MAX)
> + if (!user || len > LOG_LINE_MAX)
>   return -EINVAL;
> +
> + /* Ignore when user logging is disabled. */
> + if (devkmsg_log & DEVKMSG_LOG_MASK_OFF)
> + return len;

I wonder if we should return some sort of error message here? ENODEV?

> +
> + /* Ratelimit when not explicitly enabled or when we're not booting. */
> + if ((system_state != SYSTEM_BOOTING) && !(devkmsg_log & 
> DEVKMSG_LOG_MASK_ON)) {
> + if (!___ratelimit(&user->rs, current->comm))
> + return ret;
> + }
> +
>   buf = kmalloc(len+1, GFP_KERNEL);
>   if (buf == NULL)
>   return -ENOMEM;
> @@ -801,18 +864,20 @@ static int devkmsg_open(struct inode *inode, struct 
> file *file)
>   int err;
>  
>   /* write-only does not need any file context */
> - if ((file->f_flags & O_ACCMODE) == O_WRONLY)
> - return 0;
> -
> - err = check_syslog_permissions(SYSLOG_ACTION_READ_ALL,
> -SYSLOG_FROM_READER);
> - if (err)
> - return err;
> + if ((file->f_flags & O_ACCMODE) != O_WRONLY) {
> + err = check_syslog_permissions(SYSLOG_ACTION_READ_ALL,
> +SYSLOG_FROM_READER);
> + if (err)
> + return err;
> + }

Hmm, there's no error message when it is disabled? I'm not sure that is
what we want. I specifically had the return be an error on open if it
was disabled, because (surprisingly) systemd does the right thing and
uses another utility for syslogging.

If you silently fail here, then we lose all logging because systemd
thinks this is working when it is not. That's not what I want.

-- Steve

>  
>   user = kmalloc(sizeof(struct devkmsg_user), GFP_KERNEL);
>   if (!user)
>   return -ENOMEM;
>  
> + ratelimit_default_init(&user->rs);
> + ratelimit_set_flags(&user->rs, RATELIMIT_MSG_ON_RELEASE);
> +
>   mutex_init(&user->lock);
>  
>   raw_spin_lock_irq(&logbuf_lock);
> @@ -831,6 +896,8 @@ static int devkmsg_release(struct inode *inode, struct 
> file *file)
>   if (!user)
>   return 0;
>  
> + ratelimit_state_exit(&user->rs);
> +
>   mutex_destroy(&user->lock);
>   kfree(user);
>   return 0;
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 87b2fc38398b..013d5fe0636a 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -814,6 +814,15 @@ static struct ctl_table kern_table[] = {
>   .extra2 = &ten_thousand,
>   },
>   {
> + .procname   = "printk_devkmsg",
> + .data   = &devkmsg_log,
> + .maxlen = sizeof(unsigned int),
> + .mode   = 0644,
> + .proc_handler   = devkmsg_sysctl_set_loglvl,
> + .extra1 = &zero,
> + .extra2 = &two,
> + },
> + {
>   .procname   = "dmesg_restrict",
>   .data   = &dmesg_restrict,
>   .maxlen = sizeof(int),



Re: [PATCH 2/5] security,overlayfs: Provide security hook for copy up of xattrs for overlay file

2016-07-05 Thread Paul Moore
On Tue, Jul 5, 2016 at 11:50 AM, Vivek Goyal  wrote:
> Provide a security hook which is called when xattrs of a file are being
> copied up. This hook is called once for each xattr and one can either
> accept or reject xattr. If 0 is returned, xattr will be copied up, if 1
> is returned, xattr will not be copied up and if negative error code
> is returned, copy up will be aborted.
>
> In SELinux, label of lower file is not copied up. File already has been
> set with right label at the time of creation and we don't want to overwrite
> that label.
>
> Signed-off-by: David Howells 
> Signed-off-by: Vivek Goyal 
> ---
>  fs/overlayfs/copy_up.c|  8 
>  include/linux/lsm_hooks.h | 13 +
>  include/linux/security.h  | 10 ++
>  security/security.c   |  9 +
>  security/selinux/hooks.c  | 14 ++
>  5 files changed, 54 insertions(+)

To continue the earlier feedback about mixing generic LSM hook
definitions with the SELinux specific hook implementations - I prefer
to see patchsets organized in the following manner:

[PATCH 1/X] - add new LSM hooks and the calls from the relevant
subsystems, e.g.
{security/security.c,include/linux/security.h,fs/overlayfs/*}
[PATCH 2/X] - LSM specific hook implementation, e.g. security/selinux/*
[PATCH n/X] - LSM specific hook implementation, e.g. security/smack/*

-- 
paul moore
security @ redhat


Re: [PATCH 2/2] net: ethernet: bcmgenet: use phy_ethtool_{get|set}_link_ksettings

2016-07-05 Thread Ben Hutchings
On Tue, 2016-07-05 at 14:15 -0700, Florian Fainelli wrote:
> On 07/05/2016 02:07 PM, Philippe Reynes wrote:
> > Hi Florian,
> > 
> > On 05/07/16 06:30, Florian Fainelli wrote:
> > > Le 04/07/2016 16:03, David Miller a écrit :
> > > > From: Philippe Reynes
> > > > Date: Sun,  3 Jul 2016 17:33:57 +0200
> > > > 
> > > > > There are two generics functions phy_ethtool_{get|set}_link_ksettings,
> > > > > so we can use them instead of defining the same code in the driver.
> > > > > 
> > > > > Signed-off-by: Philippe Reynes
> > > > 
> > > > Applied.
> > > > 
> > > 
> > > The transformation is not equivalent, we lost the checks on
> > > netif_running() in the process, and those are here for a reason, if the
> > > interface is down and therefore clock gated, MDIO accesses to the PHY
> > > will simply fail outright and cause bus errors.
> > 
> > Oh, I see, I've missed this. Sorry for this mistake.
> > We should revert this path.
> 
> Well, maybe better than that, actually put the check in the generic
> functions, because if the link is down, aka netif_running() returns
> false, link parameters cannot be reliably queried and they are invalid.

Either the hardware or the driver needs to remember:

- Is auto-negotiation enabled
- If so, which modes are advertised
- If not, which mode is forced

And it should still be possible to get or set that information when the
interface is down.

Ben.

-- 

Ben Hutchings
Life is what happens to you while you're busy making other plans.
   - John
Lennon


signature.asc
Description: This is a digitally signed message part


Re: [patch for-4.7] mm, compaction: prevent VM_BUG_ON when terminating freeing scanner

2016-07-05 Thread Vlastimil Babka
On 07/05/2016 11:01 PM, David Rientjes wrote:
> On Thu, 30 Jun 2016, Vlastimil Babka wrote:
> 
>>>  Note: I really dislike the low watermark check in split_free_page() and
>>>  consider it poor software engineering.  The function should split a free
>>>  page, nothing more.  Terminating memory compaction because of a low
>>>  watermark check when we're simply trying to migrate memory seems like an
>>>  arbitrary heuristic.  There was an objection to removing it in the first
>>>  proposed patch, but I think we should really consider removing that
>>>  check so this is simpler.
>>
>> There's a patch changing it to min watermark (you were CC'd on the series). 
>> We
>> could argue whether it belongs to split_free_page() or some wrapper of it, 
>> but
>> I don't think removing it completely should be done. If zone is struggling
>> with order-0 pages, a functionality for making higher-order pages shouldn't
>> make it even worse. It's also not that arbitrary, even if we succeeded the
>> migration and created a high-order page, the higher-order allocation would
>> still fail due to watermark checks. Worse, __compact_finished() would keep
>> telling the compaction to continue, creating an even longer lag, which is 
>> also
>> against your recent patches.
>>
> 
> I'm suggesting we shouldn't check any zone watermark in split_free_page(): 
> that function should just split the free page.
> 
> I don't find our current watermark checks to determine if compaction is 
> worthwhile to be invalid, but I do think that we should avoid checking or 
> acting on any watermark in isolate_freepages() itself.  We could do more 
> effective checking in __compact_finished() to determine if we should 
> terminate compaction, but the freeing scanner feels like the wrong place 
> to do it -- it's also expensive to check while gathering free pages for 
> memory that we have already successfully isolated as part of the 
> iteration.
> 
> Do you have any objection to this fix for 4.7?

No.

Acked-by: Vlastimil Babka 

> Joonson and/or Minchan, does this address the issue that you reported?
> 



Re: [PATCH 1/3] namei: add LOOKUP_DFD_ROOT to use dfd as root

2016-07-05 Thread Andrey Vagin
On Fri, Jul 1, 2016 at 5:55 PM, Omar Sandoval  wrote:
> On Tue, Jun 28, 2016 at 10:38:28AM -0700, Andrey Vagin wrote:
>> The problem is that a pathname can contain absolute symlinks and now
>> they are resolved relative to the current root.
>>
>> If we want to open a file in another mount namespaces and we have a file
>> descriptor to its root directory, we probably want to resolve pathname
>> in the target mount namespace. For this we add this new flag.
>>
>> If LOOKUP_DFD_ROOT is set, path_init() initializes nd->root and nd->path
>> to the same value.
>>
>> Signed-off-by: Andrey Vagin 
>
> Hi, Andrey,
>
> Seems like a useful feature. Make sure to cc linux-...@vger.kernel.org
> for new userspace interfaces. One comment on the implementation below.

Hi Omar,

Thank you for the patch. I have sent a new version with your changes.
If there will not be other comments in a few days, I will resent a
whole seires with linux-...@vger.kernel.org in CC.

Thanks,
Andrew

>
>> ---
>>  fs/namei.c| 12 +++-
>>  include/linux/namei.h |  2 ++
>>  2 files changed, 13 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/namei.c b/fs/namei.c
>> index 70580ab..5f08b69 100644
>> --- a/fs/namei.c
>> +++ b/fs/namei.c
>> @@ -2148,7 +2148,7 @@ static const char *path_init(struct nameidata *nd, 
>> unsigned flags)
>>   nd->path.dentry = NULL;
>>
>>   nd->m_seq = read_seqbegin(&mount_lock);
>> - if (*s == '/') {
>> + if (*s == '/' && !(flags & LOOKUP_DFD_ROOT)) {
>>   if (flags & LOOKUP_RCU)
>>   rcu_read_lock();
>>   set_root(nd);
>> @@ -2174,6 +2174,11 @@ static const char *path_init(struct nameidata *nd, 
>> unsigned flags)
>>   get_fs_pwd(current->fs, &nd->path);
>>   nd->inode = nd->path.dentry->d_inode;
>>   }
>> + if (flags & LOOKUP_DFD_ROOT) {
>> + nd->root = nd->path;
>> + if (!(flags & LOOKUP_RCU))
>> + path_get(&nd->root);
>
> You're not initializing nd->root_seq here. That means that if we end up
> going through unlazy_walk(), we're going to call legitimize_path() (and
> thus read_seqcount_retry()) with stack garbage, get a spurious ECHILD,
> and do an unnecessary restart of the path lookup instead of dropping
> into ref-walk mode.
>
>> + }
>>   return s;
>>   } else {
>>   /* Caller must check execute permissions on the starting path 
>> component */
>> @@ -2202,6 +2207,11 @@ static const char *path_init(struct nameidata *nd, 
>> unsigned flags)
>>   nd->inode = nd->path.dentry->d_inode;
>>   }
>>   fdput(f);
>> + if (flags & LOOKUP_DFD_ROOT) {
>> + nd->root = nd->path;
>> + if (!(flags & LOOKUP_RCU))
>> + path_get(&nd->root);
>> + }
>
> Same here.
>
> The following should do the trick:
>
> diff --git a/fs/namei.c b/fs/namei.c
> index 9958b605e822..101d1fb8d3cb 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -2176,7 +2176,9 @@ static const char *path_init(struct nameidata *nd, 
> unsigned flags)
> }
> if (flags & LOOKUP_DFD_ROOT) {
> nd->root = nd->path;
> -   if (!(flags & LOOKUP_RCU))
> +   if (flags & LOOKUP_RCU)
> +   nd->root_seq = nd->seq;
> +   else
> path_get(&nd->root);
> }
> return s;
> @@ -2209,7 +2211,9 @@ static const char *path_init(struct nameidata *nd, 
> unsigned flags)
> fdput(f);
> if (flags & LOOKUP_DFD_ROOT) {
> nd->root = nd->path;
> -   if (!(flags & LOOKUP_RCU))
> +   if (flags & LOOKUP_RCU)
> +   nd->root_seq = nd->seq;
> +   else
> path_get(&nd->root);
> }
> return s;
>
> --
> Omar
> ___
> Containers mailing list
> contain...@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers


<    1   2   3   4   5   6   7   8   >