date:20181213

Re: [PATCH v6 7/7] tpm: pass an array of tpm_bank_list structures to tpm_pcr_extend()

2018-12-13 Thread Jarkko Sakkinen

On Thu, Dec 13, 2018 at 08:57:17AM +0100, Roberto Sassu wrote:
> > 1. The function does not fail if alg_id is not found. This will go
> > silent.
> 
> It is intentional. If alg_id is not found, the PCR is extended with the
> first digest passed by the caller of tpm_pcr_extend(). If no digest was
> provided, the PCR is extended with 0s. This is done to prevent that
> PCRs in unused banks are extended later with fake measurements.
> 
> 
> > 2. The function does not fail if there is a mismatch with the digest
> > sizes.
> 
> The data passed by the caller of tpm_pcr_extend() is copied to
> dummy_hash, which has the maximum length. Then, tpm2_pcr_extend() takes
> from dummy_hash as many bytes as needed, depending on the current
> algorithm.

I would suggest to document these corner cases to the function long
description to make it easy and obvious to understand.

/Jarkko

Re: [PATCH v2] Allow hwrng to initialize crng.

2018-12-13 Thread Jarkko Sakkinen

On Thu, Dec 13, 2018 at 05:18:48PM +0800, Louis Collard wrote:
> Some systems, for example embedded systems, do not generate
> enough entropy on boot through interrupts, and boot may be blocked for
> several minutes waiting for a call to getrandom to complete.
> 
> Currently, random data is read from a hwrng when it is registered,
> and is loaded into primary_crng. This data is treated in the same
> way as data that is device-specific but otherwise unchanging, and
> so primary_crng cannot become initialized with the data from the
> hwrng.
> 
> This change causes the data initially read from the hwrng to be
> treated the same as subsequent data that is read from the hwrng if
> it's quality score is non-zero.
> 
> The implications of this are:
> 
> The data read from hwrng can cause primary_crng to become
> initialized, therefore avoiding problems of getrandom blocking
> on boot.
> 
> Calls to getrandom (with GRND_RANDOM) may be using entropy
> exclusively (or in practise, almost exclusively) from the hwrng.
> 
> Regarding the latter point; this behavior is the same as if a
> user specified a quality score of 1 (bit of entropy per 1024 bits)
> so hopefully this is not too scary a change to make.
> 
> This change is the result of the discussion here:
> https://patchwork.kernel.org/patch/10453893/

Please remove these two lines.

> Signed-off-by: Louis Collard 
> Acked-by: Jarkko Sakkinen 
> ---

The change log seems to be missing before diffstat, after dashes.

/Jarkko

[RESEND PATCH] kvm: svm: remove unused struct definition

2018-12-13 Thread Peng Hao

structure svm_init_data is never used. So remove it.

Signed-off-by: Peng Hao 
---
 arch/x86/kvm/svm.c | 5 -
 1 file changed, 5 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 61ccfb1..5c7dc8b 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -675,11 +675,6 @@ struct svm_cpu_data {
 
 static DEFINE_PER_CPU(struct svm_cpu_data *, svm_data);
 
-struct svm_init_data {
-   int cpu;
-   int r;
-};
-
 static const u32 msrpm_ranges[] = {0, 0xc000, 0xc001};
 
 #define NUM_MSR_MAPS ARRAY_SIZE(msrpm_ranges)
-- 
1.8.3.1

Re: [PATCH v2] f2fs: fix sbi->extent_list corruption issue

2018-12-13 Thread Sahitya Tummala

On Wed, Dec 12, 2018 at 11:36:08AM +0800, Chao Yu wrote:
> On 2018/12/12 11:17, Sahitya Tummala wrote:
> > On Fri, Dec 07, 2018 at 05:47:31PM +0800, Chao Yu wrote:
> >> On 2018/12/1 4:33, Jaegeuk Kim wrote:
> >>> On 11/29, Sahitya Tummala wrote:
> 
>  On Tue, Nov 27, 2018 at 09:42:39AM +0800, Chao Yu wrote:
> > On 2018/11/27 8:30, Jaegeuk Kim wrote:
> >> On 11/26, Sahitya Tummala wrote:
> >>> When there is a failure in f2fs_fill_super() after/during
> >>> the recovery of fsync'd nodes, it frees the current sbi and
> >>> retries again. This time the mount is successful, but the files
> >>> that got recovered before retry, still holds the extent tree,
> >>> whose extent nodes list is corrupted since sbi and sbi->extent_list
> >>> is freed up. The list_del corruption issue is observed when the
> >>> file system is getting unmounted and when those recoverd files extent
> >>> node is being freed up in the below context.
> >>>
> >>> list_del corruption. prev->next should be fff1e1ef5480, but was 
> >>> (null)
> >>> <...>
> >>> kernel BUG at kernel/msm-4.14/lib/list_debug.c:53!
> >>> task: fff1f46f2280 task.stack: ff8008068000
> >>> lr : __list_del_entry_valid+0x94/0xb4
> >>> pc : __list_del_entry_valid+0x94/0xb4
> >>> <...>
> >>> Call trace:
> >>> __list_del_entry_valid+0x94/0xb4
> >>> __release_extent_node+0xb0/0x114
> >>> __free_extent_tree+0x58/0x7c
> >>> f2fs_shrink_extent_tree+0xdc/0x3b0
> >>> f2fs_leave_shrinker+0x28/0x7c
> >>> f2fs_put_super+0xfc/0x1e0
> >>> generic_shutdown_super+0x70/0xf4
> >>> kill_block_super+0x2c/0x5c
> >>> kill_f2fs_super+0x44/0x50
> >>> deactivate_locked_super+0x60/0x8c
> >>> deactivate_super+0x68/0x74
> >>> cleanup_mnt+0x40/0x78
> >>> __cleanup_mnt+0x1c/0x28
> >>> task_work_run+0x48/0xd0
> >>> do_notify_resume+0x678/0xe98
> >>> work_pending+0x8/0x14
> >>>
> >>> Fix this by cleaning up inodes, extent tree and nodes of those
> >>> recovered files before freeing up sbi and before next retry.
> >>>
> >>> Signed-off-by: Sahitya Tummala 
> >>> ---
> >>> v2:
> >>> -call evict_inodes() and f2fs_shrink_extent_tree() to cleanup inodes
> >>>
> >>>  fs/f2fs/f2fs.h |  1 +
> >>>  fs/f2fs/shrinker.c |  2 +-
> >>>  fs/f2fs/super.c| 13 -
> >>>  3 files changed, 14 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
> >>> index 1e03197..aaee63b 100644
> >>> --- a/fs/f2fs/f2fs.h
> >>> +++ b/fs/f2fs/f2fs.h
> >>> @@ -3407,6 +3407,7 @@ struct rb_entry *f2fs_lookup_rb_tree_ret(struct 
> >>> rb_root_cached *root,
> >>>  bool f2fs_check_rb_tree_consistence(struct f2fs_sb_info *sbi,
> >>>   struct rb_root_cached 
> >>> *root);
> >>>  unsigned int f2fs_shrink_extent_tree(struct f2fs_sb_info *sbi, int 
> >>> nr_shrink);
> >>> +unsigned long __count_extent_cache(struct f2fs_sb_info *sbi);
> >>>  bool f2fs_init_extent_tree(struct inode *inode, struct f2fs_extent 
> >>> *i_ext);
> >>>  void f2fs_drop_extent_tree(struct inode *inode);
> >>>  unsigned int f2fs_destroy_extent_node(struct inode *inode);
> >>> diff --git a/fs/f2fs/shrinker.c b/fs/f2fs/shrinker.c
> >>> index 9e13db9..7e3c13b 100644
> >>> --- a/fs/f2fs/shrinker.c
> >>> +++ b/fs/f2fs/shrinker.c
> >>> @@ -30,7 +30,7 @@ static unsigned long __count_free_nids(struct 
> >>> f2fs_sb_info *sbi)
> >>>   return count > 0 ? count : 0;
> >>>  }
> >>>  
> >>> -static unsigned long __count_extent_cache(struct f2fs_sb_info *sbi)
> >>> +unsigned long __count_extent_cache(struct f2fs_sb_info *sbi)
> >>>  {
> >>>   return atomic_read(>total_zombie_tree) +
> >>>   atomic_read(>total_ext_node);
> >>> diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
> >>> index af58b2c..769e7b1 100644
> >>> --- a/fs/f2fs/super.c
> >>> +++ b/fs/f2fs/super.c
> >>> @@ -3016,6 +3016,16 @@ static void f2fs_tuning_parameters(struct 
> >>> f2fs_sb_info *sbi)
> >>>   sbi->readdir_ra = 1;
> >>>  }
> >>>  
> >>> +static void f2fs_cleanup_inodes(struct f2fs_sb_info *sbi)
> >>> +{
> >>> + struct super_block *sb = sbi->sb;
> >>> +
> >>> + sync_filesystem(sb);
> >>
> >> This writes another checkpoint, which would not be what this retrial 
> >> intended.
> >
> > Actually, checkpoint will not be triggered due to SBI_POR_DOING flag 
> > check
> > as below:
> >
> > int f2fs_sync_fs(struct super_block *sb, int sync)
> > {
> > ...
> > if (unlikely(is_sbi_flag_set(sbi, SBI_POR_DOING)))
> > return -EAGAIN;
> > ...
> > }
> >
> > And also all dirty data/node won't be persisted due to

[PATCH] pinctrl: xway: fix gpio-hog related boot issues

2018-12-13 Thread Martin Schiller

This patch is based on commit a86caa9ba5d7 ("pinctrl: msm: fix gpio-hog
related boot issues").

It fixes the issue that the gpio ranges needs to be defined before
gpiochip_add().

Therefore, we also have to swap the order of registering the pinctrl
driver and registering the gpio chip.

You also have to add the "gpio-ranges" property to the pinctrl device
node to get it finally working.

Signed-off-by: Martin Schiller 
---
 drivers/pinctrl/pinctrl-xway.c | 39 +++
 1 file changed, 27 insertions(+), 12 deletions(-)

diff --git a/drivers/pinctrl/pinctrl-xway.c b/drivers/pinctrl/pinctrl-xway.c
index 93f8bd04e7fe..ae74b260b014 100644
--- a/drivers/pinctrl/pinctrl-xway.c
+++ b/drivers/pinctrl/pinctrl-xway.c
@@ -1746,14 +1746,6 @@ static int pinmux_xway_probe(struct platform_device 
*pdev)
}
xway_pctrl_desc.pins = xway_info.pads;
 
-   /* register the gpio chip */
-   xway_chip.parent = >dev;
-   ret = devm_gpiochip_add_data(>dev, _chip, NULL);
-   if (ret) {
-   dev_err(>dev, "Failed to register gpio chip\n");
-   return ret;
-   }
-
/* setup the data needed by pinctrl */
xway_pctrl_desc.name= dev_name(>dev);
xway_pctrl_desc.npins   = xway_chip.ngpio;
@@ -1775,10 +1767,33 @@ static int pinmux_xway_probe(struct platform_device 
*pdev)
return ret;
}
 
-   /* finish with registering the gpio range in pinctrl */
-   xway_gpio_range.npins = xway_chip.ngpio;
-   xway_gpio_range.base = xway_chip.base;
-   pinctrl_add_gpio_range(xway_info.pctrl, _gpio_range);
+   /* register the gpio chip */
+   xway_chip.parent = >dev;
+   xway_chip.owner = THIS_MODULE;
+   xway_chip.of_node = pdev->dev.of_node;
+   ret = devm_gpiochip_add_data(>dev, _chip, NULL);
+   if (ret) {
+   dev_err(>dev, "Failed to register gpio chip\n");
+   return ret;
+   }
+
+   /*
+* For DeviceTree-supported systems, the gpio core checks the
+* pinctrl's device node for the "gpio-ranges" property.
+* If it is present, it takes care of adding the pin ranges
+* for the driver. In this case the driver can skip ahead.
+*
+* In order to remain compatible with older, existing DeviceTree
+* files which don't set the "gpio-ranges" property or systems that
+* utilize ACPI the driver has to call gpiochip_add_pin_range().
+*/
+   if (!of_property_read_bool(pdev->dev.of_node, "gpio-ranges")) {
+   /* finish with registering the gpio range in pinctrl */
+   xway_gpio_range.npins = xway_chip.ngpio;
+   xway_gpio_range.base = xway_chip.base;
+   pinctrl_add_gpio_range(xway_info.pctrl, _gpio_range);
+   }
+
dev_info(>dev, "Init done\n");
return 0;
 }
-- 
2.11.0

Re: [PATCH] vhost: return EINVAL if iovecs size does not match the message size

2018-12-13 Thread Pavel Tikhomirov

On 12/13/2018 10:55 PM, Michael S. Tsirkin wrote:
> On Thu, Dec 13, 2018 at 05:53:50PM +0300, Pavel Tikhomirov wrote:
>> We've failed to copy and process vhost_iotlb_msg so let userspace at
>> least know about it. For instance before these patch the code below runs
>> without any error:
>>
>> int main()
>> {
>>struct vhost_msg msg;
>>struct iovec iov;
>>int fd;
>>
>>fd = open("/dev/vhost-net", O_RDWR);
>>if (fd == -1) {
>>  perror("open");
>>  return 1;
>>}
>>
>>iov.iov_base = 
>>iov.iov_len = sizeof(msg)-4;
>>
>>if (writev(fd, ,1) == -1) {
>>  perror("writev");
>>  return 1;
>>}
>>
>>return 0;
>> }
>>
>> Signed-off-by: Pavel Tikhomirov 
> 
> Thanks for the patch!
> 
>> ---
>>   drivers/vhost/vhost.c | 8 ++--
>>   1 file changed, 6 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
>> index 3a5f81a66d34..03014224ef13 100644
>> --- a/drivers/vhost/vhost.c
>> +++ b/drivers/vhost/vhost.c
>> @@ -1024,8 +1024,10 @@ ssize_t vhost_chr_write_iter(struct vhost_dev *dev,
>>  int type, ret;
>>   
>>  ret = copy_from_iter(, sizeof(type), from);
>> -if (ret != sizeof(type))
>> +if (ret != sizeof(type)) {
>> +ret = -EINVAL;
>>  goto done;
>> +}
>>   
>>  switch (type) {
>>  case VHOST_IOTLB_MSG:
> 
> should this be EFAULT rather?

We already have "Invalid argument" returned when wrong type of vhost_msg 
received, I though it would be fine to return same thing if we have 
wrong size of vhost_msg.

When we return "Bad address" because of vhost_process_iotlb_msg fail, it 
is because our vhost_dev has no ->iotlb so our problem is not connected 
to the data passed from userspace but with the state of vhost_dev.

So I like EINVAL more in these two cases.

> 
>> @@ -1044,8 +1046,10 @@ ssize_t vhost_chr_write_iter(struct vhost_dev *dev,
>>   
>>  iov_iter_advance(from, offset);
>>  ret = copy_from_iter(, sizeof(msg), from);
>> -if (ret != sizeof(msg))
>> +if (ret != sizeof(msg)) {
>> +ret = -EINVAL;
>>  goto done;
>> +}
>>  if (vhost_process_iotlb_msg(dev, )) {
>>  ret = -EFAULT;
>>  goto done;
> 
> This too?
> 
>> -- 
>> 2.17.1

-- 
Best regards, Tikhomirov Pavel
Software Developer, Virtuozzo.

[PATCH] mm/page_alloc.c: Allow error injection

2018-12-13 Thread Benjamin Poirier

Model call chain after should_failslab(). Likewise, we can now use a kprobe
to override the return value of should_fail_alloc_page() and inject
allocation failures into alloc_page*().

Signed-off-by: Benjamin Poirier 
---
 include/asm-generic/error-injection.h |  1 +
 mm/page_alloc.c   | 10 --
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/include/asm-generic/error-injection.h 
b/include/asm-generic/error-injection.h
index 296c65442f00..95a159a4137f 100644
--- a/include/asm-generic/error-injection.h
+++ b/include/asm-generic/error-injection.h
@@ -8,6 +8,7 @@ enum {
EI_ETYPE_NULL,  /* Return NULL if failure */
EI_ETYPE_ERRNO, /* Return -ERRNO if failure */
EI_ETYPE_ERRNO_NULL,/* Return -ERRNO or NULL if failure */
+   EI_ETYPE_TRUE,  /* Return true if failure */
 };
 
 struct error_injection_entry {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2ec9cc407216..64861d79dc2d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3053,7 +3053,7 @@ static int __init setup_fail_page_alloc(char *str)
 }
 __setup("fail_page_alloc=", setup_fail_page_alloc);
 
-static bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
+static bool __should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
 {
if (order < fail_page_alloc.min_order)
return false;
@@ -3103,13 +3103,19 @@ late_initcall(fail_page_alloc_debugfs);
 
 #else /* CONFIG_FAIL_PAGE_ALLOC */
 
-static inline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
+static inline bool __should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
 {
return false;
 }
 
 #endif /* CONFIG_FAIL_PAGE_ALLOC */
 
+static noinline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
+{
+   return __should_fail_alloc_page(gfp_mask, order);
+}
+ALLOW_ERROR_INJECTION(should_fail_alloc_page, TRUE);
+
 /*
  * Return true if free base pages are above 'mark'. For high-order checks it
  * will return true of the order-0 watermark is reached and there is at least
-- 
2.19.2

Re: [PATCH V1] mmc: tegra: Fix for SDMMC pads autocal parsing from dt

2018-12-13 Thread Adrian Hunter

On 13/12/18 10:25 PM, Sowjanya Komatineni wrote:
> Some of the SDMMC pads auto calibration values parsed from
> devicetree are assigned incorrectly. This patch fixes it.
> 
> Signed-off-by: Sowjanya Komatineni 

Acked-by: Adrian Hunter 

> ---
>  drivers/mmc/host/sdhci-tegra.c | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/mmc/host/sdhci-tegra.c b/drivers/mmc/host/sdhci-tegra.c
> index 7b95d088fdef..e6ace31e2a41 100644
> --- a/drivers/mmc/host/sdhci-tegra.c
> +++ b/drivers/mmc/host/sdhci-tegra.c
> @@ -510,25 +510,25 @@ static void tegra_sdhci_parse_pad_autocal_dt(struct 
> sdhci_host *host)
>  
>   err = device_property_read_u32(host->mmc->parent,
>   "nvidia,pad-autocal-pull-up-offset-3v3-timeout",
> - >pull_up_3v3);
> + >pull_up_3v3_timeout);
>   if (err)
>   autocal->pull_up_3v3_timeout = 0;
>  
>   err = device_property_read_u32(host->mmc->parent,
>   "nvidia,pad-autocal-pull-down-offset-3v3-timeout",
> - >pull_down_3v3);
> + >pull_down_3v3_timeout);
>   if (err)
>   autocal->pull_down_3v3_timeout = 0;
>  
>   err = device_property_read_u32(host->mmc->parent,
>   "nvidia,pad-autocal-pull-up-offset-1v8-timeout",
> - >pull_up_1v8);
> + >pull_up_1v8_timeout);
>   if (err)
>   autocal->pull_up_1v8_timeout = 0;
>  
>   err = device_property_read_u32(host->mmc->parent,
>   "nvidia,pad-autocal-pull-down-offset-1v8-timeout",
> - >pull_down_1v8);
> + >pull_down_1v8_timeout);
>   if (err)
>   autocal->pull_down_1v8_timeout = 0;
>  
>

Re: A weird problem of Realtek r8168 after resume from S3

2018-12-13 Thread Heiner Kallweit

On 14.12.2018 04:33, Chris Chiu wrote:
> On Thu, Dec 13, 2018 at 10:20 AM Chris Chiu  wrote:
>>
>> Hi,
>> We got an acer laptop which has a problem with ethernet networking after
>> resuming from S3. The ethernet is popular realtek r8168. The lspci shows as
>> follows.
>> 02:00.1 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd.
>> RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 
>> 12)
>>
Helpful would be a "dmesg | grep r8169", especially chip name + XID.

>> The problem is the ethernet is not accessible after resume. Pinging via
>> ethernet always shows the response `Destination Host Unreachable`. However,
>> the interesting part is, when I run tcpdump to monitor the problematic 
>> ethernet
>> interface, the networking is back to alive. But it's dead again after
>> I stop tcpdump.
>> One more thing, if I ping the problematic machine from others, it achieves 
>> the
>> same effect as above tcpdump. Maybe it's about the register setting for RX 
>> path?
>>
You could compare the register dumps (ethtool -d) before and after S3 sleep
to find out whether there's a difference.

>> I tried the latest 4.20 rc version but the problem still there. I
>> also tried some
>> hw_reset or init thing in the resume path but no effect. Any
>> suggestion for this?
>> Thanks
>>
Did previous kernel versions work? If it's a regression, a bisect would be
appreciated, because with the chip versions I've got I can't reproduce the 
issue.

>> Chris
> 
> Gentle ping. Any additional information required?
> 
> Chris
> 
Heiner

Re: [PATCH 00/17] Backport rt/deadline crash and the ardous story of FUTEX_UNLOCK_PI to 4.4

2018-12-13 Thread Henrik Austad

On Fri, Dec 14, 2018 at 08:18:26AM +0100, Greg Kroah-Hartman wrote:
> On Mon, Nov 19, 2018 at 12:27:21PM +0100, Henrik Austad wrote:
> > On Fri, Nov 09, 2018 at 11:35:31AM +0100, Henrik Austad wrote:
> > > On Fri, Nov 09, 2018 at 11:07:28AM +0100, Henrik Austad wrote:
> > > > From: Henrik Austad 
> > > > 
> > > > Short story:
> > > 
> > > Sorry for the spam, it looks like I was not very specific in /which/ 
> > > version I targeted this to, as well as not providing a full Cc-list for 
> > > the 
> > > cover-letter.
> > 
> > Gentle prod. I realize this was sent out just before plumbers and that 
> > people had pretty packed agendas, so a small nudge to gain a spot closer to 
> > the top of the inbox :)
> > 
> > This series has now been running on an arm64 system for 9 days without any 
> > issues and pi_stress showed a dramatic improvement from ~30 seconds and up 
> > to several ours (it finally deadlocked at 3.9e9 inversions).
> > 
> > I'd greatly appreciate if someone could give the list of patches a quick 
> > glance to verify that I got all the required patches and then if it could 
> > be added to 4.4.y.

Hi Greg,

> This is a really intrusive series of patches, and without some testing
> and verification by others, I am really reluctant to take these patches.

Yes I know, they are intrusive, and they touch core parts of the kernel in 
interesting ways.

I completely agree with the need for testing, and I do not _expect_ these 
pathces to be merged. It was a "this was useful for us, it is probably 
useful for others" kind of series.

Perhaps it is not that many others out there using pi_futex shared between 
a sched_rr thread and a sched_deadline thread, which is how you back 
yourself into this corner.

> Why not just move to the 4.9.y tree, or better yet, 4.19.y to resolve
> this issue for your systems?

That would indeed be the best solution, but vendor will not update kernel 
past 4.4 for this particular SoC, so we have no way of moving this to a 
later kernel :(

Anyway, I'm happy to carry these in our local tree for our own use. If 
something pops up in our internal testing requiring update to the series, 
I'll send an update for others to see should they experience the same 
issue. :)

Thanks for the reply!

-- 
Henrik Austad

signature.asc
Description: PGP signature

Re: [PATCH V1] mmc: sdhci: Fix sdhci_do_enable_v4_mode

2018-12-13 Thread Adrian Hunter

On 13/12/18 10:34 PM, Sowjanya Komatineni wrote:
> V4_MODE is Bit-15 of SDHCI_HOST_CONTROL2 register.
> Need to perform word access to this register.
> 
> Signed-off-by: Sowjanya Komatineni 

Fixes: b3f80b434f726 ("mmc: sdhci: Add sd host v4 mode")

Acked-by: Adrian Hunter 

> ---
>  drivers/mmc/host/sdhci.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c
> index 99bdae53fa2e..fde984d10619 100644
> --- a/drivers/mmc/host/sdhci.c
> +++ b/drivers/mmc/host/sdhci.c
> @@ -127,12 +127,12 @@ static void sdhci_do_enable_v4_mode(struct sdhci_host 
> *host)
>  {
>   u16 ctrl2;
>  
> - ctrl2 = sdhci_readb(host, SDHCI_HOST_CONTROL2);
> + ctrl2 = sdhci_readw(host, SDHCI_HOST_CONTROL2);
>   if (ctrl2 & SDHCI_CTRL_V4_MODE)
>   return;
>  
>   ctrl2 |= SDHCI_CTRL_V4_MODE;
> - sdhci_writeb(host, ctrl2, SDHCI_HOST_CONTROL);
> + sdhci_writew(host, ctrl2, SDHCI_HOST_CONTROL2);
>  }
>  
>  /*
>

Re: [PATCH] KVM: MMU: Introduce single thread to zap collapsible sptes

2018-12-13 Thread Wanpeng Li

ping,
On Thu, 6 Dec 2018 at 15:58, Wanpeng Li  wrote:
>
> From: Wanpeng Li 
>
> Last year guys from huawei reported that the call of 
> memory_global_dirty_log_start/stop()
> takes 13s for 4T memory and cause guest freeze too long which increases the 
> unacceptable
> migration downtime. [1] [2]
>
> Guangrong pointed out:
>
> | collapsible_sptes zaps 4k mappings to make memory-read happy, it is not
> | required by the semanteme of KVM_SET_USER_MEMORY_REGION and it is not
> | urgent for vCPU's running, it could be done in a separate thread and use
> | lock-break technology.
>
> [1] https://lists.gnu.org/archive/html/qemu-devel/2017-04/msg05249.html
> [2] https://www.mail-archive.com/qemu-devel@nongnu.org/msg449994.html
>
> Several TB memory guest is common now after NVDIMM is deployed in cloud 
> environment.
> This patch utilizes worker thread to zap collapsible sptes in order to lazy 
> collapse
> small sptes into large sptes during roll-back after live migration fails.
>
> Cc: Paolo Bonzini 
> Cc: Radim Krčmář 
> Signed-off-by: Wanpeng Li 
> ---
>  arch/x86/include/asm/kvm_host.h |  3 +++
>  arch/x86/kvm/mmu.c  | 37 -
>  arch/x86/kvm/x86.c  |  4 
>  3 files changed, 39 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index fbda5a9..dde32f9 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -892,6 +892,8 @@ struct kvm_arch {
> u64 master_cycle_now;
> struct delayed_work kvmclock_update_work;
> struct delayed_work kvmclock_sync_work;
> +   struct delayed_work kvm_mmu_zap_collapsible_sptes_work;
> +   bool zap_in_progress;
>
> struct kvm_xen_hvm_config xen_hvm_config;
>
> @@ -1247,6 +1249,7 @@ void kvm_mmu_zap_all(struct kvm *kvm);
>  void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, struct kvm_memslots 
> *slots);
>  unsigned int kvm_mmu_calculate_mmu_pages(struct kvm *kvm);
>  void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned int 
> kvm_nr_mmu_pages);
> +void zap_collapsible_sptes_fn(struct work_struct *work);
>
>  int load_pdptrs(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, unsigned long 
> cr3);
>  bool pdptrs_changed(struct kvm_vcpu *vcpu);
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 7c03c0f..fe87dd3 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -5679,14 +5679,41 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm 
> *kvm,
> return need_tlb_flush;
>  }
>
> +void zap_collapsible_sptes_fn(struct work_struct *work)
> +{
> +   struct kvm_memory_slot *memslot;
> +   struct kvm_memslots *slots;
> +   struct delayed_work *dwork = to_delayed_work(work);
> +   struct kvm_arch *ka = container_of(dwork, struct kvm_arch,
> +  
> kvm_mmu_zap_collapsible_sptes_work);
> +   struct kvm *kvm = container_of(ka, struct kvm, arch);
> +   int i;
> +
> +   mutex_lock(>slots_lock);
> +   for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> +   spin_lock(>mmu_lock);
> +   slots = __kvm_memslots(kvm, i);
> +   kvm_for_each_memslot(memslot, slots) {
> +   slot_handle_leaf(kvm, (struct kvm_memory_slot 
> *)memslot,
> +   kvm_mmu_zap_collapsible_spte, true);
> +   if (need_resched() || spin_needbreak(>mmu_lock))
> +   cond_resched_lock(>mmu_lock);
> +   }
> +   spin_unlock(>mmu_lock);
> +   }
> +   kvm->arch.zap_in_progress = false;
> +   mutex_unlock(>slots_lock);
> +}
> +
> +#define KVM_MMU_ZAP_DELAYED (60 * HZ)
>  void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
>const struct kvm_memory_slot *memslot)
>  {
> -   /* FIXME: const-ify all uses of struct kvm_memory_slot.  */
> -   spin_lock(>mmu_lock);
> -   slot_handle_leaf(kvm, (struct kvm_memory_slot *)memslot,
> -kvm_mmu_zap_collapsible_spte, true);
> -   spin_unlock(>mmu_lock);
> +   if (!kvm->arch.zap_in_progress) {
> +   kvm->arch.zap_in_progress = true;
> +   
> schedule_delayed_work(>arch.kvm_mmu_zap_collapsible_sptes_work,
> +   KVM_MMU_ZAP_DELAYED);
> +   }
>  }
>
>  void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index d029377..c2af289 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -9019,6 +9019,9 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long 
> type)
>
> INIT_DELAYED_WORK(>arch.kvmclock_update_work, 
> kvmclock_update_fn);
> INIT_DELAYED_WORK(>arch.kvmclock_sync_work, kvmclock_sync_fn);
> +   INIT_DELAYED_WORK(>arch.kvm_mmu_zap_collapsible_sptes_work,
> +   zap_collapsible_sptes_fn);
> +   kvm->arch.zap_in_progress

Re: [PATCH v2] arm64: invalidate TLB just before turning MMU on

2018-12-13 Thread Ard Biesheuvel

On Fri, 14 Dec 2018 at 05:08, Qian Cai  wrote:
> Also tried to move the local TLB flush part around a bit inside
> __cpu_setup(), although it did complete kdump some times, it did trigger
> "Synchronous Exception" in EFI after a cold-reboot fairly often that
> seems no way to recover remotely without reinstalling the OS.

This doesn't make any sense to me. If the system gets into a weird
state out of cold reboot, how could this code be the culprit? Please
check your firmware, and try to reproduce the issue on a system that
doesn't have such defects.

linux-next: Tree for Dec 14

2018-12-13 Thread Stephen Rothwell

Hi all,

Changes since 20181213:

The dma-mapping tree gained a conflict against the kbuild tree.

The rdma tree still had its build failure so I used a supplied patch.

The net-next tree gained a conflict against the bpf tree.

The drm tree gained a conflict against the drm-fixes tree.

The block tree gained a conflict against the scsi-fixes tree.

Non-merge commits (relative to Linus' tree): 8789
 9285 files changed, 403442 insertions(+), 229008 deletions(-)



I have created today's linux-next tree at
git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
(patches at http://www.kernel.org/pub/linux/kernel/next/ ).  If you
are tracking the linux-next tree using git, you should not use "git pull"
to do so as that will try to merge the new linux-next release with the
old one.  You should use "git fetch" and checkout or reset to the new
master.

You can see which trees have been included by looking in the Next/Trees
file in the source.  There are also quilt-import.log and merge.log
files in the Next directory.  Between each merge, the tree was built
with a ppc64_defconfig for powerpc, an allmodconfig for x86_64, a
multi_v7_defconfig for arm and a native build of tools/perf. After
the final fixups (if any), I do an x86_64 modules_install followed by
builds for x86_64 allnoconfig, powerpc allnoconfig (32 and 64 bit),
ppc44x_defconfig, allyesconfig and pseries_le_defconfig and i386, sparc
and sparc64 defconfig. And finally, a simple boot test of the powerpc
pseries_le_defconfig kernel in qemu (with and without kvm enabled).

Below is a summary of the state of the merge.

I am currently merging 288 trees (counting Linus' and 68 trees of bug
fix patches pending for the current merge release).

Stats about the size of the tree over time can be seen at
http://neuling.org/linux-next-size.html .

Status of my local build tests will be at
http://kisskb.ellerman.id.au/linux-next .  If maintainers want to give
advice about cross compilers/configs that work, we are always open to add
more builds.

Thanks to Randy Dunlap for doing many randconfig builds.  And to Paul
Gortmaker for triage and bug fixes.

-- 
Cheers,
Stephen Rothwell

$ git checkout master
$ git reset --hard stable
Merging origin/master (f5d582777bcb Merge branch 'for-linus' of 
git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid)
Merging fixes/master (d8c137546ef8 powerpc: tag implicit fall throughs)
Merging kbuild-current/fixes (ccda4af0f4b9 Linux 4.20-rc2)
Merging arc-current/for-curr (4c567a448b30 ARC: perf: remove useless ifdefs)
Merging arm-current/fixes (c2a3831df6dc ARM: 8816/1: dma-mapping: fix potential 
uninitialized return)
Merging arm64-fixes/for-next/fixes (b4aecf78083d arm64: hibernate: Avoid 
sending cross-calling with interrupts disabled)
Merging m68k-current/for-linus (58c116fb7dc6 m68k/sun3: Remove is_medusa and 
m68k_pgtable_cachemode)
Merging powerpc-fixes/fixes (a225f1567405 powerpc/ptrace: replace 
ptrace_report_syscall() with a tracehook call)
Merging sparc/master (cf76c364a1e1 Merge tag 'scsi-fixes' of 
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi)
Merging fscrypt-current/for-stable (ae64f9bd1d36 Linux 4.15-rc2)
Merging net/master (9e69efd45321 Merge branch 'vhost-fixes')
Merging bpf/master (7640ead93924 bpf: verifier: make sure callees don't prune 
with caller differences)
Merging ipsec/master (4a135e538962 xfrm_user: fix freeing of xfrm states on 
acquire)
Merging netfilter/master (d4e7df16567b netfilter: nf_conncount: use 
rb_link_node_rcu() instead of rb_link_node())
Merging ipvs/master (feb9f55c33e5 netfilter: nft_dynset: allow dynamic updates 
of non-anonymous set)
Merging wireless-drivers/master (cddfb283af7e mt76: add entry in MAINTAINERS 
file)
Merging mac80211/master (312ca38ddda6 cfg80211: Fix busy loop regression in 
ieee80211_ie_split_ric())
Merging rdma-fixes/for-rc (37fbd834b4e4 IB/core: Fix oops in 
netdev_next_upper_dev_rcu())
Merging sound-current/for-linus (0bea4cc83835 ALSA: hda/realtek: Enable audio 
jacks of ASUS UX433FN/UX333FA with ALC294)
Merging sound-asoc-fixes/for-linus (d5734a94f283 Merge branch 'asoc-4.20' into 
asoc-linus)
Merging regmap-fixes/for-linus (40e020c129cf Linux 4.20-rc6)
Merging regulator-fixes/for-linus (35f33f4f8c5a Merge branch 'regulator-4.20' 
into regulator-linus)
Merging spi-fixes/for-linus (a57daf181bb1 Merge branch 'spi-4.20' into 
spi-linus)
Merging pci-current/for-linus (b07b864ee423 Revert "PCI/ASPM: Do not initialize 
link state when aspm_disabled is set")
Merging driver-core.current/driver-core-linus (2595646791c3 Linux 4.20-rc5)
Merging tty.current/tty-linus (40e020c129cf Linux 4.20-rc6)
Merging usb.current/usb-linus (40e020c129cf Linux 4.20-rc6)
Merging usb-gadget-fixes/fixes (069caf5950df USB: omap_udc: fix rejection of 
out transfers when DMA is used)
Merging usb-serial-fixes/usb-linus (28a86092b175 USB: serial: option: add Telit 
LN940 seri

Re: [PATCH 00/17] Backport rt/deadline crash and the ardous story of FUTEX_UNLOCK_PI to 4.4

2018-12-13 Thread Greg Kroah-Hartman

On Mon, Nov 19, 2018 at 12:27:21PM +0100, Henrik Austad wrote:
> On Fri, Nov 09, 2018 at 11:35:31AM +0100, Henrik Austad wrote:
> > On Fri, Nov 09, 2018 at 11:07:28AM +0100, Henrik Austad wrote:
> > > From: Henrik Austad 
> > > 
> > > Short story:
> > 
> > Sorry for the spam, it looks like I was not very specific in /which/ 
> > version I targeted this to, as well as not providing a full Cc-list for the 
> > cover-letter.
> 
> Gentle prod. I realize this was sent out just before plumbers and that 
> people had pretty packed agendas, so a small nudge to gain a spot closer to 
> the top of the inbox :)
> 
> This series has now been running on an arm64 system for 9 days without any 
> issues and pi_stress showed a dramatic improvement from ~30 seconds and up 
> to several ours (it finally deadlocked at 3.9e9 inversions).
> 
> I'd greatly appreciate if someone could give the list of patches a quick 
> glance to verify that I got all the required patches and then if it could 
> be added to 4.4.y.

This is a really intrusive series of patches, and without some testing
and verification by others, I am really reluctant to take these patches.

Why not just move to the 4.9.y tree, or better yet, 4.19.y to resolve
this issue for your systems?

thanks,

greg k-h

Re: [PATCH v2] binder: implement binderfs

2018-12-13 Thread Dan Carpenter

On Thu, Dec 13, 2018 at 10:59:11PM +0100, Christian Brauner wrote:
> +/**
> + * binderfs_new_inode - allocate inode from super block of a binderfs mount
> + * @ref_inode: inode from wich the super block will be taken
> + * @userp: buffer to copy information about new device for userspace to
> + * @device:binder device for which the new inode will be allocated
> + * @req:   struct binderfs_device as copied from userspace
> + *
> + * This function will allocate a new inode from the super block of the
> + * filesystem mount and attach a dentry to that inode.
> + * Minor numbers are limited and tracked globally in binderfs_minors.
> + * The function will stash a struct binder_device for the specific binder
> + * device in i_private of the inode.
> + *
> + * Return: 0 on success, negative errno on failure
> + */
> +static int binderfs_new_inode(struct inode *ref_inode,
> +   struct binder_device *device,
> +   struct binderfs_device __user *userp,
> +   struct binderfs_device *req)
> +{
> + int minor, ret;
> + struct dentry *dentry, *dup, *root;
> + size_t name_len = BINDERFS_MAX_NAME + 1;
> + char *name = NULL;
> + struct inode *inode = NULL;
> + struct super_block *sb = ref_inode->i_sb;
> + struct binderfs_info *info = sb->s_fs_info;
> +
> + /* Reserve new minor number for the new device. */
> + mutex_lock(_minors_mutex);
> + minor = ida_alloc_max(_minors, BINDERFS_MAX_MINOR, GFP_KERNEL);
> + mutex_unlock(_minors_mutex);
> + if (minor < 0)
> + return minor;
> +
> + ret = -ENOMEM;
> + inode = new_inode(sb);
> + if (!inode)
> + goto err;
> +
> + inode->i_ino = minor + INODE_OFFSET;
> + inode->i_mtime = inode->i_atime = inode->i_ctime = current_time(inode);
> + init_special_inode(inode, S_IFCHR | 0600,
> +MKDEV(MAJOR(binderfs_dev), minor));
> + inode->i_fop = _fops;
> + inode->i_uid = info->root_uid;
> + inode->i_gid = info->root_gid;
> + inode->i_private = device;
> +
> + name = kmalloc(name_len, GFP_KERNEL);
> + if (!name)
> + goto err;
> +
> + ret = snprintf(name, name_len, "%s", req->name);
> + if (ret < 0 || (size_t)ret >= name_len) {

kernel snprintf() doesn't return negatives and the cast isn't required
either.

> + ret = -EINVAL;
> + goto err;
> + }
> +
> + device->binderfs_inode = inode;
> + device->context.binder_context_mgr_uid = INVALID_UID;
> + device->context.name = name;
> + device->miscdev.name = name;
> + device->miscdev.minor = minor;
> + mutex_init(>context.context_mgr_node_lock);
> +
> + req->major = MAJOR(binderfs_dev);
> + req->minor = minor;
> +
> + ret = copy_to_user(userp, req, sizeof(*req));
> + if (ret)
> + goto err;

copy_to_user() returns the number of bytes remaining.

ret = -EFAULT;
if (copy_to_user(userp, req, sizeof(*req)))
goto err;

Also if this copy_to_user() fails, then does the kfree(name) and the
iput(inode) lead to a double free of name in binderfs_evict_inode()?

> +
> + root = sb->s_root;
> + inode_lock(d_inode(root));
> + dentry = d_alloc_name(root, name);
> + if (!dentry) {
> + inode_unlock(d_inode(root));
> + ret = -ENOMEM;
> + goto err;
> + }
> +
> + /* Verify that the name userspace gave us is not already in use. */
> + dup = d_lookup(root, >d_name);
> + if (dup) {
> + if (d_really_is_positive(dup)) {
> + dput(dup);
> + dput(dentry);
> + inode_unlock(d_inode(root));
> + /*
> +  * Prevent double free since iput() calls
> +  * binderfs_evict_inode().
> +  */
> + inode->i_private = NULL;
> + ret = -EEXIST;
> + goto err;
> + }
> + dput(dup);
> + }
> +
> + d_add(dentry, inode);
> + fsnotify_create(root->d_inode, dentry);
> + inode_unlock(d_inode(root));
> +
> + return 0;
> +
> +err:
> + kfree(name);
> + mutex_lock(_minors_mutex);
> + ida_free(_minors, minor);
> + mutex_unlock(_minors_mutex);
> + iput(inode);
> +
> + return ret;
> +}
> +
> +static int binderfs_binder_device_create(struct inode *inode,
> +  struct binderfs_device __user *userp,
> +  struct binderfs_device *req)
> +{
> + struct binder_device *device;
> + int ret;
> +
> + device = kzalloc(sizeof(*device), GFP_KERNEL);
> + if (!device)
> + return -ENOMEM;

Just move this allocation into binderfs_new_inode() and get rid of this
function.

> +
> + ret = binderfs_new_inode(inode, device, userp, req);
> + if

Re: [PATCH 4.19 051/118] flexfiles: use per-mirror specified stateid for IO

2018-12-13 Thread Greg Kroah-Hartman

On Wed, Dec 12, 2018 at 08:06:14AM +0100, Greg Kroah-Hartman wrote:
> On Tue, Dec 11, 2018 at 07:49:27PM +0100, Mkrtchyan, Tigran wrote:
> > 
> > 
> > Hi Greg,
> > 
> > Thanks for pushing this into sable as well. However, I think patch makes 
> > more sense
> > with 320f35b7bf8cccf1997ca3126843535e1b95e9c4
> 
> I need an ack from the nfs maintainer before I can do that...

Ah nevermind, now queued up, it looks sane.

thanks,

greg k-h

Re: [PATCH 0/6] microblaze: fix various problems in building boot images

2018-12-13 Thread Michal Simek

On 07. 12. 18 12:33, Masahiro Yamada wrote:
> This patch set fixes various issues in microblaze Makefiles.
> 
> V2 reflected Michal's comments, and cleaned up a little more.
> 
> I did not add Michals' Acked-by.
> If this patch set goes to the MicroBlaze tree, he will add
> Signed-off-by anyway.
> 
> This patch set is independent of Kbuild tree,
> so it should apply to MicroBlaze tree.
> 
> Resolved the conflict with:
> 
> commit 1e17ab5320a654eaf1e4ce121c61e7aa9732805a
> Author: Firoz Khan 
> Date:   Tue Nov 13 11:34:34 2018 +0530
> 
> microblaze: generate uapi header and system call table files
> 
> 
> 
> 
> Masahiro Yamada (6):
>   microblaze: adjust the help to the real behavior
>   microblaze: move "... is ready" messages to arch/microblaze/Makefile
>   microblaze: fix multiple bugs in arch/microblaze/boot/Makefile
>   microblaze: add linux.bin* and simpleImage.* to PHONY
>   microblaze: fix race condition in building boot images
>   microblaze: remove the explicit removal of system.dtb
> 
>  arch/microblaze/Makefile  | 22 ++
>  arch/microblaze/boot/Makefile | 23 +--
>  arch/microblaze/boot/dts/Makefile |  5 +
>  3 files changed, 24 insertions(+), 26 deletions(-)
> 

Next time please also add v2 to subject.

Applied all.

Thanks,
Michal


-- 
Michal Simek, Ing. (M.Eng), OpenPGP -> KeyID: FE3D1F91
w: www.monstr.eu p: +42-0-721842854
Maintainer of Linux kernel - Xilinx Microblaze
Maintainer of Linux kernel - Xilinx Zynq ARM and ZynqMP ARM64 SoCs
U-Boot custodian - Xilinx Microblaze/Zynq/ZynqMP/Versal SoCs




signature.asc
Description: OpenPGP digital signature

Re: [PATCH] drm/xen-front: Make shmem backed display buffer coherent

2018-12-13 Thread Oleksandr Andrushchenko


On 12/13/18 5:48 PM, Daniel Vetter wrote:

On Thu, Dec 13, 2018 at 12:17:54PM +0200, Oleksandr Andrushchenko wrote:

Daniel, could you please comment?

Cross-revieweing someone else's stuff would scale better,

fair enough

  I don't think
I'll get around to anything before next year.


I put you on CC explicitly because you had comments on other patch [1]

and this one tries to solve the issue raised (I tried to figure out

at [2] if this is the way to go, but it seems I have no alternative here).

While at it [3] (I hope) addresses your comments and the series just

needs your single ack/nack to get in: all the rest ack/r-b are already

there. Do you mind looking at it?


-Daniel


Thank you very much for your time,

Oleksandr


Thank you

On 11/27/18 12:32 PM, Oleksandr Andrushchenko wrote:

From: Oleksandr Andrushchenko 

When GEM backing storage is allocated with drm_gem_get_pages
the backing pages may be cached, thus making it possible that
the backend sees only partial content of the buffer which may
lead to screen artifacts. Make sure that the frontend's
memory is coherent and the backend always sees correct display
buffer content.

Fixes: c575b7eeb89f ("drm/xen-front: Add support for Xen PV display frontend")

Signed-off-by: Oleksandr Andrushchenko 
---
   drivers/gpu/drm/xen/xen_drm_front_gem.c | 62 +++--
   1 file changed, 48 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/xen/xen_drm_front_gem.c 
b/drivers/gpu/drm/xen/xen_drm_front_gem.c
index 47ff019d3aef..c592735e49d2 100644
--- a/drivers/gpu/drm/xen/xen_drm_front_gem.c
+++ b/drivers/gpu/drm/xen/xen_drm_front_gem.c
@@ -33,8 +33,11 @@ struct xen_gem_object {
/* set for buffers allocated by the backend */
bool be_alloc;
-   /* this is for imported PRIME buffer */
-   struct sg_table *sgt_imported;
+   /*
+* this is for imported PRIME buffer or the one allocated via
+* drm_gem_get_pages.
+*/
+   struct sg_table *sgt;
   };
   static inline struct xen_gem_object *
@@ -77,10 +80,21 @@ static struct xen_gem_object *gem_create_obj(struct 
drm_device *dev,
return xen_obj;
   }
+struct sg_table *xen_drm_front_gem_get_sg_table(struct drm_gem_object *gem_obj)
+{
+   struct xen_gem_object *xen_obj = to_xen_gem_obj(gem_obj);
+
+   if (!xen_obj->pages)
+   return ERR_PTR(-ENOMEM);
+
+   return drm_prime_pages_to_sg(xen_obj->pages, xen_obj->num_pages);
+}
+
   static struct xen_gem_object *gem_create(struct drm_device *dev, size_t size)
   {
struct xen_drm_front_drm_info *drm_info = dev->dev_private;
struct xen_gem_object *xen_obj;
+   struct address_space *mapping;
int ret;
size = round_up(size, PAGE_SIZE);
@@ -113,10 +127,14 @@ static struct xen_gem_object *gem_create(struct 
drm_device *dev, size_t size)
xen_obj->be_alloc = true;
return xen_obj;
}
+
/*
 * need to allocate backing pages now, so we can share those
 * with the backend
 */
+   mapping = xen_obj->base.filp->f_mapping;
+   mapping_set_gfp_mask(mapping, GFP_USER | __GFP_DMA32);
+
xen_obj->num_pages = DIV_ROUND_UP(size, PAGE_SIZE);
xen_obj->pages = drm_gem_get_pages(_obj->base);
if (IS_ERR_OR_NULL(xen_obj->pages)) {
@@ -125,8 +143,27 @@ static struct xen_gem_object *gem_create(struct drm_device 
*dev, size_t size)
goto fail;
}
+   xen_obj->sgt = xen_drm_front_gem_get_sg_table(_obj->base);
+   if (IS_ERR_OR_NULL(xen_obj->sgt)){
+   ret = PTR_ERR(xen_obj->sgt);
+   xen_obj->sgt = NULL;
+   goto fail_put_pages;
+   }
+
+   if (!dma_map_sg(dev->dev, xen_obj->sgt->sgl, xen_obj->sgt->nents,
+   DMA_BIDIRECTIONAL)) {
+   ret = -EFAULT;
+   goto fail_free_sgt;
+   }
+
return xen_obj;
+fail_free_sgt:
+   sg_free_table(xen_obj->sgt);
+   xen_obj->sgt = NULL;
+fail_put_pages:
+   drm_gem_put_pages(_obj->base, xen_obj->pages, true, false);
+   xen_obj->pages = NULL;
   fail:
DRM_ERROR("Failed to allocate buffer with size %zu\n", size);
return ERR_PTR(ret);
@@ -149,7 +186,7 @@ void xen_drm_front_gem_free_object_unlocked(struct 
drm_gem_object *gem_obj)
struct xen_gem_object *xen_obj = to_xen_gem_obj(gem_obj);
if (xen_obj->base.import_attach) {
-   drm_prime_gem_destroy(_obj->base, xen_obj->sgt_imported);
+   drm_prime_gem_destroy(_obj->base, xen_obj->sgt);
gem_free_pages_array(xen_obj);
} else {
if (xen_obj->pages) {
@@ -158,6 +195,13 @@ void xen_drm_front_gem_free_object_unlocked(struct 
drm_gem_object *gem_obj)
xen_obj->pages);
gem_free_pages_array(xen_obj);
} else {
+

[PATCH] partitions: fix coding style

2018-12-13 Thread jotun9935

From: Sungkyung Kim 

Fix coding style of osf.c

Signed-off-by: Sungkyung Kim 
---
 block/partitions/osf.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/block/partitions/osf.c b/block/partitions/osf.c
index 4b873973d6c0..96921a1e31ce 100644
--- a/block/partitions/osf.c
+++ b/block/partitions/osf.c
@@ -22,7 +22,7 @@ int osf_partition(struct parsed_partitions *state)
unsigned char *data;
struct disklabel {
__le32 d_magic;
-   __le16 d_type,d_subtype;
+   __le16 d_type, d_subtype;
u8 d_typename[16];
u8 d_packname[16];
__le32 d_secsize;
@@ -50,8 +50,8 @@ int osf_partition(struct parsed_partitions *state)
u8  p_frag;
__le16 p_cpg;
} d_partitions[MAX_OSF_PARTITIONS];
-   } * label;
-   struct d_partition * partition;
+   } *label;
+   struct d_partition *partition;
 
data = read_part_sector(state, 0, );
if (!data)
@@ -74,7 +74,7 @@ int osf_partition(struct parsed_partitions *state)
}
for (i = 0 ; i < npartitions; i++, partition++) {
if (slot == state->limit)
-   break;
+   break;
if (le32_to_cpu(partition->p_size))
put_partition(state, slot,
le32_to_cpu(partition->p_offset),
-- 
2.18.0

Re: [PATCH] ARM: dts: exynos: Specify I2S assigned clocks in proper node

2018-12-13 Thread Greg KH

On Thu, Dec 13, 2018 at 09:56:56PM +0100, Krzysztof Kozlowski wrote:
> On Wed, Dec 12, 2018 at 06:57:44PM +0100, Sylwester Nawrocki wrote:
> > The assigned parent clocks should be normally specified in the consumer
> > device's DT node, this ensures respective driver always sees correct clock
> > settings when required.
> > 
> > This patch fixes regression in audio subsystem on Odroid XU3/XU4 boards
> > that appeared after commits:
> > 
> > 'commit 647d04f8e07a ("ASoC: samsung: i2s: Ensure the RCLK rate is properly 
> > determined")'
> > 'commit 995e73e55f46 ("ASoC: samsung: i2s: Fix rclk_srcrate handling")'
> > 'commit 48279c53fd1d ("ASoC: samsung: i2s: Prevent external abort on 
> > exynos5433 I2S1 access")'
> > 
> > Without this patch the driver gets wrong clock as the I2S function (op_clk)
> > clock in probe() and effectively the clock which is finally assigned from DT
> > is not being enabled/disabled in the runtime resume/suspend ops.
> > 
> > Without the above listed commits the EXYNOS_I2S_BUS clock was always set
> > as parent of CLK_I2S_RCLK_SRC regardless of DT settings so there was no 
> > issue
> > with not enabled EXYNOS_SCLK_I2S.
> > 
> > Cc: sta...@vger.kernel.org # v4.17+
> 
> I gues your format would work (got recognized by stable scripts) but
> strictly speaking format is different:
> 
>   Cc:  # 4.17.x
> 
> https://elixir.bootlin.com/linux/latest/source/Documentation/process/stable-kernel-rules.rst#L127

Either works just fine, my scripts have to be a bit flexible due to all
of the odd ways people like to tag things here...

thanks,

greg k-h

Re: [PATCH 2/3] ASoC: xlnx: Add i2s driver

2018-12-13 Thread Michal Simek

Hi Mark,

On 13. 12. 18 16:31, Mark Brown wrote:
> On Sat, Dec 08, 2018 at 12:02:37AM +0530, Maruthi Srinivas Bayyavarapu wrote:
> 
>> @@ -0,0 +1,185 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Xilinx ASoC I2S audio support
>> + *
> 
> This looks otherwise good so I've applied it but please send a followup
> patch converting the entire comment block to C++ style so this looks
> more consistent.

Is it the rule for your subsystems? Or did it come from any generic
agreement how this should be handled in .c files?

Thanks,
Michal

RE: [RFC PATCH v2 08/15] usb:cdns3: Implements device operations part of the API

2018-12-13 Thread Pawel Laszczak

Hi,
>
>On Wed, Dec 12, 2018 at 3:49 AM Pawel Laszczak  wrote:
>>
>> Hi,
>>
>> >On 10/12/18 7:42 AM, Peter Chen wrote:
>>  +static struct usb_ep *cdns3_gadget_match_ep(struct usb_gadget *gadget,
>>  + struct 
>>  usb_endpoint_descriptor *desc,
>>  + struct 
>>  usb_ss_ep_comp_descriptor *comp_desc)
>>  +{
>>  + struct cdns3_device *priv_dev = gadget_to_cdns3_device(gadget);
>>  + struct cdns3_endpoint *priv_ep;
>>  + unsigned long flags;
>>  +
>>  + priv_ep = cdns3_find_available_ss_ep(priv_dev, desc);
>>  + if (IS_ERR(priv_ep)) {
>>  + dev_err(_dev->dev, "no available ep\n");
>>  + return NULL;
>>  + }
>>  +
>>  + dev_dbg(_dev->dev, "match endpoint: %s\n", priv_ep->name);
>>  +
>>  + spin_lock_irqsave(_dev->lock, flags);
>>  + priv_ep->endpoint.desc = desc;
>>  + priv_ep->dir  = usb_endpoint_dir_in(desc) ? USB_DIR_IN : 
>>  USB_DIR_OUT;
>>  + priv_ep->type = usb_endpoint_type(desc);
>>  +
>>  + list_add_tail(_ep->ep_match_pending_list,
>>  +   _dev->ep_match_list);
>>  + spin_unlock_irqrestore(_dev->lock, flags);
>>  + return _ep->endpoint;
>>  +}
>> >>> Why do you need a custom match_ep?
>> >>> doesn't usb_ep_autoconfig suffice?
>> >>>
>> >>> You can check if EP is claimed or not by checking the ep->claimed flag.
>> >>>
>> >> It is a special requirement for this IP, the EP type and MaxPacketSize
>> >> changing can't be done at runtime, eg, at ep->enable. See below commit
>> >> for detail.
>> >>
>> >> usb: cdns3: gadget: configure all endpoints before set configuration
>> >>
>> >> Cadence IP has one limitation that all endpoints must be configured
>> >> (Type & MaxPacketSize) before setting configuration through hardware
>> >> register, it means we can't change endpoints configuration after
>> >> set_configuration.
>> >>
>> >> In this patch, we add non-control endpoint through 
>> >> usb_ss->ep_match_list,
>> >> which is added when the gadget driver uses usb_ep_autoconfig to 
>> >> configure
>> >> specific endpoint; When the udc driver receives set_configurion 
>> >> request,
>> >> it goes through usb_ss->ep_match_list, and configure all endpoints
>> >> accordingly.
>> >>
>> >> At usb_ep_ops.enable/disable, we only enable and disable endpoint 
>> >> through
>> >> ep_cfg register which can be changed after set_configuration, and do
>> >> some software operation accordingly.
>> >
>> >All this should be part of comments in code along with information about
>> >controller versions which suffer from the errata.
>> >
>> >Is there a version of controller available which does not have the
>> >defect? Is there a future plan to fix this?
>> >
>> >If any of that is yes, you probably want to handle this with runtime
>> >detection of version (like done with DWC3_REVISION_XXX macros).
>> >Sometimes the hardware-read versions themselves are incorrect, so its
>> >better to introduce a version specific compatible too like
>> >"cdns,usb-1.0.0" (as hinted to by Rob Herring as well).
>> >
>>
>> custom match_ep is used and works with all versions of the gen1
>> controller. Future (gen2) releases of the controller won’t have such
>> limitation but there is no plan to change current (gen1) functionality
>> of the controller.
>>
>> I will add comment before cdns3_gadget_match_ep function.
>> Also I will change cdns,usb3 to cdns,usb3-1.0.0 and add additional
>> cdns,usb3-1.0.1 compatible.
>>
>> cdns,usb3-1.0.1 will be for current version of controller which I use.
>> cdns,usb3-1.0.0 will be for older version - Peter Chan platform.
>> I now that I have some changes in controller, and one of them require
>> some changes in DRD driver. It will be safer to add two separate
>> version in compatibles.
>>
>
>Pawel, could we have correct register to show controller version? It is
>better we could version judgement at runtime instead of static compatible.
>

Ok, I will tray do this in this way.

Pawel

Re: [PATCH 0/2] of: phandle_cache, fix refcounts, remove stale entry

2018-12-13 Thread Frank Rowand

Hi Michael Bringmann,

On 12/13/18 10:42 PM, frowand.l...@gmail.com wrote:
> From: Frank Rowand 
> 
> Non-overlay dynamic devicetree node removal may leave the node in
> the phandle cache.  Subsequent calls to of_find_node_by_phandle()
> will incorrectly find the stale entry.  This bug exposed the foloowing
> phandle cache refcount bug.
> 
> The refcount of phandle_cache entries is not incremented while in
> the cache, allowing use after free error after kfree() of the
> cached entry.
> 
> Frank Rowand (2):
>   of: of_node_get()/of_node_put() nodes held in phandle cache
>   of: __of_detach_node() - remove node from phandle cache
> 
>  drivers/of/base.c   | 99 
> -
>  drivers/of/dynamic.c|  3 ++
>  drivers/of/of_private.h |  4 ++
>  3 files changed, 81 insertions(+), 25 deletions(-)
> 

Can you please test that these patches fix the problem that you
reported in:

[PATCH v03] powerpc/mobility: Fix node detach/rename problem

Thanks,

Frank

[v4] PCI: imx: make msi work without CONFIG_PCIEPORTBUS=y

2018-12-13 Thread Richard Zhu

Assertion of the MSI Enable bit of RC's MSI CAP is mandatory required to
trigger MSI on i.MX6 PCIe.
This bit would be asserted when CONFIG_PCIEPORTBUS=y.
Thus, the MSI works fine on i.MX6 PCIe before the commit "f3fdfc4".

Assert it unconditionally when MSI is enabled.
Otherwise, the MSI wouldn't be triggered although the EP is present and
the MSIs are assigned.

Signed-off-by: Richard Zhu 
Reviewed-by: Lucas Stach 
---
Changes v1 -> v2:
* Assert the MSI_EN unconditionally when MSI is supported.
Changes v2 -> v3:
* Remove the IS_ENABLED(CONFIG_PCI_MSI) since the driver depends on
PCI_MSI_IRQ_DOMAIN
* Extended with a check for pci_msi_enabled() to see if the user
explicitly want legacy IRQs
Changes v3 -> v4:
* Refer to Bjorn's comments, refine the subject and commit log and change
the PCI_MSI_CAP to PCIE_RC_IMX6_MSI_CAP.
---
 drivers/pci/controller/dwc/pci-imx6.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/drivers/pci/controller/dwc/pci-imx6.c 
b/drivers/pci/controller/dwc/pci-imx6.c
index 26087b3..639bb27 100644
--- a/drivers/pci/controller/dwc/pci-imx6.c
+++ b/drivers/pci/controller/dwc/pci-imx6.c
@@ -74,6 +74,7 @@ struct imx6_pcie {
 #define PHY_PLL_LOCK_WAIT_USLEEP_MAX   200
 
 /* PCIe Root Complex registers (memory-mapped) */
+#define PCIE_RC_IMX6_MSI_CAP   0x50
 #define PCIE_RC_LCR0x7c
 #define PCIE_RC_LCR_MAX_LINK_SPEEDS_GEN1   0x1
 #define PCIE_RC_LCR_MAX_LINK_SPEEDS_GEN2   0x2
@@ -926,6 +927,7 @@ static int imx6_pcie_probe(struct platform_device *pdev)
struct resource *dbi_base;
struct device_node *node = dev->of_node;
int ret;
+   u16 val;
 
imx6_pcie = devm_kzalloc(dev, sizeof(*imx6_pcie), GFP_KERNEL);
if (!imx6_pcie)
@@ -1071,6 +1073,14 @@ static int imx6_pcie_probe(struct platform_device *pdev)
if (ret < 0)
return ret;
 
+   if (pci_msi_enabled()) {
+   val = dw_pcie_readw_dbi(pci, PCIE_RC_IMX6_MSI_CAP +
+   PCI_MSI_FLAGS);
+   val |= PCI_MSI_FLAGS_ENABLE;
+   dw_pcie_writew_dbi(pci, PCIE_RC_MSI_IMX6_CAP +
+   PCI_MSI_FLAGS, val);
+   }
+
return 0;
 }
 
-- 
2.7.4

[PATCH 2/2] of: __of_detach_node() - remove node from phandle cache

2018-12-13 Thread frowand . list

From: Frank Rowand 

Non-overlay dynamic devicetree node removal may leave the node in
the phandle cache.  Subsequent calls to of_find_node_by_phandle()
will incorrectly find the stale entry.  Remove the node from the
cache.

Add paranoia checks in of_find_node_by_phandle() as a second level
of defense (do not return cached node if detached, do not add node
to cache if detached).

Reported-by: Michael Bringmann 
Signed-off-by: Frank Rowand 
---
 drivers/of/base.c   | 29 -
 drivers/of/dynamic.c|  3 +++
 drivers/of/of_private.h |  4 
 3 files changed, 35 insertions(+), 1 deletion(-)

diff --git a/drivers/of/base.c b/drivers/of/base.c
index d599367cb92a..34a5125713c8 100644
--- a/drivers/of/base.c
+++ b/drivers/of/base.c
@@ -162,6 +162,27 @@ int of_free_phandle_cache(void)
 late_initcall_sync(of_free_phandle_cache);
 #endif
 
+/*
+ * Caller must hold devtree_lock.
+ */
+void __of_free_phandle_cache_entry(phandle handle)
+{
+   phandle masked_handle;
+
+   if (!handle)
+   return;
+
+   masked_handle = handle & phandle_cache_mask;
+
+   if (phandle_cache) {
+   if (phandle_cache[masked_handle] &&
+   handle == phandle_cache[masked_handle]->phandle) {
+   of_node_put(phandle_cache[masked_handle]);
+   phandle_cache[masked_handle] = NULL;
+   }
+   }
+}
+
 void of_populate_phandle_cache(void)
 {
unsigned long flags;
@@ -1209,11 +1230,17 @@ struct device_node *of_find_node_by_phandle(phandle 
handle)
if (phandle_cache[masked_handle] &&
handle == phandle_cache[masked_handle]->phandle)
np = phandle_cache[masked_handle];
+   if (np && of_node_check_flag(np, OF_DETACHED)) {
+   of_node_put(np);
+   phandle_cache[masked_handle] = NULL;
+   np = NULL;
+   }
}
 
if (!np) {
for_each_of_allnodes(np)
-   if (np->phandle == handle) {
+   if (np->phandle == handle &&
+   !of_node_check_flag(np, OF_DETACHED)) {
if (phandle_cache) {
/* will put when removed from cache */
of_node_get(np);
diff --git a/drivers/of/dynamic.c b/drivers/of/dynamic.c
index f4f8ed9b5454..ecea92f68c87 100644
--- a/drivers/of/dynamic.c
+++ b/drivers/of/dynamic.c
@@ -268,6 +268,9 @@ void __of_detach_node(struct device_node *np)
}
 
of_node_set_flag(np, OF_DETACHED);
+
+   /* race with of_find_node_by_phandle() prevented by devtree_lock */
+   __of_free_phandle_cache_entry(np->phandle);
 }
 
 /**
diff --git a/drivers/of/of_private.h b/drivers/of/of_private.h
index 5d1567025358..24786818e32e 100644
--- a/drivers/of/of_private.h
+++ b/drivers/of/of_private.h
@@ -84,6 +84,10 @@ static inline void __of_detach_node_sysfs(struct device_node 
*np) {}
 int of_resolve_phandles(struct device_node *tree);
 #endif
 
+#if defined(CONFIG_OF_DYNAMIC)
+void __of_free_phandle_cache_entry(phandle handle);
+#endif
+
 #if defined(CONFIG_OF_OVERLAY)
 void of_overlay_mutex_lock(void);
 void of_overlay_mutex_unlock(void);
-- 
Frank Rowand

[PATCH 0/2] of: phandle_cache, fix refcounts, remove stale entry

2018-12-13 Thread frowand . list

From: Frank Rowand 

Non-overlay dynamic devicetree node removal may leave the node in
the phandle cache.  Subsequent calls to of_find_node_by_phandle()
will incorrectly find the stale entry.  This bug exposed the foloowing
phandle cache refcount bug.

The refcount of phandle_cache entries is not incremented while in
the cache, allowing use after free error after kfree() of the
cached entry.

Frank Rowand (2):
  of: of_node_get()/of_node_put() nodes held in phandle cache
  of: __of_detach_node() - remove node from phandle cache

 drivers/of/base.c   | 99 -
 drivers/of/dynamic.c|  3 ++
 drivers/of/of_private.h |  4 ++
 3 files changed, 81 insertions(+), 25 deletions(-)

-- 
Frank Rowand

[PATCH 1/2] of: of_node_get()/of_node_put() nodes held in phandle cache

2018-12-13 Thread frowand . list

From: Frank Rowand 

The phandle cache contains struct device_node pointers.  The refcount
of the pointers was not incremented while in the cache, allowing use
after free error after kfree() of the node.  Add the proper increment
and decrement of the use count.

Fixes: 0b3ce78e90fc ("of: cache phandle nodes to reduce cost of 
of_find_node_by_phandle()")

Signed-off-by: Frank Rowand 
---

do not "cc: stable", unless the following commits are also in stable:

  commit e54192b48da7 ("of: fix phandle cache creation for DTs with no 
phandles")
  commit b9952b5218ad ("of: overlay: update phandle cache on overlay apply and 
remove")
  commit 0b3ce78e90fc ("of: cache phandle nodes to reduce cost of 
of_find_node_by_phandle()")

 drivers/of/base.c | 70 ---
 1 file changed, 46 insertions(+), 24 deletions(-)

diff --git a/drivers/of/base.c b/drivers/of/base.c
index 09692c9b32a7..d599367cb92a 100644
--- a/drivers/of/base.c
+++ b/drivers/of/base.c
@@ -116,9 +116,6 @@ int __weak of_node_to_nid(struct device_node *np)
 }
 #endif
 
-static struct device_node **phandle_cache;
-static u32 phandle_cache_mask;
-
 /*
  * Assumptions behind phandle_cache implementation:
  *   - phandle property values are in a contiguous range of 1..n
@@ -127,6 +124,44 @@ int __weak of_node_to_nid(struct device_node *np)
  *   - the phandle lookup overhead reduction provided by the cache
  * will likely be less
  */
+
+static struct device_node **phandle_cache;
+static u32 phandle_cache_mask;
+
+/*
+ * Caller must hold devtree_lock.
+ */
+void __of_free_phandle_cache(void)
+{
+   u32 cache_entries = phandle_cache_mask + 1;
+   u32 k;
+
+   if (!phandle_cache)
+   return;
+
+   for (k = 0; k < cache_entries; k++)
+   of_node_put(phandle_cache[k]);
+
+   kfree(phandle_cache);
+   phandle_cache = NULL;
+}
+
+int of_free_phandle_cache(void)
+{
+   unsigned long flags;
+
+   raw_spin_lock_irqsave(_lock, flags);
+
+   __of_free_phandle_cache();
+
+   raw_spin_unlock_irqrestore(_lock, flags);
+
+   return 0;
+}
+#if !defined(CONFIG_MODULES)
+late_initcall_sync(of_free_phandle_cache);
+#endif
+
 void of_populate_phandle_cache(void)
 {
unsigned long flags;
@@ -136,8 +171,7 @@ void of_populate_phandle_cache(void)
 
raw_spin_lock_irqsave(_lock, flags);
 
-   kfree(phandle_cache);
-   phandle_cache = NULL;
+   __of_free_phandle_cache();
 
for_each_of_allnodes(np)
if (np->phandle && np->phandle != OF_PHANDLE_ILLEGAL)
@@ -155,30 +189,15 @@ void of_populate_phandle_cache(void)
goto out;
 
for_each_of_allnodes(np)
-   if (np->phandle && np->phandle != OF_PHANDLE_ILLEGAL)
+   if (np->phandle && np->phandle != OF_PHANDLE_ILLEGAL) {
+   of_node_get(np);
phandle_cache[np->phandle & phandle_cache_mask] = np;
+   }
 
 out:
raw_spin_unlock_irqrestore(_lock, flags);
 }
 
-int of_free_phandle_cache(void)
-{
-   unsigned long flags;
-
-   raw_spin_lock_irqsave(_lock, flags);
-
-   kfree(phandle_cache);
-   phandle_cache = NULL;
-
-   raw_spin_unlock_irqrestore(_lock, flags);
-
-   return 0;
-}
-#if !defined(CONFIG_MODULES)
-late_initcall_sync(of_free_phandle_cache);
-#endif
-
 void __init of_core_init(void)
 {
struct device_node *np;
@@ -1195,8 +1214,11 @@ struct device_node *of_find_node_by_phandle(phandle 
handle)
if (!np) {
for_each_of_allnodes(np)
if (np->phandle == handle) {
-   if (phandle_cache)
+   if (phandle_cache) {
+   /* will put when removed from cache */
+   of_node_get(np);
phandle_cache[masked_handle] = np;
+   }
break;
}
}
-- 
Frank Rowand

Re: [PATCH V3 6/6] PM / Domains: Propagate performance state updates

2018-12-13 Thread Viresh Kumar

On 13-12-18, 16:53, Ulf Hansson wrote:
> On Wed, 12 Dec 2018 at 11:58, Viresh Kumar  wrote:
> >  update_state:
> > -   return _genpd_set_performance_state(genpd, state);
> > +   return _genpd_set_performance_state(genpd, state, depth);
> 
> Instead of calling _genpd_set_performance_state() from here, I suggest
> to let the caller do it. Simply return the aggregated new state, if it
> needs to be updated - and zero if no update is needed.
> 
> Why? I think it may clarify and simplify the code, in regards to the
> actual set/propagation of state changes. Another side-effect, is that
> you should be able to avoid the forward declaration of
> _genpd_reeval_performance_state(), which I think is nice as well.

_genpd_reeval_performance_state() is currently called from 3 different
places and with the suggested change those sites will have this diff.

-   ret = _genpd_reeval_performance_state(master, master_state,
- depth + 1);
+   master_state = _genpd_reeval_performance_state(master,
+   master_state);
+   ret = _genpd_set_performance_state(genpd, master_state, depth);

To be honest, I don't like it. Probably because I don't find the extra
declaration of _genpd_reeval_performance_state() that bad. If two
routines are always going to get called together it is worth calling
the second one from the first one for me.

But anyway, I am fine with it if you are. Please let me know.

> >  }
> >
> >  /**
> > @@ -332,7 +407,7 @@ int dev_pm_genpd_set_performance_state(struct device 
> > *dev, unsigned int state)
> > prev = gpd_data->performance_state;
> > gpd_data->performance_state = state;
> >
> > -   ret = _genpd_reeval_performance_state(genpd, state);
> > +   ret = _genpd_reeval_performance_state(genpd, state, 0);
> > if (ret)
> > gpd_data->performance_state = prev;
> >
> > diff --git a/include/linux/pm_domain.h b/include/linux/pm_domain.h
> > index 9ad101362aef..dd364abb649a 100644
> > --- a/include/linux/pm_domain.h
> > +++ b/include/linux/pm_domain.h
> > @@ -136,6 +136,10 @@ struct gpd_link {
> > struct list_head master_node;
> > struct generic_pm_domain *slave;
> > struct list_head slave_node;
> > +
> > +   /* Sub-domain's per-master domain performance state */
> > +   unsigned int performance_state;
> > +   unsigned int prev_performance_state;
> 
> Probably a leftover from the earlier versions, please remove.

No, these are still getting used.

-- 
viresh

Re: [v7, PATCH 1/2] net:stmmac: dwmac-mediatek: add support for mt2712

2018-12-13 Thread biao huang

Dear Florian,
Thanks for your comments.

On Thu, 2018-12-13 at 21:11 -0800, Florian Fainelli wrote:
> Le 12/13/18 à 7:01 PM, biao huang a écrit :
> > Dear Andrew,
> > Thanks for your comments.
> > 
> > On Thu, 2018-12-13 at 13:33 +0100, Andrew Lunn wrote:
> >> Hi Biao
> >>
> >>> + case PHY_INTERFACE_MODE_RGMII:
> >>> + /* the PHY is not responsible for inserting any internal
> >>> +  * delay by itself in PHY_INTERFACE_MODE_RGMII case,
> >>> +  * so Ethernet MAC will insert delays for both transmit
> >>> +  * and receive path here.
> >>> +  */
> >>
> >> What if the PCB designed has decided to do a kink in the clock to add
> >> the delays? I don't think any of these delays should depend on the PHY
> >> interface mode. It is up to the device tree writer to set both the PHY
> >> delay and the MAC delay, based on knowledge of the board, including
> >> any kicks in the tracks. The driver should then do what it is told.
> >>
> > Originally, we recommend equal trace length on PCB, which means that
> > RGMII delay by PCB traces is not recommended. so only PHY/MAC delay is
> > taken into account in the transmit/receive path.
> > 
> > as you described above, maybe the equal PCB trace length assumption is
> > not reasonable, and we'll only handle MAC delay-ps in our driver based
> > on the device tree information no matter which rgmii is selected.
> 
> Expecting identical PCB traces is something that is hard to enforce with
> external customers, for internal reference boards, absolutely they
> should have those traces of equal length.
> 
yes, we'll set the corresponding register based on the
tx-delay-ps/rx-delay-ps in device tree for rgmii interface.
the PHY_INTERFACE_MODE_RGMII/-RXID/-TXID/-ID share the same flow in
Ethernet driver.

A new patch will be sent to fix this issue.
> > 
> > Since David already applied this patch, I'll send another patch to fix
> > this issue.
> >>> + if (!of_property_read_u32(plat->np, "mediatek,tx-delay-ps", 
> >>> _delay_ps)) {
> >>> + if (tx_delay_ps < plat->variant->tx_delay_max) {
> >>> + mac_delay->tx_delay = tx_delay_ps;
> >>> + } else {
> >>> + dev_err(plat->dev, "Invalid TX clock delay: %dps\n", 
> >>> tx_delay_ps);
> >>> + return -EINVAL;
> >>> + }
> >>> + }
> >>> +
> >>> + if (!of_property_read_u32(plat->np, "mediatek,rx-delay-ps", 
> >>> _delay_ps)) {
> >>> + if (rx_delay_ps < plat->variant->rx_delay_max) {
> >>> + mac_delay->rx_delay = rx_delay_ps;
> >>> + } else {
> >>> + dev_err(plat->dev, "Invalid RX clock delay: %dps\n", 
> >>> rx_delay_ps);
> >>> + return -EINVAL;
> >>> + }
> >>> + }
> >>> +
> >>> + mac_delay->tx_inv = of_property_read_bool(plat->np, 
> >>> "mediatek,txc-inverse");
> >>> + mac_delay->rx_inv = of_property_read_bool(plat->np, 
> >>> "mediatek,rxc-inverse");
> >>> + mac_delay->fine_tune = of_property_read_bool(plat->np, 
> >>> "mediatek,fine-tune");
> >>
> >> Why is fine tune needed? If the requested delay can be done using fine
> >> tune, it should use fine tune. If not, it should use rough tune. The
> >> driver can work this out itself.
> > 
> > find tune here represents a more accurate delay circuit than coarse
> > tune, and it's a parallel circuit of coarse tune.
> > For most delay, both fine and coarse tune can meet the requirement.
> > It's up to the user to select which one.
> > 
> > But only one of them can work at the same time, so we need a switch
> > flag(fine_tune here) to indicate which one is valid.
> > Driver can hardly work out which one is working according to delay-ps.
> > 
> > Please correct me if any misunderstanding.
> 
> You are giving a lot of options for users of this Ethernet controller to
> shoot themselves in the feet and spend a good amount of time debugging
> why their RGMII connection is not reliable or have timing violations.
yes, since fine tune is more accurate, and can meet customer's
requirement, we'll remove the 'fine-tune' property in device tree,
enable fine-tune circuit by default in Ethernet driver, and abandon the
coarse delay mechanism. so customer will not be confused by the options.

I'll send a new patch to fix this issue.

[PATCH] mm: remove unused page state adjustment macro

2018-12-13 Thread Wei Yang

These four macro are not used anymore.

Just remove them.

Signed-off-by: Wei Yang 
---
 include/linux/vmstat.h | 5 -
 1 file changed, 5 deletions(-)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index f25cef84b41d..2db8d60981fe 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -239,11 +239,6 @@ extern unsigned long node_page_state(struct pglist_data 
*pgdat,
 #define node_page_state(node, item) global_node_page_state(item)
 #endif /* CONFIG_NUMA */
 
-#define add_zone_page_state(__z, __i, __d) mod_zone_page_state(__z, __i, __d)
-#define sub_zone_page_state(__z, __i, __d) mod_zone_page_state(__z, __i, 
-(__d))
-#define add_node_page_state(__p, __i, __d) mod_node_page_state(__p, __i, __d)
-#define sub_node_page_state(__p, __i, __d) mod_node_page_state(__p, __i, 
-(__d))
-
 #ifdef CONFIG_SMP
 void __mod_zone_page_state(struct zone *, enum zone_stat_item item, long);
 void __inc_zone_page_state(struct page *, enum zone_stat_item);
-- 
2.15.1

[PATCH -V9 05/21] swap: Support PMD swap mapping in put_swap_page()

2018-12-13 Thread Huang Ying

Previously, during swapout, all PMD page mapping will be split and
replaced with PTE swap mapping.  And when clearing the SWAP_HAS_CACHE
flag for the huge swap cluster in put_swap_page(), the huge swap
cluster will be split.  Now, during swapout, the PMD page mappings to
the THP will be changed to PMD swap mappings to the corresponding swap
cluster.  So when clearing the SWAP_HAS_CACHE flag, the huge swap
cluster will only be split if the PMD swap mapping count is 0.
Otherwise, we will keep it as the huge swap cluster.  So that we can
swapin a THP in one piece later.

Signed-off-by: "Huang, Ying" 
Cc: "Kirill A. Shutemov" 
Cc: Andrea Arcangeli 
Cc: Michal Hocko 
Cc: Johannes Weiner 
Cc: Shaohua Li 
Cc: Hugh Dickins 
Cc: Minchan Kim 
Cc: Rik van Riel 
Cc: Dave Hansen 
Cc: Naoya Horiguchi 
Cc: Zi Yan 
Cc: Daniel Jordan 
---
 mm/swapfile.c | 31 ---
 1 file changed, 24 insertions(+), 7 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index bd8756ac3bcc..04cf6b95cae0 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1314,6 +1314,15 @@ void swap_free(swp_entry_t entry)
 
 /*
  * Called after dropping swapcache to decrease refcnt to swap entries.
+ *
+ * When a THP is added into swap cache, the SWAP_HAS_CACHE flag will
+ * be set in the swap_map[] of all swap entries in the huge swap
+ * cluster backing the THP.  This huge swap cluster will not be split
+ * unless the THP is split even if its PMD swap mapping count dropped
+ * to 0.  Later, when the THP is removed from swap cache, the
+ * SWAP_HAS_CACHE flag will be cleared in the swap_map[] of all swap
+ * entries in the huge swap cluster.  And this huge swap cluster will
+ * be split if its PMD swap mapping count is 0.
  */
 void put_swap_page(struct page *page, swp_entry_t entry)
 {
@@ -1332,15 +1341,23 @@ void put_swap_page(struct page *page, swp_entry_t entry)
 
ci = lock_cluster_or_swap_info(si, offset);
if (size == SWAPFILE_CLUSTER) {
-   VM_BUG_ON(!cluster_is_huge(ci));
+   VM_BUG_ON(!IS_ALIGNED(offset, size));
map = si->swap_map + offset;
-   for (i = 0; i < SWAPFILE_CLUSTER; i++) {
-   val = map[i];
-   VM_BUG_ON(!(val & SWAP_HAS_CACHE));
-   if (val == SWAP_HAS_CACHE)
-   free_entries++;
+   /*
+* No PMD swap mapping, the swap cluster will be freed
+* if all swap entries becoming free, otherwise the
+* huge swap cluster will be split.
+*/
+   if (!cluster_swapcount(ci)) {
+   for (i = 0; i < SWAPFILE_CLUSTER; i++) {
+   val = map[i];
+   VM_BUG_ON(!(val & SWAP_HAS_CACHE));
+   if (val == SWAP_HAS_CACHE)
+   free_entries++;
+   }
+   if (free_entries != SWAPFILE_CLUSTER)
+   cluster_clear_huge(ci);
}
-   cluster_clear_huge(ci);
if (free_entries == SWAPFILE_CLUSTER) {
unlock_cluster_or_swap_info(si, ci);
spin_lock(>lock);
-- 
2.18.1

[PATCH -V9 18/21] swap: Support PMD swap mapping for MADV_WILLNEED

2018-12-13 Thread Huang Ying

During MADV_WILLNEED, for a PMD swap mapping, if THP swapin is enabled
for the VMA, the whole swap cluster will be swapin.  Otherwise, the
huge swap cluster and the PMD swap mapping will be split and fallback
to PTE swap mapping.

Signed-off-by: "Huang, Ying" 
Cc: "Kirill A. Shutemov" 
Cc: Andrea Arcangeli 
Cc: Michal Hocko 
Cc: Johannes Weiner 
Cc: Shaohua Li 
Cc: Hugh Dickins 
Cc: Minchan Kim 
Cc: Rik van Riel 
Cc: Dave Hansen 
Cc: Naoya Horiguchi 
Cc: Zi Yan 
Cc: Daniel Jordan 
---
 mm/madvise.c | 26 --
 1 file changed, 24 insertions(+), 2 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index c1845dab2dd4..84d055c19dd4 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -196,14 +196,36 @@ static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned 
long start,
pte_t *orig_pte;
struct vm_area_struct *vma = walk->private;
unsigned long index;
+   swp_entry_t entry;
+   struct page *page;
+   pmd_t pmdval;
+
+   pmdval = *pmd;
+   if (IS_ENABLED(CONFIG_THP_SWAP) && is_swap_pmd(pmdval) &&
+   !is_pmd_migration_entry(pmdval)) {
+   entry = pmd_to_swp_entry(pmdval);
+   if (!transparent_hugepage_swapin_enabled(vma)) {
+   if (!split_swap_cluster(entry, 0))
+   split_huge_swap_pmd(vma, pmd, start, pmdval);
+   } else {
+   page = read_swap_cache_async(entry,
+GFP_HIGHUSER_MOVABLE,
+vma, start, false);
+   if (page) {
+   /* The swap cluster has been split under us */
+   if (!PageTransHuge(page))
+   split_huge_swap_pmd(vma, pmd, start,
+   pmdval);
+   put_page(page);
+   }
+   }
+   }
 
if (pmd_none_or_trans_huge_or_clear_bad(pmd))
return 0;
 
for (index = start; index != end; index += PAGE_SIZE) {
pte_t pte;
-   swp_entry_t entry;
-   struct page *page;
spinlock_t *ptl;
 
orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, start, );
-- 
2.18.1

[PATCH -V9 19/21] swap: Support PMD swap mapping in mincore()

2018-12-13 Thread Huang Ying

During mincore(), for PMD swap mapping, swap cache will be looked up.
If the resulting page isn't compound page, the PMD swap mapping will
be split and fallback to PTE swap mapping processing.

Signed-off-by: "Huang, Ying" 
Cc: "Kirill A. Shutemov" 
Cc: Andrea Arcangeli 
Cc: Michal Hocko 
Cc: Johannes Weiner 
Cc: Shaohua Li 
Cc: Hugh Dickins 
Cc: Minchan Kim 
Cc: Rik van Riel 
Cc: Dave Hansen 
Cc: Naoya Horiguchi 
Cc: Zi Yan 
Cc: Daniel Jordan 
---
 mm/mincore.c | 37 +++--
 1 file changed, 31 insertions(+), 6 deletions(-)

diff --git a/mm/mincore.c b/mm/mincore.c
index aa0e542569f9..1d861fac82ee 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -48,7 +48,8 @@ static int mincore_hugetlb(pte_t *pte, unsigned long hmask, 
unsigned long addr,
  * and is up to date; i.e. that no page-in operation would be required
  * at this time if an application were to map and access this page.
  */
-static unsigned char mincore_page(struct address_space *mapping, pgoff_t pgoff)
+static unsigned char mincore_page(struct address_space *mapping, pgoff_t pgoff,
+ bool *compound)
 {
unsigned char present = 0;
struct page *page;
@@ -86,6 +87,8 @@ static unsigned char mincore_page(struct address_space 
*mapping, pgoff_t pgoff)
 #endif
if (page) {
present = PageUptodate(page);
+   if (compound)
+   *compound = PageCompound(page);
put_page(page);
}
 
@@ -103,7 +106,8 @@ static int __mincore_unmapped_range(unsigned long addr, 
unsigned long end,
 
pgoff = linear_page_index(vma, addr);
for (i = 0; i < nr; i++, pgoff++)
-   vec[i] = mincore_page(vma->vm_file->f_mapping, pgoff);
+   vec[i] = mincore_page(vma->vm_file->f_mapping,
+ pgoff, NULL);
} else {
for (i = 0; i < nr; i++)
vec[i] = 0;
@@ -127,14 +131,36 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long 
addr, unsigned long end,
pte_t *ptep;
unsigned char *vec = walk->private;
int nr = (end - addr) >> PAGE_SHIFT;
+   swp_entry_t entry;
 
ptl = pmd_trans_huge_lock(pmd, vma);
if (ptl) {
-   memset(vec, 1, nr);
+   unsigned char val = 1;
+   bool compound;
+
+   if (IS_ENABLED(CONFIG_THP_SWAP) && is_swap_pmd(*pmd)) {
+   entry = pmd_to_swp_entry(*pmd);
+   if (!non_swap_entry(entry)) {
+   val = mincore_page(swap_address_space(entry),
+  swp_offset(entry),
+  );
+   /*
+* The huge swap cluster has been
+* split under us
+*/
+   if (!compound) {
+   __split_huge_swap_pmd(vma, addr, pmd);
+   spin_unlock(ptl);
+   goto fallback;
+   }
+   }
+   }
+   memset(vec, val, nr);
spin_unlock(ptl);
goto out;
}
 
+fallback:
if (pmd_trans_unstable(pmd)) {
__mincore_unmapped_range(addr, end, vma, vec);
goto out;
@@ -150,8 +176,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long 
addr, unsigned long end,
else if (pte_present(pte))
*vec = 1;
else { /* pte is a swap entry */
-   swp_entry_t entry = pte_to_swp_entry(pte);
-
+   entry = pte_to_swp_entry(pte);
if (non_swap_entry(entry)) {
/*
 * migration or hwpoison entries are always
@@ -161,7 +186,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long 
addr, unsigned long end,
} else {
 #ifdef CONFIG_SWAP
*vec = mincore_page(swap_address_space(entry),
-   swp_offset(entry));
+   swp_offset(entry), NULL);
 #else
WARN_ON(1);
*vec = 1;
-- 
2.18.1

[PATCH -V9 07/21] swap: Support PMD swap mapping when splitting huge PMD

2018-12-13 Thread Huang Ying

A huge PMD need to be split when zap a part of the PMD mapping etc.
If the PMD mapping is a swap mapping, we need to split it too.  This
patch implemented the support for this.  This is similar as splitting
the PMD page mapping, except we need to decrease the PMD swap mapping
count for the huge swap cluster too.  If the PMD swap mapping count
becomes 0, the huge swap cluster will be split.

Notice: is_huge_zero_pmd() and pmd_page() doesn't work well with swap
PMD, so pmd_present() check is called before them.

Thanks Daniel Jordan for testing and reporting a data corruption bug
caused by misaligned address processing issue in __split_huge_swap_pmd().

Signed-off-by: "Huang, Ying" 
Cc: "Kirill A. Shutemov" 
Cc: Andrea Arcangeli 
Cc: Michal Hocko 
Cc: Johannes Weiner 
Cc: Shaohua Li 
Cc: Hugh Dickins 
Cc: Minchan Kim 
Cc: Rik van Riel 
Cc: Dave Hansen 
Cc: Naoya Horiguchi 
Cc: Zi Yan 
Cc: Daniel Jordan 
---
 include/linux/huge_mm.h |  4 
 include/linux/swap.h|  6 +
 mm/huge_memory.c| 49 -
 mm/swapfile.c   | 32 +++
 4 files changed, 86 insertions(+), 5 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 4663ee96cf59..1c0fda003d6a 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -226,6 +226,10 @@ static inline bool is_huge_zero_page(struct page *page)
return READ_ONCE(huge_zero_page) == page;
 }
 
+/*
+ * is_huge_zero_pmd() must be called after checking pmd_present(),
+ * otherwise, it may report false positive for PMD swap entry.
+ */
 static inline bool is_huge_zero_pmd(pmd_t pmd)
 {
return is_huge_zero_page(pmd_page(pmd));
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 24c3014894dd..a24d101b131d 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -619,11 +619,17 @@ static inline swp_entry_t get_swap_page(struct page *page)
 
 #ifdef CONFIG_THP_SWAP
 extern int split_swap_cluster(swp_entry_t entry);
+extern int split_swap_cluster_map(swp_entry_t entry);
 #else
 static inline int split_swap_cluster(swp_entry_t entry)
 {
return 0;
 }
+
+static inline int split_swap_cluster_map(swp_entry_t entry)
+{
+   return 0;
+}
 #endif
 
 #ifdef CONFIG_MEMCG
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index bd2543e10938..49df3e7c96c7 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1617,6 +1617,41 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, 
pmd_t pmd)
return 0;
 }
 
+/* Convert a PMD swap mapping to a set of PTE swap mappings */
+static void __split_huge_swap_pmd(struct vm_area_struct *vma,
+ unsigned long addr,
+ pmd_t *pmd)
+{
+   struct mm_struct *mm = vma->vm_mm;
+   pgtable_t pgtable;
+   pmd_t _pmd;
+   swp_entry_t entry;
+   int i, soft_dirty;
+
+   addr &= HPAGE_PMD_MASK;
+   entry = pmd_to_swp_entry(*pmd);
+   soft_dirty = pmd_soft_dirty(*pmd);
+
+   split_swap_cluster_map(entry);
+
+   pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+   pmd_populate(mm, &_pmd, pgtable);
+
+   for (i = 0; i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE, entry.val++) {
+   pte_t *pte, ptent;
+
+   pte = pte_offset_map(&_pmd, addr);
+   VM_BUG_ON(!pte_none(*pte));
+   ptent = swp_entry_to_pte(entry);
+   if (soft_dirty)
+   ptent = pte_swp_mksoft_dirty(ptent);
+   set_pte_at(mm, addr, pte, ptent);
+   pte_unmap(pte);
+   }
+   smp_wmb(); /* make pte visible before pmd */
+   pmd_populate(mm, pmd, pgtable);
+}
+
 /*
  * Return true if we do MADV_FREE successfully on entire pmd page.
  * Otherwise, return false.
@@ -2082,7 +2117,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct 
*vma, pmd_t *pmd,
VM_BUG_ON(haddr & ~HPAGE_PMD_MASK);
VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PMD_SIZE, vma);
-   VM_BUG_ON(!is_pmd_migration_entry(*pmd) && !pmd_trans_huge(*pmd)
+   VM_BUG_ON(!is_swap_pmd(*pmd) && !pmd_trans_huge(*pmd)
&& !pmd_devmap(*pmd));
 
count_vm_event(THP_SPLIT_PMD);
@@ -2106,7 +2141,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct 
*vma, pmd_t *pmd,
put_page(page);
add_mm_counter(mm, mm_counter_file(page), -HPAGE_PMD_NR);
return;
-   } else if (is_huge_zero_pmd(*pmd)) {
+   } else if (pmd_present(*pmd) && is_huge_zero_pmd(*pmd)) {
/*
 * FIXME: Do we want to invalidate secondary mmu by calling
 * mmu_notifier_invalidate_range() see comments below inside
@@ -2150,6 +2185,9 @@ static void __split_huge_pmd_locked(struct vm_area_struct 
*vma, pmd_t *pmd,
page = pfn_to_page(swp_offset(entry));
} else
 #endif
+   if

[PATCH -V9 12/21] swap: Add sysfs interface to configure THP swapin

2018-12-13 Thread Huang Ying

Swapin a THP as a whole isn't desirable in some situations.  For
example, for completely random access pattern, swapin a THP in one
piece will inflate the reading greatly.  So a sysfs interface:
/sys/kernel/mm/transparent_hugepage/swapin_enabled is added to
configure it.  Three options as follow are provided,

- always: THP swapin will be enabled always

- madvise: THP swapin will be enabled only for VMA with VM_HUGEPAGE
  flag set.

- never: THP swapin will be disabled always

The default configuration is: madvise.

During page fault, if a PMD swap mapping is found and THP swapin is
disabled, the huge swap cluster and the PMD swap mapping will be split
and fallback to normal page swapin.

Signed-off-by: "Huang, Ying" 
Cc: "Kirill A. Shutemov" 
Cc: Andrea Arcangeli 
Cc: Michal Hocko 
Cc: Johannes Weiner 
Cc: Shaohua Li 
Cc: Hugh Dickins 
Cc: Minchan Kim 
Cc: Rik van Riel 
Cc: Dave Hansen 
Cc: Naoya Horiguchi 
Cc: Zi Yan 
Cc: Daniel Jordan 
---
 Documentation/admin-guide/mm/transhuge.rst | 21 +
 include/linux/huge_mm.h| 31 
 mm/huge_memory.c   | 93 +-
 3 files changed, 126 insertions(+), 19 deletions(-)

diff --git a/Documentation/admin-guide/mm/transhuge.rst 
b/Documentation/admin-guide/mm/transhuge.rst
index 85e33f785fd7..23aefb17101c 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -160,6 +160,27 @@ Some userspace (such as a test program, or an optimized 
memory allocation
 
cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size
 
+Transparent hugepage may be swapout and swapin in one piece without
+splitting.  This will improve the utility of transparent hugepage but
+may inflate the read/write too.  So whether to enable swapin
+transparent hugepage in one piece can be configured as follow.
+
+   echo always >/sys/kernel/mm/transparent_hugepage/swapin_enabled
+   echo madvise >/sys/kernel/mm/transparent_hugepage/swapin_enabled
+   echo never >/sys/kernel/mm/transparent_hugepage/swapin_enabled
+
+always
+   Attempt to allocate a transparent huge page and read it from
+   swap space in one piece every time.
+
+never
+   Always split the swap space and PMD swap mapping and swapin
+   the fault normal page during swapin.
+
+madvise
+   Only swapin the transparent huge page in one piece for
+   MADV_HUGEPAGE madvise regions.
+
 khugepaged will be automatically started when
 transparent_hugepage/enabled is set to "always" or "madvise, and it'll
 be automatically shutdown if it's set to "never".
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index debe3760e894..06dbbcf6a6dd 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -63,6 +63,8 @@ enum transparent_hugepage_flag {
 #ifdef CONFIG_DEBUG_VM
TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG,
 #endif
+   TRANSPARENT_HUGEPAGE_SWAPIN_FLAG,
+   TRANSPARENT_HUGEPAGE_SWAPIN_REQ_MADV_FLAG,
 };
 
 struct kobject;
@@ -373,11 +375,40 @@ static inline gfp_t alloc_hugepage_direct_gfpmask(struct 
vm_area_struct *vma)
 
 #ifdef CONFIG_THP_SWAP
 extern int do_huge_pmd_swap_page(struct vm_fault *vmf, pmd_t orig_pmd);
+
+static inline bool transparent_hugepage_swapin_enabled(
+   struct vm_area_struct *vma)
+{
+   if (vma->vm_flags & VM_NOHUGEPAGE)
+   return false;
+
+   if (is_vma_temporary_stack(vma))
+   return false;
+
+   if (test_bit(MMF_DISABLE_THP, >vm_mm->flags))
+   return false;
+
+   if (transparent_hugepage_flags &
+   (1 << TRANSPARENT_HUGEPAGE_SWAPIN_FLAG))
+   return true;
+
+   if (transparent_hugepage_flags &
+   (1 << TRANSPARENT_HUGEPAGE_SWAPIN_REQ_MADV_FLAG))
+   return !!(vma->vm_flags & VM_HUGEPAGE);
+
+   return false;
+}
 #else /* CONFIG_THP_SWAP */
 static inline int do_huge_pmd_swap_page(struct vm_fault *vmf, pmd_t orig_pmd)
 {
return 0;
 }
+
+static inline bool transparent_hugepage_swapin_enabled(
+   struct vm_area_struct *vma)
+{
+   return false;
+}
 #endif /* CONFIG_THP_SWAP */
 
 #endif /* _LINUX_HUGE_MM_H */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e1e95e6c86e3..8e8952938c25 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -57,7 +57,8 @@ unsigned long transparent_hugepage_flags __read_mostly =
 #endif
(1

[PATCH -V9 03/21] swap: Add __swap_duplicate_locked()

2018-12-13 Thread Huang Ying

The part of __swap_duplicate() with lock held is separated into a new
function __swap_duplicate_locked().  Because we will add more logic
about the PMD swap mapping into __swap_duplicate() and keep the most
PTE swap mapping related logic in __swap_duplicate_locked().

Just mechanical code refactoring, there is no any functional change in
this patch.

Signed-off-by: "Huang, Ying" 
Cc: "Kirill A. Shutemov" 
Cc: Andrea Arcangeli 
Cc: Michal Hocko 
Cc: Johannes Weiner 
Cc: Shaohua Li 
Cc: Hugh Dickins 
Cc: Minchan Kim 
Cc: Rik van Riel 
Cc: Dave Hansen 
Cc: Naoya Horiguchi 
Cc: Zi Yan 
Cc: Daniel Jordan 
---
 mm/swapfile.c | 63 ---
 1 file changed, 35 insertions(+), 28 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 9e6da494781f..5adc0787343f 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -3343,32 +3343,12 @@ void si_swapinfo(struct sysinfo *val)
spin_unlock(_lock);
 }
 
-/*
- * Verify that a swap entry is valid and increment its swap map count.
- *
- * Returns error code in following case.
- * - success -> 0
- * - swp_entry is invalid -> EINVAL
- * - swp_entry is migration entry -> EINVAL
- * - swap-cache reference is requested but there is already one. -> EEXIST
- * - swap-cache reference is requested but the entry is not used. -> ENOENT
- * - swap-mapped reference requested but needs continued swap count. -> ENOMEM
- */
-static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
+static int __swap_duplicate_locked(struct swap_info_struct *p,
+  unsigned long offset, unsigned char usage)
 {
-   struct swap_info_struct *p;
-   struct swap_cluster_info *ci;
-   unsigned long offset;
unsigned char count;
unsigned char has_cache;
-   int err = -EINVAL;
-
-   p = get_swap_device(entry);
-   if (!p)
-   goto out;
-
-   offset = swp_offset(entry);
-   ci = lock_cluster_or_swap_info(p, offset);
+   int err = 0;
 
count = p->swap_map[offset];
 
@@ -3378,12 +3358,11 @@ static int __swap_duplicate(swp_entry_t entry, unsigned 
char usage)
 */
if (unlikely(swap_count(count) == SWAP_MAP_BAD)) {
err = -ENOENT;
-   goto unlock_out;
+   goto out;
}
 
has_cache = count & SWAP_HAS_CACHE;
count &= ~SWAP_HAS_CACHE;
-   err = 0;
 
if (usage == SWAP_HAS_CACHE) {
 
@@ -3410,11 +3389,39 @@ static int __swap_duplicate(swp_entry_t entry, unsigned 
char usage)
 
p->swap_map[offset] = count | has_cache;
 
-unlock_out:
+out:
+   return err;
+}
+
+/*
+ * Verify that a swap entry is valid and increment its swap map count.
+ *
+ * Returns error code in following case.
+ * - success -> 0
+ * - swp_entry is invalid -> EINVAL
+ * - swp_entry is migration entry -> EINVAL
+ * - swap-cache reference is requested but there is already one. -> EEXIST
+ * - swap-cache reference is requested but the entry is not used. -> ENOENT
+ * - swap-mapped reference requested but needs continued swap count. -> ENOMEM
+ */
+static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
+{
+   struct swap_info_struct *p;
+   struct swap_cluster_info *ci;
+   unsigned long offset;
+   int err = -EINVAL;
+
+   p = get_swap_device(entry);
+   if (!p)
+   goto out;
+
+   offset = swp_offset(entry);
+   ci = lock_cluster_or_swap_info(p, offset);
+   err = __swap_duplicate_locked(p, offset, usage);
unlock_cluster_or_swap_info(p, ci);
+
+   put_swap_device(p);
 out:
-   if (p)
-   put_swap_device(p);
return err;
 }
 
-- 
2.18.1

[PATCH -V9 16/21] swap: Support to copy PMD swap mapping when fork()

2018-12-13 Thread Huang Ying

During fork, the page table need to be copied from parent to child.  A
PMD swap mapping need to be copied too and the swap reference count
need to be increased.

When the huge swap cluster has been split already, we need to split
the PMD swap mapping and fallback to PTE copying.

When swap count continuation failed to allocate a page with
GFP_ATOMIC, we need to unlock the spinlock and try again with
GFP_KERNEL.

Signed-off-by: "Huang, Ying" 
Cc: "Kirill A. Shutemov" 
Cc: Andrea Arcangeli 
Cc: Michal Hocko 
Cc: Johannes Weiner 
Cc: Shaohua Li 
Cc: Hugh Dickins 
Cc: Minchan Kim 
Cc: Rik van Riel 
Cc: Dave Hansen 
Cc: Naoya Horiguchi 
Cc: Zi Yan 
Cc: Daniel Jordan 
---
 mm/huge_memory.c | 72 ++--
 1 file changed, 57 insertions(+), 15 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e460241ea761..b083c66a9d09 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -974,6 +974,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct 
mm_struct *src_mm,
if (unlikely(!pgtable))
goto out;
 
+retry:
dst_ptl = pmd_lock(dst_mm, dst_pmd);
src_ptl = pmd_lockptr(src_mm, src_pmd);
spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
@@ -981,26 +982,67 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct 
mm_struct *src_mm,
ret = -EAGAIN;
pmd = *src_pmd;
 
-#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
if (unlikely(is_swap_pmd(pmd))) {
swp_entry_t entry = pmd_to_swp_entry(pmd);
 
-   VM_BUG_ON(!is_pmd_migration_entry(pmd));
-   if (is_write_migration_entry(entry)) {
-   make_migration_entry_read();
-   pmd = swp_entry_to_pmd(entry);
-   if (pmd_swp_soft_dirty(*src_pmd))
-   pmd = pmd_swp_mksoft_dirty(pmd);
-   set_pmd_at(src_mm, addr, src_pmd, pmd);
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+   if (is_migration_entry(entry)) {
+   if (is_write_migration_entry(entry)) {
+   make_migration_entry_read();
+   pmd = swp_entry_to_pmd(entry);
+   if (pmd_swp_soft_dirty(*src_pmd))
+   pmd = pmd_swp_mksoft_dirty(pmd);
+   set_pmd_at(src_mm, addr, src_pmd, pmd);
+   }
+   add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
+   mm_inc_nr_ptes(dst_mm);
+   pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
+   set_pmd_at(dst_mm, addr, dst_pmd, pmd);
+   ret = 0;
+   goto out_unlock;
}
-   add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
-   mm_inc_nr_ptes(dst_mm);
-   pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
-   set_pmd_at(dst_mm, addr, dst_pmd, pmd);
-   ret = 0;
-   goto out_unlock;
-   }
 #endif
+   if (IS_ENABLED(CONFIG_THP_SWAP) && !non_swap_entry(entry)) {
+   ret = swap_duplicate(, HPAGE_PMD_NR);
+   if (!ret) {
+   add_mm_counter(dst_mm, MM_SWAPENTS,
+  HPAGE_PMD_NR);
+   mm_inc_nr_ptes(dst_mm);
+   pgtable_trans_huge_deposit(dst_mm, dst_pmd,
+  pgtable);
+   set_pmd_at(dst_mm, addr, dst_pmd, pmd);
+   /* make sure dst_mm is on swapoff's mmlist. */
+   if (unlikely(list_empty(_mm->mmlist))) {
+   spin_lock(_lock);
+   if (list_empty(_mm->mmlist))
+   list_add(_mm->mmlist,
+_mm->mmlist);
+   spin_unlock(_lock);
+   }
+   } else if (ret == -ENOTDIR) {
+   /*
+* The huge swap cluster has been split, split
+* the PMD swap mapping and fallback to PTE
+*/
+   __split_huge_swap_pmd(vma, addr, src_pmd);
+   pte_free(dst_mm, pgtable);
+   } else if (ret == -ENOMEM) {
+   spin_unlock(src_ptl);
+   spin_unlock(dst_ptl);
+   ret = add_swap_count_continuation(entry,
+ GFP_KERNEL);
+   if (ret < 0) {
+

[PATCH -V9 20/21] swap: Support PMD swap mapping in common path

2018-12-13 Thread Huang Ying

Original code is only for PMD migration entry, it is revised to
support PMD swap mapping.

Signed-off-by: "Huang, Ying" 
Cc: "Kirill A. Shutemov" 
Cc: Andrea Arcangeli 
Cc: Michal Hocko 
Cc: Johannes Weiner 
Cc: Shaohua Li 
Cc: Hugh Dickins 
Cc: Minchan Kim 
Cc: Rik van Riel 
Cc: Dave Hansen 
Cc: Naoya Horiguchi 
Cc: Zi Yan 
Cc: Daniel Jordan 
---
 fs/proc/task_mmu.c | 12 +---
 mm/gup.c   | 36 
 mm/huge_memory.c   |  7 ---
 mm/mempolicy.c |  2 +-
 4 files changed, 34 insertions(+), 23 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index dc36909a73c6..fa41822574e1 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -986,7 +986,7 @@ static inline void clear_soft_dirty_pmd(struct 
vm_area_struct *vma,
pmd = pmd_clear_soft_dirty(pmd);
 
set_pmd_at(vma->vm_mm, addr, pmdp, pmd);
-   } else if (is_migration_entry(pmd_to_swp_entry(pmd))) {
+   } else if (is_swap_pmd(pmd)) {
pmd = pmd_swp_clear_soft_dirty(pmd);
set_pmd_at(vma->vm_mm, addr, pmdp, pmd);
}
@@ -1320,9 +1320,8 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long 
addr, unsigned long end,
if (pm->show_pfn)
frame = pmd_pfn(pmd) +
((addr & ~PMD_MASK) >> PAGE_SHIFT);
-   }
-#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
-   else if (is_swap_pmd(pmd)) {
+   } else if (IS_ENABLED(CONFIG_HAVE_PMD_SWAP_ENTRY) &&
+  is_swap_pmd(pmd)) {
swp_entry_t entry = pmd_to_swp_entry(pmd);
unsigned long offset;
 
@@ -1335,10 +1334,9 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long 
addr, unsigned long end,
flags |= PM_SWAP;
if (pmd_swp_soft_dirty(pmd))
flags |= PM_SOFT_DIRTY;
-   VM_BUG_ON(!is_pmd_migration_entry(pmd));
-   page = migration_entry_to_page(entry);
+   if (is_pmd_migration_entry(pmd))
+   page = migration_entry_to_page(entry);
}
-#endif
 
if (page && page_mapcount(page) == 1)
flags |= PM_MMAP_EXCLUSIVE;
diff --git a/mm/gup.c b/mm/gup.c
index 6dd33e16a806..460565825ef0 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -215,6 +215,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct 
*vma,
spinlock_t *ptl;
struct page *page;
struct mm_struct *mm = vma->vm_mm;
+   swp_entry_t entry;
 
pmd = pmd_offset(pudp, address);
/*
@@ -242,18 +243,22 @@ static struct page *follow_pmd_mask(struct vm_area_struct 
*vma,
if (!pmd_present(pmdval)) {
if (likely(!(flags & FOLL_MIGRATION)))
return no_page_table(vma, flags);
-   VM_BUG_ON(thp_migration_supported() &&
- !is_pmd_migration_entry(pmdval));
-   if (is_pmd_migration_entry(pmdval))
+   entry = pmd_to_swp_entry(pmdval);
+   if (thp_migration_supported() && is_migration_entry(entry)) {
pmd_migration_entry_wait(mm, pmd);
-   pmdval = READ_ONCE(*pmd);
-   /*
-* MADV_DONTNEED may convert the pmd to null because
-* mmap_sem is held in read mode
-*/
-   if (pmd_none(pmdval))
+   pmdval = READ_ONCE(*pmd);
+   /*
+* MADV_DONTNEED may convert the pmd to null because
+* mmap_sem is held in read mode
+*/
+   if (pmd_none(pmdval))
+   return no_page_table(vma, flags);
+   goto retry;
+   }
+   if (IS_ENABLED(CONFIG_THP_SWAP) && !non_swap_entry(entry))
return no_page_table(vma, flags);
-   goto retry;
+   WARN_ON(1);
+   return no_page_table(vma, flags);
}
if (pmd_devmap(pmdval)) {
ptl = pmd_lock(mm, pmd);
@@ -275,11 +280,18 @@ static struct page *follow_pmd_mask(struct vm_area_struct 
*vma,
return no_page_table(vma, flags);
}
if (unlikely(!pmd_present(*pmd))) {
+   entry = pmd_to_swp_entry(*pmd);
spin_unlock(ptl);
if (likely(!(flags & FOLL_MIGRATION)))
return no_page_table(vma, flags);
-   pmd_migration_entry_wait(mm, pmd);
-   goto retry_locked;
+   if (thp_migration_supported() && is_migration_entry(entry)) {
+   pmd_migration_entry_wait(mm, pmd);
+   goto retry_locked;
+   }
+

[PATCH -V9 21/21] swap: create PMD swap mapping when unmap the THP

2018-12-13 Thread Huang Ying

This is the final step of the THP swapin support.  When reclaiming a
anonymous THP, after allocating the huge swap cluster and add the THP
into swap cache, the PMD page mapping will be changed to the mapping
to the swap space.  Previously, the PMD page mapping will be split
before being changed.  In this patch, the unmap code is enhanced not
to split the PMD mapping, but create a PMD swap mapping to replace it
instead.  So later when clear the SWAP_HAS_CACHE flag in the last step
of swapout, the huge swap cluster will be kept instead of being split,
and when swapin, the huge swap cluster will be read in one piece into a
THP.  That is, the THP will not be split during swapout/swapin.  This
can eliminate the overhead of splitting/collapsing, and reduce the
page fault count, etc.  But more important, the utilization of THP is
improved greatly, that is, much more THP will be kept when swapping is
used, so that we can take full advantage of THP including its high
performance for swapout/swapin.

Signed-off-by: "Huang, Ying" 
Cc: "Kirill A. Shutemov" 
Cc: Andrea Arcangeli 
Cc: Michal Hocko 
Cc: Johannes Weiner 
Cc: Shaohua Li 
Cc: Hugh Dickins 
Cc: Minchan Kim 
Cc: Rik van Riel 
Cc: Dave Hansen 
Cc: Naoya Horiguchi 
Cc: Zi Yan 
Cc: Daniel Jordan 
---
 include/linux/huge_mm.h | 11 +++
 mm/huge_memory.c| 30 ++
 mm/rmap.c   | 41 -
 mm/vmscan.c |  6 +-
 4 files changed, 82 insertions(+), 6 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 3c05294689c1..fef5d27c2083 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -373,12 +373,16 @@ static inline gfp_t alloc_hugepage_direct_gfpmask(struct 
vm_area_struct *vma)
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
+struct page_vma_mapped_walk;
+
 #ifdef CONFIG_THP_SWAP
 extern void __split_huge_swap_pmd(struct vm_area_struct *vma,
  unsigned long addr, pmd_t *pmd);
 extern int split_huge_swap_pmd(struct vm_area_struct *vma, pmd_t *pmd,
   unsigned long address, pmd_t orig_pmd);
 extern int do_huge_pmd_swap_page(struct vm_fault *vmf, pmd_t orig_pmd);
+extern bool set_pmd_swap_entry(struct page_vma_mapped_walk *pvmw,
+   struct page *page, unsigned long address, pmd_t pmdval);
 
 static inline bool transparent_hugepage_swapin_enabled(
struct vm_area_struct *vma)
@@ -419,6 +423,13 @@ static inline int do_huge_pmd_swap_page(struct vm_fault 
*vmf, pmd_t orig_pmd)
return 0;
 }
 
+static inline bool set_pmd_swap_entry(struct page_vma_mapped_walk *pvmw,
+ struct page *page, unsigned long address,
+ pmd_t pmdval)
+{
+   return false;
+}
+
 static inline bool transparent_hugepage_swapin_enabled(
struct vm_area_struct *vma)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 38904d673339..e0205fceb84c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1922,6 +1922,36 @@ int do_huge_pmd_swap_page(struct vm_fault *vmf, pmd_t 
orig_pmd)
put_page(page);
return ret;
 }
+
+bool set_pmd_swap_entry(struct page_vma_mapped_walk *pvmw, struct page *page,
+   unsigned long address, pmd_t pmdval)
+{
+   struct vm_area_struct *vma = pvmw->vma;
+   struct mm_struct *mm = vma->vm_mm;
+   pmd_t swp_pmd;
+   swp_entry_t entry = { .val = page_private(page) };
+
+   if (swap_duplicate(, HPAGE_PMD_NR) < 0) {
+   set_pmd_at(mm, address, pvmw->pmd, pmdval);
+   return false;
+   }
+   if (list_empty(>mmlist)) {
+   spin_lock(_lock);
+   if (list_empty(>mmlist))
+   list_add(>mmlist, _mm.mmlist);
+   spin_unlock(_lock);
+   }
+   add_mm_counter(mm, MM_ANONPAGES, -HPAGE_PMD_NR);
+   add_mm_counter(mm, MM_SWAPENTS, HPAGE_PMD_NR);
+   swp_pmd = swp_entry_to_pmd(entry);
+   if (pmd_soft_dirty(pmdval))
+   swp_pmd = pmd_swp_mksoft_dirty(swp_pmd);
+   set_pmd_at(mm, address, pvmw->pmd, swp_pmd);
+
+   page_remove_rmap(page, true);
+   put_page(page);
+   return true;
+}
 #endif
 
 static inline void zap_deposited_table(struct mm_struct *mm, pmd_t *pmd)
diff --git a/mm/rmap.c b/mm/rmap.c
index e9b07016f587..a957af84ec12 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1423,11 +1423,50 @@ static bool try_to_unmap_one(struct page *page, struct 
vm_area_struct *vma,
continue;
}
 
+   address = pvmw.address;
+
+#ifdef CONFIG_THP_SWAP
+   /* PMD-mapped THP swap entry */
+   if (IS_ENABLED(CONFIG_THP_SWAP) &&
+   !pvmw.pte && PageAnon(page)) {
+   pmd_t pmdval;
+
+   VM_BUG_ON_PAGE(PageHuge(page) ||
+

[PATCH -V9 15/21] swap: Support to move swap account for PMD swap mapping

2018-12-13 Thread Huang Ying

Previously the huge swap cluster will be split after the THP is
swapout.  Now, to support to swapin the THP in one piece, the huge
swap cluster will not be split after the THP is reclaimed.  So in
memcg, we need to move the swap account for PMD swap mappings in the
process's page table.

When the page table is scanned during moving memcg charge, the PMD
swap mapping will be identified.  And mem_cgroup_move_swap_account()
and its callee is revised to move account for the whole huge swap
cluster.  If the swap cluster mapped by PMD has been split, the PMD
swap mapping will be split and fallback to PTE processing.

Signed-off-by: "Huang, Ying" 
Cc: "Kirill A. Shutemov" 
Cc: Andrea Arcangeli 
Cc: Michal Hocko 
Cc: Johannes Weiner 
Cc: Shaohua Li 
Cc: Hugh Dickins 
Cc: Minchan Kim 
Cc: Rik van Riel 
Cc: Dave Hansen 
Cc: Naoya Horiguchi 
Cc: Zi Yan 
Cc: Daniel Jordan 
---
 include/linux/huge_mm.h |   7 ++
 include/linux/swap.h|   6 ++
 include/linux/swap_cgroup.h |   3 +-
 mm/huge_memory.c|   7 +-
 mm/memcontrol.c | 131 
 mm/swap_cgroup.c|  45 ++---
 mm/swapfile.c   |  14 
 7 files changed, 173 insertions(+), 40 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 7c72e63757af..3c05294689c1 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -374,6 +374,8 @@ static inline gfp_t alloc_hugepage_direct_gfpmask(struct 
vm_area_struct *vma)
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 #ifdef CONFIG_THP_SWAP
+extern void __split_huge_swap_pmd(struct vm_area_struct *vma,
+ unsigned long addr, pmd_t *pmd);
 extern int split_huge_swap_pmd(struct vm_area_struct *vma, pmd_t *pmd,
   unsigned long address, pmd_t orig_pmd);
 extern int do_huge_pmd_swap_page(struct vm_fault *vmf, pmd_t orig_pmd);
@@ -401,6 +403,11 @@ static inline bool transparent_hugepage_swapin_enabled(
return false;
 }
 #else /* CONFIG_THP_SWAP */
+static inline void __split_huge_swap_pmd(struct vm_area_struct *vma,
+unsigned long addr, pmd_t *pmd)
+{
+}
+
 static inline int split_huge_swap_pmd(struct vm_area_struct *vma, pmd_t *pmd,
  unsigned long address, pmd_t orig_pmd)
 {
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 4bd532c9315e..6463784fd5e8 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -622,6 +622,7 @@ static inline swp_entry_t get_swap_page(struct page *page)
 #ifdef CONFIG_THP_SWAP
 extern int split_swap_cluster(swp_entry_t entry, unsigned long flags);
 extern int split_swap_cluster_map(swp_entry_t entry);
+extern int get_swap_entry_size(swp_entry_t entry);
 #else
 static inline int split_swap_cluster(swp_entry_t entry, unsigned long flags)
 {
@@ -632,6 +633,11 @@ static inline int split_swap_cluster_map(swp_entry_t entry)
 {
return 0;
 }
+
+static inline int get_swap_entry_size(swp_entry_t entry)
+{
+   return 1;
+}
 #endif
 
 #ifdef CONFIG_MEMCG
diff --git a/include/linux/swap_cgroup.h b/include/linux/swap_cgroup.h
index a12dd1c3966c..c40fb52b0563 100644
--- a/include/linux/swap_cgroup.h
+++ b/include/linux/swap_cgroup.h
@@ -7,7 +7,8 @@
 #ifdef CONFIG_MEMCG_SWAP
 
 extern unsigned short swap_cgroup_cmpxchg(swp_entry_t ent,
-   unsigned short old, unsigned short new);
+   unsigned short old, unsigned short new,
+   unsigned int nr_ents);
 extern unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id,
 unsigned int nr_ents);
 extern unsigned short lookup_swap_cgroup_id(swp_entry_t ent);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c895c2a2db6e..e460241ea761 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1670,10 +1670,10 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, 
pmd_t pmd)
return 0;
 }
 
+#ifdef CONFIG_THP_SWAP
 /* Convert a PMD swap mapping to a set of PTE swap mappings */
-static void __split_huge_swap_pmd(struct vm_area_struct *vma,
- unsigned long addr,
- pmd_t *pmd)
+void __split_huge_swap_pmd(struct vm_area_struct *vma,
+  unsigned long addr, pmd_t *pmd)
 {
struct mm_struct *mm = vma->vm_mm;
pgtable_t pgtable;
@@ -1705,7 +1705,6 @@ static void __split_huge_swap_pmd(struct vm_area_struct 
*vma,
pmd_populate(mm, pmd, pgtable);
 }
 
-#ifdef CONFIG_THP_SWAP
 int split_huge_swap_pmd(struct vm_area_struct *vma, pmd_t *pmd,
unsigned long address, pmd_t orig_pmd)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b860dd4f75f2..ac1abfcfab88 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2667,9 +2667,10 @@ void mem_cgroup_split_huge_fixup(struct page *head)

[PATCH -V9 17/21] swap: Free PMD swap mapping when zap_huge_pmd()

2018-12-13 Thread Huang Ying

For a PMD swap mapping, zap_huge_pmd() will clear the PMD and call
free_swap_and_cache() to decrease the swap reference count and maybe
free or split the huge swap cluster and the THP in swap cache.

Signed-off-by: "Huang, Ying" 
Cc: "Kirill A. Shutemov" 
Cc: Andrea Arcangeli 
Cc: Michal Hocko 
Cc: Johannes Weiner 
Cc: Shaohua Li 
Cc: Hugh Dickins 
Cc: Minchan Kim 
Cc: Rik van Riel 
Cc: Dave Hansen 
Cc: Naoya Horiguchi 
Cc: Zi Yan 
Cc: Daniel Jordan 
---
 mm/huge_memory.c | 32 +---
 1 file changed, 21 insertions(+), 11 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b083c66a9d09..6d144d687e69 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2055,7 +2055,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct 
vm_area_struct *vma,
spin_unlock(ptl);
if (is_huge_zero_pmd(orig_pmd))
tlb_remove_page_size(tlb, pmd_page(orig_pmd), 
HPAGE_PMD_SIZE);
-   } else if (is_huge_zero_pmd(orig_pmd)) {
+   } else if (pmd_present(orig_pmd) && is_huge_zero_pmd(orig_pmd)) {
zap_deposited_table(tlb->mm, pmd);
spin_unlock(ptl);
tlb_remove_page_size(tlb, pmd_page(orig_pmd), HPAGE_PMD_SIZE);
@@ -2068,17 +2068,27 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct 
vm_area_struct *vma,
page_remove_rmap(page, true);
VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
VM_BUG_ON_PAGE(!PageHead(page), page);
-   } else if (thp_migration_supported()) {
-   swp_entry_t entry;
-
-   VM_BUG_ON(!is_pmd_migration_entry(orig_pmd));
-   entry = pmd_to_swp_entry(orig_pmd);
-   page = pfn_to_page(swp_offset(entry));
+   } else {
+   swp_entry_t entry = pmd_to_swp_entry(orig_pmd);
+
+   if (thp_migration_supported() &&
+   is_migration_entry(entry))
+   page = pfn_to_page(swp_offset(entry));
+   else if (IS_ENABLED(CONFIG_THP_SWAP) &&
+!non_swap_entry(entry))
+   free_swap_and_cache(entry, HPAGE_PMD_NR);
+   else {
+   WARN_ONCE(1,
+"Non present huge pmd without pmd migration or swap enabled!");
+   goto unlock;
+   }
flush_needed = 0;
-   } else
-   WARN_ONCE(1, "Non present huge pmd without pmd 
migration enabled!");
+   }
 
-   if (PageAnon(page)) {
+   if (!page) {
+   zap_deposited_table(tlb->mm, pmd);
+   add_mm_counter(tlb->mm, MM_SWAPENTS, -HPAGE_PMD_NR);
+   } else if (PageAnon(page)) {
zap_deposited_table(tlb->mm, pmd);
add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
} else {
@@ -2086,7 +2096,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct 
vm_area_struct *vma,
zap_deposited_table(tlb->mm, pmd);
add_mm_counter(tlb->mm, mm_counter_file(page), 
-HPAGE_PMD_NR);
}
-
+unlock:
spin_unlock(ptl);
if (flush_needed)
tlb_remove_page_size(tlb, page, HPAGE_PMD_SIZE);
-- 
2.18.1

[PATCH -V9 11/21] swap: Support to count THP swapin and its fallback

2018-12-13 Thread Huang Ying

2 new /proc/vmstat fields are added, "thp_swapin" and
"thp_swapin_fallback" to count swapin a THP from swap device in one
piece and fallback to normal page swapin.

Signed-off-by: "Huang, Ying" 
Cc: "Kirill A. Shutemov" 
Cc: Andrea Arcangeli 
Cc: Michal Hocko 
Cc: Johannes Weiner 
Cc: Shaohua Li 
Cc: Hugh Dickins 
Cc: Minchan Kim 
Cc: Rik van Riel 
Cc: Dave Hansen 
Cc: Naoya Horiguchi 
Cc: Zi Yan 
Cc: Daniel Jordan 
---
 Documentation/admin-guide/mm/transhuge.rst |  8 
 include/linux/vm_event_item.h  |  2 ++
 mm/huge_memory.c   |  4 +++-
 mm/page_io.c   | 15 ---
 mm/vmstat.c|  2 ++
 5 files changed, 27 insertions(+), 4 deletions(-)

diff --git a/Documentation/admin-guide/mm/transhuge.rst 
b/Documentation/admin-guide/mm/transhuge.rst
index 7ab93a8404b9..85e33f785fd7 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -364,6 +364,14 @@ thp_swpout_fallback
Usually because failed to allocate some continuous swap space
for the huge page.
 
+thp_swpin
+   is incremented every time a huge page is swapin in one piece
+   without splitting.
+
+thp_swpin_fallback
+   is incremented if a huge page has to be split during swapin.
+   Usually because failed to allocate a huge page.
+
 As the system ages, allocating huge pages may be expensive as the
 system uses memory compaction to copy data around memory to free a
 huge page for use. There are some counters in ``/proc/vmstat`` to help
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 47a3441cf4c4..c20b655cfdcc 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -88,6 +88,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
THP_ZERO_PAGE_ALLOC_FAILED,
THP_SWPOUT,
THP_SWPOUT_FALLBACK,
+   THP_SWPIN,
+   THP_SWPIN_FALLBACK,
 #endif
 #ifdef CONFIG_MEMORY_BALLOON
BALLOON_INFLATE,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 644cb5d6b056..e1e95e6c86e3 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1708,8 +1708,10 @@ int do_huge_pmd_swap_page(struct vm_fault *vmf, pmd_t 
orig_pmd)
/* swapoff occurs under us */
} else if (ret == -EINVAL)
ret = 0;
-   else
+   else {
+   count_vm_event(THP_SWPIN_FALLBACK);
goto fallback;
+   }
}
delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
goto out;
diff --git a/mm/page_io.c b/mm/page_io.c
index 67a7f64d6c1a..00774b453dca 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -348,6 +348,15 @@ int __swap_writepage(struct page *page, struct 
writeback_control *wbc,
return ret;
 }
 
+static inline void count_swpin_vm_event(struct page *page)
+{
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+   if (unlikely(PageTransHuge(page)))
+   count_vm_event(THP_SWPIN);
+#endif
+   count_vm_events(PSWPIN, hpage_nr_pages(page));
+}
+
 int swap_readpage(struct page *page, bool synchronous)
 {
struct bio *bio;
@@ -371,7 +380,7 @@ int swap_readpage(struct page *page, bool synchronous)
 
ret = mapping->a_ops->readpage(swap_file, page);
if (!ret)
-   count_vm_event(PSWPIN);
+   count_swpin_vm_event(page);
return ret;
}
 
@@ -382,7 +391,7 @@ int swap_readpage(struct page *page, bool synchronous)
unlock_page(page);
}
 
-   count_vm_event(PSWPIN);
+   count_swpin_vm_event(page);
return 0;
}
 
@@ -403,7 +412,7 @@ int swap_readpage(struct page *page, bool synchronous)
bio_set_op_attrs(bio, REQ_OP_READ, 0);
if (synchronous)
bio->bi_opf |= REQ_HIPRI;
-   count_vm_event(PSWPIN);
+   count_swpin_vm_event(page);
bio_get(bio);
qc = submit_bio(bio);
while (synchronous) {
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 83b30edc2f7f..80a731e9a5e5 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1265,6 +1265,8 @@ const char * const vmstat_text[] = {
"thp_zero_page_alloc_failed",
"thp_swpout",
"thp_swpout_fallback",
+   "thp_swpin",
+   "thp_swpin_fallback",
 #endif
 #ifdef CONFIG_MEMORY_BALLOON
"balloon_inflate",
-- 
2.18.1

[PATCH -V9 08/21] swap: Support PMD swap mapping in split_swap_cluster()

2018-12-13 Thread Huang Ying

When splitting a THP in swap cache or failing to allocate a THP when
swapin a huge swap cluster, the huge swap cluster will be split.  In
addition to clear the huge flag of the swap cluster, the PMD swap
mapping count recorded in cluster_count() will be set to 0.  But we
will not touch PMD swap mappings themselves, because it is hard to
find them all sometimes.  When the PMD swap mappings are operated
later, it will be found that the huge swap cluster has been split and
the PMD swap mappings will be split at that time.

Unless splitting a THP in swap cache (specified via "force"
parameter), split_swap_cluster() will return -EEXIST if there is
SWAP_HAS_CACHE flag in swap_map[offset].  Because this indicates there
is a THP corresponds to this huge swap cluster, and it isn't desired
to split the THP.

When splitting a THP in swap cache, the position to call
split_swap_cluster() is changed to before unlocking sub-pages.  So
that all sub-pages will be kept locked from the THP has been split to
the huge swap cluster is split.  This makes the code much easier to be
reasoned.

Signed-off-by: "Huang, Ying" 
Cc: "Kirill A. Shutemov" 
Cc: Andrea Arcangeli 
Cc: Michal Hocko 
Cc: Johannes Weiner 
Cc: Shaohua Li 
Cc: Hugh Dickins 
Cc: Minchan Kim 
Cc: Rik van Riel 
Cc: Dave Hansen 
Cc: Naoya Horiguchi 
Cc: Zi Yan 
Cc: Daniel Jordan 
---
 include/linux/swap.h |  6 +++--
 mm/huge_memory.c | 18 +-
 mm/swapfile.c| 58 +++-
 3 files changed, 57 insertions(+), 25 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index a24d101b131d..441da4a832a6 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -617,11 +617,13 @@ static inline swp_entry_t get_swap_page(struct page *page)
 
 #endif /* CONFIG_SWAP */
 
+#define SSC_SPLIT_CACHED   0x1
+
 #ifdef CONFIG_THP_SWAP
-extern int split_swap_cluster(swp_entry_t entry);
+extern int split_swap_cluster(swp_entry_t entry, unsigned long flags);
 extern int split_swap_cluster_map(swp_entry_t entry);
 #else
-static inline int split_swap_cluster(swp_entry_t entry)
+static inline int split_swap_cluster(swp_entry_t entry, unsigned long flags)
 {
return 0;
 }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 49df3e7c96c7..fc31fc1ae0b3 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2507,6 +2507,17 @@ static void __split_huge_page(struct page *page, struct 
list_head *list,
 
remap_page(head);
 
+   /*
+* Split swap cluster before unlocking sub-pages.  So all
+* sub-pages will be kept locked from THP has been split to
+* swap cluster is split.
+*/
+   if (PageSwapCache(head)) {
+   swp_entry_t entry = { .val = page_private(head) };
+
+   split_swap_cluster(entry, SSC_SPLIT_CACHED);
+   }
+
for (i = 0; i < HPAGE_PMD_NR; i++) {
struct page *subpage = head + i;
if (subpage == page)
@@ -2741,12 +2752,7 @@ int split_huge_page_to_list(struct page *page, struct 
list_head *list)
__dec_node_page_state(page, NR_SHMEM_THPS);
spin_unlock(>split_queue_lock);
__split_huge_page(page, list, end, flags);
-   if (PageSwapCache(head)) {
-   swp_entry_t entry = { .val = page_private(head) };
-
-   ret = split_swap_cluster(entry);
-   } else
-   ret = 0;
+   ret = 0;
} else {
if (IS_ENABLED(CONFIG_DEBUG_VM) && mapcount) {
pr_alert("total_mapcount: %u, page_count(): %u\n",
diff --git a/mm/swapfile.c b/mm/swapfile.c
index d38760b6d495..c59cc2ca7c2c 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1469,23 +1469,6 @@ void put_swap_page(struct page *page, swp_entry_t entry)
unlock_cluster_or_swap_info(si, ci);
 }
 
-#ifdef CONFIG_THP_SWAP
-int split_swap_cluster(swp_entry_t entry)
-{
-   struct swap_info_struct *si;
-   struct swap_cluster_info *ci;
-   unsigned long offset = swp_offset(entry);
-
-   si = _swap_info_get(entry);
-   if (!si)
-   return -EBUSY;
-   ci = lock_cluster(si, offset);
-   cluster_clear_huge(ci);
-   unlock_cluster(ci);
-   return 0;
-}
-#endif
-
 static int swp_entry_cmp(const void *ent1, const void *ent2)
 {
const swp_entry_t *e1 = ent1, *e2 = ent2;
@@ -3972,6 +3955,47 @@ int split_swap_cluster_map(swp_entry_t entry)
unlock_cluster(ci);
return 0;
 }
+
+/*
+ * We will not try to split all PMD swap mappings to the swap cluster,
+ * because we haven't enough information available for that.  Later,
+ * when the PMD swap mapping is duplicated or swapin, etc, the PMD
+ * swap mapping will be split and fallback to the PTE operations.
+ */
+int split_swap_cluster(swp_entry_t entry, unsigned long flags)
+{
+   struct swap_info_struct *si;
+   struct swap_cluster_info *ci;
+

[PATCH -V9 09/21] swap: Support to read a huge swap cluster for swapin a THP

2018-12-13 Thread Huang Ying

To swapin a THP in one piece, we need to read a huge swap cluster from
the swap device.  This patch revised the __read_swap_cache_async() and
its callers and callees to support this.  If __read_swap_cache_async()
find the swap cluster of the specified swap entry is huge, it will try
to allocate a THP, add it into the swap cache.  So later the contents
of the huge swap cluster can be read into the THP.

Signed-off-by: "Huang, Ying" 
Cc: "Kirill A. Shutemov" 
Cc: Andrea Arcangeli 
Cc: Michal Hocko 
Cc: Johannes Weiner 
Cc: Shaohua Li 
Cc: Hugh Dickins 
Cc: Minchan Kim 
Cc: Rik van Riel 
Cc: Dave Hansen 
Cc: Naoya Horiguchi 
Cc: Zi Yan 
Cc: Daniel Jordan 
---
 include/linux/huge_mm.h |  6 +
 include/linux/swap.h|  4 +--
 mm/huge_memory.c|  4 +--
 mm/swap_state.c | 60 -
 mm/swapfile.c   |  9 ---
 5 files changed, 64 insertions(+), 19 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 1c0fda003d6a..72f2617d336b 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -250,6 +250,7 @@ static inline bool thp_migration_supported(void)
return IS_ENABLED(CONFIG_ARCH_ENABLE_THP_MIGRATION);
 }
 
+gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma);
 #else /* CONFIG_TRANSPARENT_HUGEPAGE */
 #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
 #define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
@@ -363,6 +364,11 @@ static inline bool thp_migration_supported(void)
 {
return false;
 }
+
+static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
+{
+   return 0;
+}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 #endif /* _LINUX_HUGE_MM_H */
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 441da4a832a6..4bd532c9315e 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -462,7 +462,7 @@ extern sector_t map_swap_page(struct page *, struct 
block_device **);
 extern sector_t swapdev_block(int, pgoff_t);
 extern int page_swapcount(struct page *);
 extern int __swap_count(swp_entry_t entry);
-extern int __swp_swapcount(swp_entry_t entry);
+extern int __swp_swapcount(swp_entry_t entry, int *entry_size);
 extern int swp_swapcount(swp_entry_t entry);
 extern struct swap_info_struct *page_swap_info(struct page *);
 extern struct swap_info_struct *swp_swap_info(swp_entry_t entry);
@@ -590,7 +590,7 @@ static inline int __swap_count(swp_entry_t entry)
return 0;
 }
 
-static inline int __swp_swapcount(swp_entry_t entry)
+static inline int __swp_swapcount(swp_entry_t entry, int *entry_size)
 {
return 0;
 }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index fc31fc1ae0b3..1cec1eec340e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -629,9 +629,9 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct 
vm_fault *vmf,
  * available
  * never: never stall for any thp allocation
  */
-static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
+gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
 {
-   const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
+   const bool vma_madvised = vma ? !!(vma->vm_flags & VM_HUGEPAGE) : false;
 
/* Always do synchronous compaction */
if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, 
_hugepage_flags))
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 97831166994a..5e761bb6e354 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -361,7 +361,9 @@ struct page *__read_swap_cache_async(swp_entry_t entry, 
gfp_t gfp_mask,
 {
struct page *found_page = NULL, *new_page = NULL;
struct swap_info_struct *si;
-   int err;
+   int err, entry_size = 1;
+   swp_entry_t hentry;
+
*new_page_allocated = false;
 
do {
@@ -387,14 +389,41 @@ struct page *__read_swap_cache_async(swp_entry_t entry, 
gfp_t gfp_mask,
 * as SWAP_HAS_CACHE.  That's done in later part of code or
 * else swap_off will be aborted if we return NULL.
 */
-   if (!__swp_swapcount(entry) && swap_slot_cache_enabled)
+   if (!__swp_swapcount(entry, _size) &&
+   swap_slot_cache_enabled)
break;
 
/*
 * Get a new page to read into from swap.
 */
-   if (!new_page) {
-   new_page = alloc_page_vma(gfp_mask, vma, addr);
+   if (!new_page ||
+   (IS_ENABLED(CONFIG_THP_SWAP) &&
+hpage_nr_pages(new_page) != entry_size)) {
+   if (new_page)
+   put_page(new_page);
+   if (IS_ENABLED(CONFIG_THP_SWAP) &&
+   entry_size == HPAGE_PMD_NR) {
+   gfp_t gfp;
+
+   gfp = alloc_hugepage_direct_gfpmask(vma);
+   /*
+

[PATCH -V9 00/21] swap: Swapout/swapin THP in one piece

2018-12-13 Thread Huang Ying

Hi, Andrew, could you help me to check whether the overall design is
reasonable?

Hi, Hugh, Shaohua, Minchan and Rik, could you help me to review the
swap part of the patchset?  Especially [02/21], [03/21], [04/21],
[05/21], [06/21], [07/21], [08/21], [09/21], [10/21], [11/21],
[12/21], [20/21], [21/21].

Hi, Andrea and Kirill, could you help me to review the THP part of the
patchset?  Especially [01/21], [07/21], [09/21], [11/21], [13/21],
[15/21], [16/21], [17/21], [18/21], [19/21], [20/21].

Hi, Johannes and Michal, could you help me to review the cgroup part
of the patchset?  Especially [14/21].

And for all, Any comment is welcome!

This patchset is based on the 2018-12-10 head of mmotm/master.

This is the final step of THP (Transparent Huge Page) swap
optimization.  After the first and second step, the splitting huge
page is delayed from almost the first step of swapout to after swapout
has been finished.  In this step, we avoid splitting THP for swapout
and swapout/swapin the THP in one piece.

We tested the patchset with vm-scalability benchmark swap-w-seq test
case, with 16 processes.  The test case forks 16 processes.  Each
process allocates large anonymous memory range, and writes it from
begin to end for 8 rounds.  The first round will swapout, while the
remaining rounds will swapin and swapout.  The test is done on a Xeon
E5 v3 system, the swap device used is a RAM simulated PMEM (persistent
memory) device.  The test result is as follow,

base  optimized
 -- 
 %stddev %change %stddev
 \  |\  
   1417897 ±  2%+992.8%   15494673vm-scalability.throughput
   1020489 ±  4%   +1091.2%   12156349vmstat.swap.si
   1255093 ±  3%+940.3%   13056114vmstat.swap.so
   1259769 ±  7%   +1818.3%   24166779meminfo.AnonHugePages
  28021761   -10.7%   25018848 ±  2%  meminfo.AnonPages
  64080064 ±  4% -95.6%2787565 ± 33%  
interrupts.CAL:Function_call_interrupts
 13.91 ±  5% -13.80.10 ± 27%  
perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath

Where, the score of benchmark (bytes written per second) improved
992.8%.  The swapout/swapin throughput improved 1008% (from about
2.17GB/s to 24.04GB/s).  The performance difference is huge.  In base
kernel, for the first round of writing, the THP is swapout and split,
so in the remaining rounds, there is only normal page swapin and
swapout.  While in optimized kernel, the THP is kept after first
swapout, so THP swapin and swapout is used in the remaining rounds.
This shows the key benefit to swapout/swapin THP in one piece, the THP
will be kept instead of being split.  meminfo information verified
this, in base kernel only 4.5% of anonymous page are THP during the
test, while in optimized kernel, that is 96.6%.  The TLB flushing IPI
(represented as interrupts.CAL:Function_call_interrupts) reduced
95.6%, while cycles for spinlock reduced from 13.9% to 0.1%.  These
are performance benefit of THP swapout/swapin too.

Below is the description for all steps of THP swap optimization.

Recently, the performance of the storage devices improved so fast that
we cannot saturate the disk bandwidth with single logical CPU when do
page swapping even on a high-end server machine.  Because the
performance of the storage device improved faster than that of single
logical CPU.  And it seems that the trend will not change in the near
future.  On the other hand, the THP becomes more and more popular
because of increased memory size.  So it becomes necessary to optimize
THP swap performance.

The advantages to swapout/swapin a THP in one piece include:

- Batch various swap operations for the THP.  Many operations need to
  be done once per THP instead of per normal page, for example,
  allocating/freeing the swap space, writing/reading the swap space,
  flushing TLB, page fault, etc.  This will improve the performance of
  the THP swap greatly.

- The THP swap space read/write will be large sequential IO (2M on
  x86_64).  It is particularly helpful for the swapin, which are
  usually 4k random IO.  This will improve the performance of the THP
  swap too.

- It will help the memory fragmentation, especially when the THP is
  heavily used by the applications.  The THP order pages will be free
  up after THP swapout.

- It will improve the THP utilization on the system with the swap
  turned on.  Because the speed for khugepaged to collapse the normal
  pages into the THP is quite slow.  After the THP is split during the
  swapout, it will take quite long time for the normal pages to
  collapse back into the THP after being swapin.  The high THP
  utilization helps the efficiency of the page based memory management
  too.

There are some concerns regarding THP swapin, mainly because possible
enlarged read/write IO size (for swapout/swapin) may put more overhead
on the storage device.

[PATCH -V9 04/21] swap: Support PMD swap mapping in swap_duplicate()

2018-12-13 Thread Huang Ying

To support to swapin the THP in one piece, we need to create PMD swap
mapping during swapout, and maintain PMD swap mapping count.  This
patch implements the support to increase the PMD swap mapping
count (for swapout, fork, etc.)  and set SWAP_HAS_CACHE flag (for
swapin, etc.) for a huge swap cluster in swap_duplicate() function
family.  Although it only implements a part of the design of the swap
reference count with PMD swap mapping, the whole design is described
as follow to make it easy to understand the patch and the whole
picture.

A huge swap cluster is used to hold the contents of a swapouted THP.
After swapout, a PMD page mapping to the THP will become a PMD
swap mapping to the huge swap cluster via a swap entry in PMD.  While
a PTE page mapping to a subpage of the THP will become the PTE swap
mapping to a swap slot in the huge swap cluster via a swap entry in
PTE.

If there is no PMD swap mapping and the corresponding THP is removed
from the page cache (reclaimed), the huge swap cluster will be split
and become a normal swap cluster.

The count (cluster_count()) of the huge swap cluster is
SWAPFILE_CLUSTER (= HPAGE_PMD_NR) + PMD swap mapping count.  Because
all swap slots in the huge swap cluster are mapped by PTE or PMD, or
has SWAP_HAS_CACHE bit set, the usage count of the swap cluster is
HPAGE_PMD_NR.  And the PMD swap mapping count is recorded too to make
it easy to determine whether there are remaining PMD swap mappings.

The count in swap_map[offset] is the sum of PTE and PMD swap mapping
count.  This means when we increase the PMD swap mapping count, we
need to increase swap_map[offset] for all swap slots inside the swap
cluster.  An alternative choice is to make swap_map[offset] to record
PTE swap map count only, given we have recorded PMD swap mapping count
in the count of the huge swap cluster.  But this need to increase
swap_map[offset] when splitting the PMD swap mapping, that may fail
because of memory allocation for swap count continuation.  That is
hard to dealt with.  So we choose current solution.

The PMD swap mapping to a huge swap cluster may be split when unmap a
part of PMD mapping etc.  That is easy because only the count of the
huge swap cluster need to be changed.  When the last PMD swap mapping
is gone and SWAP_HAS_CACHE is unset, we will split the huge swap
cluster (clear the huge flag).  This makes it easy to reason the
cluster state.

A huge swap cluster will be split when splitting the THP in swap
cache, or failing to allocate THP during swapin, etc.  But when
splitting the huge swap cluster, we will not try to split all PMD swap
mappings, because we haven't enough information available for that
sometimes.  Later, when the PMD swap mapping is duplicated or swapin,
etc, the PMD swap mapping will be split and fallback to the PTE
operation.

When a THP is added into swap cache, the SWAP_HAS_CACHE flag will be
set in the swap_map[offset] of all swap slots inside the huge swap
cluster backing the THP.  This huge swap cluster will not be split
unless the THP is split even if its PMD swap mapping count dropped to
0.  Later, when the THP is removed from swap cache, the SWAP_HAS_CACHE
flag will be cleared in the swap_map[offset] of all swap slots inside
the huge swap cluster.  And this huge swap cluster will be split if
its PMD swap mapping count is 0.

The first parameter of swap_duplicate() is changed to return the swap
entry to call add_swap_count_continuation() for.  Because we may need
to call it for a swap entry in the middle of a huge swap cluster.

Signed-off-by: "Huang, Ying" 
Cc: "Kirill A. Shutemov" 
Cc: Andrea Arcangeli 
Cc: Michal Hocko 
Cc: Johannes Weiner 
Cc: Shaohua Li 
Cc: Hugh Dickins 
Cc: Minchan Kim 
Cc: Rik van Riel 
Cc: Dave Hansen 
Cc: Naoya Horiguchi 
Cc: Zi Yan 
Cc: Daniel Jordan 
---
 include/linux/swap.h |   9 ++--
 mm/memory.c  |   2 +-
 mm/rmap.c|   2 +-
 mm/swap_state.c  |   2 +-
 mm/swapfile.c| 109 ---
 5 files changed, 99 insertions(+), 25 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 928550bd28f3..70a6ede1e7e0 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -451,8 +451,8 @@ extern swp_entry_t get_swap_page_of_type(int);
 extern int get_swap_pages(int n, swp_entry_t swp_entries[], int entry_size);
 extern int add_swap_count_continuation(swp_entry_t, gfp_t);
 extern void swap_shmem_alloc(swp_entry_t);
-extern int swap_duplicate(swp_entry_t);
-extern int swapcache_prepare(swp_entry_t);
+extern int swap_duplicate(swp_entry_t *entry, int entry_size);
+extern int swapcache_prepare(swp_entry_t entry, int entry_size);
 extern void swap_free(swp_entry_t);
 extern void swapcache_free_entries(swp_entry_t *entries, int n);
 extern int free_swap_and_cache(swp_entry_t);
@@ -510,7 +510,8 @@ static inline void show_swap_cache_info(void)
 }
 
 #define free_swap_and_cache(e) ({(is_migration_entry(e) || 
is_device_private_entry(e));})

[PATCH -V9 14/21] swap: Support PMD swap mapping in madvise_free()

2018-12-13 Thread Huang Ying

When madvise_free() found a PMD swap mapping, if only part of the huge
swap cluster is operated on, the PMD swap mapping will be split and
fallback to PTE swap mapping processing.  Otherwise, if all huge swap
cluster is operated on, free_swap_and_cache() will be called to
decrease the PMD swap mapping count and probably free the swap space
and the THP in swap cache too.

Signed-off-by: "Huang, Ying" 
Cc: "Kirill A. Shutemov" 
Cc: Andrea Arcangeli 
Cc: Michal Hocko 
Cc: Johannes Weiner 
Cc: Shaohua Li 
Cc: Hugh Dickins 
Cc: Minchan Kim 
Cc: Rik van Riel 
Cc: Dave Hansen 
Cc: Naoya Horiguchi 
Cc: Zi Yan 
Cc: Daniel Jordan 
---
 mm/huge_memory.c | 52 ++--
 mm/madvise.c |  2 +-
 2 files changed, 38 insertions(+), 16 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index fdffa07bff98..c895c2a2db6e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1883,6 +1883,15 @@ int do_huge_pmd_swap_page(struct vm_fault *vmf, pmd_t 
orig_pmd)
 }
 #endif
 
+static inline void zap_deposited_table(struct mm_struct *mm, pmd_t *pmd)
+{
+   pgtable_t pgtable;
+
+   pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+   pte_free(mm, pgtable);
+   mm_dec_nr_ptes(mm);
+}
+
 /*
  * Return true if we do MADV_FREE successfully on entire pmd page.
  * Otherwise, return false.
@@ -1903,15 +1912,37 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, 
struct vm_area_struct *vma,
goto out_unlocked;
 
orig_pmd = *pmd;
-   if (is_huge_zero_pmd(orig_pmd))
-   goto out;
-
if (unlikely(!pmd_present(orig_pmd))) {
-   VM_BUG_ON(thp_migration_supported() &&
- !is_pmd_migration_entry(orig_pmd));
-   goto out;
+   swp_entry_t entry = pmd_to_swp_entry(orig_pmd);
+
+   if (is_migration_entry(entry)) {
+   VM_BUG_ON(!thp_migration_supported());
+   goto out;
+   } else if (IS_ENABLED(CONFIG_THP_SWAP) &&
+  !non_swap_entry(entry)) {
+   /*
+* If part of THP is discarded, split the PMD
+* swap mapping and operate on the PTEs
+*/
+   if (next - addr != HPAGE_PMD_SIZE) {
+   __split_huge_swap_pmd(vma, addr, pmd);
+   goto out;
+   }
+   free_swap_and_cache(entry, HPAGE_PMD_NR);
+   pmd_clear(pmd);
+   zap_deposited_table(mm, pmd);
+   if (current->mm == mm)
+   sync_mm_rss(mm);
+   add_mm_counter(mm, MM_SWAPENTS, -HPAGE_PMD_NR);
+   ret = true;
+   goto out;
+   } else
+   VM_BUG_ON(1);
}
 
+   if (is_huge_zero_pmd(orig_pmd))
+   goto out;
+
page = pmd_page(orig_pmd);
/*
 * If other processes are mapping this page, we couldn't discard
@@ -1957,15 +1988,6 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, 
struct vm_area_struct *vma,
return ret;
 }
 
-static inline void zap_deposited_table(struct mm_struct *mm, pmd_t *pmd)
-{
-   pgtable_t pgtable;
-
-   pgtable = pgtable_trans_huge_withdraw(mm, pmd);
-   pte_free(mm, pgtable);
-   mm_dec_nr_ptes(mm);
-}
-
 int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 pmd_t *pmd, unsigned long addr)
 {
diff --git a/mm/madvise.c b/mm/madvise.c
index fac48161b015..c1845dab2dd4 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -321,7 +321,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long 
addr,
unsigned long next;
 
next = pmd_addr_end(addr, end);
-   if (pmd_trans_huge(*pmd))
+   if (pmd_trans_huge(*pmd) || is_swap_pmd(*pmd))
if (madvise_free_huge_pmd(tlb, vma, pmd, addr, next))
goto next;
 
-- 
2.18.1

[PATCH -V9 02/21] swap: Enable PMD swap operations for CONFIG_THP_SWAP

2018-12-13 Thread Huang Ying

Currently, "the swap entry" in the page tables is used for a number of
things outside of actual swap, like page migration, etc.  We support
the THP/PMD "swap entry" for page migration currently and the
functions behind this are tied to page migration's config
option (CONFIG_ARCH_ENABLE_THP_MIGRATION).

But, we also need them for THP swap optimization.  So a new config
option (CONFIG_HAVE_PMD_SWAP_ENTRY) is added.  It is enabled when
either CONFIG_ARCH_ENABLE_THP_MIGRATION or CONFIG_THP_SWAP is enabled.
And PMD swap entry functions are tied to this new config option
instead.  Some functions enabled by CONFIG_ARCH_ENABLE_THP_MIGRATION
are for page migration only, they are still enabled only for that.

Signed-off-by: "Huang, Ying" 
Cc: "Kirill A. Shutemov" 
Cc: Andrea Arcangeli 
Cc: Michal Hocko 
Cc: Johannes Weiner 
Cc: Shaohua Li 
Cc: Hugh Dickins 
Cc: Minchan Kim 
Cc: Rik van Riel 
Cc: Dave Hansen 
Cc: Naoya Horiguchi 
Cc: Zi Yan 
Cc: Daniel Jordan 
---
 arch/x86/include/asm/pgtable.h |  2 +-
 include/asm-generic/pgtable.h  |  2 +-
 include/linux/swapops.h| 44 ++
 mm/Kconfig |  8 +++
 4 files changed, 33 insertions(+), 23 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 40616e805292..e830ab345551 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1333,7 +1333,7 @@ static inline pte_t pte_swp_clear_soft_dirty(pte_t pte)
return pte_clear_flags(pte, _PAGE_SWP_SOFT_DIRTY);
 }
 
-#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+#ifdef CONFIG_HAVE_PMD_SWAP_ENTRY
 static inline pmd_t pmd_swp_mksoft_dirty(pmd_t pmd)
 {
return pmd_set_flags(pmd, _PAGE_SWP_SOFT_DIRTY);
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index e0381a4ce7d4..2a619f378297 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -675,7 +675,7 @@ static inline void ptep_modify_prot_commit(struct mm_struct 
*mm,
 #endif
 
 #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
-#ifndef CONFIG_ARCH_ENABLE_THP_MIGRATION
+#ifndef CONFIG_HAVE_PMD_SWAP_ENTRY
 static inline pmd_t pmd_swp_mksoft_dirty(pmd_t pmd)
 {
return pmd;
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 4d961668e5fc..905ddc65caa3 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -254,17 +254,7 @@ static inline int is_write_migration_entry(swp_entry_t 
entry)
 
 #endif
 
-struct page_vma_mapped_walk;
-
-#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
-extern void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
-   struct page *page);
-
-extern void remove_migration_pmd(struct page_vma_mapped_walk *pvmw,
-   struct page *new);
-
-extern void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd);
-
+#ifdef CONFIG_HAVE_PMD_SWAP_ENTRY
 static inline swp_entry_t pmd_to_swp_entry(pmd_t pmd)
 {
swp_entry_t arch_entry;
@@ -282,6 +272,28 @@ static inline pmd_t swp_entry_to_pmd(swp_entry_t entry)
arch_entry = __swp_entry(swp_type(entry), swp_offset(entry));
return __swp_entry_to_pmd(arch_entry);
 }
+#else
+static inline swp_entry_t pmd_to_swp_entry(pmd_t pmd)
+{
+   return swp_entry(0, 0);
+}
+
+static inline pmd_t swp_entry_to_pmd(swp_entry_t entry)
+{
+   return __pmd(0);
+}
+#endif
+
+struct page_vma_mapped_walk;
+
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+extern void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
+   struct page *page);
+
+extern void remove_migration_pmd(struct page_vma_mapped_walk *pvmw,
+   struct page *new);
+
+extern void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd);
 
 static inline int is_pmd_migration_entry(pmd_t pmd)
 {
@@ -302,16 +314,6 @@ static inline void remove_migration_pmd(struct 
page_vma_mapped_walk *pvmw,
 
 static inline void pmd_migration_entry_wait(struct mm_struct *m, pmd_t *p) { }
 
-static inline swp_entry_t pmd_to_swp_entry(pmd_t pmd)
-{
-   return swp_entry(0, 0);
-}
-
-static inline pmd_t swp_entry_to_pmd(swp_entry_t entry)
-{
-   return __pmd(0);
-}
-
 static inline int is_pmd_migration_entry(pmd_t pmd)
 {
return 0;
diff --git a/mm/Kconfig b/mm/Kconfig
index 25c71eb8a7db..d7c5299c5b7d 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -422,6 +422,14 @@ config THP_SWAP
 
  For selection by architectures with reasonable THP sizes.
 
+#
+# "PMD swap entry" in the page table is used both for migration and
+# actual swap.
+#
+config HAVE_PMD_SWAP_ENTRY
+   def_bool y
+   depends on THP_SWAP || ARCH_ENABLE_THP_MIGRATION
+
 config TRANSPARENT_HUGE_PAGECACHE
def_bool y
depends on TRANSPARENT_HUGEPAGE
-- 
2.18.1

[PATCH -V9 13/21] swap: Support PMD swap mapping in swapoff

2018-12-13 Thread Huang Ying

During swapoff, for each PMD swap mapping, we will allocate a THP,
read the contents of the huge swap cluster into the THP and change the
PMD swap mapping to the PMD page mapping to the THP, then try to free
the huge swap cluster.  If failed to allocate a THP, the huge swap
cluster will be split.

If the swap cluster mapped by a PMD swap mapping has been split
already, we will split the PMD swap mapping and unuse the PTEs.

Signed-off-by: "Huang, Ying" 
Cc: "Kirill A. Shutemov" 
Cc: Andrea Arcangeli 
Cc: Michal Hocko 
Cc: Johannes Weiner 
Cc: Shaohua Li 
Cc: Hugh Dickins 
Cc: Minchan Kim 
Cc: Rik van Riel 
Cc: Dave Hansen 
Cc: Naoya Horiguchi 
Cc: Zi Yan 
Cc: Daniel Jordan 
---
 include/asm-generic/pgtable.h |  14 +
 include/linux/huge_mm.h   |   8 +++
 mm/huge_memory.c  |   4 +-
 mm/swapfile.c | 108 +-
 4 files changed, 119 insertions(+), 15 deletions(-)

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 2a619f378297..d2d4d520e2e7 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -931,22 +931,12 @@ static inline int 
pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd)
barrier();
 #endif
/*
-* !pmd_present() checks for pmd migration entries
-*
-* The complete check uses is_pmd_migration_entry() in linux/swapops.h
-* But using that requires moving current function and 
pmd_trans_unstable()
-* to linux/swapops.h to resovle dependency, which is too much code 
move.
-*
-* !pmd_present() is equivalent to is_pmd_migration_entry() currently,
-* because !pmd_present() pages can only be under migration not swapped
-* out.
-*
-* pmd_none() is preseved for future condition checks on pmd migration
+* pmd_none() is preseved for future condition checks on pmd swap
 * entries and not confusing with this function name, although it is
 * redundant with !pmd_present().
 */
if (pmd_none(pmdval) || pmd_trans_huge(pmdval) ||
-   (IS_ENABLED(CONFIG_ARCH_ENABLE_THP_MIGRATION) && 
!pmd_present(pmdval)))
+   (IS_ENABLED(CONFIG_HAVE_PMD_SWAP_ENTRY) && !pmd_present(pmdval)))
return 1;
if (unlikely(pmd_bad(pmdval))) {
pmd_clear_bad(pmd);
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 06dbbcf6a6dd..7c72e63757af 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -374,6 +374,8 @@ static inline gfp_t alloc_hugepage_direct_gfpmask(struct 
vm_area_struct *vma)
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 #ifdef CONFIG_THP_SWAP
+extern int split_huge_swap_pmd(struct vm_area_struct *vma, pmd_t *pmd,
+  unsigned long address, pmd_t orig_pmd);
 extern int do_huge_pmd_swap_page(struct vm_fault *vmf, pmd_t orig_pmd);
 
 static inline bool transparent_hugepage_swapin_enabled(
@@ -399,6 +401,12 @@ static inline bool transparent_hugepage_swapin_enabled(
return false;
 }
 #else /* CONFIG_THP_SWAP */
+static inline int split_huge_swap_pmd(struct vm_area_struct *vma, pmd_t *pmd,
+ unsigned long address, pmd_t orig_pmd)
+{
+   return 0;
+}
+
 static inline int do_huge_pmd_swap_page(struct vm_fault *vmf, pmd_t orig_pmd)
 {
return 0;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8e8952938c25..fdffa07bff98 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1706,8 +1706,8 @@ static void __split_huge_swap_pmd(struct vm_area_struct 
*vma,
 }
 
 #ifdef CONFIG_THP_SWAP
-static int split_huge_swap_pmd(struct vm_area_struct *vma, pmd_t *pmd,
-  unsigned long address, pmd_t orig_pmd)
+int split_huge_swap_pmd(struct vm_area_struct *vma, pmd_t *pmd,
+   unsigned long address, pmd_t orig_pmd)
 {
struct mm_struct *mm = vma->vm_mm;
spinlock_t *ptl;
diff --git a/mm/swapfile.c b/mm/swapfile.c
index e27fe24a1f41..454e993bc32f 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1931,6 +1931,11 @@ static inline int pte_same_as_swp(pte_t pte, pte_t 
swp_pte)
return pte_same(pte_swp_clear_soft_dirty(pte), swp_pte);
 }
 
+static inline int pmd_same_as_swp(pmd_t pmd, pmd_t swp_pmd)
+{
+   return pmd_same(pmd_swp_clear_soft_dirty(pmd), swp_pmd);
+}
+
 /*
  * No need to decide whether this PTE shares the swap entry with others,
  * just let do_wp_page work it out if a write is requested later - to
@@ -1992,6 +1997,53 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t 
*pmd,
return ret;
 }
 
+#ifdef CONFIG_THP_SWAP
+static int unuse_pmd(struct vm_area_struct *vma, pmd_t *pmd,
+unsigned long addr, swp_entry_t entry, struct page *page)
+{
+   struct mem_cgroup *memcg;
+   spinlock_t *ptl;
+   int ret = 1;
+
+   if (mem_cgroup_try_charge(page, vma->vm_mm, GFP_KERNEL,
+

[PATCH -V9 10/21] swap: Swapin a THP in one piece

2018-12-13 Thread Huang Ying

With this patch, when page fault handler find a PMD swap mapping, it
will swap in a THP in one piece.  This avoids the overhead of
splitting/collapsing before/after the THP swapping.  And improves the
swap performance greatly for reduced page fault count etc.

do_huge_pmd_swap_page() is added in the patch to implement this.  It
is similar to do_swap_page() for normal page swapin.

If failing to allocate a THP, the huge swap cluster and the PMD swap
mapping will be split to fallback to normal page swapin.

If the huge swap cluster has been split already, the PMD swap mapping
will be split to fallback to normal page swapin.

Signed-off-by: "Huang, Ying" 
Cc: "Kirill A. Shutemov" 
Cc: Andrea Arcangeli 
Cc: Michal Hocko 
Cc: Johannes Weiner 
Cc: Shaohua Li 
Cc: Hugh Dickins 
Cc: Minchan Kim 
Cc: Rik van Riel 
Cc: Dave Hansen 
Cc: Naoya Horiguchi 
Cc: Zi Yan 
Cc: Daniel Jordan 
---
 include/linux/huge_mm.h |   9 +++
 mm/huge_memory.c| 174 
 mm/memory.c |  16 ++--
 3 files changed, 193 insertions(+), 6 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 72f2617d336b..debe3760e894 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -371,4 +371,13 @@ static inline gfp_t alloc_hugepage_direct_gfpmask(struct 
vm_area_struct *vma)
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
+#ifdef CONFIG_THP_SWAP
+extern int do_huge_pmd_swap_page(struct vm_fault *vmf, pmd_t orig_pmd);
+#else /* CONFIG_THP_SWAP */
+static inline int do_huge_pmd_swap_page(struct vm_fault *vmf, pmd_t orig_pmd)
+{
+   return 0;
+}
+#endif /* CONFIG_THP_SWAP */
+
 #endif /* _LINUX_HUGE_MM_H */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1cec1eec340e..644cb5d6b056 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -33,6 +33,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include 
 #include 
@@ -1652,6 +1654,178 @@ static void __split_huge_swap_pmd(struct vm_area_struct 
*vma,
pmd_populate(mm, pmd, pgtable);
 }
 
+#ifdef CONFIG_THP_SWAP
+static int split_huge_swap_pmd(struct vm_area_struct *vma, pmd_t *pmd,
+  unsigned long address, pmd_t orig_pmd)
+{
+   struct mm_struct *mm = vma->vm_mm;
+   spinlock_t *ptl;
+   int ret = 0;
+
+   ptl = pmd_lock(mm, pmd);
+   if (pmd_same(*pmd, orig_pmd))
+   __split_huge_swap_pmd(vma, address, pmd);
+   else
+   ret = -ENOENT;
+   spin_unlock(ptl);
+
+   return ret;
+}
+
+int do_huge_pmd_swap_page(struct vm_fault *vmf, pmd_t orig_pmd)
+{
+   struct page *page;
+   struct mem_cgroup *memcg;
+   struct vm_area_struct *vma = vmf->vma;
+   unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
+   swp_entry_t entry;
+   pmd_t pmd;
+   int i, locked, exclusive = 0, ret = 0;
+
+   entry = pmd_to_swp_entry(orig_pmd);
+   VM_BUG_ON(non_swap_entry(entry));
+   delayacct_set_flag(DELAYACCT_PF_SWAPIN);
+retry:
+   page = lookup_swap_cache(entry, NULL, vmf->address);
+   if (!page) {
+   page = read_swap_cache_async(entry, GFP_HIGHUSER_MOVABLE, vma,
+haddr, false);
+   if (!page) {
+   /*
+* Back out if somebody else faulted in this pmd
+* while we released the pmd lock.
+*/
+   if (likely(pmd_same(*vmf->pmd, orig_pmd))) {
+   /*
+* Failed to allocate huge page, split huge swap
+* cluster, and fallback to swapin normal page
+*/
+   ret = split_swap_cluster(entry, 0);
+   /* Somebody else swapin the swap entry, retry */
+   if (ret == -EEXIST) {
+   ret = 0;
+   goto retry;
+   /* swapoff occurs under us */
+   } else if (ret == -EINVAL)
+   ret = 0;
+   else
+   goto fallback;
+   }
+   delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
+   goto out;
+   }
+
+   /* Had to read the page from swap area: Major fault */
+   ret = VM_FAULT_MAJOR;
+   count_vm_event(PGMAJFAULT);
+   count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
+   } else if (!PageTransCompound(page))
+   goto fallback;
+
+   locked = lock_page_or_retry(page, vma->vm_mm, vmf->flags);
+
+   delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
+   if (!locked) {
+   ret |= VM_FAULT_RETRY;
+   goto out_release;
+   }
+
+   /*
+

[PATCH -V9 06/21] swap: Support PMD swap mapping in free_swap_and_cache()/swap_free()

2018-12-13 Thread Huang Ying

When a PMD swap mapping is removed from a huge swap cluster, for
example, unmap a memory range mapped with PMD swap mapping, etc,
free_swap_and_cache() will be called to decrease the reference count
to the huge swap cluster.  free_swap_and_cache() may also free or
split the huge swap cluster, and free the corresponding THP in swap
cache if necessary.  swap_free() is similar, and shares most
implementation with free_swap_and_cache().  This patch revises
free_swap_and_cache() and swap_free() to implement this.

If the swap cluster has been split already, for example, because of
failing to allocate a THP during swapin, we just decrease one from the
reference count of all swap slots.

Otherwise, we will decrease one from the reference count of all swap
slots and the PMD swap mapping count in cluster_count().  When the
corresponding THP isn't in swap cache, if PMD swap mapping count
becomes 0, the huge swap cluster will be split, and if all swap count
becomes 0, the huge swap cluster will be freed.  When the corresponding
THP is in swap cache, if every swap_map[offset] == SWAP_HAS_CACHE, we
will try to delete the THP from swap cache.  Which will cause the THP
and the huge swap cluster be freed.

Signed-off-by: "Huang, Ying" 
Cc: "Kirill A. Shutemov" 
Cc: Andrea Arcangeli 
Cc: Michal Hocko 
Cc: Johannes Weiner 
Cc: Shaohua Li 
Cc: Hugh Dickins 
Cc: Minchan Kim 
Cc: Rik van Riel 
Cc: Dave Hansen 
Cc: Naoya Horiguchi 
Cc: Zi Yan 
Cc: Daniel Jordan 
---
 arch/s390/mm/pgtable.c |   2 +-
 include/linux/swap.h   |   9 ++-
 kernel/power/swap.c|   4 +-
 mm/madvise.c   |   2 +-
 mm/memory.c|   4 +-
 mm/shmem.c |   4 +-
 mm/swapfile.c  | 170 -
 7 files changed, 147 insertions(+), 48 deletions(-)

diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index f2cc7da473e4..ffd4b68adbb3 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -675,7 +675,7 @@ static void ptep_zap_swap_entry(struct mm_struct *mm, 
swp_entry_t entry)
 
dec_mm_counter(mm, mm_counter(page));
}
-   free_swap_and_cache(entry);
+   free_swap_and_cache(entry, 1);
 }
 
 void ptep_zap_unused(struct mm_struct *mm, unsigned long addr,
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 70a6ede1e7e0..24c3014894dd 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -453,9 +453,9 @@ extern int add_swap_count_continuation(swp_entry_t, gfp_t);
 extern void swap_shmem_alloc(swp_entry_t);
 extern int swap_duplicate(swp_entry_t *entry, int entry_size);
 extern int swapcache_prepare(swp_entry_t entry, int entry_size);
-extern void swap_free(swp_entry_t);
+extern void swap_free(swp_entry_t entry, int entry_size);
 extern void swapcache_free_entries(swp_entry_t *entries, int n);
-extern int free_swap_and_cache(swp_entry_t);
+extern int free_swap_and_cache(swp_entry_t entry, int entry_size);
 extern int swap_type_of(dev_t, sector_t, struct block_device **);
 extern unsigned int count_swap_pages(int, int);
 extern sector_t map_swap_page(struct page *, struct block_device **);
@@ -509,7 +509,8 @@ static inline void show_swap_cache_info(void)
 {
 }
 
-#define free_swap_and_cache(e) ({(is_migration_entry(e) || 
is_device_private_entry(e));})
+#define free_swap_and_cache(e, s)  \
+   ({(is_migration_entry(e) || is_device_private_entry(e)); })
 #define swapcache_prepare(e, s)
\
({(is_migration_entry(e) || is_device_private_entry(e)); })
 
@@ -527,7 +528,7 @@ static inline int swap_duplicate(swp_entry_t *swp, int 
entry_size)
return 0;
 }
 
-static inline void swap_free(swp_entry_t swp)
+static inline void swap_free(swp_entry_t swp, int entry_size)
 {
 }
 
diff --git a/kernel/power/swap.c b/kernel/power/swap.c
index d7f6c1a288d3..0275df84ed3d 100644
--- a/kernel/power/swap.c
+++ b/kernel/power/swap.c
@@ -182,7 +182,7 @@ sector_t alloc_swapdev_block(int swap)
offset = swp_offset(get_swap_page_of_type(swap));
if (offset) {
if (swsusp_extents_insert(offset))
-   swap_free(swp_entry(swap, offset));
+   swap_free(swp_entry(swap, offset), 1);
else
return swapdev_block(swap, offset);
}
@@ -206,7 +206,7 @@ void free_all_swap_pages(int swap)
ext = rb_entry(node, struct swsusp_extent, node);
rb_erase(node, _extents);
for (offset = ext->start; offset <= ext->end; offset++)
-   swap_free(swp_entry(swap, offset));
+   swap_free(swp_entry(swap, offset), 1);
 
kfree(ext);
}
diff --git a/mm/madvise.c b/mm/madvise.c
index d220ad7087ed..fac48161b015 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -349,7 +349,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long 
addr,
if

[PATCH -V9 01/21] swap: Deal with PTE mapped THP when unuse PTE

2018-12-13 Thread Huang Ying

A PTE swap entry may map to a normal swap slot inside a huge swap
cluster.  To free the huge swap cluster and the corresponding
THP (transparent huge page), all PTE swap entry mappings need to be
unmapped.  The original implementation only checks current PTE swap
entry mapping, this is fixed via calling try_to_free_swap() instead,
which will check all PTE swap mappings inside the huge swap cluster.

This fix could be folded into the patch: mm, swap: rid swapoff of
quadratic complexity in -mm patchset.

Signed-off-by: "Huang, Ying" 
Cc: Vineeth Remanan Pillai 
Cc: Kelley Nielsen 
Cc: Rik van Riel 
Cc: Matthew Wilcox 
Cc: Hugh Dickins 
---
 mm/swapfile.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 7464d0a92869..9e6da494781f 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1921,10 +1921,8 @@ static int unuse_pte_range(struct vm_area_struct *vma, 
pmd_t *pmd,
goto out;
}
 
-   if (PageSwapCache(page) && (swap_count(*swap_map) == 0))
-   delete_from_swap_cache(compound_head(page));
+   try_to_free_swap(page);
 
-   SetPageDirty(page);
unlock_page(page);
put_page(page);
 
-- 
2.18.1

Re: [PATCH 04/11] staging: iio: adt7316: fix handling of dac high resolution option

2018-12-13 Thread Dan Carpenter

On Thu, Dec 13, 2018 at 03:01:46PM -0700, Jeremy Fertic wrote:
> On Wed, Dec 12, 2018 at 11:23:16AM +0300, Dan Carpenter wrote:
> > On Tue, Dec 11, 2018 at 05:54:56PM -0700, Jeremy Fertic wrote:
> > > @@ -651,10 +649,12 @@ static ssize_t 
> > > adt7316_store_da_high_resolution(struct device *dev,
> > >   u8 config3;
> > >   int ret;
> > >  
> > > + if (chip->id == ID_ADT7318 || chip->id == ID_ADT7519)
> > > + return -EPERM;
> > 
> > return -EINVAL is more appropriate than -EPERM.
> > 
> > regards,
> > dan carpenter
> > 
> 
> I chose -EPERM because the driver uses it quite a few times in similar
> circumstances.

Yeah.  I saw that when I reviewed the later patches in this series.

It's really not doing it right.  -EPERM means permission checks like
access_ok() failed so it's not appropriate.  -EINVAL is sort of general
purpose for invalid commands so it's probably the correct thing.

> At least with this driver, -EINVAL is used when the user
> attempts to write data that would never be valid. -EPERM is used when
> either the current device settings prevent some functionality from being
> used, or the device never supports that functionality. This patch is the
> latter, that these two chip ids never support this function.
> 
> I'll change to -EINVAL in a v2 series, but I wonder if I should hold off
> on a separate patch for other instances in this driver since it will be
> undergoing a substantial refactoring.

Generally, you should prefer kernel standards over driver standards and
especially for staging.  But it doesn't matter.  When I reviewed this
patch, I hadn't seen that the driver was doing it like this but now I
know so it's fine.  We can clean it all at once later if you want.

regards,
dan carpenter

Re: [PATCH] powerpc/prom: fix early DEBUG messages

2018-12-13 Thread Michael Ellerman

Christophe Leroy  writes:

> diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
> index fe758cedb93f..d8e56e03c9c6 100644
> --- a/arch/powerpc/kernel/prom.c
> +++ b/arch/powerpc/kernel/prom.c
> @@ -749,7 +749,11 @@ void __init early_init_devtree(void *params)
>   memblock_allow_resize();
>   memblock_dump_all();
>  
> +#ifdef CONFIG_PHYS_64BIT
>   DBG("Phys. mem: %llx\n", memblock_phys_mem_size());
> +#else
> + DBG("Phys. mem: %x\n", memblock_phys_mem_size());
> +#endif

Can we just do:

DBG("Phys. mem: %llx\n", (unsigned long long)memblock_phys_mem_size());

?

cheers

[RFC v2 1/2] pwm: sifive: Add DT documentation for SiFive PWM Controller

2018-12-13 Thread Yash Shah

DT documentation for PWM controller added with updated compatible
string.

Signed-off-by: Wesley W. Terpstra 
[Atish: Compatible string update]
Signed-off-by: Atish Patra 
Signed-off-by: Yash Shah 
---
 .../devicetree/bindings/pwm/pwm-sifive.txt | 44 ++
 1 file changed, 44 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/pwm/pwm-sifive.txt

diff --git a/Documentation/devicetree/bindings/pwm/pwm-sifive.txt 
b/Documentation/devicetree/bindings/pwm/pwm-sifive.txt
new file mode 100644
index 000..250d8ee
--- /dev/null
+++ b/Documentation/devicetree/bindings/pwm/pwm-sifive.txt
@@ -0,0 +1,44 @@
+SiFive PWM controller
+
+Unlike most other PWM controllers, the SiFive PWM controller currently only
+supports one period for all channels in the PWM. This is set globally in DTS.
+The period also has significant restrictions on the values it can achieve,
+which the driver rounds to the nearest achievable frequency.
+
+Required properties:
+- compatible: should be something similar to "sifive,-pwm" for
+ the PWM as integrated on a particular chip, and
+ "sifive,pwm" for the general PWM IP block
+ programming model. Supported compatible strings are:
+ "sifive,fu540-c000-pwm" for the SiFive PWM v0 as
+ integrated onto the SiFive FU540 chip, and "sifive,pwm0"
+ for the SiFive PWM v0 IP block with no chip integration
+ tweaks.
+- reg: physical base address and length of the controller's registers
+- clocks: The frequency the controller runs at
+- #pwm-cells: Should be 2.
+  The first cell is the PWM channel number
+  The second cell is the PWM polarity
+- sifive,approx-period: the driver will get as close to this period as it can
+- interrupts: one interrupt per PWM channel
+
+PWM RTL that corresponds to the IP block version numbers can be found
+here:
+
+https://github.com/sifive/sifive-blocks/tree/master/src/main/scala/devices/pwm
+
+Further information on the format of the IP
+block-specific version numbers can be found in
+Documentation/devicetree/bindings/sifive/sifive-blocks-ip-versioning.txt
+
+Examples:
+
+pwm:  pwm@1002 {
+   compatible = "sifive,fu540-c000-pwm","sifive,pwm0";
+   reg = <0x0 0x1002 0x0 0x1000>;
+   clocks = <>;
+   interrupt-parent = <>;
+   interrupts = <42 43 44 45>;
+   #pwm-cells = <2>;
+   sifive,approx-period = <100>;
+};
-- 
1.9.1

[RFC v2 2/2] pwm: sifive: Add a driver for SiFive SoC PWM

2018-12-13 Thread Yash Shah

Adds a PWM driver for PWM chip present in SiFive's HiFive Unleashed SoC.

Signed-off-by: Wesley W. Terpstra 
[Atish: Various fixes and code cleanup]
Signed-off-by: Atish Patra 
Signed-off-by: Yash Shah 
---
 drivers/pwm/Kconfig  |  10 +++
 drivers/pwm/Makefile |   1 +
 drivers/pwm/pwm-sifive.c | 229 +++
 3 files changed, 240 insertions(+)
 create mode 100644 drivers/pwm/pwm-sifive.c

diff --git a/drivers/pwm/Kconfig b/drivers/pwm/Kconfig
index 27e5dd4..da85557 100644
--- a/drivers/pwm/Kconfig
+++ b/drivers/pwm/Kconfig
@@ -378,6 +378,16 @@ config PWM_SAMSUNG
  To compile this driver as a module, choose M here: the module
  will be called pwm-samsung.
 
+config PWM_SIFIVE
+   tristate "SiFive PWM support"
+   depends on OF
+   depends on COMMON_CLK
+   help
+ Generic PWM framework driver for SiFive SoCs.
+
+ To compile this driver as a module, choose M here: the module
+ will be called pwm-sifive.
+
 config PWM_SPEAR
tristate "STMicroelectronics SPEAr PWM support"
depends on PLAT_SPEAR
diff --git a/drivers/pwm/Makefile b/drivers/pwm/Makefile
index 9c676a0..30089ca 100644
--- a/drivers/pwm/Makefile
+++ b/drivers/pwm/Makefile
@@ -37,6 +37,7 @@ obj-$(CONFIG_PWM_RCAR)+= pwm-rcar.o
 obj-$(CONFIG_PWM_RENESAS_TPU)  += pwm-renesas-tpu.o
 obj-$(CONFIG_PWM_ROCKCHIP) += pwm-rockchip.o
 obj-$(CONFIG_PWM_SAMSUNG)  += pwm-samsung.o
+obj-$(CONFIG_PWM_SIFIVE)   += pwm-sifive.o
 obj-$(CONFIG_PWM_SPEAR)+= pwm-spear.o
 obj-$(CONFIG_PWM_STI)  += pwm-sti.o
 obj-$(CONFIG_PWM_STM32)+= pwm-stm32.o
diff --git a/drivers/pwm/pwm-sifive.c b/drivers/pwm/pwm-sifive.c
new file mode 100644
index 000..26913b6
--- /dev/null
+++ b/drivers/pwm/pwm-sifive.c
@@ -0,0 +1,229 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2017-2018 SiFive
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/* Register offsets */
+#define REG_PWMCFG 0x0
+#define REG_PWMCOUNT   0x8
+#define REG_PWMS   0x10
+#define REG_PWMCMP00x20
+
+/* PWMCFG fields */
+#define BIT_PWM_SCALE  0
+#define BIT_PWM_STICKY 8
+#define BIT_PWM_ZERO_ZMP   9
+#define BIT_PWM_DEGLITCH   10
+#define BIT_PWM_EN_ALWAYS  12
+#define BIT_PWM_EN_ONCE13
+#define BIT_PWM0_CENTER16
+#define BIT_PWM0_GANG  24
+#define BIT_PWM0_IP28
+
+#define SIZE_PWMCMP4
+#define MASK_PWM_SCALE 0xf
+
+struct sifive_pwm_device {
+   struct pwm_chip chip;
+   struct notifier_block notifier;
+   struct clk *clk;
+   void __iomem *regs;
+   unsigned int approx_period;
+   unsigned int real_period;
+};
+
+static inline struct sifive_pwm_device *to_sifive_pwm_chip(struct pwm_chip *c)
+{
+   return container_of(c, struct sifive_pwm_device, chip);
+}
+
+static int sifive_pwm_apply(struct pwm_chip *chip, struct pwm_device *dev,
+   struct pwm_state *state)
+{
+   struct sifive_pwm_device *pwm = to_sifive_pwm_chip(chip);
+   unsigned int duty_cycle;
+   u32 frac;
+
+   duty_cycle = state->duty_cycle;
+   if (!state->enabled)
+   duty_cycle = 0;
+
+   frac = ((u64)duty_cycle << 16) / state->period;
+   frac = min(frac, 0xU);
+
+   writel(frac, pwm->regs + REG_PWMCMP0 + dev->hwpwm * SIZE_PWMCMP);
+
+   if (state->enabled) {
+   state->period = pwm->real_period;
+   state->duty_cycle = ((u64)frac * pwm->real_period) >> 16;
+   }
+
+   return 0;
+}
+
+static void sifive_pwm_get_state(struct pwm_chip *chip, struct pwm_device *dev,
+struct pwm_state *state)
+{
+   struct sifive_pwm_device *pwm = to_sifive_pwm_chip(chip);
+   u32 duty;
+
+   duty = readl(pwm->regs + REG_PWMCMP0 + dev->hwpwm * SIZE_PWMCMP);
+
+   state->period = pwm->real_period;
+   state->duty_cycle = ((u64)duty * pwm->real_period) >> 16;
+   state->polarity = PWM_POLARITY_INVERSED;
+   state->enabled = duty > 0;
+}
+
+static const struct pwm_ops sifive_pwm_ops = {
+   .get_state = sifive_pwm_get_state,
+   .apply = sifive_pwm_apply,
+   .owner = THIS_MODULE,
+};
+
+static struct pwm_device *sifive_pwm_xlate(struct pwm_chip *chip,
+  const struct of_phandle_args *args)
+{
+   struct sifive_pwm_device *pwm = to_sifive_pwm_chip(chip);
+   struct pwm_device *dev;
+
+   if (args->args[0] >= chip->npwm)
+   return ERR_PTR(-EINVAL);
+
+   dev = pwm_request_from_chip(chip, args->args[0], NULL);
+   if (IS_ERR(dev))
+   return dev;
+
+   /* The period cannot be changed on a per-PWM basis */
+   dev->args.period   = pwm->real_period;
+   dev->args.polarity = PWM_POLARITY_NORMAL;
+   if (args->args[1] &

[PATCH v2 2/3] f2fs: check PageWriteback flag for ordered case

2018-12-13 Thread Chao Yu

For all ordered cases in f2fs_wait_on_page_writeback(), we need to
check PageWriteback status, so let's clean up to relocate the check
into f2fs_wait_on_page_writeback().

Signed-off-by: Chao Yu 
---
- cover f2fs_sync_meta_pages and f2fs_write_cache_pages as well.
 fs/f2fs/checkpoint.c | 2 --
 fs/f2fs/data.c   | 1 -
 fs/f2fs/node.c   | 3 ---
 fs/f2fs/segment.c| 6 --
 4 files changed, 4 insertions(+), 8 deletions(-)

diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c
index 4f02461f348c..f6c01487456a 100644
--- a/fs/f2fs/checkpoint.c
+++ b/fs/f2fs/checkpoint.c
@@ -372,7 +372,6 @@ long f2fs_sync_meta_pages(struct f2fs_sb_info *sbi, enum 
page_type type,
 
f2fs_wait_on_page_writeback(page, META, true);
 
-   BUG_ON(PageWriteback(page));
if (!clear_page_dirty_for_io(page))
goto continue_unlock;
 
@@ -1291,7 +1290,6 @@ static void commit_checkpoint(struct f2fs_sb_info *sbi,
int err;
 
f2fs_wait_on_page_writeback(page, META, true);
-   f2fs_bug_on(sbi, PageWriteback(page));
 
memcpy(page_address(page), src, PAGE_SIZE);
 
diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index fd3a1e5ab6d9..d4cf4e1f83f2 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -2154,7 +2154,6 @@ static int f2fs_write_cache_pages(struct address_space 
*mapping,
goto continue_unlock;
}
 
-   BUG_ON(PageWriteback(page));
if (!clear_page_dirty_for_io(page))
goto continue_unlock;
 
diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c
index c09df777f66f..30a4427aaa94 100644
--- a/fs/f2fs/node.c
+++ b/fs/f2fs/node.c
@@ -1599,7 +1599,6 @@ int f2fs_move_node_page(struct page *node_page, int 
gc_type)
};
 
f2fs_wait_on_page_writeback(node_page, NODE, true);
-   f2fs_bug_on(F2FS_P_SB(node_page), PageWriteback(node_page));
 
set_page_dirty(node_page);
 
@@ -1691,7 +1690,6 @@ int f2fs_fsync_node_pages(struct f2fs_sb_info *sbi, 
struct inode *inode,
}
 
f2fs_wait_on_page_writeback(page, NODE, true);
-   BUG_ON(PageWriteback(page));
 
set_fsync_mark(page, 0);
set_dentry_mark(page, 0);
@@ -1825,7 +1823,6 @@ int f2fs_sync_node_pages(struct f2fs_sb_info *sbi,
 
f2fs_wait_on_page_writeback(page, NODE, true);
 
-   BUG_ON(PageWriteback(page));
if (!clear_page_dirty_for_io(page))
goto continue_unlock;
 
diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
index e2e971e89b2d..007a6f6c74c7 100644
--- a/fs/f2fs/segment.c
+++ b/fs/f2fs/segment.c
@@ -3281,10 +3281,12 @@ void f2fs_wait_on_page_writeback(struct page *page,
struct f2fs_sb_info *sbi = F2FS_P_SB(page);
 
f2fs_submit_merged_write_cond(sbi, NULL, page, 0, type);
-   if (ordered)
+   if (ordered) {
wait_on_page_writeback(page);
-   else
+   f2fs_bug_on(sbi, PageWriteback(page));
+   } else {
wait_for_stable_page(page);
+   }
}
 }
 
-- 
2.18.0.rc1

[RFC v2 0/2] PWM support for HiFive Unleashed

2018-12-13 Thread Yash Shah

This patch series adds PWM drivers and DT documentation for 
HiFive Unleashed board. The patches are mostly based on Wesley's patch.
V2 of this patchset incorporates below items pointed out in v1.

V2 changed from V1:
  1. Remove inclusion of dt-bindings/pwm/pwm.h
  2. Remove artificial alignments
  3. Replace ioread32/iowrite32 with readl/writel
  4. Remove camelcase
  5. Change dev_info to dev_dbg for unnecessary log
  6. Correct typo in driver name
  7. Remove use of of_match_ptr macro
  8. Update the DT compatible strings and Add reference to a common
 versioning document

Yash Shah (2):
  pwm: sifive: Add DT documentation for SiFive PWM Controller
  pwm: sifive: Add a driver for SiFive SoC PWM

 .../devicetree/bindings/pwm/pwm-sifive.txt |  44 
 drivers/pwm/Kconfig|  10 +
 drivers/pwm/Makefile   |   1 +
 drivers/pwm/pwm-sifive.c   | 229 +
 4 files changed, 284 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/pwm/pwm-sifive.txt
 create mode 100644 drivers/pwm/pwm-sifive.c

-- 
1.9.1

Re: [PATCH 02/11] staging: iio: adt7316: invert the logic of the check for an ldac pin

2018-12-13 Thread Dan Carpenter

On Thu, Dec 13, 2018 at 03:06:29PM -0700, Jeremy Fertic wrote:
> On Wed, Dec 12, 2018 at 11:19:49AM +0300, Dan Carpenter wrote:
> > On Tue, Dec 11, 2018 at 05:54:54PM -0700, Jeremy Fertic wrote:
> > > ADT7316_DA_EN_VIA_DAC_LDCA is set when the dac and ldac registers are 
> > > being
> > > used to update the dacs instead of the ldac pin. ADT7516_SEL_AIN3 is an 
> > > adc
> > > input that shares the ldac pin. Only set these bits if an ldac pin is not
> > > being used.
> > > 
> > > Signed-off-by: Jeremy Fertic 
> > 
> > Huh...  This bug has always been there...
> > 
> > Fixes: 35f6b6b86ede ("staging: iio: new ADT7316/7/8 and ADT7516/7/9 driver")
> > 
> > regards,
> > dan carpenter
> > 
> 
> Should I include this Fixes tag in v2? I wasn't sure how important this was
> in staging. I think most of the patches in this series fix bugs that date
> back to the introduction of the driver.

I was just curious to see if it was a cleanup which introduced the
inverted if statement.

I think the Fixes tag is always useful.  For example, it would be
interesting to do some data mining to see how many bugs drivers
normally have when they're first merged.

regards,
dan carpenter

Re: ubifs: fix page_count in ->ubifs_migrate_page()

2018-12-13 Thread zhangjun


On 2018/12/14 上午6:57, Dave Chinner wrote:

On Thu, Dec 13, 2018 at 03:23:37PM +0100, Richard Weinberger wrote:

Hello zhangjun,

thanks a lot for bringing this up!

Am Mittwoch, 12. Dezember 2018, 15:13:57 CET schrieb zhangjun:

Because the PagePrivate() in UBIFS is different meanings,
alloc_cma() will fail when one dirty page cache located in
the type of MIGRATE_CMA

If not adjust the 'extra_count' for dirty page,
ubifs_migrate_page() -> migrate_page_move_mapping() will
always return -EAGAIN for:
expected_count += page_has_private(page)
This causes the migration to fail until the page cache is cleaned

In general, PagePrivate() indicates that buff_head is already bound
to this page, and at the same time page_count() will also increase.


That's an invalid assumption.

We should not be trying to infer what PagePrivate() means in code
that has no business using looking at it i.e. page->private is private
information for the owner of the page, and it's life cycle and
intent are unknown to anyone other than the page owner.

e.g. on XFS, a page cache page's page->private /might/ contain a
struct iomap_page, or it might be NULL. Assigning a struct
iomap_page to the page does not change the reference count on the
page.  IOWs, the page needs to be handled exactly the same
way by external code regardless of whether there is somethign
attached to page->private or not.

Hence it looks to me like the migration code is making invalid
assumptions about PagePrivate inferring reference counts and so the
migration code needs to be fixed. Requiring filesystems to work
around invalid assumptions in the migration code is a sure recipe
for problems with random filesystems using page->private for their
own internal purposes

Cheers,

Dave.
I agree with your main point of view, but for the buffer_head based file 
system this assumption is no problem,
and the parameters and comments from the migrate_page_move_mapping() 
function:

  * 3 for pages with a mapping and PagePrivate/PagePrivate2 set.
This assumption has been explained.
Or to accurately say that the migrate system does not currently have a 
generic function for this case.
Since you call the function implemented for buffer_head, you should 
follow its rules.

Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions

2018-12-13 Thread John Hubbard

On 12/13/18 9:21 PM, Dan Williams wrote:
> On Thu, Dec 13, 2018 at 7:53 PM John Hubbard  wrote:
>>
>> On 12/12/18 4:51 PM, Dave Chinner wrote:
>>> On Wed, Dec 12, 2018 at 04:59:31PM -0500, Jerome Glisse wrote:
 On Thu, Dec 13, 2018 at 08:46:41AM +1100, Dave Chinner wrote:
> On Wed, Dec 12, 2018 at 10:03:20AM -0500, Jerome Glisse wrote:
>> On Mon, Dec 10, 2018 at 11:28:46AM +0100, Jan Kara wrote:
>>> On Fri 07-12-18 21:24:46, Jerome Glisse wrote:
>>> So this approach doesn't look like a win to me over using counter in 
>>> struct
>>> page and I'd rather try looking into squeezing HMM public page usage of
>>> struct page so that we can fit that gup counter there as well. I know 
>>> that
>>> it may be easier said than done...
>>
>>
>> Agreed. After all the discussion this week, I'm thinking that the original 
>> idea
>> of a per-struct-page counter is better. Fortunately, we can do the moral 
>> equivalent
>> of that, unless I'm overlooking something: Jerome had another proposal that 
>> he
>> described, off-list, for doing that counting, and his idea avoids the 
>> problem of
>> finding space in struct page. (And in fact, when I responded yesterday, I 
>> initially
>> thought that's where he was going with this.)
>>
>> So how about this hybrid solution:
>>
>> 1. Stay with the basic RFC approach of using a per-page counter, but actually
>> store the counter(s) in the mappings instead of the struct page. We can use
>> !PageAnon and page_mapping to look up all the mappings, stash the 
>> dma_pinned_count
>> there. So the total pinned count is scattered across mappings. Probably 
>> still need
>> a PageDmaPinned bit.
> 
> How do you safely look at page->mapping from the get_user_pages_fast()
> path? You'll be racing invalidation disconnecting the page from the
> mapping.
> 

I don't have an answer for that, so maybe the page->mapping idea is dead 
already. 

So in that case, there is still one more way to do all of this, which is to
combine ZONE_DEVICE, HMM, and gup/dma information in a per-page struct, and get
there via basically page->private, more or less like this:

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5ed8f6292a53..13f651bb5cc1 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -67,6 +67,13 @@ struct hmm;
 #define _struct_page_alignment
 #endif
 
+struct page_aux {
+   struct dev_pagemap *pgmap;
+   unsigned long hmm_data;
+   unsigned long private;
+   atomic_t dma_pinned_count;
+};
+
 struct page {
unsigned long flags;/* Atomic flags, some possibly
 * updated asynchronously */
@@ -149,11 +156,13 @@ struct page {
spinlock_t ptl;
 #endif
};
-   struct {/* ZONE_DEVICE pages */
+   struct {/* ZONE_DEVICE, HMM or get_user_pages() pages */
/** @pgmap: Points to the hosting device page map. */
-   struct dev_pagemap *pgmap;
-   unsigned long hmm_data;
-   unsigned long _zd_pad_1;/* uses mapping */
+   unsigned long _zd_pad_1;/* LRU */
+   unsigned long _zd_pad_2;/* LRU */
+   unsigned long _zd_pad_3;/* mapping */
+   unsigned long _zd_pad_4;/* index */
+   struct page_aux *aux;   /* private */
};
 
/** @rcu_head: You can use this to free a page by RCU. */

...is there any appetite for that approach?

-- 
thanks,
John Hubbard
NVIDIA

RE: [PATCH] dt-bindings: usb: renesas_usbhs: Add r8a774c0 support

2018-12-13 Thread Yoshihiro Shimoda

Hi Fabrizio,

> From: Fabrizio Castro, Sent: Friday, December 14, 2018 5:21 AM
> 
> Document RZ/G2E (R8A774C0) SoC bindings.
> 
> Signed-off-by: Fabrizio Castro 

Thank you for the patch!

Reviewed-by: Yoshihiro Shimoda 

By the way, I'm not sure, but I'm wondering that we need to add
.compatible renesas,usbhs-r8a774c0 with .data USBHS_TYPE_RCAR_GEN3_WITH_PLL
like r8a77990 to the drivers/usb/renesas_usbhs/common.c.
What do you think?

Best regards,
Yoshihiro Shimoda

> ---
>  Documentation/devicetree/bindings/usb/renesas_usbhs.txt | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/Documentation/devicetree/bindings/usb/renesas_usbhs.txt
> b/Documentation/devicetree/bindings/usb/renesas_usbhs.txt
> index 90719f5..d93b6a1 100644
> --- a/Documentation/devicetree/bindings/usb/renesas_usbhs.txt
> +++ b/Documentation/devicetree/bindings/usb/renesas_usbhs.txt
> @@ -7,6 +7,7 @@ Required properties:
>   - "renesas,usbhs-r8a7744" for r8a7744 (RZ/G1N) compatible device
>   - "renesas,usbhs-r8a7745" for r8a7745 (RZ/G1E) compatible device
>   - "renesas,usbhs-r8a774a1" for r8a774a1 (RZ/G2M) compatible device
> + - "renesas,usbhs-r8a774c0" for r8a774c0 (RZ/G2E) compatible device
>   - "renesas,usbhs-r8a7790" for r8a7790 (R-Car H2) compatible device
>   - "renesas,usbhs-r8a7791" for r8a7791 (R-Car M2-W) compatible device
>   - "renesas,usbhs-r8a7792" for r8a7792 (R-Car V2H) compatible device
> --
> 2.7.4

Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions

2018-12-13 Thread Dave Chinner

On Wed, Dec 12, 2018 at 09:02:29PM -0500, Jerome Glisse wrote:
> On Thu, Dec 13, 2018 at 11:51:19AM +1100, Dave Chinner wrote:
> > On Wed, Dec 12, 2018 at 04:59:31PM -0500, Jerome Glisse wrote:
> > > On Thu, Dec 13, 2018 at 08:46:41AM +1100, Dave Chinner wrote:
> > > > On Wed, Dec 12, 2018 at 10:03:20AM -0500, Jerome Glisse wrote:
> > > > > On Mon, Dec 10, 2018 at 11:28:46AM +0100, Jan Kara wrote:
> > > > > > On Fri 07-12-18 21:24:46, Jerome Glisse wrote:
> > > > > > So this approach doesn't look like a win to me over using counter 
> > > > > > in struct
> > > > > > page and I'd rather try looking into squeezing HMM public page 
> > > > > > usage of
> > > > > > struct page so that we can fit that gup counter there as well. I 
> > > > > > know that
> > > > > > it may be easier said than done...
> > > > > 
> > > > > So i want back to the drawing board and first i would like to 
> > > > > ascertain
> > > > > that we all agree on what the objectives are:
> > > > > 
> > > > > [O1] Avoid write back from a page still being written by either a
> > > > >  device or some direct I/O or any other existing user of GUP.
> > 
> > IOWs, you need to mark pages being written to by a GUP as
> > PageWriteback, so all attempts to write the page will block on
> > wait_on_page_writeback() before trying to write the dirty page.
> 
> No you don't and you can't for the simple reasons is that the GUP
> of some device driver can last days, weeks, months, years ... so
> it is not something you want to do. Here is what happens today:
> - user space submit directio read from a file and writing to
>   virtual address and the problematic case is when that virtual
>   address is actualy a mmap of a file itself
> - kernel do GUP on the virtual address, if the page has write
>   permission in the CPU page table mapping then the page
>   refcount is incremented and the page is return to directio
>   kernel code that do memcpy
> 
>   It means that the page already went through page_mkwrite so
>   all is fine from fs point of view.
>   If page does not have write permission then a page fault is
>   triggered and page_mkwrite will happen and prep the page
>   accordingly

Yes, the short term GUP references do the right thing. They aren't
the issue - the problem is the long term GUP references that dirty
clean pages without first having called ->page_mkwrite.

> In the above scheme a page write back might happens after we looked
> up the page from the CPU page table and before directio finish with
> memcpy so that the page content during the write back might not be
> stable. This is a small window for things to go bad and i do not
> think we know if anybody ever experience a bug because of that.
> 
> For other GUP users the flow is the same except that device driver
> that keep the page around and do continuous dma to it might last
> days, weeks, months, years ... so for those the race window is big
> enough for bad things to happen. Jan have report of such bugs.

i.e. this case.

GUP faults the page, gets marked dirty, time passes, page
writeback occurs, it's now mapped clean, time passes, another RDMA
hits those pages, it calls set_page_dirty() again and things go
boom.

Basically, you are saying that the problem here is that writeback
of a dirty page occurred while there was an active GUP, and that
you want us to 

> So what i am proposing to fix the above is have page_mkclean return
> a is_pin boolean if page is pin than the fs code use a bounce page
> to do the write back giving a stable bounce page. More over fs will
> need to keep around all buffer_head, blocks, ... ie whatever is
> associated with that file offset so that any latter set_page_dirty
> would not freak out and would not need to reallocate blocks or do
> anything heavy weight.

 keep the dirty page pinned and never written back until the GUP
is released.

Which, quite frankly, is insanity.  The whole point of
->page_mkwrite() is that we can clean file backed mapped pages at
any point in time and have the next write access correctly mark it
dirty again so it can be written back.

This is *absolutely necessary* for data integrity (i.e. fsync,
sync(), etc) as well as filesystem management operations (e.g.
filesystem freeze) to work correctly and not lose data if the system
crashes or generate corrupt snapshots for backup or migration
purposes.

> We have a separate discussion on what to do about truncate and other
> fs event that inherently invalidate portion of file so i do not
> want to complexify present discussion with those but we also have
> that in mind.
> 
> Do you see any fundamental issues with that ? It abides by all
> existing fs standard AFAICT (you have a page_mkwrite and we ask
> fs to keep the result of that around).

The fundamental issue is that ->page_mkwrite must be called on every
write access to a clean file backed page, not just the first one.
How long the GUP reference lasts is

[PATCH] fix page_count in ->iomap_migrate_page()

2018-12-13 Thread zhangjun

IOMAP uses PG_private a little different with buffer_head based
filesystem.
It uses it as marker and when set, the page counter is not incremented,
migrate_page_move_mapping() assumes that PG_private indicates a counter
of +1.
so, we have to pass a extra count of -1 to migrate_page_move_mapping()
if the flag is set.

Signed-off-by: zhangjun 
---
 fs/iomap.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/fs/iomap.c b/fs/iomap.c
index 64ce240..352e58a 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -544,8 +544,17 @@ iomap_migrate_page(struct address_space *mapping, struct 
page *newpage,
struct page *page, enum migrate_mode mode)
 {
int ret;
+   int extra_count = 0;
 
-   ret = migrate_page_move_mapping(mapping, newpage, page, NULL, mode, 0);
+   /*
+* IOMAP uses PG_private as marker and does not raise the page counter.
+* migrate_page_move_mapping() expects a incremented counter if 
PG_private
+* is set. Therefore pass -1 as extra_count for this case.
+*/
+   if (page_has_private(page))
+   extra_count = -1;
+   ret = migrate_page_move_mapping(mapping, newpage, page,
+  NULL, mode, extra_count);
if (ret != MIGRATEPAGE_SUCCESS)
return ret;
 
-- 
2.7.4

Re: [PATCH v4 3/7] mips: rename macros and files from '64' to 'n64'

2018-12-13 Thread Firoz Khan

Hi Paul,

On Fri, 14 Dec 2018 at 01:45, Paul Burton  wrote:
> I've applied v5 but undone the change from __NR_64_* to __NR_N64_*
> because it's part of the UAPI & a github code search showed that it's
> actually used.
>
> Could you take a look at this branch & check that you're OK with it
> before I push it to mips-next?
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux.git test-syscalls
>
>   
> https://git.kernel.org/pub/scm/linux/kernel/git/mips/linux.git/log/?h=test-syscalls

This looks good to me. Please push to mips-next.

Thanks
Firoz

RE: [PATCH 2/2] scsi: ufs: add inline crypto support to UFS HCD

2018-12-13 Thread Parshuram Raju Thombare

Hi Ladvine,

>From: Ladvine D Almeida 
>Sent: Friday, December 14, 2018 1:10 AM
>Subject: Re: [PATCH 2/2] scsi: ufs: add inline crypto support to UFS HCD
>Where is Crypto target 'crypto-ufs' implementation available? Did you submitted
>any other patch for the same?
>Also, it is better to provide a generic name as the target is valid for all 
>other block
>devices.

[PATCH 2/2] have crypto-ufs implementation. [PATCH 1/2] add variable to struct 
bio.
Device mapper name is not generic because there are ufs specific things but we 
should
be able to make it generic.

Regards,
Parshuram Thombare

>-Original Message-
>From: Ladvine D Almeida 
>Sent: Friday, December 14, 2018 1:10 AM
>To: Parshuram Raju Thombare ; Eric Biggers
>
>Cc: ax...@kernel.dk; vinholika...@gmail.com; j...@linux.vnet.ibm.com;
>martin.peter...@oracle.com; mchehab+sams...@kernel.org;
>gre...@linuxfoundation.org; da...@davemloft.net; akpm@linux-
>foundation.org; nicolas.fe...@microchip.com; a...@arndb.de; linux-
>ker...@vger.kernel.org; linux-bl...@vger.kernel.org; linux-
>s...@vger.kernel.org; Alan Douglas ; Janek Kotas
>; Rafal Ciepiela ; AnilKumar
>Chimata ; Ladvine D Almeida
>; Satya Tangirala ; Paul
>Crowley ; Manjunath M Bettegowda
>; Tejas Joglekar
>; Joao Pinto ; linux-
>cry...@vger.kernel.org
>Subject: Re: [PATCH 2/2] scsi: ufs: add inline crypto support to UFS HCD
>
>EXTERNAL MAIL
>
>
>On 12/12/18 5:52 AM, Parshuram Raju Thombare wrote:
>> Hello Eric,
>>
>> Thank you for a comment.
>>
>>> -Original Message-
>>> From: Eric Biggers 
>>> Sent: Tuesday, December 11, 2018 11:47 PM
>>> To: Parshuram Raju Thombare 
>>> Cc: ax...@kernel.dk; vinholika...@gmail.com; j...@linux.vnet.ibm.com;
>>> martin.peter...@oracle.com; mchehab+sams...@kernel.org;
>>> gre...@linuxfoundation.org; da...@davemloft.net; akpm@linux-
>>> foundation.org; nicolas.fe...@microchip.com; a...@arndb.de; linux-
>>> ker...@vger.kernel.org; linux-bl...@vger.kernel.org; linux-
>>> s...@vger.kernel.org; Alan Douglas ; Janek
>>> Kotas ; Rafal Ciepiela ;
>>> AnilKumar Chimata ; Ladvine D Almeida
>>> ; Satya Tangirala ; Paul
>>> Crowley 
>>> Subject: Re: [PATCH 2/2] scsi: ufs: add inline crypto support to UFS
>>> HCD
>>>
>>> EXTERNAL MAIL
>>>
>>>
>>> [+Cc other people who have been working on this]
>Eric, Thanks for cc-ing me to the mail chain.
>
>Parshuram,
>Glad to know that you are working on the Inline Encryption support.
>My concerns are mentioned inline below.
>
>>>
>>>
>>>
>>> Hi Parshuram,
>>>
>>>
>>>
>>> On Tue, Dec 11, 2018 at 09:50:27AM +, Parshuram Thombare wrote:
>>>
 Add real time crypto support to UFS HCD using new device
>>>
 mapper 'crypto-ufs'. dmsetup tool can be used to enable
>>>
 real time / inline crypto support using device mapper
>>>
 'crypt-ufs'.
>
>Where is Crypto target 'crypto-ufs' implementation available? Did you submitted
>any other patch for the same?
>Also, it is better to provide a generic name as the target is valid for all 
>other block
>devices.
>
>>>

>>>
 Signed-off-by: Parshuram Thombare 
>>>
 ---
>>>
  MAINTAINERS  |7 +
>>>
  block/Kconfig|5 +
>>>
  drivers/scsi/ufs/Kconfig |   12 +
>>>
  drivers/scsi/ufs/Makefile|1 +
>>>
  drivers/scsi/ufs/ufshcd-crypto.c |  453
>>> ++
>>>
  drivers/scsi/ufs/ufshcd-crypto.h |  102 +
>>>
  drivers/scsi/ufs/ufshcd.c|   27 +++-
>>>
  drivers/scsi/ufs/ufshcd.h|6 +
>>>
  drivers/scsi/ufs/ufshci.h|1 +
>>>
  9 files changed, 613 insertions(+), 1 deletions(-)
>>>
  create mode 100644 drivers/scsi/ufs/ufshcd-crypto.c
>>>
  create mode 100644 drivers/scsi/ufs/ufshcd-crypto.h
>>>

>>>
 diff --git a/MAINTAINERS b/MAINTAINERS
>>>
 index f485597..3a68126 100644
>>>
 --- a/MAINTAINERS
>>>
 +++ b/MAINTAINERS
>>>
 @@ -15340,6 +15340,13 @@ S:Supported
>>>
  F:Documentation/scsi/ufs.txt
>>>
  F:drivers/scsi/ufs/
>>>

>>>
 +UNIVERSAL FLASH STORAGE HOST CONTROLLER CRYPTO DRIVER
>>>
 +M:Parshuram Thombare 
>>>
 +L:linux-s...@vger.kernel.org
>>>
 +S:Supported
>>>
 +F:drivers/scsi/ufs/ufshcd-crypto.c
>>>
 +F:drivers/scsi/ufs/ufshcd-crypto.h
>>>
 +
>>>
  UNIVERSAL FLASH STORAGE HOST CONTROLLER DRIVER DWC HOOKS
>>>
  M:Joao Pinto 
>>>
  L:linux-s...@vger.kernel.org
>>>
 diff --git a/block/Kconfig b/block/Kconfig
>>>
 index f7045aa..6afe131 100644
>>>
 --- a/block/Kconfig
>>>
 +++ b/block/Kconfig
>>>
 @@ -224,4 +224,9 @@ config BLK_MQ_RDMA
>>>
  config BLK_PM
>>>
def_bool BLOCK && PM
>>>

>>>
 +config BLK_DEV_HW_RT_ENCRYPTION
>>>
 +  bool
>>>
 +  depends on SCSI_UFSHCD_RT_ENCRYPTION
>>>
 +  default n
>>>
 +
>>>
  source block/Kconfig.iosched
>>>
 diff --git

Re: [PATCH v2 01/12] fs-verity: add a documentation file

2018-12-13 Thread Eric Biggers

On Fri, Dec 14, 2018 at 12:17:22AM -0500, Theodore Y. Ts'o wrote:
> Furthermore, it would require extra complexity in the common fsverity code
> --- which looks for the Merkle tree at the end of file data --- for no real
> benefit.

To clarify, while this is technically true currently, as I mentioned it's been
kept flexible enough such that a filesystem *could* store the metadata elsewhere
with only some slight changes to the common fs/verity/ code which won't break
other filesystems.  Though of course, keeping all filesystems using the
"metadata after EOF" approach does allow a couple simplifications.

- Eric

Re: rcu_preempt caused oom

2018-12-13 Thread Paul E. McKenney

On Thu, Dec 13, 2018 at 09:10:12PM -0800, Paul E. McKenney wrote:
> On Fri, Dec 14, 2018 at 02:40:50AM +, He, Bo wrote:
> > another experiment we have done with the enclosed debug patch, and also 
> > have more rcu trace event enable but without CONFIG_RCU_BOOST config, we 
> > don't reproduce the issue after 90 Hours until now on 10 boards(the issue 
> > should reproduce on one night per previous experience).
> 
> That certainly supports the hypothesis that a wakeup is either not
> being sent or is being lost.  Your patch is great for debugging (thank
> you!), but the real solution of course needs to avoid the extra wakeups,
> especially on battery-powered systems.
> 
> One suggested change below, to get rid of potential false positives.
> 
> > the purposes are to capture the more rcu event trace close to the issue 
> > happen, because I check the __wait_rcu_gp is not always in running, so we 
> > think even it trigger the panic for 3s timeout, the issue is already 
> > happened before 3s.
> 
> Agreed, it would be really good to have trace information from the cause.
> In the case you sent yesterday, it would be good to have trace information
> from 308.256 seconds prior to the sysrq-v, for example, by collecting the
> same event traces you did a few days ago.  It would also be good to know
> whether the scheduler tick is providing interrupts, and if so, why
> rcu_check_gp_start_stall() isn't being invoked.  ;-)
> 
> If collecting this information with your setup is not feasible (for
> example, you might need a large trace buffer to capture five minutes
> of traces), please let me know and I can provide additional debug
> code.  Or you could add "rcu_ftrace_dump(DUMP_ALL);" just before the
> "show_rcu_gp_kthreads();" in your patch below.
> 
> > And Actually the rsp->gp_flags = 1, but RCU_GP_WAIT_GPS(1) ->state: 0x402, 
> > it means the kthread is not schedule for 300s but the RCU_GP_FLAG_INIT is 
> > set. What's your ideas? 
> 
> The most likely possibility is that my analysis below is confused and
> there really is some way that the code can set the RCU_GP_FLAG_INIT
> bit without later doing a wakeup.  The trace data above could help
> unconfuse me.
> 
>   Thanx, Paul
> 
> > -
> > -   swait_event_idle_exclusive(rsp->gp_wq, 
> > READ_ONCE(rsp->gp_flags) &
> > -RCU_GP_FLAG_INIT);
> > +   if (current->pid != rcu_preempt_pid) {
> > +   swait_event_idle_exclusive(rsp->gp_wq, 
> > READ_ONCE(rsp->gp_flags) &
> > +   RCU_GP_FLAG_INIT);
> > +   } else {
> 
> wait_again:
> 
> > +   ret = 
> > swait_event_idle_timeout_exclusive(rsp->gp_wq, READ_ONCE(rsp->gp_flags) &
> > +   RCU_GP_FLAG_INIT, 2*HZ);
> > +
> > +   if(!ret) {
> 
> This would avoid complaining if RCU was legitimately idle for a long time:

Let's try this again.  Unless I am confused (quite possible) your original
would panic if RCU was idle for more than two seconds.  What we instead
want is to panic if we time out, but end up with RCU_GP_FLAG_INIT set.

So something like this:

if (ret == 1) {
/* Timed out with RCU_GP_FLAG_INIT. */
rcu_ftrace_dump(DUMP_ALL);
show_rcu_gp_kthreads();
panic("hung_task: blocked in 
rcu_gp_kthread init");
} else if (!ret) {
/* Timed out w/out RCU_GP_FLAG_INIT. */
goto wait_again;
}

Thanx, Paul

> > +   show_rcu_gp_kthreads();
> > +   panic("hung_task: blocked in 
> > rcu_gp_kthread init");
> > +   }
> > +   }
> > --
> > -Original Message-
> > From: Paul E. McKenney  
> > Sent: Friday, December 14, 2018 10:15 AM
> > To: He, Bo 
> > Cc: Zhang, Jun ; Steven Rostedt ; 
> > linux-kernel@vger.kernel.org; j...@joshtriplett.org; 
> > mathieu.desnoy...@efficios.com; jiangshan...@gmail.com; Xiao, Jin 
> > ; Zhang, Yanmin ; Bai, Jie A 
> > ; Sun, Yi J 
> > Subject: Re: rcu_preempt caused oom
> > 
> > On Fri, Dec 14, 2018 at 01:30:04AM +, He, Bo wrote:
> > > as you mentioned CONFIG_FAST_NO_HZ, do you mean CONFIG_RCU_FAST_NO_HZ? I 
> > > double checked there is no FAST_NO_HZ in .config:
> > 
> > Yes, you are correct,

Re: [PATCH v13 2/2] cpufreq: qcom-hw: Add support for QCOM cpufreq HW driver

2018-12-13 Thread Stephen Boyd

Quoting Taniya Das (2018-12-13 20:10:24)
> The CPUfreq HW present in some QCOM chipsets offloads the steps necessary
> for changing the frequency of CPUs. The driver implements the cpufreq
> driver interface for this hardware engine.
> 
> Signed-off-by: Saravana Kannan 
> Signed-off-by: Stephen Boyd 
> Signed-off-by: Taniya Das 
> ---

Reviewed-by: Stephen Boyd 
Tested-by: Stephen Boyd

Re: [PATCH v13 1/2] dt-bindings: cpufreq: Introduce QCOM CPUFREQ Firmware bindings

2018-12-13 Thread Stephen Boyd

Quoting Taniya Das (2018-12-13 20:10:23)
> Add QCOM cpufreq firmware device bindings for Qualcomm Technology Inc's
> SoCs. This is required for managing the cpu frequency transitions which are
> controlled by the hardware engine.
> 
> Signed-off-by: Taniya Das 
> ---

Reviewed-by: Stephen Boyd 

except one question below for Rob.

> diff --git a/Documentation/devicetree/bindings/cpufreq/cpufreq-qcom-hw.txt 
> b/Documentation/devicetree/bindings/cpufreq/cpufreq-qcom-hw.txt
> new file mode 100644
> index 000..33856947
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/cpufreq/cpufreq-qcom-hw.txt
> @@ -0,0 +1,172 @@
> +Qualcomm Technologies, Inc. CPUFREQ Bindings
> +
> +CPUFREQ HW is a hardware engine used by some Qualcomm Technologies, Inc. 
> (QTI)
> +SoCs to manage frequency in hardware. It is capable of controlling frequency
> +for multiple clusters.
> +
> +Properties:
> +- compatible
> +   Usage:  required
> +   Value type: 
> +   Definition: must be "qcom,cpufreq-hw".
> +
> +- clocks
> +   Usage:  required
> +   Value type:  From common clock binding.
> +   Definition: clock handle for XO clock and GPLL0 clock.
> +
> +- clock-names
> +   Usage:  required
> +   Value type:  From common clock binding.
> +   Definition: must be "xo", "alternate".
> +
> +- reg
> +   Usage:  required
> +   Value type: 
> +   Definition: Addresses and sizes for the memory of the HW bases in
> +   each frequency domain.
> +- reg-names
> +   Usage:  Optional
> +   Value type: 
> +   Definition: Frequency domain name i.e.
> +   "freq-domain0", "freq-domain1".
> +
> +- #freq-domain-cells:

I still wonder if this should be #qcom,freq-domain-cells, but if Rob is
OK I won't complain.

> +   Usage:  required.

Nitpick: Weird full-stop here ^

> +   Definition: Number of cells in a freqency domain specifier.
> +
> +* Property qcom,freq-domain
> +Devices supporting freq-domain must set their "qcom,freq-domain" property 
> with
> +phandle to a cpufreq_hw followed by the Domain ID(0/1) in the CPU DT node.
> +
> +

Re: [PATCH] power: reset: msm: Add support for download-mode control

2018-12-13 Thread Bjorn Andersson

On Wed 21 Nov 10:26 PST 2018, Stephen Boyd wrote:

> Quoting Stephen Boyd (2018-07-20 10:44:53)
> > Quoting Rajendra Nayak (2018-07-18 23:59:20)
> > > On 7/19/2018 11:12 AM, Bjorn Andersson wrote:
> > > > On Wed 18 Jul 22:18 PDT 2018, Rajendra Nayak wrote:
> > > >> diff --git 
> > > >> a/Documentation/devicetree/bindings/power/reset/msm-poweroff.txt 
> > > >> b/Documentation/devicetree/bindings/power/reset/msm-poweroff.txt
> > > >> index ce44ad3..9dd489f 100644
> > > >> --- a/Documentation/devicetree/bindings/power/reset/msm-poweroff.txt
> > > >> +++ b/Documentation/devicetree/bindings/power/reset/msm-poweroff.txt
> > > >> @@ -8,6 +8,9 @@ settings.
> > > >>   Required Properties:
> > > >>   -compatible: "qcom,pshold"
> > > >>   -reg: Specifies the physical address of the ps-hold register
> > > >> +Optional Properties:
> > > >> +-qcom,dload-mode: phandle to the TCSR hardware block and offset of the
> > > >> + download mode control register
> > > >>   
> > > >>   Example:
> > > >>   
> > > >> diff --git a/drivers/power/reset/Kconfig b/drivers/power/reset/Kconfig
> > > >> index df58fc8..0c97e34 100644
> > > >> --- a/drivers/power/reset/Kconfig
> > > >> +++ b/drivers/power/reset/Kconfig
> > > >> @@ -104,6 +104,17 @@ config POWER_RESET_MSM
> > > >>  help
> > > >>Power off and restart support for Qualcomm boards.
> > > >>   
> > > >> +config POWER_RESET_MSM_DOWNLOAD_MODE
> > > > 
> > > > How about moving QCOM_SCM_DOWNLOAD_MODE_DEFAULT to
> > > > drivers/soc/qcom/Kconfig (and removing "SCM") and referencing this in
> > > > both drivers?
> > > 
> > > yes, thats possible, but I am not sure how to make the command line
> > > option common for both. One other option I thought was if we could handle 
> > > it
> > > within the scm driver itself with an additional
> > > binding to specify the non-secure download mode address.
> > > something like qcom,dload-mode-ns?
> > 
> > Is the SCM device and driver always going to be present though? It may
> > be better to make a TCSR platform device driver on designs that would
> > configure the cookie with direct read/writes from Linux to break the
> > relationship with scm entirely. Then the different configurations could
> > flow from the DTS file either describing scm that has scm call, a
> > special scm_writel address for TCSR, or a specific TCSR node with the
> > address of the download mode cookie that triggers a TCSR driver to probe
> > and register a reboot handler.
> > 
> 
> Does my proposal work? I haven't seen anything new on the list since
> this email.
> 

Afaiu the SCM device is still there, even though we don't use all the
usual functionality.

I tested on qcs404 and sdm845-mtp (LA boot chain), and they both return
positive on:

  __qcom_scm_is_call_available(dev, QCOM_SCM_SVC_IO, QCOM_SCM_IO_WRITE)

So how about we change qcom_scm_set_download_mode() to do:

  if (scm_call_avail(QCOM_SCM_SVC_BOOT, QCOM_SCM_SET_DLOAD_MODE))
__qcom_scm_set_dload_mode()
  else if (scm_call_avail(QCOM_SCM_SVC_IO, QCOM_SCM_IO_WRITE) && 
dload_mode_addr)
__qcom_scm_io_writel();
  else if (dload_mode_addr)
writel()

This would also mean that we can put the dload addr in the sdm845.dtsi
and share that between LA and ATF.

Regards,
Bjorn

Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions

2018-12-13 Thread Dan Williams

On Thu, Dec 13, 2018 at 7:53 PM John Hubbard  wrote:
>
> On 12/12/18 4:51 PM, Dave Chinner wrote:
> > On Wed, Dec 12, 2018 at 04:59:31PM -0500, Jerome Glisse wrote:
> >> On Thu, Dec 13, 2018 at 08:46:41AM +1100, Dave Chinner wrote:
> >>> On Wed, Dec 12, 2018 at 10:03:20AM -0500, Jerome Glisse wrote:
>  On Mon, Dec 10, 2018 at 11:28:46AM +0100, Jan Kara wrote:
> > On Fri 07-12-18 21:24:46, Jerome Glisse wrote:
> > So this approach doesn't look like a win to me over using counter in 
> > struct
> > page and I'd rather try looking into squeezing HMM public page usage of
> > struct page so that we can fit that gup counter there as well. I know 
> > that
> > it may be easier said than done...
> 
>
> Agreed. After all the discussion this week, I'm thinking that the original 
> idea
> of a per-struct-page counter is better. Fortunately, we can do the moral 
> equivalent
> of that, unless I'm overlooking something: Jerome had another proposal that he
> described, off-list, for doing that counting, and his idea avoids the problem 
> of
> finding space in struct page. (And in fact, when I responded yesterday, I 
> initially
> thought that's where he was going with this.)
>
> So how about this hybrid solution:
>
> 1. Stay with the basic RFC approach of using a per-page counter, but actually
> store the counter(s) in the mappings instead of the struct page. We can use
> !PageAnon and page_mapping to look up all the mappings, stash the 
> dma_pinned_count
> there. So the total pinned count is scattered across mappings. Probably still 
> need
> a PageDmaPinned bit.

How do you safely look at page->mapping from the get_user_pages_fast()
path? You'll be racing invalidation disconnecting the page from the
mapping.

Re: [PATCH] Linux: Implement membarrier function

2018-12-13 Thread Paul E. McKenney

On Thu, Dec 13, 2018 at 09:26:47PM -0500, Alan Stern wrote:
> On Thu, 13 Dec 2018, Paul E. McKenney wrote:
> 
> > > > A good next step would be to automatically generate random tests along
> > > > with an automatically generated prediction, like I did for RCU a few
> > > > years back.  I should be able to generalize my time-based cheat for RCU 
> > > > to
> > > > also cover SRCU, though sys_membarrier() will require a bit more 
> > > > thought.
> > > > (The time-based cheat was to have fixed duration RCU grace periods and
> > > > RCU read-side critical sections, with the grace period duration being
> > > > slightly longer than that of the critical sections.  The number of
> > > > processes is of course limited by the chosen durations, but that limit
> > > > can easily be made insanely large.)
> > > 
> > > Imagine that each sys_membarrier call takes a fixed duration and each 
> > > other instruction takes slightly less (the idea being that each 
> > > instruction is a critical section).  Instructions can be reordered 
> > > (although not across a sys_membarrier call), but no matter how the 
> > > reordering is done, the result is disallowed.
> 
> This turns out not to be right.  Instead, imagine that each 
> sys_membarrier call takes a fixed duration, T.  Other instructions can 
> take arbitrary amounts of time and can be reordered abitrarily, with 
> two restrictions:
> 
>   Instructions cannot be reordered past a sys_membarrier call;
> 
>   If instructions A and B are reordered then the time duration
>   from B to A must be less than T.
> 
> If you prefer, you can replace the second restriction with something a 
> little more liberal:
> 
>   If A and B are reordered and A ends up executing after a 
>   sys_membarrier call (on any CPU) then B cannot execute before 
>   that sys_membarrier call.
> 
> Of course, this form is a consequence of the more restrictive form.

Makes sense.  And the zero-size critical sections are why sys_membarrier()
cannot be directly used for classic deferred reclamation.

> > It gets a bit trickier with interleavings of different combinations
> > of RCU, SRCU, and sys_membarrier().  Yes, your cat code very elegantly
> > sorts this out, but my goal is to be able to explain a given example
> > to someone.
> 
> I don't think you're going to be able to fit different combinations of
> RCU, SRCU, and sys_membarrier into this picture.  How would you allow
> tests with incorrect interleaving, such as GP - memb - RSCS - nothing,
> while forbidding similar tests with correct interleaving?

Well, no, I cannot do a simple linear scan tracking time, which is what
the current scripts do.  I must instead find longest sequence with all
operations of the same type (RCU, SRCU, or memb) and work out their
worst-case timing.  If the overall effect of a given sequence is to
go backwards in time, the result is allowed.  Otherwise eliminate that
sequence from the cycle and repeat.  If everything is eliminated, the
cycle is forbidden.

Which can be thought of as an iterative process similar to something
called "rcu-fence", can't it?  ;-)

Thanx, Paul

Re: [PATCH v2 01/12] fs-verity: add a documentation file

2018-12-13 Thread Theodore Y. Ts'o

On Thu, Dec 13, 2018 at 12:22:49PM -0800, Christoph Hellwig wrote:
> On Wed, Dec 12, 2018 at 12:26:10PM -0800, Eric Biggers wrote:
> > > As this apparently got merged despite no proper reviews from VFS
> > > level persons:
> > 
> > fs-verity has been out for review since August, and Cc'ed to all relevant
> > mailing lists including linux-fsdevel, linux-ext4, linux-f2fs-devel,
> > linux-fscrypt, linux-integrity, and linux-kernel.  There are tests,
> > documentation (since v2), and a userspace tool.  It's also been presented at
> > multiple conferences, and has been covered by LWN multiple times.  If more
> > people want to review it, then they should do so; there's nothing stopping 
> > them.
> 
> But you did not got a review from someone like Al, Linus, Andrew or me,
> did you?

I don't consider fs-verity to be part of core VFS, but rather a
library that happens to be used by ext4 and f2fs.  This is much like
fscrypt, which was originally an ext4-only thing, but the code was
always set up so it could be used by other file systems, and when f2fs
was interested in using it, we moved it to fs/crypto.  As such the
fscrypto code never got a review from Al, Andrew, or you, and when I
pushed it to Linus, he accepted the pull request.

The difference this time is that ext4 and f2fs are interested in using
common code from the beginning.

> > Can you elaborate on the actual problems you think the current solution 
> > has, and
> > exactly what solution you'd prefer instead?  Keep in mind that (1) for large
> > files the Merkle tree can be gigabytes long, (2) Linux doesn't have an API 
> > for
> > file streams, and (3) when fs-verity is combined with fscrypt, it's 
> > important
> > that the hashes be encrypted, so as to not leak information about the 
> > plaintext.
> 
> Given that you alread use an ioctl as the interface what is the problem
> of passing this data through the ioctl?

The size of the Merkle tree is roughly size/129.  So for a 100MB file
(and there can be Android APK files that bug), the Merkle tree could
be almost 800k.  That's not really a size that we would want to push
through an ioctl.

We could treat the ioctl as write-like interface, but using write(2)
seemed to make a lot more sense.  Also, the fscrypt common code
leveraged by f2fs and ext4 assume that the verity tree will be stored
after the data blocks.

Given that the semantics of a verity-protected file is that it is
immutable, you *could* store the Merkle tree in a separate file
stream, but it really doesn't buy you anything --- by definition, you
can't append to a fs-verity protected file.  Furthermore, it would
require extra complexity in the common fsverity code --- which looks
for the Merkle tree at the end of file data --- for no real benefit.

Cheers,

- Ted

P.S.  And if you've purchased a Pixel 3 device, it's already using the
fsverity code, so it's quite well tested (and yes, we have xfstests).

Re: [v7, PATCH 1/2] net:stmmac: dwmac-mediatek: add support for mt2712

2018-12-13 Thread Florian Fainelli

Le 12/13/18 à 7:01 PM, biao huang a écrit :
> Dear Andrew,
>   Thanks for your comments.
> 
> On Thu, 2018-12-13 at 13:33 +0100, Andrew Lunn wrote:
>> Hi Biao
>>
>>> +   case PHY_INTERFACE_MODE_RGMII:
>>> +   /* the PHY is not responsible for inserting any internal
>>> +* delay by itself in PHY_INTERFACE_MODE_RGMII case,
>>> +* so Ethernet MAC will insert delays for both transmit
>>> +* and receive path here.
>>> +*/
>>
>> What if the PCB designed has decided to do a kink in the clock to add
>> the delays? I don't think any of these delays should depend on the PHY
>> interface mode. It is up to the device tree writer to set both the PHY
>> delay and the MAC delay, based on knowledge of the board, including
>> any kicks in the tracks. The driver should then do what it is told.
>>
> Originally, we recommend equal trace length on PCB, which means that
> RGMII delay by PCB traces is not recommended. so only PHY/MAC delay is
> taken into account in the transmit/receive path.
> 
> as you described above, maybe the equal PCB trace length assumption is
> not reasonable, and we'll only handle MAC delay-ps in our driver based
> on the device tree information no matter which rgmii is selected.

Expecting identical PCB traces is something that is hard to enforce with
external customers, for internal reference boards, absolutely they
should have those traces of equal length.

> 
> Since David already applied this patch, I'll send another patch to fix
> this issue.
>>> +   if (!of_property_read_u32(plat->np, "mediatek,tx-delay-ps", 
>>> _delay_ps)) {
>>> +   if (tx_delay_ps < plat->variant->tx_delay_max) {
>>> +   mac_delay->tx_delay = tx_delay_ps;
>>> +   } else {
>>> +   dev_err(plat->dev, "Invalid TX clock delay: %dps\n", 
>>> tx_delay_ps);
>>> +   return -EINVAL;
>>> +   }
>>> +   }
>>> +
>>> +   if (!of_property_read_u32(plat->np, "mediatek,rx-delay-ps", 
>>> _delay_ps)) {
>>> +   if (rx_delay_ps < plat->variant->rx_delay_max) {
>>> +   mac_delay->rx_delay = rx_delay_ps;
>>> +   } else {
>>> +   dev_err(plat->dev, "Invalid RX clock delay: %dps\n", 
>>> rx_delay_ps);
>>> +   return -EINVAL;
>>> +   }
>>> +   }
>>> +
>>> +   mac_delay->tx_inv = of_property_read_bool(plat->np, 
>>> "mediatek,txc-inverse");
>>> +   mac_delay->rx_inv = of_property_read_bool(plat->np, 
>>> "mediatek,rxc-inverse");
>>> +   mac_delay->fine_tune = of_property_read_bool(plat->np, 
>>> "mediatek,fine-tune");
>>
>> Why is fine tune needed? If the requested delay can be done using fine
>> tune, it should use fine tune. If not, it should use rough tune. The
>> driver can work this out itself.
> 
> find tune here represents a more accurate delay circuit than coarse
> tune, and it's a parallel circuit of coarse tune.
> For most delay, both fine and coarse tune can meet the requirement.
> It's up to the user to select which one.
> 
> But only one of them can work at the same time, so we need a switch
> flag(fine_tune here) to indicate which one is valid.
> Driver can hardly work out which one is working according to delay-ps.
> 
> Please correct me if any misunderstanding.

You are giving a lot of options for users of this Ethernet controller to
shoot themselves in the feet and spend a good amount of time debugging
why their RGMII connection is not reliable or have timing violations.
-- 
Florian

Re: rcu_preempt caused oom

2018-12-13 Thread Paul E. McKenney

On Fri, Dec 14, 2018 at 02:40:50AM +, He, Bo wrote:
> another experiment we have done with the enclosed debug patch, and also have 
> more rcu trace event enable but without CONFIG_RCU_BOOST config, we don't 
> reproduce the issue after 90 Hours until now on 10 boards(the issue should 
> reproduce on one night per previous experience).

That certainly supports the hypothesis that a wakeup is either not
being sent or is being lost.  Your patch is great for debugging (thank
you!), but the real solution of course needs to avoid the extra wakeups,
especially on battery-powered systems.

One suggested change below, to get rid of potential false positives.

> the purposes are to capture the more rcu event trace close to the issue 
> happen, because I check the __wait_rcu_gp is not always in running, so we 
> think even it trigger the panic for 3s timeout, the issue is already happened 
> before 3s.

Agreed, it would be really good to have trace information from the cause.
In the case you sent yesterday, it would be good to have trace information
from 308.256 seconds prior to the sysrq-v, for example, by collecting the
same event traces you did a few days ago.  It would also be good to know
whether the scheduler tick is providing interrupts, and if so, why
rcu_check_gp_start_stall() isn't being invoked.  ;-)

If collecting this information with your setup is not feasible (for
example, you might need a large trace buffer to capture five minutes
of traces), please let me know and I can provide additional debug
code.  Or you could add "rcu_ftrace_dump(DUMP_ALL);" just before the
"show_rcu_gp_kthreads();" in your patch below.

> And Actually the rsp->gp_flags = 1, but RCU_GP_WAIT_GPS(1) ->state: 0x402, it 
> means the kthread is not schedule for 300s but the RCU_GP_FLAG_INIT is set. 
> What's your ideas? 

The most likely possibility is that my analysis below is confused and
there really is some way that the code can set the RCU_GP_FLAG_INIT
bit without later doing a wakeup.  The trace data above could help
unconfuse me.

Thanx, Paul

> -
> - swait_event_idle_exclusive(rsp->gp_wq, 
> READ_ONCE(rsp->gp_flags) &
> -  RCU_GP_FLAG_INIT);
> + if (current->pid != rcu_preempt_pid) {
> + swait_event_idle_exclusive(rsp->gp_wq, 
> READ_ONCE(rsp->gp_flags) &
> + RCU_GP_FLAG_INIT);
> + } else {

wait_again:

> + ret = 
> swait_event_idle_timeout_exclusive(rsp->gp_wq, READ_ONCE(rsp->gp_flags) &
> + RCU_GP_FLAG_INIT, 2*HZ);
> +
> + if(!ret) {

This would avoid complaining if RCU was legitimately idle for a long time:

if(!ret && !READ_ONCE(rsp->gp_flags)) {
rcu_ftrace_dump(DUMP_ALL);
show_rcu_gp_kthreads();
panic("hung_task: blocked in 
rcu_gp_kthread init");
} else if (!ret) {
goto wait_again;
}

> + show_rcu_gp_kthreads();
> + panic("hung_task: blocked in 
> rcu_gp_kthread init");
> + }
> + }
> --
> -Original Message-
> From: Paul E. McKenney  
> Sent: Friday, December 14, 2018 10:15 AM
> To: He, Bo 
> Cc: Zhang, Jun ; Steven Rostedt ; 
> linux-kernel@vger.kernel.org; j...@joshtriplett.org; 
> mathieu.desnoy...@efficios.com; jiangshan...@gmail.com; Xiao, Jin 
> ; Zhang, Yanmin ; Bai, Jie A 
> ; Sun, Yi J 
> Subject: Re: rcu_preempt caused oom
> 
> On Fri, Dec 14, 2018 at 01:30:04AM +, He, Bo wrote:
> > as you mentioned CONFIG_FAST_NO_HZ, do you mean CONFIG_RCU_FAST_NO_HZ? I 
> > double checked there is no FAST_NO_HZ in .config:
> 
> Yes, you are correct, CONFIG_RCU_FAST_NO_HZ.  OK, you do not have it set, 
> which means several code paths can be ignored.  Also CONFIG_HZ=1000, so
> 300 second delay.
> 
>   Thanx, Paul
> 
> > Here is the grep from .config:
> > egrep "HZ|RCU" .config
> > CONFIG_NO_HZ_COMMON=y
> > # CONFIG_HZ_PERIODIC is not set
> > CONFIG_NO_HZ_IDLE=y
> > # CONFIG_NO_HZ_FULL is not set
> > CONFIG_NO_HZ=y
> > # RCU Subsystem
> > CONFIG_PREEMPT_RCU=y
> > # CONFIG_RCU_EXPERT is not set
> > CONFIG_SRCU=y
> > CONFIG_TREE_SRCU=y
> > CONFIG_TASKS_RCU=y
> > CONFIG_RCU_STALL_COMMON=y
> > CONFIG_RCU_NEED_SEGCBLIST=y
> > # CONFIG_HZ_100 is not

Re: [PATCH v3] mm: Create the new vm_fault_t type

2018-12-13 Thread Souptick Joarder

Hi Andrew,

On Sat, Nov 24, 2018 at 10:16 AM Souptick Joarder  wrote:
>
> On Thu, Nov 15, 2018 at 7:17 AM Mike Rapoport  wrote:
> >
> > On Tue, Nov 06, 2018 at 05:36:42PM +0530, Souptick Joarder wrote:
> > > Page fault handlers are supposed to return VM_FAULT codes,
> > > but some drivers/file systems mistakenly return error
> > > numbers. Now that all drivers/file systems have been converted
> > > to use the vm_fault_t return type, change the type definition
> > > to no longer be compatible with 'int'. By making it an unsigned
> > > int, the function prototype becomes incompatible with a function
> > > which returns int. Sparse will detect any attempts to return a
> > > value which is not a VM_FAULT code.
> > >
> > > VM_FAULT_SET_HINDEX and VM_FAULT_GET_HINDEX values are changed
> > > to avoid conflict with other VM_FAULT codes.
> > >
> > > Signed-off-by: Souptick Joarder 
> >
> > For the docs part
> > Reviewed-by: Mike Rapoport 
> >
> > > ---
> > > v2: Updated the change log and corrected the document part.
> > > name added to the enum that kernel-doc able to parse it.
> > >
> > > v3: Corrected the documentation.
>
> If no further comment, can we get this patch in queue for 4.21 ?

Do I need to make any further improvement for this patch ?
>
> > >
> > >  include/linux/mm.h   | 46 --
> > >  include/linux/mm_types.h | 73 
> > > +++-
> > >  2 files changed, 72 insertions(+), 47 deletions(-)
> > >
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index fcf9cc9..511a3ce 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -1267,52 +1267,6 @@ static inline void clear_page_pfmemalloc(struct 
> > > page *page)
> > >  }
> > >
> > >  /*
> > > - * Different kinds of faults, as returned by handle_mm_fault().
> > > - * Used to decide whether a process gets delivered SIGBUS or
> > > - * just gets major/minor fault counters bumped up.
> > > - */
> > > -
> > > -#define VM_FAULT_OOM 0x0001
> > > -#define VM_FAULT_SIGBUS  0x0002
> > > -#define VM_FAULT_MAJOR   0x0004
> > > -#define VM_FAULT_WRITE   0x0008  /* Special case for get_user_pages 
> > > */
> > > -#define VM_FAULT_HWPOISON 0x0010 /* Hit poisoned small page */
> > > -#define VM_FAULT_HWPOISON_LARGE 0x0020  /* Hit poisoned large page. 
> > > Index encoded in upper bits */
> > > -#define VM_FAULT_SIGSEGV 0x0040
> > > -
> > > -#define VM_FAULT_NOPAGE  0x0100  /* ->fault installed the pte, not 
> > > return page */
> > > -#define VM_FAULT_LOCKED  0x0200  /* ->fault locked the returned page 
> > > */
> > > -#define VM_FAULT_RETRY   0x0400  /* ->fault blocked, must retry */
> > > -#define VM_FAULT_FALLBACK 0x0800 /* huge page fault failed, fall 
> > > back to small */
> > > -#define VM_FAULT_DONE_COW   0x1000   /* ->fault has fully handled COW */
> > > -#define VM_FAULT_NEEDDSYNC  0x2000   /* ->fault did not modify page 
> > > tables
> > > -  * and needs fsync() to complete 
> > > (for
> > > -  * synchronous page faults in DAX) 
> > > */
> > > -
> > > -#define VM_FAULT_ERROR   (VM_FAULT_OOM | VM_FAULT_SIGBUS | 
> > > VM_FAULT_SIGSEGV | \
> > > -  VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE | \
> > > -  VM_FAULT_FALLBACK)
> > > -
> > > -#define VM_FAULT_RESULT_TRACE \
> > > - { VM_FAULT_OOM, "OOM" }, \
> > > - { VM_FAULT_SIGBUS,  "SIGBUS" }, \
> > > - { VM_FAULT_MAJOR,   "MAJOR" }, \
> > > - { VM_FAULT_WRITE,   "WRITE" }, \
> > > - { VM_FAULT_HWPOISON,"HWPOISON" }, \
> > > - { VM_FAULT_HWPOISON_LARGE,  "HWPOISON_LARGE" }, \
> > > - { VM_FAULT_SIGSEGV, "SIGSEGV" }, \
> > > - { VM_FAULT_NOPAGE,  "NOPAGE" }, \
> > > - { VM_FAULT_LOCKED,  "LOCKED" }, \
> > > - { VM_FAULT_RETRY,   "RETRY" }, \
> > > - { VM_FAULT_FALLBACK,"FALLBACK" }, \
> > > - { VM_FAULT_DONE_COW,"DONE_COW" }, \
> > > - { VM_FAULT_NEEDDSYNC,   "NEEDDSYNC" }
> > > -
> > > -/* Encode hstate index for a hwpoisoned large page */
> > > -#define VM_FAULT_SET_HINDEX(x) ((x) << 12)
> > > -#define VM_FAULT_GET_HINDEX(x) (((x) >> 12) & 0xf)
> > > -
> > > -/*
> > >   * Can be called by the pagefault handler when it gets a VM_FAULT_OOM.
> > >   */
> > >  extern void pagefault_out_of_memory(void);
> > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > > index 5ed8f62..cb25016 100644
> > > --- a/include/linux/mm_types.h
> > > +++ b/include/linux/mm_types.h
> > > @@ -22,7 +22,6 @@
> > >  #endif
> > >  #define AT_VECTOR_SIZE (2*(AT_VECTOR_SIZE_ARCH + AT_VECTOR_SIZE_BASE + 
> > > 1))
> > >
> > > -typedef int vm_fault_t;
> > >
> > >  struct address_space;
> > >  struct mem_cgroup;
> > > @@ -609,6 +608,78 @@ static inline bool

Re: [PATCH v2] arm64: invalidate TLB just before turning MMU on

2018-12-13 Thread Bhupesh Sharma

On Fri, Dec 14, 2018 at 9:39 AM Qian Cai  wrote:
>
> On this HPE Apollo 70 arm64 server with 256 CPUs, triggering a crash
> dump just hung. It has 4 threads on each core. Each 2-core share a same
> L1 and L2 caches, so that is 8 CPUs shares those. All CPUs share a same
> L3 cache.
>
> It turned out that this was due to the TLB contained stale entries (or
> uninitialized junk which just happened to look valid) before turning the
> MMU on in the second kernel which caused this instruction hung,
>
> msr sctlr_el1, x0
>
> Although there is a local TLB flush in the second kernel in
> __cpu_setup(), it is called too early. When the time to turn the MMU on
> later, the TLB is dirty again from some reasons.
>
> Also tried to move the local TLB flush part around a bit inside
> __cpu_setup(), although it did complete kdump some times, it did trigger
> "Synchronous Exception" in EFI after a cold-reboot fairly often that
> seems no way to recover remotely without reinstalling the OS. For
> example, in those places,
>
> ENTRY(__cpu_setup)
> +   isb
> tlbivmalle1
> dsb nsh
>
> or
>
> mov x0, #3 << 20
> msr cpacr_el1, x0
> +   tlbivmalle1
> +   dsb nsh
>
> Since it is only necessary to flush local TLB right before turning the
> MMU on, just re-arrage the part a bit like the one in __primary_switch()
> within CONFIG_RANDOMIZE_BASE path, so it does not depends on other
> instructions in between that could pollute the TLB, and it no longer
> trigger "Synchronous Exception" as well.
>
> Signed-off-by: Qian Cai 
> ---
>
> v2: merge the similar part from __cpu_setup() pointed out by James.
>
>  arch/arm64/kernel/head.S | 4 
>  arch/arm64/mm/proc.S | 3 ---
>  2 files changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
> index 4471f570a295..7f555dd4577e 100644
> --- a/arch/arm64/kernel/head.S
> +++ b/arch/arm64/kernel/head.S
> @@ -771,6 +771,10 @@ ENTRY(__enable_mmu)
> msr ttbr0_el1, x2   // load TTBR0
> msr ttbr1_el1, x1   // load TTBR1
> isb
> +
> +   tlbivmalle1 // invalidate TLB
> +   dsb nsh
> +
> msr sctlr_el1, x0
> isb
> /*
> diff --git a/arch/arm64/mm/proc.S b/arch/arm64/mm/proc.S
> index 2c75b0b903ae..14f68afdd57f 100644
> --- a/arch/arm64/mm/proc.S
> +++ b/arch/arm64/mm/proc.S
> @@ -406,9 +406,6 @@ ENDPROC(idmap_kpti_install_ng_mappings)
>   */
> .pushsection ".idmap.text", "awx"
>  ENTRY(__cpu_setup)
> -   tlbivmalle1 // Invalidate local TLB
> -   dsb nsh
> -
> mov x0, #3 << 20
> msr cpacr_el1, x0   // Enable FP/ASIMD
> mov x0, #1 << 12// Reset mdscr_el1 and disable
> --
> 2.17.2 (Apple Git-113)
>

Not sure why I can't reproduce on my HPE Apollo machine, so a couple
of questions:
1. How many CPUs do you enable in the kdump kernel - do you pass
'nr_cpus=1' to the kdump kernel to limit the maximum number of cores
to 1 in the kdump kernel?
2. Which firmware version do you use on your board?

Thanks,
Bhupesh

[PATCH 1/3] f2fs: use kvmalloc, if kmalloc is failed

2018-12-13 Thread Jaegeuk Kim

One report says memalloc failure during mount.

 (unwind_backtrace) from [] (show_stack+0x10/0x14)
 (show_stack) from [] (dump_stack+0x8c/0xa0)
 (dump_stack) from [] (warn_alloc+0xc4/0x160)
 (warn_alloc) from [] (__alloc_pages_nodemask+0x3f4/0x10d0)
 (__alloc_pages_nodemask) from [] (kmalloc_order_trace+0x2c/0x120)
 (kmalloc_order_trace) from [] (build_node_manager+0x35c/0x688)
 (build_node_manager) from [] (f2fs_fill_super+0xf0c/0x16cc)
 (f2fs_fill_super) from [] (mount_bdev+0x15c/0x188)
 (mount_bdev) from [] (f2fs_mount+0x18/0x20)
 (f2fs_mount) from [] (mount_fs+0x158/0x19c)
 (mount_fs) from [] (vfs_kern_mount+0x78/0x134)
 (vfs_kern_mount) from [] (do_mount+0x474/0xca4)
 (do_mount) from [] (SyS_mount+0x94/0xbc)
 (SyS_mount) from [] (ret_fast_syscall+0x0/0x48)

Signed-off-by: Jaegeuk Kim 
---
 fs/f2fs/acl.c|  6 ++--
 fs/f2fs/checkpoint.c |  2 +-
 fs/f2fs/data.c   |  2 +-
 fs/f2fs/debug.c  |  2 +-
 fs/f2fs/f2fs.h   | 10 +--
 fs/f2fs/gc.c |  4 +--
 fs/f2fs/inline.c |  4 +--
 fs/f2fs/namei.c  |  2 +-
 fs/f2fs/node.c   | 10 +++
 fs/f2fs/segment.c| 36 +++
 fs/f2fs/super.c  | 68 ++--
 11 files changed, 76 insertions(+), 70 deletions(-)

diff --git a/fs/f2fs/acl.c b/fs/f2fs/acl.c
index 22f0d17cde43..63e599524085 100644
--- a/fs/f2fs/acl.c
+++ b/fs/f2fs/acl.c
@@ -160,7 +160,7 @@ static void *f2fs_acl_to_disk(struct f2fs_sb_info *sbi,
return (void *)f2fs_acl;
 
 fail:
-   kfree(f2fs_acl);
+   kvfree(f2fs_acl);
return ERR_PTR(-EINVAL);
 }
 
@@ -190,7 +190,7 @@ static struct posix_acl *__f2fs_get_acl(struct inode 
*inode, int type,
acl = NULL;
else
acl = ERR_PTR(retval);
-   kfree(value);
+   kvfree(value);
 
return acl;
 }
@@ -240,7 +240,7 @@ static int __f2fs_set_acl(struct inode *inode, int type,
 
error = f2fs_setxattr(inode, name_index, "", value, size, ipage, 0);
 
-   kfree(value);
+   kvfree(value);
if (!error)
set_cached_acl(inode, type, acl);
 
diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c
index 3a25cf22d732..5cba6a8ee55c 100644
--- a/fs/f2fs/checkpoint.c
+++ b/fs/f2fs/checkpoint.c
@@ -911,7 +911,7 @@ int f2fs_get_valid_checkpoint(struct f2fs_sb_info *sbi)
f2fs_put_page(cp1, 1);
f2fs_put_page(cp2, 1);
 fail_no_cp:
-   kfree(sbi->ckpt);
+   kvfree(sbi->ckpt);
return -EINVAL;
 }
 
diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index fd3a1e5ab6d9..59d86f692c84 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -2573,7 +2573,7 @@ static void f2fs_dio_end_io(struct bio *bio)
bio->bi_private = dio->orig_private;
bio->bi_end_io = dio->orig_end_io;
 
-   kfree(dio);
+   kvfree(dio);
 
bio_endio(bio);
 }
diff --git a/fs/f2fs/debug.c b/fs/f2fs/debug.c
index 1f1230f690ec..11d4448e8e09 100644
--- a/fs/f2fs/debug.c
+++ b/fs/f2fs/debug.c
@@ -503,7 +503,7 @@ void f2fs_destroy_stats(struct f2fs_sb_info *sbi)
list_del(>stat_list);
mutex_unlock(_stat_mutex);
 
-   kfree(si);
+   kvfree(si);
 }
 
 int __init f2fs_create_root_stats(void)
diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index 7cec897146a3..81bd9a2bf22b 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -1631,7 +1631,7 @@ static inline void disable_nat_bits(struct f2fs_sb_info 
*sbi, bool lock)
if (lock)
spin_lock_irqsave(>cp_lock, flags);
__clear_ckpt_flags(F2FS_CKPT(sbi), CP_NAT_BITS_FLAG);
-   kfree(NM_I(sbi)->nat_bits);
+   kvfree(NM_I(sbi)->nat_bits);
NM_I(sbi)->nat_bits = NULL;
if (lock)
spin_unlock_irqrestore(>cp_lock, flags);
@@ -2705,12 +2705,18 @@ static inline bool f2fs_may_extent_tree(struct inode 
*inode)
 static inline void *f2fs_kmalloc(struct f2fs_sb_info *sbi,
size_t size, gfp_t flags)
 {
+   void *ret;
+
if (time_to_inject(sbi, FAULT_KMALLOC)) {
f2fs_show_injection_info(FAULT_KMALLOC);
return NULL;
}
 
-   return kmalloc(size, flags);
+   ret = kmalloc(size, flags);
+   if (ret)
+   return ret;
+
+   return kvmalloc(size, flags);
 }
 
 static inline void *f2fs_kzalloc(struct f2fs_sb_info *sbi,
diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c
index 71462f2e47d4..20ca8141a720 100644
--- a/fs/f2fs/gc.c
+++ b/fs/f2fs/gc.c
@@ -142,7 +142,7 @@ int f2fs_start_gc_thread(struct f2fs_sb_info *sbi)
"f2fs_gc-%u:%u", MAJOR(dev), MINOR(dev));
if (IS_ERR(gc_th->f2fs_gc_task)) {
err = PTR_ERR(gc_th->f2fs_gc_task);
-   kfree(gc_th);
+   kvfree(gc_th);
sbi->gc_thread = NULL;
}
 out:
@@ -155,7 +155,7 @@ void f2fs_stop_gc_thread(struct f2fs_sb_info *sbi)
if (!gc_th)
return;
kthread_stop(gc_th->f2fs_gc_task);
-   kfree(gc_th);
+

[PATCH 3/3] f2fs: flush stale issued discard candidates

2018-12-13 Thread Jaegeuk Kim

Sometimes, I could observe # of issuing_discard to be 1 which blocks background
jobs due to is_idle()=false.
The only way to get out of it was to trigger gc_urgent. This patch avoids that
by checking any candidates as done in the list.

Signed-off-by: Jaegeuk Kim 
---
 fs/f2fs/segment.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
index 49ea9009ab5a..acbbc924e518 100644
--- a/fs/f2fs/segment.c
+++ b/fs/f2fs/segment.c
@@ -1651,6 +1651,10 @@ static int issue_discard_thread(void *data)
if (dcc->discard_wake)
dcc->discard_wake = 0;
 
+   /* clean up pending candidates before going to sleep */
+   if (atomic_read(>queued_discard))
+   __wait_all_discard_cmd(sbi, NULL);
+
if (try_to_freeze())
continue;
if (f2fs_readonly(sbi->sb))
-- 
2.19.0.605.g01d371f741-goog

[PATCH 2/3] f2fs: correct wrong spelling, issing_*

2018-12-13 Thread Jaegeuk Kim

Let's use "queued" instead of "issuing".

Signed-off-by: Jaegeuk Kim 
---
 fs/f2fs/debug.c   |  4 ++--
 fs/f2fs/f2fs.h| 10 +-
 fs/f2fs/segment.c | 26 +-
 3 files changed, 20 insertions(+), 20 deletions(-)

diff --git a/fs/f2fs/debug.c b/fs/f2fs/debug.c
index 11d4448e8e09..ebcc121920ba 100644
--- a/fs/f2fs/debug.c
+++ b/fs/f2fs/debug.c
@@ -64,7 +64,7 @@ static void update_general_status(struct f2fs_sb_info *sbi)
si->nr_flushed =
atomic_read(_I(sbi)->fcc_info->issued_flush);
si->nr_flushing =
-   atomic_read(_I(sbi)->fcc_info->issing_flush);
+   atomic_read(_I(sbi)->fcc_info->queued_flush);
si->flush_list_empty =
llist_empty(_I(sbi)->fcc_info->issue_list);
}
@@ -72,7 +72,7 @@ static void update_general_status(struct f2fs_sb_info *sbi)
si->nr_discarded =
atomic_read(_I(sbi)->dcc_info->issued_discard);
si->nr_discarding =
-   atomic_read(_I(sbi)->dcc_info->issing_discard);
+   atomic_read(_I(sbi)->dcc_info->queued_discard);
si->nr_discard_cmd =
atomic_read(_I(sbi)->dcc_info->discard_cmd_cnt);
si->undiscard_blks = SM_I(sbi)->dcc_info->undiscard_blks;
diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index 81bd9a2bf22b..b634043fe14c 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -285,7 +285,7 @@ struct discard_cmd {
struct block_device *bdev;  /* bdev */
unsigned short ref; /* reference count */
unsigned char state;/* state */
-   unsigned char issuing;  /* issuing discard */
+   unsigned char queued;   /* queued discard */
int error;  /* bio error */
spinlock_t lock;/* for state/bio_ref updating */
unsigned short bio_ref; /* bio reference count */
@@ -327,7 +327,7 @@ struct discard_cmd_control {
unsigned int undiscard_blks;/* # of undiscard blocks */
unsigned int next_pos;  /* next discard position */
atomic_t issued_discard;/* # of issued discard */
-   atomic_t issing_discard;/* # of issing discard */
+   atomic_t queued_discard;/* # of queued discard */
atomic_t discard_cmd_cnt;   /* # of cached cmd count */
struct rb_root_cached root; /* root of discard rb-tree */
bool rbtree_check;  /* config for consistence check 
*/
@@ -892,7 +892,7 @@ struct flush_cmd_control {
struct task_struct *f2fs_issue_flush;   /* flush thread */
wait_queue_head_t flush_wait_queue; /* waiting queue for wake-up */
atomic_t issued_flush;  /* # of issued flushes */
-   atomic_t issing_flush;  /* # of issing flushes */
+   atomic_t queued_flush;  /* # of queued flushes */
struct llist_head issue_list;   /* list for command issue */
struct llist_node *dispatch_list;   /* list for command dispatch */
 };
@@ -2167,8 +2167,8 @@ static inline bool is_idle(struct f2fs_sb_info *sbi, int 
type)
get_pages(sbi, F2FS_WB_CP_DATA) ||
get_pages(sbi, F2FS_DIO_READ) ||
get_pages(sbi, F2FS_DIO_WRITE) ||
-   atomic_read(_I(sbi)->dcc_info->issing_discard) ||
-   atomic_read(_I(sbi)->fcc_info->issing_flush))
+   atomic_read(_I(sbi)->dcc_info->queued_discard) ||
+   atomic_read(_I(sbi)->fcc_info->queued_flush))
return false;
return f2fs_time_over(sbi, type);
 }
diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
index 98649d304a2f..49ea9009ab5a 100644
--- a/fs/f2fs/segment.c
+++ b/fs/f2fs/segment.c
@@ -620,16 +620,16 @@ int f2fs_issue_flush(struct f2fs_sb_info *sbi, nid_t ino)
return 0;
 
if (!test_opt(sbi, FLUSH_MERGE)) {
-   atomic_inc(>issing_flush);
+   atomic_inc(>queued_flush);
ret = submit_flush_wait(sbi, ino);
-   atomic_dec(>issing_flush);
+   atomic_dec(>queued_flush);
atomic_inc(>issued_flush);
return ret;
}
 
-   if (atomic_inc_return(>issing_flush) == 1 || sbi->s_ndevs > 1) {
+   if (atomic_inc_return(>queued_flush) == 1 || sbi->s_ndevs > 1) {
ret = submit_flush_wait(sbi, ino);
-   atomic_dec(>issing_flush);
+   atomic_dec(>queued_flush);
 
atomic_inc(>issued_flush);
return ret;
@@ -648,14 +648,14 @@ int f2fs_issue_flush(struct f2fs_sb_info *sbi, nid_t ino)
 
if (fcc->f2fs_issue_flush) {
wait_for_completion();
-

Re: [PATCH] arm64: dts: qcom: sdm845: Add Q6V5 MSS node

2018-12-13 Thread Bjorn Andersson

On Thu 13 Dec 14:17 PST 2018, Doug Anderson wrote:
> On Tue, Nov 27, 2018 at 12:58 AM Sibi Sankar  wrote:
[..]
> > diff --git a/arch/arm64/boot/dts/qcom/sdm845.dtsi 
> > b/arch/arm64/boot/dts/qcom/sdm845.dtsi
> > index 58870273dbc9..df16ee464872 100644
> > --- a/arch/arm64/boot/dts/qcom/sdm845.dtsi
> > +++ b/arch/arm64/boot/dts/qcom/sdm845.dtsi
> > @@ -1095,6 +1095,69 @@
> > };
> > };
> >
> > +   remoteproc@408 {
> > +   compatible = "qcom,sdm845-mss-pil";
> > +   reg = <0x0408 0x408>, <0x0418 0x48>;
> 
> s/0x0408/0x408 to appease the DT folks.
> 

Andy requests this to be padded to 8 digits, and I've come to really
appreciate this as it makes sorting much easier.

But perhaps there's a verdict on this?

Regards,
Bjorn

Re: Generic kernel fails to boot on Alpha bisected to b38d08f3181c

2018-12-13 Thread Michael Cree

On Thu, Dec 13, 2018 at 08:07:24AM -0800, Tejun Heo wrote:
> Hello, Michael.
> 
> On Thu, Dec 13, 2018 at 09:26:12PM +1300, Michael Cree wrote:
> > A kernel built for generic UP Alpha had been noted to fail to boot
> > for quite some time (since the release of 3.18).  The kernel either
> > locks up before printing any messages to the console or just falls
> > back into the SRM with a HALT instruction again before any messages
> > are printed to the console.  A work around is to either use a kernel
> > built for generic SMP or to build a machine specific kernel as these
> > boot correctly.
> > 
> > Because there were other compile errors at the time it proved
> > difficult to bisect, but we are continuing to get complaints about
> > it as it renders the Debian Alpha installer somewhat useless, so I
> > returned to trying to find the problem and managed to bisect it to:
> > 
> > commit b38d08f3181c5025a7ce84646494cc4748492a3b
> > Author: Tejun Heo 
> > Date:   Tue Sep 2 14:46:02 2014 -0400
> > 
> > percpu: restructure locking
> > 
> > Any suggestions as to what might be the problem and a fix?
> 
> So, the only thing I can think of is that it's calling
> spin_unlock_irq() while irq handling isn't set up yet.  Can you please
> try the followings?
> 
> 1. Convert all spin_[un]lock_irq() to
>spin_lock_irqsave/unlock_irqrestore().

Yes, that's it.  With the attached patch the kernel boots.

Cheers
Michael.
>From e08cf3c714184d8fe168fffcd7d15732924deb1e Mon Sep 17 00:00:00 2001
From: Michael Cree 
Date: Fri, 14 Dec 2018 17:24:31 +1300
Subject: [PATCH] percpu: convert spin_lock_irq to spin_lock_irqsave.

Bisection lead to commit b38d08f3181c ("percpu: restructure
locking") as being the cause of lockups at initial boot on
the kernel built for generic Alpha.

On a suggestion by Tejun Heo that:

So, the only thing I can think of is that it's calling
spin_unlock_irq() while irq handling isn't set up yet.
Can you please try the followings?

1. Convert all spin_[un]lock_irq() to
   spin_lock_irqsave/unlock_irqrestore().
---
 mm/percpu-km.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/percpu-km.c b/mm/percpu-km.c
index 38de70ab1a0d..0f643dc2dc65 100644
--- a/mm/percpu-km.c
+++ b/mm/percpu-km.c
@@ -50,6 +50,7 @@ static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp)
 	const int nr_pages = pcpu_group_sizes[0] >> PAGE_SHIFT;
 	struct pcpu_chunk *chunk;
 	struct page *pages;
+	unsigned long flags;
 	int i;
 
 	chunk = pcpu_alloc_chunk(gfp);
@@ -68,9 +69,9 @@ static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp)
 	chunk->data = pages;
 	chunk->base_addr = page_address(pages) - pcpu_group_offsets[0];
 
-	spin_lock_irq(_lock);
+	spin_lock_irqsave(_lock, flags);
 	pcpu_chunk_populated(chunk, 0, nr_pages, false);
-	spin_unlock_irq(_lock);
+	spin_unlock_irqrestore(_lock, flags);
 
 	pcpu_stats_chunk_alloc();
 	trace_percpu_create_chunk(chunk->base_addr);
-- 
2.11.0

Re: [PATCH v2 01/12] fs-verity: add a documentation file

2018-12-13 Thread Eric Biggers

Hi Christoph,

On Thu, Dec 13, 2018 at 12:22:49PM -0800, Christoph Hellwig wrote:
> On Wed, Dec 12, 2018 at 12:26:10PM -0800, Eric Biggers wrote:
> > > As this apparently got merged despite no proper reviews from VFS
> > > level persons:
> > 
> > fs-verity has been out for review since August, and Cc'ed to all relevant
> > mailing lists including linux-fsdevel, linux-ext4, linux-f2fs-devel,
> > linux-fscrypt, linux-integrity, and linux-kernel.  There are tests,
> > documentation (since v2), and a userspace tool.  It's also been presented at
> > multiple conferences, and has been covered by LWN multiple times.  If more
> > people want to review it, then they should do so; there's nothing stopping 
> > them.
> 
> But you did not got a review from someone like Al, Linus, Andrew or me,
> did you?

Sure, those specific people (modulo you just now) haven't responded to the
fs-verity patches yet.  But again, the patches have been out for review for
months.  Of course, we always prefer more reviews over fewer, and we strongly
encourage anyone interested to review fs-verity!  (The Documentation/ file may
be a good place to start.)  But ultimately we cannot force reviews, and as you
know kernel reviews can be very hard to come by.  Yet, people still need
fs-verity anyway; it isn't just some toy.  And we're committed to maintaining
it, similar to fscrypt.  The ext4 and f2fs maintainers are also satisfied with
the current approach to storing the verity metadata past EOF; in fact it was
even originally Ted's idea, I think.

> 
> > Can you elaborate on the actual problems you think the current solution 
> > has, and
> > exactly what solution you'd prefer instead?  Keep in mind that (1) for large
> > files the Merkle tree can be gigabytes long, (2) Linux doesn't have an API 
> > for
> > file streams, and (3) when fs-verity is combined with fscrypt, it's 
> > important
> > that the hashes be encrypted, so as to not leak information about the 
> > plaintext.
> 
> Given that you alread use an ioctl as the interface what is the problem
> of passing this data through the ioctl?

Do you mean pass the verity metadata in a buffer?  That cannot work in general,
because it may be too large to fit into memory.

Or do you mean pass it via a second file descriptor?  That could work, but it
doesn't seem better than the current approach.  It would force every filesystem
to move the metadata around, whereas currently ext4 and f2fs can simply leave it
in place.  If you meant this, are there advantages you have in mind that would
outweigh this?

We also considered generating the Merkle tree in the kernel, in which case
FS_IOC_ENABLE_VERITY would just take a small structure similar to the current
fsverity_descriptor.  But that would add extra complexity to the kernel, and
generating a Merkle tree over a large file is the type of parallelizable, CPU
intensive work that really should be done in userspace.  Also, having userspace
provide the Merkle tree allows for it to be pre-generated and distributed with
the file, e.g. provided in a package to be installed on many systems.

But please do let us know if you have any better ideas.

Thanks!

- Eric

Re: [PATCH v6 08/27] csky: define syscall_get_arch()

2018-12-13 Thread Guo Ren

Thx Dmitry,

Reviewed-by: Guo Ren 

On Thu, Dec 13, 2018 at 08:22:07PM +0300, Dmitry V. Levin wrote:
> syscall_get_arch() is required to be implemented on all architectures
> in order to extend the generic ptrace API with PTRACE_GET_SYSCALL_INFO
> request.
> 
> Cc: Guo Ren 
> Cc: Paul Moore 
> Cc: Eric Paris 
> Cc: Oleg Nesterov 
> Cc: Andy Lutomirski 
> Cc: Elvira Khabirova 
> Cc: Eugene Syromyatnikov 
> Cc: linux-au...@redhat.com
> Signed-off-by: Dmitry V. Levin 
> ---
> 
> Notes:
> v6: unchanged
> 
>  arch/csky/include/asm/syscall.h | 7 +++
>  include/uapi/linux/audit.h  | 1 +
>  2 files changed, 8 insertions(+)
> 
> diff --git a/arch/csky/include/asm/syscall.h b/arch/csky/include/asm/syscall.h
> index 926a64a8b4ee..d637445737b7 100644
> --- a/arch/csky/include/asm/syscall.h
> +++ b/arch/csky/include/asm/syscall.h
> @@ -6,6 +6,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  static inline int
>  syscall_get_nr(struct task_struct *task, struct pt_regs *regs)
> @@ -68,4 +69,10 @@ syscall_set_arguments(struct task_struct *task, struct 
> pt_regs *regs,
>   memcpy(>a1 + i * sizeof(regs->a1), args, n * sizeof(regs->a0));
>  }
>  
> +static inline int
> +syscall_get_arch(void)
> +{
> + return AUDIT_ARCH_CSKY;
> +}
> +
>  #endif   /* __ASM_SYSCALL_H */
> diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h
> index 72aeea0a740d..55904a40d768 100644
> --- a/include/uapi/linux/audit.h
> +++ b/include/uapi/linux/audit.h
> @@ -384,6 +384,7 @@ enum {
>  #define AUDIT_ARCH_C6X   (EM_TI_C6000|__AUDIT_ARCH_LE)
>  #define AUDIT_ARCH_C6XBE (EM_TI_C6000)
>  #define AUDIT_ARCH_CRIS  (EM_CRIS|__AUDIT_ARCH_LE)
> +#define AUDIT_ARCH_CSKY  (EM_CSKY|__AUDIT_ARCH_LE)
>  #define AUDIT_ARCH_FRV   (EM_FRV)
>  #define AUDIT_ARCH_I386  (EM_386|__AUDIT_ARCH_LE)
>  #define AUDIT_ARCH_IA64  
> (EM_IA_64|__AUDIT_ARCH_64BIT|__AUDIT_ARCH_LE)
> -- 
> ldv

Re: [PATCH v6 07/27] elf-em.h: add EM_CSKY

2018-12-13 Thread Guo Ren

Reviewed-by: Guo Ren 

On Thu, Dec 13, 2018 at 08:22:00PM +0300, Dmitry V. Levin wrote:
> The uapi/linux/audit.h header is going to use EM_CSKY in order
> to define AUDIT_ARCH_CSKY which is needed to implement
> syscall_get_arch() which in turn is required to extend
> the generic ptrace API with PTRACE_GET_SYSCALL_INFO request.
> 
> The value for EM_CSKY has been taken from arch/csky/include/asm/elf.h
> and confirmed by binutils:include/elf/common.h
> 
> Cc: Guo Ren 
> Cc: Oleg Nesterov 
> Cc: Andy Lutomirski 
> Cc: Elvira Khabirova 
> Cc: Eugene Syromyatnikov 
> Signed-off-by: Dmitry V. Levin 
> ---
> 
> Notes:
> v6: unchanged
> 
>  include/uapi/linux/elf-em.h | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/include/uapi/linux/elf-em.h b/include/uapi/linux/elf-em.h
> index 42b7546352a6..ee0b26ab92b0 100644
> --- a/include/uapi/linux/elf-em.h
> +++ b/include/uapi/linux/elf-em.h
> @@ -45,6 +45,7 @@
>  #define EM_ARCV2 195 /* ARCv2 Cores */
>  #define EM_RISCV 243 /* RISC-V */
>  #define EM_BPF   247 /* Linux BPF - in-kernel virtual 
> machine */
> +#define EM_CSKY  252 /* C-SKY processor family */
>  #define EM_FRV   0x5441  /* Fujitsu FR-V */
>  
>  /*
> -- 
> ldv

Re: [PATCH v13 0/2] cpufreq: qcom-hw: Add support for QCOM cpufreq HW

2018-12-13 Thread Viresh Kumar

On 14-12-18, 09:40, Taniya Das wrote:
>  [v13]
>* Update Documentation binding to #freq-domain-cells in description.
>* Replace devm_ioremap_resource() to use devm_ioremap() API.

Acked-by: Viresh Kumar 

-- 
viresh

Re: [PATCH net-next 0/3] vhost: accelerate metadata access through vmap()

2018-12-13 Thread Jason Wang




On 2018/12/14 上午4:12, Michael S. Tsirkin wrote:

On Thu, Dec 13, 2018 at 06:10:19PM +0800, Jason Wang wrote:

Hi:

This series tries to access virtqueue metadata through kernel virtual
address instead of copy_user() friends since they had too much
overheads like checks, spec barriers or even hardware feature
toggling.

Test shows about 24% improvement on TX PPS. It should benefit other
cases as well.

Please review

I think the idea of speeding up userspace access is a good one.
However I think that moving all checks to start is way too aggressive.



So did packet and AF_XDP. Anyway, sharing address space and access them 
directly is the fastest way. Performance is the major consideration for 
people to choose backend. Compare to userspace implementation, vhost 
does not have security advantages at any level. If vhost is still slow, 
people will start to develop backends based on e.g AF_XDP.




Instead, let's batch things up but let's not keep them
around forever.
Here are some ideas:


1. Disable preemption, process a small number of small packets
directly in an atomic context. This should cut latency
down significantly, the tricky part is to only do it
on a light load and disable this
for the streaming case otherwise it's unfair.
This might fail, if it does just bounce things out to
a thread.



I'm not sure what context you meant here. Is this for TX path of TUN? 
But a fundamental difference is my series is targeted for extreme heavy 
load not light one, 100% cpu for vhost is expected.





2. Switch to unsafe_put_user/unsafe_get_user,
and batch up multiple accesses.



As I said, unless we can batch accessing of two difference places of 
three of avail, descriptor and used. It won't help for batching the 
accessing of a single place like used. I'm even not sure this can be 
done consider the case of packed virtqueue, we have a single descriptor 
ring. Batching through unsafe helpers may not help in this case since 
it's equivalent to safe ones . And This requires non trivial refactoring 
of vhost. And such refactoring itself make give us noticeable impact 
(e.g it may lead regression).





3. Allow adding a fixup point manually,
such that multiple independent get_user accesses
can get a single fixup (will allow better compiler
optimizations).



So for metadata access, I don't see how you suggest here can help in the 
case of heavy workload.


For data access, this may help but I've played to batch the data copy to 
reduce SMAP/spec barriers in vhost-net but I don't see performance 
improvement.


Thanks







Jason Wang (3):
   vhost: generalize adding used elem
   vhost: fine grain userspace memory accessors
   vhost: access vq metadata through kernel virtual address

  drivers/vhost/vhost.c | 281 ++
  drivers/vhost/vhost.h |  11 ++
  2 files changed, 266 insertions(+), 26 deletions(-)

--
2.17.1

Re: [PATCH V14 0/3] blk-mq: refactor and fix the code of issue directly

2018-12-13 Thread Jens Axboe

On 12/13/18 6:28 PM, Jianchao Wang wrote:
> Hi Jens
> 
> After commit c616cbee ( blk-mq: punt failed direct issue to dispatch
> list ), we always insert request to hctx dispatch list whenever get a
> BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE, this is overkill and will
> harm the merging. We just need to do that for the requests that has
> been through .queue_rq.
> 
> This patch set fixes the issue above and refactors the code of issue
> request directly to unify the interface and make the code clearer and
> more readable.
> 
> Please consider this patchset for 4.21.

I'll throw this through all the testing tomorrow, looks good to me.
If it passes, I'll add it for 4.21.

-- 
Jens Axboe

Re: [PATCH RESEND v7 3/4] clk: meson: add sub MMC clock controller driver

2018-12-13 Thread Jianxin Pan

On 2018/12/13 17:01, Jerome Brunet wrote:
> On Thu, 2018-12-13 at 12:55 +0800, Jianxin Pan wrote:
>> On 2018/12/12 0:59, Jerome Brunet wrote:
>>> On Tue, 2018-12-11 at 00:04 +0800, Jianxin Pan wrote:
 From: Yixun Lan 

>> [...]
  
 +config COMMON_CLK_MMC_MESON
 +  tristate "Meson MMC Sub Clock Controller Driver"
 +  select MFD_SYSCON
 +  select COMMON_CLK_AMLOGIC
 +  select COMMON_CLK_AMLOGIC_AUDIO
>>>
>>> No it is wrong for the mmc to select AUDIO clocks.
>>> If as a result of your patch sclk is needed for things, make the necessary
>>> change in the Makefile.
>> OK, I will add COMMON_CLK_AMLOGIC_SCLKDIV for sclk-div.
> 
> No! There is no reason to create a specific configuration for this.
> please put it under COMMON_CLK_AMLOGIC
OK, I will use COMMON_CLK_AMLOGIC and clkc.h for sclk-div in the next version. 
Thank you.
> 
>> [...]>> +#include 
 +#include 
 +#include 
 +#include 
 +#include 
 +#include 
 +#include 
[...]

[PATCH v13 0/2] cpufreq: qcom-hw: Add support for QCOM cpufreq HW

2018-12-13 Thread Taniya Das

 [v13]
   * Update Documentation binding to #freq-domain-cells in description.
   * Replace devm_ioremap_resource() to use devm_ioremap() API.

 [v12]
   * Remove per-cpu domain global structure.

 [v11]
   * Updated the code logic as per Stephen.
   * Default boost enabled is removed.
   * Update the clock name to use "alternate" for GPLL0 source in code and
 Documentation binding.
   * Module description updated.
   * perf_base updated to perf_state_reg.

 [v10]
  * Update Documentation binding for cpufreq node.
  * Make the driver 'tristate' instead of 'bool' and update code.
  * Update the clock name to reflect the hardware name.
  * Remove support for varying offset.

 [v9]
  * Update the Documentation binding for freq-domain-cells & cpufreq node.
  * Address comments in Kconfig.arm & Makefile.
  * Remove include file & MODULE_DESCRIPTION not required.
  * Fix the code for 'of_node_put(cpu_np)'.

 [v8]
   * Address comments to update code to take cpufreq_hw phandle and index from
 the CPU nodes.
   * Updated the Documentation for the above change in DT.
   * Updated logic for assigning 'qcom_freq_domain_map' for related CPUs.
   * Clock input to the HW block is taken from DT which has been updated in
 code and Device tree documentation.

 [v7]
   * Updated the logic to check for related CPUs.

 [v6]
   * Renamed match table 'qcom_cpufreq_hw_match'.
   * Renamed 'qcom_read_lut' to 'qcom_cpufreq_hw_read_lut'.
   * Updated the logic to check for related CPUs at the beginning of the
 'qcom_cpu_resources_init'.
   * Use devm_ioremap_resource instead of devm_ioremap.
   * Update the use of of_node_put to handle error conditions.
   * Use policy->cached_resolved_idx in fast switch callback.
   * Keep precalculated offsets 'reg_bases'.
   * XO clock is taken from Device tree.
   * Update documentation binding for clocks/clock-names.
   * Minor comments in Kconfig.arm.
   * Comments to move dev_info to dev_dbg.

 [v5]
   * Remove mapping different register regions of perf/lut/enable,
 instead map the entire HW region.
   * Add reg_offset/cpufreq_qcom_std_offsets to be supplied as device data.
   * Check of src == 0 during lut read.
   * Add of_node_put(cpu_np) in qcom_get_related_cpus
   * Update the qcom_cpu_resources_init for register offset data,
 and cleanup the related cpus to keep a single copy of CPUfreq.
   * Replace FW with HW, update Kconfig, rename filename qcom-cpufreq-hw.c
   * Update the documentation binding to reflect the changes of mapping the
   * entire HW region.

 [v4]
   * Fixed console messages as per comments.
   * Return error from qcom_resources_init()
 in the cases where failed to get frequency domain.
   * Rename cpu_dev to cpu_np in qcom_resources_init,
 qcom_get_related_cpus(). Also use temp variable freq_np in
 qcom_get_related_cpus().
   * Update qcom_cpufreq_fw_get() to use the policy data to incoporate
 the hotplug use case.
   * Update code to use of fast_switching.
   * Check for !c->max_cores instead of cpumask_empty in
 qcom_get_related_cpus().
   * Update the logic of assigning 'c' to qcom_freq_domain_map[cpu].

 [v3]
   * Remove index check from 'qcom_cpufreq_fw_target_index'.
   * Update the Documentation binding to add the platform specific properties in
 the CPU nodes, node name "qcom,freq-domain".
   * Update return value to '0' from -ENODEV from 'qcom_cpufreq_fw_get'.
   * Update the logic for boost frequency to use local variables instead of
 cpufreq driver data in 'qcom_read_lut'.
   * Update the logic in 'qcom_get_related_cpus' to find the related cpus.
   * Update the reg-names to remove "_base" and also update the binding with the
 description of these registers.
   * Update the logic in 'qcom_resources_init' to address the new device tree
 notation of handling the frequency domain phandles.

 [v2]
   * Fixed the alignment issues in "qcom_cpufreq_fw_target_index" for dev_err 
and
 also for "qcom_cpu_resources_init".
   * Removed ret = 0 from qcom_get_related_cpus and added to check for
 cpu_mask_empty to return -ENOENT.
   * Fixes qcom_cpu_resources_init function
   * Remove initialization of 'index'
   * Check for valid 'c'
   * Removed initialization of 'prev_cc' from 'qcom_read_lut'.

Taniya Das (2):
  dt-bindings: cpufreq: Introduce QCOM CPUFREQ Firmware bindings
  cpufreq: qcom-hw: Add support for QCOM cpufreq HW driver

 .../bindings/cpufreq/cpufreq-qcom-hw.txt   | 172 
 drivers/cpufreq/Kconfig.arm|  11 +
 drivers/cpufreq/Makefile   |   1 +
 drivers/cpufreq/qcom-cpufreq-hw.c  | 308 +
 4 files changed, 492 insertions(+)
 create mode 100644 
Documentation/devicetree/bindings/cpufreq/cpufreq-qcom-hw.txt
 create mode 100644 drivers/cpufreq/qcom-cpufreq-hw.c

--
Qualcomm INDIA, on behalf of Qualcomm Innovation Center, Inc.is a member
of the Code Aurora Forum, hosted by the  Linux

[PATCH v13 2/2] cpufreq: qcom-hw: Add support for QCOM cpufreq HW driver

2018-12-13 Thread Taniya Das

The CPUfreq HW present in some QCOM chipsets offloads the steps necessary
for changing the frequency of CPUs. The driver implements the cpufreq
driver interface for this hardware engine.

Signed-off-by: Saravana Kannan 
Signed-off-by: Stephen Boyd 
Signed-off-by: Taniya Das 
---
 drivers/cpufreq/Kconfig.arm   |  11 ++
 drivers/cpufreq/Makefile  |   1 +
 drivers/cpufreq/qcom-cpufreq-hw.c | 308 ++
 3 files changed, 320 insertions(+)
 create mode 100644 drivers/cpufreq/qcom-cpufreq-hw.c

diff --git a/drivers/cpufreq/Kconfig.arm b/drivers/cpufreq/Kconfig.arm
index 4e1131e..688f102 100644
--- a/drivers/cpufreq/Kconfig.arm
+++ b/drivers/cpufreq/Kconfig.arm
@@ -114,6 +114,17 @@ config ARM_QCOM_CPUFREQ_KRYO

  If in doubt, say N.

+config ARM_QCOM_CPUFREQ_HW
+   tristate "QCOM CPUFreq HW driver"
+   depends on ARCH_QCOM || COMPILE_TEST
+   help
+ Support for the CPUFreq HW driver.
+ Some QCOM chipsets have a HW engine to offload the steps
+ necessary for changing the frequency of the CPUs. Firmware loaded
+ in this engine exposes a programming interface to the OS.
+ The driver implements the cpufreq interface for this HW engine.
+ Say Y if you want to support CPUFreq HW.
+
 config ARM_S3C_CPUFREQ
bool
help
diff --git a/drivers/cpufreq/Makefile b/drivers/cpufreq/Makefile
index d5ee456..08c071b 100644
--- a/drivers/cpufreq/Makefile
+++ b/drivers/cpufreq/Makefile
@@ -61,6 +61,7 @@ obj-$(CONFIG_MACH_MVEBU_V7)   += mvebu-cpufreq.o
 obj-$(CONFIG_ARM_OMAP2PLUS_CPUFREQ)+= omap-cpufreq.o
 obj-$(CONFIG_ARM_PXA2xx_CPUFREQ)   += pxa2xx-cpufreq.o
 obj-$(CONFIG_PXA3xx)   += pxa3xx-cpufreq.o
+obj-$(CONFIG_ARM_QCOM_CPUFREQ_HW)  += qcom-cpufreq-hw.o
 obj-$(CONFIG_ARM_QCOM_CPUFREQ_KRYO)+= qcom-cpufreq-kryo.o
 obj-$(CONFIG_ARM_S3C2410_CPUFREQ)  += s3c2410-cpufreq.o
 obj-$(CONFIG_ARM_S3C2412_CPUFREQ)  += s3c2412-cpufreq.o
diff --git a/drivers/cpufreq/qcom-cpufreq-hw.c 
b/drivers/cpufreq/qcom-cpufreq-hw.c
new file mode 100644
index 000..d83939a
--- /dev/null
+++ b/drivers/cpufreq/qcom-cpufreq-hw.c
@@ -0,0 +1,308 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2018, The Linux Foundation. All rights reserved.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define LUT_MAX_ENTRIES40U
+#define LUT_SRCGENMASK(31, 30)
+#define LUT_L_VAL  GENMASK(7, 0)
+#define LUT_CORE_COUNT GENMASK(18, 16)
+#define LUT_ROW_SIZE   32
+#define CLK_HW_DIV 2
+
+/* Register offsets */
+#define REG_ENABLE 0x0
+#define REG_LUT_TABLE  0x110
+#define REG_PERF_STATE 0x920
+
+static unsigned long cpu_hw_rate, xo_rate;
+static struct platform_device *global_pdev;
+
+static int qcom_cpufreq_hw_target_index(struct cpufreq_policy *policy,
+   unsigned int index)
+{
+   void __iomem *perf_state_reg = policy->driver_data;
+
+   writel_relaxed(index, perf_state_reg);
+
+   return 0;
+}
+
+static unsigned int qcom_cpufreq_hw_get(unsigned int cpu)
+{
+   void __iomem *perf_state_reg;
+   struct cpufreq_policy *policy;
+   unsigned int index;
+
+   policy = cpufreq_cpu_get_raw(cpu);
+   if (!policy)
+   return 0;
+
+   perf_state_reg = policy->driver_data;
+
+   index = readl_relaxed(perf_state_reg);
+   index = min(index, LUT_MAX_ENTRIES - 1);
+
+   return policy->freq_table[index].frequency;
+}
+
+static unsigned int qcom_cpufreq_hw_fast_switch(struct cpufreq_policy *policy,
+   unsigned int target_freq)
+{
+   void __iomem *perf_state_reg = policy->driver_data;
+   int index;
+
+   index = policy->cached_resolved_idx;
+   if (index < 0)
+   return 0;
+
+   writel_relaxed(index, perf_state_reg);
+
+   return policy->freq_table[index].frequency;
+}
+
+static int qcom_cpufreq_hw_read_lut(struct device *dev,
+   struct cpufreq_policy *policy,
+   void __iomem *base)
+{
+   u32 data, src, lval, i, core_count, prev_cc = 0, prev_freq = 0, freq;
+   unsigned int max_cores = cpumask_weight(policy->cpus);
+   struct cpufreq_frequency_table  *table;
+
+   table = kcalloc(LUT_MAX_ENTRIES + 1, sizeof(*table), GFP_KERNEL);
+   if (!table)
+   return -ENOMEM;
+
+   for (i = 0; i < LUT_MAX_ENTRIES; i++) {
+   data = readl_relaxed(base + REG_LUT_TABLE + i * LUT_ROW_SIZE);
+   src = FIELD_GET(LUT_SRC, data);
+   lval = FIELD_GET(LUT_L_VAL, data);
+   core_count = FIELD_GET(LUT_CORE_COUNT, data);
+
+   if (src)
+   freq =

Re: [PATCH v12 2/2] cpufreq: qcom-hw: Add support for QCOM cpufreq HW driver

2018-12-13 Thread Taniya Das


Hello Stephen, Viresh,

On 12/13/2018 3:28 PM, Stephen Boyd wrote:

Quoting Taniya Das (2018-12-12 23:49:54)

The CPUfreq HW present in some QCOM chipsets offloads the steps necessary
for changing the frequency of CPUs. The driver implements the cpufreq
driver interface for this hardware engine.

Signed-off-by: Saravana Kannan 
Signed-off-by: Stephen Boyd 
Signed-off-by: Taniya Das 
---


Reviewed-by: Stephen Boyd 

But I noticed that we don't release the I/O region anymore so hotplug
and replug of a whole clk domain fails. I guess devm_ioremap_resource()
was just too much magic so how about we downgrade to devm_ioremap()
instead?

BTW, Viresh, I see a lockdep splat when cpufreq_init returns an error
upon bringing the policy online the second time. I guess cpufreq_stats
aren't able to be freed from there because they take locks in different
order vs. the normal path?

-8<---
diff --git a/drivers/cpufreq/qcom-cpufreq-hw.c 
b/drivers/cpufreq/qcom-cpufreq-hw.c
index fce7a1162e87..0e1105151478 100644
--- a/drivers/cpufreq/qcom-cpufreq-hw.c
+++ b/drivers/cpufreq/qcom-cpufreq-hw.c
@@ -182,9 +182,12 @@ static int qcom_cpufreq_hw_cpu_init(struct cpufreq_policy 
*policy)
index = args.args[0];
  
  	res = platform_get_resource(global_pdev, IORESOURCE_MEM, index);

-   base = devm_ioremap_resource(dev, res);
-   if (IS_ERR(base))
-   return PTR_ERR(base);
+   if (!res)
+   return -ENODEV;
+
+   base = devm_ioremap(dev, res->start, resource_size(res));
+   if (!base)
+   return -ENOMEM;
  


Updated the above in the next series.


/* HW should be in enabled state to proceed */
if (!(readl_relaxed(base + REG_ENABLE) & 0x1)) {



--
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
of Code Aurora Forum, hosted by The Linux Foundation.

--

[PATCH v13 1/2] dt-bindings: cpufreq: Introduce QCOM CPUFREQ Firmware bindings

2018-12-13 Thread Taniya Das

Add QCOM cpufreq firmware device bindings for Qualcomm Technology Inc's
SoCs. This is required for managing the cpu frequency transitions which are
controlled by the hardware engine.

Signed-off-by: Taniya Das 
---
 .../bindings/cpufreq/cpufreq-qcom-hw.txt   | 172 +
 1 file changed, 172 insertions(+)
 create mode 100644 
Documentation/devicetree/bindings/cpufreq/cpufreq-qcom-hw.txt

diff --git a/Documentation/devicetree/bindings/cpufreq/cpufreq-qcom-hw.txt 
b/Documentation/devicetree/bindings/cpufreq/cpufreq-qcom-hw.txt
new file mode 100644
index 000..33856947
--- /dev/null
+++ b/Documentation/devicetree/bindings/cpufreq/cpufreq-qcom-hw.txt
@@ -0,0 +1,172 @@
+Qualcomm Technologies, Inc. CPUFREQ Bindings
+
+CPUFREQ HW is a hardware engine used by some Qualcomm Technologies, Inc. (QTI)
+SoCs to manage frequency in hardware. It is capable of controlling frequency
+for multiple clusters.
+
+Properties:
+- compatible
+   Usage:  required
+   Value type: 
+   Definition: must be "qcom,cpufreq-hw".
+
+- clocks
+   Usage:  required
+   Value type:  From common clock binding.
+   Definition: clock handle for XO clock and GPLL0 clock.
+
+- clock-names
+   Usage:  required
+   Value type:  From common clock binding.
+   Definition: must be "xo", "alternate".
+
+- reg
+   Usage:  required
+   Value type: 
+   Definition: Addresses and sizes for the memory of the HW bases in
+   each frequency domain.
+- reg-names
+   Usage:  Optional
+   Value type: 
+   Definition: Frequency domain name i.e.
+   "freq-domain0", "freq-domain1".
+
+- #freq-domain-cells:
+   Usage:  required.
+   Definition: Number of cells in a freqency domain specifier.
+
+* Property qcom,freq-domain
+Devices supporting freq-domain must set their "qcom,freq-domain" property with
+phandle to a cpufreq_hw followed by the Domain ID(0/1) in the CPU DT node.
+
+
+Example:
+
+Example 1: Dual-cluster, Quad-core per cluster. CPUs within a cluster switch
+DCVS state together.
+
+/ {
+   cpus {
+   #address-cells = <2>;
+   #size-cells = <0>;
+
+   CPU0: cpu@0 {
+   device_type = "cpu";
+   compatible = "qcom,kryo385";
+   reg = <0x0 0x0>;
+   enable-method = "psci";
+   next-level-cache = <_0>;
+   qcom,freq-domain = <_hw 0>;
+   L2_0: l2-cache {
+   compatible = "cache";
+   next-level-cache = <_0>;
+   L3_0: l3-cache {
+ compatible = "cache";
+   };
+   };
+   };
+
+   CPU1: cpu@100 {
+   device_type = "cpu";
+   compatible = "qcom,kryo385";
+   reg = <0x0 0x100>;
+   enable-method = "psci";
+   next-level-cache = <_100>;
+   qcom,freq-domain = <_hw 0>;
+   L2_100: l2-cache {
+   compatible = "cache";
+   next-level-cache = <_0>;
+   };
+   };
+
+   CPU2: cpu@200 {
+   device_type = "cpu";
+   compatible = "qcom,kryo385";
+   reg = <0x0 0x200>;
+   enable-method = "psci";
+   next-level-cache = <_200>;
+   qcom,freq-domain = <_hw 0>;
+   L2_200: l2-cache {
+   compatible = "cache";
+   next-level-cache = <_0>;
+   };
+   };
+
+   CPU3: cpu@300 {
+   device_type = "cpu";
+   compatible = "qcom,kryo385";
+   reg = <0x0 0x300>;
+   enable-method = "psci";
+   next-level-cache = <_300>;
+   qcom,freq-domain = <_hw 0>;
+   L2_300: l2-cache {
+   compatible = "cache";
+   next-level-cache = <_0>;
+   };
+   };
+
+   CPU4: cpu@400 {
+   device_type = "cpu";
+   compatible = "qcom,kryo385";
+   reg = <0x0 0x400>;
+   enable-method = "psci";
+   next-level-cache = <_400>;
+   qcom,freq-domain = <_hw 1>;
+   L2_400: l2-cache {
+   compatible = "cache";
+   next-level-cache =

Re: [PATCH v12 1/2] dt-bindings: cpufreq: Introduce QCOM CPUFREQ Firmware bindings

2018-12-13 Thread Taniya Das


Hello Stephen,

On 12/13/2018 1:58 PM, Stephen Boyd wrote:

Quoting Taniya Das (2018-12-12 23:49:53)

diff --git a/Documentation/devicetree/bindings/cpufreq/cpufreq-qcom-hw.txt 
b/Documentation/devicetree/bindings/cpufreq/cpufreq-qcom-hw.txt
new file mode 100644
index 000..2b82965
--- /dev/null
+++ b/Documentation/devicetree/bindings/cpufreq/cpufreq-qcom-hw.txt
@@ -0,0 +1,172 @@

[...]

+- reg-names
+   Usage:  Optional
+   Value type: 
+   Definition: Frequency domain name i.e.
+   "freq-domain0", "freq-domain1".
+
+- freq-domain-cells:


Should be #freq-domain-cells? Or maybe #qcom,freq-domain-cells?



Updated in the next series.


+   Usage:  required.
+   Definition: Number of cells in a freqency domain specifier.
+
+* Property qcom,freq-domain
+Devices supporting freq-domain must set their "qcom,freq-domain" property with
+phandle to a cpufreq_hw followed by the Domain ID(0/1) in the CPU DT node.
+
+


--
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
of Code Aurora Forum, hosted by The Linux Foundation.

--

[PATCH v2] arm64: invalidate TLB just before turning MMU on

2018-12-13 Thread Qian Cai

On this HPE Apollo 70 arm64 server with 256 CPUs, triggering a crash
dump just hung. It has 4 threads on each core. Each 2-core share a same
L1 and L2 caches, so that is 8 CPUs shares those. All CPUs share a same
L3 cache.

It turned out that this was due to the TLB contained stale entries (or
uninitialized junk which just happened to look valid) before turning the
MMU on in the second kernel which caused this instruction hung,

msr sctlr_el1, x0

Although there is a local TLB flush in the second kernel in
__cpu_setup(), it is called too early. When the time to turn the MMU on
later, the TLB is dirty again from some reasons.

Also tried to move the local TLB flush part around a bit inside
__cpu_setup(), although it did complete kdump some times, it did trigger
"Synchronous Exception" in EFI after a cold-reboot fairly often that
seems no way to recover remotely without reinstalling the OS. For
example, in those places,

ENTRY(__cpu_setup)
+   isb
tlbivmalle1
dsb nsh

or

mov x0, #3 << 20
msr cpacr_el1, x0
+   tlbivmalle1
+   dsb nsh

Since it is only necessary to flush local TLB right before turning the
MMU on, just re-arrage the part a bit like the one in __primary_switch()
within CONFIG_RANDOMIZE_BASE path, so it does not depends on other
instructions in between that could pollute the TLB, and it no longer
trigger "Synchronous Exception" as well.

Signed-off-by: Qian Cai 
---

v2: merge the similar part from __cpu_setup() pointed out by James.

 arch/arm64/kernel/head.S | 4 
 arch/arm64/mm/proc.S | 3 ---
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
index 4471f570a295..7f555dd4577e 100644
--- a/arch/arm64/kernel/head.S
+++ b/arch/arm64/kernel/head.S
@@ -771,6 +771,10 @@ ENTRY(__enable_mmu)
msr ttbr0_el1, x2   // load TTBR0
msr ttbr1_el1, x1   // load TTBR1
isb
+
+   tlbivmalle1 // invalidate TLB
+   dsb nsh
+
msr sctlr_el1, x0
isb
/*
diff --git a/arch/arm64/mm/proc.S b/arch/arm64/mm/proc.S
index 2c75b0b903ae..14f68afdd57f 100644
--- a/arch/arm64/mm/proc.S
+++ b/arch/arm64/mm/proc.S
@@ -406,9 +406,6 @@ ENDPROC(idmap_kpti_install_ng_mappings)
  */
.pushsection ".idmap.text", "awx"
 ENTRY(__cpu_setup)
-   tlbivmalle1 // Invalidate local TLB
-   dsb nsh
-
mov x0, #3 << 20
msr cpacr_el1, x0   // Enable FP/ASIMD
mov x0, #1 << 12// Reset mdscr_el1 and disable
-- 
2.17.2 (Apple Git-113)

[PATCHv2] x86/kdump: bugfix, make the behavior of crashkernel=X consistent with kaslr

2018-12-13 Thread Pingfan Liu

Customer reported a bug on a high end server with many pcie devices, where
kernel bootup with crashkernel=384M, and kaslr is enabled. Even
though we still see much memory under 896 MB, the finding still failed
intermittently. Because currently we can only find region under 896 MB,
if w/0 ',high' specified. Then KASLR breaks 896 MB into several parts
randomly, and crashkernel reservation need be aligned to 128 MB, that's
why failure is found. It raises confusion to the end user that sometimes
crashkernel=X works while sometimes fails.
If want to make it succeed, customer can change kernel option to
"crashkernel=384M, high". Just this give "crashkernel=xx@yy" a very
limited space to behave even though its grammer looks more generic.
And we can't answer questions raised from customer that confidently:
1) why it doesn't succeed to reserve 896 MB;
2) what's wrong with memory region under 4G;
3) why I have to add ',high', I only require 384 MB, not 3840 MB.

This patch simplifies the method suggested in the mail [1]. It just goes
bottom-up to find a candidate region for crashkernel. The bottom-up may be
better compatible with the old reservation style, i.e. still want to get
memory region from 896 MB firstly, then [896 MB, 4G], finally above 4G.

There is one trivial thing about the compatibility with old kexec-tools:
if the reserved region is above 896M, then old tool will fail to load
bzImage. But without this patch, the old tool also fail since there is no
memory below 896M can be reserved for crashkernel.

[1]: http://lists.infradead.org/pipermail/kexec/2017-October/019571.html
Signed-off-by: Pingfan Liu 
Cc: Dave Young 
Cc: Andrew Morton 
Cc: Baoquan He 
Cc: ying...@kernel.org,
Cc: vgo...@redhat.com
Cc: ke...@lists.infradead.org

---
v1->v2:
  improve commit log
 arch/x86/kernel/setup.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index d494b9b..60f12c4 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -541,15 +541,18 @@ static void __init reserve_crashkernel(void)
 
/* 0 means: find the address automatically */
if (crash_base <= 0) {
+   if (!memblock_bottom_up())
+   memblock_set_bottom_up(true);
/*
 * Set CRASH_ADDR_LOW_MAX upper bound for crash memory,
 * as old kexec-tools loads bzImage below that, unless
 * "crashkernel=size[KMG],high" is specified.
 */
crash_base = memblock_find_in_range(CRASH_ALIGN,
-   high ? CRASH_ADDR_HIGH_MAX
-: CRASH_ADDR_LOW_MAX,
-   crash_size, CRASH_ALIGN);
+   (max_pfn * PAGE_SIZE), crash_size, CRASH_ALIGN);
+   if (!memblock_bottom_up())
+   memblock_set_bottom_up(false);
+
if (!crash_base) {
pr_info("crashkernel reservation failed - No suitable 
area found.\n");
return;
-- 
2.7.4

Re: linux-next: manual merge of the block tree with the scsi-fixes tree

2018-12-13 Thread Jens Axboe

On 12/13/18 7:23 PM, Stephen Rothwell wrote:
> Hi all,
> 
> Today's linux-next merge of the block tree got a conflict in:
> 
>   drivers/scsi/sd.c
> 
> between commit:
> 
>   61cce6f6eece ("scsi: sd: use mempool for discard special page")
> 
> from the scsi-fixes tree and commit:
> 
>   159b2cbf59f4 ("scsi: return blk_status_t from scsi_init_io and 
> ->init_command")
> 
> from the block tree.
> 
> I fixed it up (see below) and can carry the fix as necessary. This
> is now fixed as far as linux-next is concerned, but any non trivial
> conflicts should be mentioned to your upstream maintainer when your tree
> is submitted for merging.  You may also want to consider cooperating
> with the maintainer of the conflicting tree to minimise any particularly
> complex conflicts.

Martin, I can carry this one to avoid this conflict. Let me know!

-- 
Jens Axboe

[PATCH] arm64: replace arm64-obj-* in Makefile with obj-*

2018-12-13 Thread Masahiro Yamada

Use the standard obj-$(CONFIG_...) syntex. The behavior is still the
same.

Signed-off-by: Masahiro Yamada 
---

 arch/arm64/kernel/Makefile | 59 +++---
 1 file changed, 29 insertions(+), 30 deletions(-)

diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile
index 583334c..b4e4f7e 100644
--- a/arch/arm64/kernel/Makefile
+++ b/arch/arm64/kernel/Makefile
@@ -12,7 +12,7 @@ CFLAGS_REMOVE_insn.o = -pg
 CFLAGS_REMOVE_return_address.o = -pg
 
 # Object file lists.
-arm64-obj-y:= debug-monitors.o entry.o irq.o fpsimd.o  
\
+obj-y  := debug-monitors.o entry.o irq.o fpsimd.o  
\
   entry-fpsimd.o process.o ptrace.o setup.o signal.o   
\
   sys.o stacktrace.o time.o traps.o io.o vdso.o
\
   hyp-stub.o psci.o cpu_ops.o insn.o   \
@@ -27,40 +27,39 @@ OBJCOPYFLAGS := --prefix-symbols=__efistub_
 $(obj)/%.stub.o: $(obj)/%.o FORCE
$(call if_changed,objcopy)
 
-arm64-obj-$(CONFIG_COMPAT) += sys32.o kuser32.o signal32.o 
\
+obj-$(CONFIG_COMPAT)   += sys32.o kuser32.o signal32.o 
\
   sys_compat.o
-arm64-obj-$(CONFIG_FUNCTION_TRACER)+= ftrace.o entry-ftrace.o
-arm64-obj-$(CONFIG_MODULES)+= module.o
-arm64-obj-$(CONFIG_ARM64_MODULE_PLTS)  += module-plts.o
-arm64-obj-$(CONFIG_PERF_EVENTS)+= perf_regs.o perf_callchain.o
-arm64-obj-$(CONFIG_HW_PERF_EVENTS) += perf_event.o
-arm64-obj-$(CONFIG_HAVE_HW_BREAKPOINT) += hw_breakpoint.o
-arm64-obj-$(CONFIG_CPU_PM) += sleep.o suspend.o
-arm64-obj-$(CONFIG_CPU_IDLE)   += cpuidle.o
-arm64-obj-$(CONFIG_JUMP_LABEL) += jump_label.o
-arm64-obj-$(CONFIG_KGDB)   += kgdb.o
-arm64-obj-$(CONFIG_EFI)+= efi.o efi-entry.stub.o   
\
+obj-$(CONFIG_FUNCTION_TRACER)  += ftrace.o entry-ftrace.o
+obj-$(CONFIG_MODULES)  += module.o
+obj-$(CONFIG_ARM64_MODULE_PLTS)+= module-plts.o
+obj-$(CONFIG_PERF_EVENTS)  += perf_regs.o perf_callchain.o
+obj-$(CONFIG_HW_PERF_EVENTS)   += perf_event.o
+obj-$(CONFIG_HAVE_HW_BREAKPOINT)   += hw_breakpoint.o
+obj-$(CONFIG_CPU_PM)   += sleep.o suspend.o
+obj-$(CONFIG_CPU_IDLE) += cpuidle.o
+obj-$(CONFIG_JUMP_LABEL)   += jump_label.o
+obj-$(CONFIG_KGDB) += kgdb.o
+obj-$(CONFIG_EFI)  += efi.o efi-entry.stub.o   
\
   efi-rt-wrapper.o
-arm64-obj-$(CONFIG_PCI)+= pci.o
-arm64-obj-$(CONFIG_ARMV8_DEPRECATED)   += armv8_deprecated.o
-arm64-obj-$(CONFIG_ACPI)   += acpi.o
-arm64-obj-$(CONFIG_ACPI_NUMA)  += acpi_numa.o
-arm64-obj-$(CONFIG_ARM64_ACPI_PARKING_PROTOCOL)+= 
acpi_parking_protocol.o
-arm64-obj-$(CONFIG_PARAVIRT)   += paravirt.o
-arm64-obj-$(CONFIG_RANDOMIZE_BASE) += kaslr.o
-arm64-obj-$(CONFIG_HIBERNATION)+= hibernate.o hibernate-asm.o
-arm64-obj-$(CONFIG_KEXEC_CORE) += machine_kexec.o relocate_kernel.o
\
+obj-$(CONFIG_PCI)  += pci.o
+obj-$(CONFIG_ARMV8_DEPRECATED) += armv8_deprecated.o
+obj-$(CONFIG_ACPI) += acpi.o
+obj-$(CONFIG_ACPI_NUMA)+= acpi_numa.o
+obj-$(CONFIG_ARM64_ACPI_PARKING_PROTOCOL) += acpi_parking_protocol.o
+obj-$(CONFIG_PARAVIRT) += paravirt.o
+obj-$(CONFIG_RANDOMIZE_BASE)   += kaslr.o
+obj-$(CONFIG_HIBERNATION)  += hibernate.o hibernate-asm.o
+obj-$(CONFIG_KEXEC_CORE)   += machine_kexec.o relocate_kernel.o
\
   cpu-reset.o
-arm64-obj-$(CONFIG_KEXEC_FILE) += machine_kexec_file.o kexec_image.o
-arm64-obj-$(CONFIG_ARM64_RELOC_TEST)   += arm64-reloc-test.o
+obj-$(CONFIG_KEXEC_FILE)   += machine_kexec_file.o kexec_image.o
+obj-$(CONFIG_ARM64_RELOC_TEST) += arm64-reloc-test.o
 arm64-reloc-test-y := reloc_test_core.o reloc_test_syms.o
-arm64-obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
-arm64-obj-$(CONFIG_CRASH_CORE) += crash_core.o
-arm64-obj-$(CONFIG_ARM_SDE_INTERFACE)  += sdei.o
-arm64-obj-$(CONFIG_ARM64_SSBD) += ssbd.o
+obj-$(CONFIG_CRASH_DUMP)   += crash_dump.o
+obj-$(CONFIG_CRASH_CORE)   += crash_core.o
+obj-$(CONFIG_ARM_SDE_INTERFACE)+= sdei.o
+obj-$(CONFIG_ARM64_SSBD)   += ssbd.o
 
-obj-y  += $(arm64-obj-y) vdso/ probes/
-obj-m  += $(arm64-obj-m)
+obj-y  += vdso/ probes/
 head-y := head.o
 extra-y+= $(head-y) vmlinux.lds
 
-- 
2.7.4

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1063 matches

Mail list logo