[PATCH] checkpatch: don't check c99 types like uint8_t under tools
Tools contains user space code so uintX_t types are just fine. Signed-off-by: Tomas Winkler --- scripts/checkpatch.pl | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl index a8368d1c4348..42c3221be6eb 100755 --- a/scripts/checkpatch.pl +++ b/scripts/checkpatch.pl @@ -5548,8 +5548,9 @@ sub process { "Using weak declarations can have unintended link defects\n" . $herecurr); } -# check for c99 types like uint8_t used outside of uapi/ +# check for c99 types like uint8_t used outside of uapi/ and tools/ if ($realfile !~ m@\binclude/uapi/@ && + $realfile !~ m@\btools/@ && $line =~ /\b($Declare)\s*$Ident\s*[=;,\[]/) { my $type = $1; if ($type =~ /\b($typeC99Typedefs)\b/) { -- 2.7.4
Re: [PATCH] mtd: nand: nandsim: fix error check
On Tuesday 15 November 2016 11:42 PM, Marek Vasut wrote: On 11/16/2016 12:09 AM, Sudip Mukherjee wrote: debugfs_create_dir() and debugfs_create_file() returns NULL on error or a pointer on success. They do not return the error value with ERR_PTR. So we should not check the return with IS_ERR_OR_NULL, instead we should just check for NULL. Signed-off-by: Sudip Mukherjee --- drivers/mtd/nand/nandsim.c | 9 +++-- 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/drivers/mtd/nand/nandsim.c b/drivers/mtd/nand/nandsim.c index c76287a..9b0d79a 100644 --- a/drivers/mtd/nand/nandsim.c +++ b/drivers/mtd/nand/nandsim.c @@ -525,15 +525,13 @@ static int nandsim_debugfs_create(struct nandsim *dev) { struct nandsim_debug_info *dbg = &dev->dbg; struct dentry *dent; - int err; + int err = -ENODEV; Why don't you just nuke the err altogether and just return -ENODEV ? That was the first version which i made and discarded before sending. I will go and find it now. Regards Sudip
Re: [PATCHSET 0/7] perf sched: Introduce timehist command, again (v2)
* Namhyung Kim wrote: > Hello, > > This patchset is a rebased version of David's sched timehist work [1]. > I plan to improve perf sched command more and think that having > timehist command before the work looks good. It seems David is busy > these days, so I'm retrying it by myself. > > * changes in v2) > - change name 'b/n time' to 'wait time' (Ingo) > - show arrow between functions in the callchain (Ingo) > - fix a bug in calculating initial run time > > This implements only basic feature and a few options. I just split > the patch to make it easier to review and did some cosmetic changes. > More patches will come later. > > The below is from the David's original description (w/ slight change): > > 8<- > 'perf sched timehist' provides an analysis of scheduling events. > > Example usage: > perf sched record -- sleep 1 > perf sched timehist > > By default it shows the individual schedule events, including the time between > sched-in events for the task, the task scheduling delay (time between wakeup > and actually running) and run time for the task: > >time cpu task name[tid/pid] wait time sch delay run time > - - - - >79371.874569 [11] gcc[31949] 0.014 0.000 1.148 >79371.874591 [10] gcc[31951] 0.000 0.000 0.024 >79371.874603 [10] migration/10[59] 3.350 0.004 0.011 >79371.874604 [11]1.148 0.000 0.035 >79371.874723 [05]0.016 0.000 1.383 >79371.874746 [05] gcc[31949] 0.153 0.078 0.022 > ... > > Times are in msec.usec. > > If callchains were recorded they are appended to the line with a default > stack depth of 5: > >79371.874569 [11] gcc[31949] 0.014 0.000 1.148 > wait_for_completion_killable <- do_fork <- sys_vfork <- stub_vfork <- __vfork >79371.874591 [10] gcc[31951] 0.000 0.000 0.024 > __cond_resched <- _cond_resched <- wait_for_completion <- stop_one_cpu <- > sched_exec >79371.874603 [10] migration/10[59] 3.350 0.004 0.011 > smpboot_thread_fn <- kthread <- ret_from_fork >79371.874604 [11]1.148 0.000 0.035 > cpu_startup_entry <- start_secondary >79371.874723 [05]0.016 0.000 1.383 > cpu_startup_entry <- start_secondary >79371.874746 [05] gcc[31949] 0.153 0.078 0.022 > do_wait sys_wait4 <- system_call_fastpath <- __GI___waitpid > > --no-call-graph can be used to not show the callchains. --max-stack is used > to control the number of frames shown (default of 5). -x/--excl options can > be used to collapse redundant callchains to get more relevant data on screen. > > Similar to perf-trace -s and -S can be used to dump a statistical summary > without or with events (respectively). Statistics include min run time, > average run time and max run time. Stats are also shown for run time by > cpu. > > The cpu-visual option provides a visual aid for sched switches by cpu: > ... >79371.874569 [11]s gcc[31949] 0.014 > 0.000 1.148 >79371.874591 [10] s gcc[31951] 0.000 > 0.000 0.024 >79371.874603 [10] s migration/10[59]3.350 > 0.004 0.011 >79371.874604 [11]i1.148 > 0.000 0.035 >79371.874723 [05] i 0.016 > 0.000 1.383 >79371.874746 [05] sgcc[31949] 0.153 > 0.078 0.022 > ... Looks great to me! Acked-by: Ingo Molnar Thanks, Ingo
[PATCH] b: re-queue tx dma request on herror
Some times dma transfer to usb endpoint fails: [ 78.378283] musb-hdrc musb-hdrc.1.auto: Start TX10 dma [ 78.410763] musb-hdrc musb-hdrc.1.auto: OUT/TX10 end, csr 3400, dma [ 78.410896] musb-hdrc musb-hdrc.1.auto: complete dc01eb00 usb_api_blocking_completion+0x0/0x24 [usbcore] (0), dev4 ep1out, 10/10 [ 78.411181] musb-hdrc musb-hdrc.1.auto: qh dc01ed00 periodic slot 10 [ 78.411205] musb-hdrc musb-hdrc.1.auto: qh dc01ed00 urb dc01eb00 dev4 ep1out-intr, hw_ep 10, dd624d00/10 [ 78.411223] musb-hdrc musb-hdrc.1.auto: --> hw10 urb dc01eb00 spd2 dev4 ep1out h_addr83 h_port02 bytes 10 [ 78.411244] musb-hdrc musb-hdrc.1.auto: check whether there's still time for periodic Tx [ 78.411256] musb-hdrc musb-hdrc.1.auto: Start TX10 dma [ 78.443762] musb-hdrc musb-hdrc.1.auto: OUT/TX10 end, csr 3500, dma success transmition: [ 78.443889] musb-hdrc musb-hdrc.1.auto: complete dc01eb00 usb_api_blocking_completion+0x0/0x24 [usbcore] (0), dev4 ep1out, 10/10 [ 78.444170] musb-hdrc musb-hdrc.1.auto: qh dc01ed00 periodic slot 10 [ 78.444195] musb-hdrc musb-hdrc.1.auto: qh dc01ed00 urb dc01eb00 dev4 ep1out-intr, hw_ep 10, dd624d00/10 [ 78.444213] musb-hdrc musb-hdrc.1.auto: --> hw10 urb dc01eb00 spd2 dev4 ep1out h_addr83 h_port02 bytes 10 [ 78.444232] musb-hdrc musb-hdrc.1.auto: check whether there's still time for periodic Tx [ 78.444245] musb-hdrc musb-hdrc.1.auto: Start TX10 dma [ 78.540761] musb-hdrc musb-hdrc.1.auto: OUT/TX10 end, csr 3504, dma failed transmission: [ 78.540780] musb-hdrc musb-hdrc.1.auto: TX 3strikes on ep=10 set ETIMEDOUT [ 78.540897] musb-hdrc musb-hdrc.1.auto: complete dc01eb00 usb_api_blocking_completion+0x0/0x24 [usbcore] (-110), dev4 ep1out, 10/10 [ 78.540945] musb-hdrc musb-hdrc.1.auto: extra TX10 ready, csr 2500 [ 78.540989] usb 2-1.1.2: urb wait ok but retval -110 Use to reproduce is writes to /dev/hidraw0 device which end up with early unexpected timeout and -110 errno set. Code sets timeout to 5 seconds vfs_write()->hidraw_write()->hidraw_send_report()->usbhid_output_report() ret = usb_interrupt_msg(dev, usbhid->urbout->pipe, buf, count, &actual_length, USB_CTRL_SET_TIMEOUT); where is set to 5 second: when it wents to usb_start_wait_urb() where waits for completition: retval = usb_submit_urb(urb, GFP_NOIO); expire = timeout ? msecs_to_jiffies(timeout) : MAX_SCHEDULE_TIMEOUT; if (!wait_for_completion_timeout(&ctx.done, expire)) { So user space application expects that write will be done in 5 seconds or error will happen. But musb driver exists this logic on first dma error without trying to retransmit current urb. This patch adds current request to the end of list, destroys current dma transfer and renew transmission. In that case this urb transmitted in next cycle and not failing with error before timeout. Signed-off-by: Max Uvarov --- drivers/usb/musb/musb_host.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/usb/musb/musb_host.c b/drivers/usb/musb/musb_host.c index 53bc4ce..e44ae95 100644 --- a/drivers/usb/musb/musb_host.c +++ b/drivers/usb/musb/musb_host.c @@ -1293,11 +1293,11 @@ void musb_host_tx(struct musb *musb, u8 epnum) status = -EPIPE; } else if (tx_csr & MUSB_TXCSR_H_ERROR) { - /* (NON-ISO) dma was disabled, fifo flushed */ musb_dbg(musb, "TX 3strikes on ep=%d", epnum); - - status = -ETIMEDOUT; - + if (dma) + musb_bulk_nak_timeout(musb, hw_ep, 0); + else + status = -ETIMEDOUT; } else if (tx_csr & MUSB_TXCSR_H_NAKTIMEOUT) { if (USB_ENDPOINT_XFER_BULK == qh->type && qh->mux == 1 && !list_is_singular(&musb->out_bulk)) { -- 1.9.1
[PATCH] mm, rmap: handle anon_vma_prepare() common case inline
The anon_vma_prepare() function is mostly a large "if (unlikely(...))" block, as the expected common case is that an anon_vma already exists. We could turn the condition around and return 0, but it also makes sense to do it inline and avoid a call for the common case. Bloat-o-meter naturally shows that inlining the check has some code size costs: add/remove: 1/1 grow/shrink: 4/0 up/down: 475/-373 (102) function old new delta __anon_vma_prepare - 359+359 handle_mm_fault 27442796 +52 hugetlb_cow 11461170 +24 hugetlb_fault 21232145 +22 wp_page_copy14691487 +18 anon_vma_prepare 373 --373 Checking the asm however confirms that the hot paths now avoid a call, which is now moved away. Signed-off-by: Vlastimil Babka Cc: "Kirill A. Shutemov" Cc: Johannes Weiner Cc: Konstantin Khlebnikov Cc: Rik van Riel --- include/linux/rmap.h | 10 ++- mm/rmap.c| 73 ++-- 2 files changed, 45 insertions(+), 38 deletions(-) diff --git a/include/linux/rmap.h b/include/linux/rmap.h index b46bb5620a76..850da50c574e 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -137,11 +137,19 @@ static inline void anon_vma_unlock_read(struct anon_vma *anon_vma) * anon_vma helper functions. */ void anon_vma_init(void); /* create anon_vma_cachep */ -int anon_vma_prepare(struct vm_area_struct *); +int __anon_vma_prepare(struct vm_area_struct *); void unlink_anon_vmas(struct vm_area_struct *); int anon_vma_clone(struct vm_area_struct *, struct vm_area_struct *); int anon_vma_fork(struct vm_area_struct *, struct vm_area_struct *); +static inline int anon_vma_prepare(struct vm_area_struct * vma) +{ + if (likely(vma->anon_vma)) + return 0; + + return __anon_vma_prepare(vma); +} + static inline void anon_vma_merge(struct vm_area_struct *vma, struct vm_area_struct *next) { diff --git a/mm/rmap.c b/mm/rmap.c index 1ef36404e7b2..91619fd70939 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -141,14 +141,15 @@ static void anon_vma_chain_link(struct vm_area_struct *vma, } /** - * anon_vma_prepare - attach an anon_vma to a memory region + * __anon_vma_prepare - attach an anon_vma to a memory region * @vma: the memory region in question * * This makes sure the memory mapping described by 'vma' has * an 'anon_vma' attached to it, so that we can associate the * anonymous pages mapped into it with that anon_vma. * - * The common case will be that we already have one, but if + * The common case will be that we already have one, which + * is handled inline by anon_vma_prepare(). But if * not we either need to find an adjacent mapping that we * can re-use the anon_vma from (very common when the only * reason for splitting a vma has been mprotect()), or we @@ -167,48 +168,46 @@ static void anon_vma_chain_link(struct vm_area_struct *vma, * * This must be called with the mmap_sem held for reading. */ -int anon_vma_prepare(struct vm_area_struct *vma) +int __anon_vma_prepare(struct vm_area_struct *vma) { - struct anon_vma *anon_vma = vma->anon_vma; + struct mm_struct *mm = vma->vm_mm; + struct anon_vma *anon_vma, *allocated; struct anon_vma_chain *avc; might_sleep(); - if (unlikely(!anon_vma)) { - struct mm_struct *mm = vma->vm_mm; - struct anon_vma *allocated; - avc = anon_vma_chain_alloc(GFP_KERNEL); - if (!avc) - goto out_enomem; + avc = anon_vma_chain_alloc(GFP_KERNEL); + if (!avc) + goto out_enomem; - anon_vma = find_mergeable_anon_vma(vma); - allocated = NULL; - if (!anon_vma) { - anon_vma = anon_vma_alloc(); - if (unlikely(!anon_vma)) - goto out_enomem_free_avc; - allocated = anon_vma; - } - - anon_vma_lock_write(anon_vma); - /* page_table_lock to protect against threads */ - spin_lock(&mm->page_table_lock); - if (likely(!vma->anon_vma)) { - vma->anon_vma = anon_vma; - anon_vma_chain_link(vma, avc, anon_vma); - /* vma reference or self-parent link for new root */ - anon_vma->degree++; - allocated = NULL; - avc = NULL; - } - spin_unlock(&mm->page_table_lock); - anon_vma_unlock_write(anon_vma); - - if (unlikely(allocated)) - put_anon_vma(allocated); -
Re: [PATCH net-next 2/5] net: ethoc: Implement ethtool::nway_reset
On 2016-11-15 at 20:19:46 +0100, Florian Fainelli wrote: > Implement ethtool::nway_reset using phy_ethtool_nway_reset. We are > already using dev->phydev all over the place so this comes for free. > > Signed-off-by: Florian Fainelli Reviewed-by: Tobias Klauser
i2c: undefined option I2C_ALGO_BUSCLEAR
Hi Shardar, your commit c3ca951fe41a ("i2c: Add Tegra BPMP I2C proxy driver") popped up in today's linux-next tree, adding Kconfig option I2C_TEGRA_BPMP, which further selects I2C_ALGO_BUSCLEAR, which is nowhere defined in Kconfig. Is there a patch queued somewhere to add I2C_ALGO_BUSCLEAR to Kconfig? I could not find anything on the lkml; only some older repositories on github, where the options is defined in drivers/i2c/busses/Kconfig. Best regards, Valentin
Re: [PATCH] kasan: support use-after-scope detection
On Wed, Nov 16, 2016 at 12:40 AM, Andrew Morton wrote: > On Tue, 15 Nov 2016 17:07:25 +0100 Dmitry Vyukov wrote: > >> Gcc revision 241896 implements use-after-scope detection. >> Will be available in gcc 7. Support it in KASAN. >> >> Gcc emits 2 new callbacks to poison/unpoison large stack >> objects when they go in/out of scope. >> Implement the callbacks and add a test. >> >> ... >> >> --- a/lib/test_kasan.c >> +++ b/lib/test_kasan.c >> @@ -411,6 +411,29 @@ static noinline void __init copy_user_test(void) >> kfree(kmem); >> } >> >> +static noinline void __init use_after_scope_test(void) > > This reader has no idea why this code uses noinline, and I expect > others will have the same issue. > > Can we please get a code comment in there to reveal the reason? Mailed v2 with a comment re noinline. Taking the opportunity also fixed a type in the new comment: - /* Emitted by compiler to unpoison large objects when they go into of scope. */ + /* Emitted by compiler to unpoison large objects when they go into scope. */
Re: [PATCH] vt: fix Scroll Lock LED trigger name
On Wed, Nov 16, 2016 at 12:55:57AM +0100, Maciej S. Szmigiero wrote: > There is a disagreement between drivers/tty/vt/keyboard.c and > drivers/input/input-leds.c with regard to what is a Scroll Lock LED > trigger name: input calls it "kbd-scrolllock", but vt calls it > "kbd-scrollock" (two l's). > This prevents Scroll Lock LED trigger from binding to this LED by default. > > Since it is a scroLL Lock LED, this interface was introduced only about a > year ago and in an Internet search people seem to reference this trigger > only to set it to this LED let's simply rename it to "kbd-scrolllock". > > Also, it looks like this was supposed to be changed before this code was > merged: https://lkml.org/lkml/2015/6/9/697 but it was done only on > the input side. > > Signed-off-by: Maciej S. Szmigiero > --- > drivers/tty/vt/keyboard.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/tty/vt/keyboard.c b/drivers/tty/vt/keyboard.c > index d5d81d4d3c04..3dd6a491cdba 100644 > --- a/drivers/tty/vt/keyboard.c > +++ b/drivers/tty/vt/keyboard.c > @@ -982,7 +982,7 @@ static void kbd_led_trigger_activate(struct led_classdev > *cdev) > KBD_LED_TRIGGER((_led_bit) + 8, _name) > > static struct kbd_led_trigger kbd_led_triggers[] = { > - KBD_LED_TRIGGER(VC_SCROLLOCK, "kbd-scrollock"), > + KBD_LED_TRIGGER(VC_SCROLLOCK, "kbd-scrolllock"), > KBD_LED_TRIGGER(VC_NUMLOCK, "kbd-numlock"), > KBD_LED_TRIGGER(VC_CAPSLOCK, "kbd-capslock"), > KBD_LED_TRIGGER(VC_KANALOCK, "kbd-kanalock"), So, how far back should this change be backported to? thanks, greg k-h
Re: [PATCH] mm: don't cap request size based on read-ahead setting
On Wednesday, November 16, 2016 12:31 PM Jens Axboe wrote: > @@ -369,10 +369,25 @@ ondemand_readahead(struct address_space *mapping, > bool hit_readahead_marker, pgoff_t offset, > unsigned long req_size) > { > - unsigned long max = ra->ra_pages; > + unsigned long io_pages, max_pages; > pgoff_t prev_offset; > > /* > + * If bdi->io_pages is set, that indicates the (soft) max IO size > + * per command for that device. If we have that available, use > + * that as the max suitable read-ahead size for this IO. Instead of > + * capping read-ahead at ra_pages if req_size is larger, we can go > + * up to io_pages. If io_pages isn't set, fall back to using > + * ra_pages as a safe max. > + */ > + io_pages = inode_to_bdi(mapping->host)->io_pages; > + if (io_pages) { > + max_pages = max_t(unsigned long, ra->ra_pages, req_size); > + io_pages = min(io_pages, max_pages); Doubt if you mean max_pages = min(io_pages, max_pages); > + } else > + max_pages = ra->ra_pages; > +
[PATCH v2] kasan: support use-after-scope detection
Gcc revision 241896 implements use-after-scope detection. Will be available in gcc 7. Support it in KASAN. Gcc emits 2 new callbacks to poison/unpoison large stack objects when they go in/out of scope. Implement the callbacks and add a test. Signed-off-by: Dmitry Vyukov Cc: aryabi...@virtuozzo.com Cc: gli...@google.com Cc: a...@linux-foundation.org Cc: kasan-...@googlegroups.com Cc: linux...@kvack.org Cc: linux-kernel@vger.kernel.org --- Changes since v1: - added comment to test_kasan.c re noinline - fixed a typo in comment: s/go into of scope/go into scope/ FTR here are reports from the test with gcc 7: kasan test: use_after_scope_test use-after-scope on int == BUG: KASAN: use-after-scope in use_after_scope_test+0xe0/0x25b [test_kasan] at addr 8800359b72b0 Write of size 1 by task insmod/6644 page:ead66dc0 count:0 mapcount:0 mapping: (null) index:0x0 flags: 0x1fffc00() page dumped because: kasan: bad access detected CPU: 2 PID: 6644 Comm: insmod Tainted: GB 4.9.0-rc5+ #39 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 8800359b71f0 834c2999 0002 110006b36dd1 ed0006b36dc9 41b58ab3 89575430 834c26ab 0001 Call Trace: [< inline >] __dump_stack lib/dump_stack.c:15 [] dump_stack+0x2ee/0x3f5 lib/dump_stack.c:51 [< inline >] print_address_description mm/kasan/report.c:207 [< inline >] kasan_report_error mm/kasan/report.c:286 [] kasan_report+0x490/0x4c0 mm/kasan/report.c:306 [] __asan_report_store1_noabort+0x1c/0x20 mm/kasan/report.c:334 [] use_after_scope_test+0xe0/0x25b [test_kasan] lib/test_kasan.c:424 [] kmalloc_tests_init+0x72/0x79 [test_kasan] [] do_one_initcall+0xfb/0x3f0 init/main.c:778 [] do_init_module+0x219/0x59c kernel/module.c:3386 [] load_module+0x5918/0x8c40 kernel/module.c:3706 [] SYSC_init_module+0x3f9/0x470 kernel/module.c:3776 [] SyS_init_module+0xe/0x10 kernel/module.c:3759 [] entry_SYSCALL_64_fastpath+0x23/0xc6 arch/x86/entry/entry_64.S:209 Memory state around the buggy address: 8800359b7180: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 8800359b7200: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >8800359b7280: 00 00 f1 f1 f1 f1 f8 f2 f2 f2 f2 f2 f2 f2 00 f2 ^ 8800359b7300: f2 f2 f2 f2 f2 f2 00 00 00 00 00 00 00 00 00 00 8800359b7380: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 == == BUG: KASAN: use-after-scope in use_after_scope_test+0x118/0x25b [test_kasan] at addr 8800359b72b3 Write of size 1 by task insmod/6644 page:ead66dc0 count:0 mapcount:0 mapping: (null) index:0x0 flags: 0x1fffc00() page dumped because: kasan: bad access detected CPU: 2 PID: 6644 Comm: insmod Tainted: GB 4.9.0-rc5+ #39 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 8800359b71f0 834c2999 0002 110006b36dd1 ed0006b36dc9 41b58ab3 89575430 834c26ab 0001 Call Trace: [< inline >] __dump_stack lib/dump_stack.c:15 [] dump_stack+0x2ee/0x3f5 lib/dump_stack.c:51 [< inline >] print_address_description mm/kasan/report.c:207 [< inline >] kasan_report_error mm/kasan/report.c:286 [] kasan_report+0x490/0x4c0 mm/kasan/report.c:306 [] __asan_report_store1_noabort+0x1c/0x20 mm/kasan/report.c:334 [] use_after_scope_test+0x118/0x25b [test_kasan] lib/test_kasan.c:425 [] kmalloc_tests_init+0x72/0x79 [test_kasan] [] do_one_initcall+0xfb/0x3f0 init/main.c:778 [] do_init_module+0x219/0x59c kernel/module.c:3386 [] load_module+0x5918/0x8c40 kernel/module.c:3706 [] SYSC_init_module+0x3f9/0x470 kernel/module.c:3776 [] SyS_init_module+0xe/0x10 kernel/module.c:3759 [] entry_SYSCALL_64_fastpath+0x23/0xc6 arch/x86/entry/entry_64.S:209 Memory state around the buggy address: 8800359b7180: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 8800359b7200: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >8800359b7280: 00 00 f1 f1 f1 f1 f8 f2 f2 f2 f2 f2 f2 f2 00 f2 ^ 8800359b7300: f2 f2 f2 f2 f2 f2 00 00 00 00 00 00 00 00 00 00 8800359b7380: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 == kasan test: use_after_scope_test use-after-scope on array == BUG: KASAN: use-after-scope in use_after_scope_test+0x1ee/0x25b [test_kasan] at addr 8800359b7330 Write of size 1 by task insmod/6644 page:ead66dc0 count:0 mapcount:0 mapping: (nul
Re: Patch procedure
On Tue, Nov 15, 2016 at 08:55:11AM +, Ioana Ciornei wrote: > > > -Original Message- > > From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel- > > ow...@vger.kernel.org] On Behalf Of feas > > Sent: Monday, November 14, 2016 7:16 PM > > To: de...@driverdev.osuosl.org; gre...@linuxfoundation.org; linux- > > ker...@vger.kernel.org > > Subject: Patch procedure > > > > > > I know you mentioned the kernel newbies page but just in case you missed this > toturial, here is a link https://kernelnewbies.org/FirstKernelPatch. It has > all the necessary info for you to submit a proper patch set. > > Ioana C Ioana, Thank you for the link! I had read that but I think this one has fixed what I was missing. It has me sending multiple patches together in one email vs sending them individualy like I was doing. At least I hope the last one I sent is correct. https://burzalodowa.wordpress.com/ Walt
Re: [OpenRISC] [PATCH v2 8/9] openrisc: Updates after openrisc.net has been lost
On Mon, Nov 14, 2016 at 2:30 PM, Stafford Horne wrote: > The openrisc.net domain expired and was taken over by squatters. > These updates point documentation to the new domain, mailing lists > and git repos. > > Also, Jonas is not the main maintainer anylonger, he reviews changes > but does not maintain a repo or sent pull requests. Updating this to > add Stafford and Stefan who are the active maintainers. > > Signed-off-by: Stafford Horne > --- > MAINTAINERS | 6 -- > arch/openrisc/README.openrisc | 8 > arch/openrisc/kernel/setup.c | 2 +- > 3 files changed, 9 insertions(+), 7 deletions(-) > > diff --git a/MAINTAINERS b/MAINTAINERS > index 851b89b..d84a585 100644 > --- a/MAINTAINERS > +++ b/MAINTAINERS > @@ -8958,9 +8958,11 @@ F: drivers/of/resolver.c > > OPENRISC ARCHITECTURE > M: Jonas Bonn > -W: http://openrisc.net > +M: Stefan Kristiansson > +M: Stafford Horne > +L: openr...@lists.librecores.org > +W: http://openrisc.io > S: Maintained > -T: git git://openrisc.net/~jonas/linux > F: arch/openrisc/ > > OPENVSWITCH > diff --git a/arch/openrisc/README.openrisc b/arch/openrisc/README.openrisc > index c9f7edf..072069a 100644 > --- a/arch/openrisc/README.openrisc > +++ b/arch/openrisc/README.openrisc > @@ -6,7 +6,7 @@ target architecture, specifically, is the 32-bit OpenRISC > 1000 family (or1k). > > For information about OpenRISC processors and ongoing development: > > - website http://openrisc.net > + website http://openrisc.io > > For more information about Linux on OpenRISC, please contact South Pole AB. > > @@ -24,17 +24,17 @@ In order to build and run Linux for OpenRISC, you'll need > at least a basic > toolchain and, perhaps, the architectural simulator. Steps to get these bits > in place are outlined here. > > -1) The toolchain can be obtained from openrisc.net. Instructions for > building > +1) The toolchain can be obtained from openrisc.io. Instructions for > building > a toolchain can be found at: > > -http://openrisc.net/toolchain-build.html > +https://github.com/openrisc/tutorials > > 2) or1ksim (optional) > > or1ksim is the architectural simulator which will allow you to actually run > your OpenRISC Linux kernel if you don't have an OpenRISC processor at hand. > > - git clone git://openrisc.net/jonas/or1ksim-svn > + git clone https://github.com/openrisc/or1ksim.git > > cd or1ksim > ./configure --prefix=$OPENRISC_PREFIX > diff --git a/arch/openrisc/kernel/setup.c b/arch/openrisc/kernel/setup.c > index 6329d7a..cb797a3 100644 > --- a/arch/openrisc/kernel/setup.c > +++ b/arch/openrisc/kernel/setup.c > @@ -295,7 +295,7 @@ void __init setup_arch(char **cmdline_p) > > *cmdline_p = boot_command_line; > > - printk(KERN_INFO "OpenRISC Linux -- http://openrisc.net\n";); > + printk(KERN_INFO "OpenRISC Linux -- http://openrisc.io\n";); > } > > static int show_cpuinfo(struct seq_file *m, void *v) > -- > 2.7.4 > > ___ > OpenRISC mailing list > openr...@lists.librecores.org > https://lists.librecores.org/listinfo/openrisc This looks all correct Acked-by: Olof Kindgren
Re: [PATCH v2] staging: slicoss: fix different address space warnings
On Wed, Nov 16, 2016 at 05:07:37AM +0100, Sergio Paracuellos wrote: > This patch fix the following sparse warnings in slicoss driver: > warning: incorrect type in assignment (different address spaces) > > Changes in v2: > * Remove IOMEM_GET_FIELDADDR macro > * Add ioread64 and iowrite64 defines > > Signed-off-by: Sergio Paracuellos > --- > drivers/staging/slicoss/slicoss.c | 111 > ++ > 1 file changed, 76 insertions(+), 35 deletions(-) > > diff --git a/drivers/staging/slicoss/slicoss.c > b/drivers/staging/slicoss/slicoss.c > index d2929b9..d68a463 100644 > --- a/drivers/staging/slicoss/slicoss.c > +++ b/drivers/staging/slicoss/slicoss.c > @@ -128,6 +128,35 @@ > > MODULE_DEVICE_TABLE(pci, slic_pci_tbl); > > +#ifndef ioread64 > +#ifdef readq > +#define ioread64 readq > +#else > +#define ioread64 _ioread64 > +static inline u64 _ioread64(void __iomem *mmio) > +{ > + u64 low, high; > + > + low = ioread32(mmio); > + high = ioread32(mmio + sizeof(u32)); > + return low | (high << 32); > +} > +#endif > +#endif eek, no! Don't write common kernel functions in a driver just because some configuration option was incorrect. That implies that you really can't do that type of read/write for that platform, so maybe you shouldn't be doing it! Split this up into one patch that does the 32bit stuff, then worry about the 64bit stuff in a separate patch please. thanks, greg k-h
Re: [PATCH 1/3] virtio: Basic implementation of virtio pstore driver
Hi, On Tue, Nov 15, 2016 at 11:38 PM, Paolo Bonzini wrote: > > > On 15/11/2016 15:36, Namhyung Kim wrote: >> Hi, >> >> On Tue, Nov 15, 2016 at 10:57:29AM +0100, Paolo Bonzini wrote: >>> >>> >>> On 15/11/2016 06:06, Michael S. Tsirkin wrote: On Tue, Nov 15, 2016 at 01:50:21PM +0900, Namhyung Kim wrote: > Hi Michael, > > On Thu, Nov 10, 2016 at 06:39:55PM +0200, Michael S. Tsirkin wrote: >> On Sat, Aug 20, 2016 at 05:07:42PM +0900, Namhyung Kim wrote: >>> The virtio pstore driver provides interface to the pstore subsystem so >>> that the guest kernel's log/dump message can be saved on the host >>> machine. Users can access the log file directly on the host, or on the >>> guest at the next boot using pstore filesystem. It currently deals with >>> kernel log (printk) buffer only, but we can extend it to have other >>> information (like ftrace dump) later. >>> >>> It supports legacy PCI device using single order-2 page buffer. >> >> Do you mean a legacy virtio device? I don't see why >> you would want to support pre-1.0 mode. >> If you drop that, you can drop all cpu_to_virtio things >> and just use __le accessors. > > I was thinking about the kvmtools which lacks 1.0 support AFAIK. Unless kvmtools wants to be left behind it has to go 1.0. >>> >>> And it also has to go ACPI. Is there any reason, apart from kvmtool, to >>> make a completely new virtio device, with no support in existing guests, >>> rather than implement ACPI ERST? >> >> Well, I know nothing about ACPI. It looks like a huge spec and I >> don't want to dig into it just for this. > > ERST (error record serialization table) is a small subset of the ACPI spec. Not sure how independent ERST is from ACPI and other specs. It looks like referencing UEFI spec at least. Btw, is the ERST used for pstore only (in Linux)? Also I need to control pstore driver like using bigger buffer, enabling specific message types and so on if ERST supports. Is it possible for ERST to provide such information? Thanks, Namhyung
[RESEND PATCH v7 3/3] clocksource: Add clockevent support to NPS400 driver
From: Noam Camus Till now we used clockevent from generic ARC driver. This was enough as long as we worked with simple multicore SoC. When we are working with multithread SoC each HW thread can be scheduled to receive timer interrupt using timer mask register. This patch will provide a way to control clock events per HW thread. The design idea is that for each core there is dedicated register (TSI) serving all 16 HW threads. The register is a bitmask with one bit for each HW thread. When HW thread wants that next expiration of timer interrupt will hit it then the proper bit should be set in this dedicated register. When timer expires all HW threads within this core which their bit is set at the TSI register will be interrupted. Driver can be used from device tree by: compatible = "ezchip,nps400-timer0" <-- for clocksource compatible = "ezchip,nps400-timer1" <-- for clockevent Note that name convention for timer0/timer1 was taken from legacy ARC design. This design is our base before adding HW threads. For backward compatibility we keep "ezchip,nps400-timer" for clocksource Signed-off-by: Noam Camus Acked-by: Daniel Lezcano --- .../bindings/timer/ezchip,nps400-timer.txt | 15 -- .../bindings/timer/ezchip,nps400-timer0.txt| 17 ++ .../bindings/timer/ezchip,nps400-timer1.txt| 15 ++ drivers/clocksource/timer-nps.c| 170 4 files changed, 202 insertions(+), 15 deletions(-) delete mode 100644 Documentation/devicetree/bindings/timer/ezchip,nps400-timer.txt create mode 100644 Documentation/devicetree/bindings/timer/ezchip,nps400-timer0.txt create mode 100644 Documentation/devicetree/bindings/timer/ezchip,nps400-timer1.txt diff --git a/Documentation/devicetree/bindings/timer/ezchip,nps400-timer.txt b/Documentation/devicetree/bindings/timer/ezchip,nps400-timer.txt deleted file mode 100644 index c8c03d7..000 --- a/Documentation/devicetree/bindings/timer/ezchip,nps400-timer.txt +++ /dev/null @@ -1,15 +0,0 @@ -NPS Network Processor - -Required properties: - -- compatible : should be "ezchip,nps400-timer" - -Clocks required for compatible = "ezchip,nps400-timer": -- clocks : Must contain a single entry describing the clock input - -Example: - -timer { - compatible = "ezchip,nps400-timer"; - clocks = <&sysclk>; -}; diff --git a/Documentation/devicetree/bindings/timer/ezchip,nps400-timer0.txt b/Documentation/devicetree/bindings/timer/ezchip,nps400-timer0.txt new file mode 100644 index 000..e3cfce8 --- /dev/null +++ b/Documentation/devicetree/bindings/timer/ezchip,nps400-timer0.txt @@ -0,0 +1,17 @@ +NPS Network Processor + +Required properties: + +- compatible : should be "ezchip,nps400-timer0" + +Clocks required for compatible = "ezchip,nps400-timer0": +- interrupts : The interrupt of the first timer +- clocks : Must contain a single entry describing the clock input + +Example: + +timer { + compatible = "ezchip,nps400-timer0"; + interrupts = <3>; + clocks = <&sysclk>; +}; diff --git a/Documentation/devicetree/bindings/timer/ezchip,nps400-timer1.txt b/Documentation/devicetree/bindings/timer/ezchip,nps400-timer1.txt new file mode 100644 index 000..c0ab419 --- /dev/null +++ b/Documentation/devicetree/bindings/timer/ezchip,nps400-timer1.txt @@ -0,0 +1,15 @@ +NPS Network Processor + +Required properties: + +- compatible : should be "ezchip,nps400-timer1" + +Clocks required for compatible = "ezchip,nps400-timer1": +- clocks : Must contain a single entry describing the clock input + +Example: + +timer { + compatible = "ezchip,nps400-timer1"; + clocks = <&sysclk>; +}; diff --git a/drivers/clocksource/timer-nps.c b/drivers/clocksource/timer-nps.c index 0c8e21f..b4c8a02 100644 --- a/drivers/clocksource/timer-nps.c +++ b/drivers/clocksource/timer-nps.c @@ -111,3 +111,173 @@ static int __init nps_setup_clocksource(struct device_node *node) CLOCKSOURCE_OF_DECLARE(ezchip_nps400_clksrc, "ezchip,nps400-timer", nps_setup_clocksource); +CLOCKSOURCE_OF_DECLARE(ezchip_nps400_clk_src, "ezchip,nps400-timer1", + nps_setup_clocksource); + +#ifdef CONFIG_EZNPS_MTM_EXT +#include + +/* Timer related Aux registers */ +#define NPS_REG_TIMER0_TSI 0xF850 +#define NPS_REG_TIMER0_LIMIT 0x23 +#define NPS_REG_TIMER0_CTRL0x22 +#define NPS_REG_TIMER0_CNT 0x21 + +/* + * Interrupt Enabled (IE) - re-arm the timer + * Not Halted (NH) - is cleared when working with JTAG (for debug) + */ +#define TIMER0_CTRL_IE BIT(0) +#define TIMER0_CTRL_NH BIT(1) + +static unsigned long nps_timer0_freq; +static unsigned long nps_timer0_irq; + +static void nps_clkevent_rm_thread(void) +{ + int thread; + unsigned int cflags, enabled_threads; + + hw_schd_save(&cflags); + + enabled_threads = read_aux_reg(NPS_REG_TIMER0_TSI); + + /* remove thread from TSI1 */ + thread = read_aux_reg(CTOP_AUX_THREAD_ID); + enabled_threa
Re: [PATCH] clk: qoriq: added ls1012a clock configuration
On Wed, 2016-11-16 at 13:58 +0800, yuantian.t...@nxp.com wrote: > From: Tang Yuantian > > Added ls1012a clock configuation information. Do we really need the same line in the changelog twice? > > Signed-off-by: Tang Yuantian > --- > drivers/clk/clk-qoriq.c | 19 +++ > 1 file changed, 19 insertions(+) > > diff --git a/drivers/clk/clk-qoriq.c b/drivers/clk/clk-qoriq.c > index 1bece0f..563d874 100644 > --- a/drivers/clk/clk-qoriq.c > +++ b/drivers/clk/clk-qoriq.c > @@ -202,6 +202,14 @@ static const struct clockgen_muxinfo ls1021a_cmux = { > } > }; > > +static const struct clockgen_muxinfo ls1012a_cmux = { > + { > + [0] = { CLKSEL_VALID, CGA_PLL1, PLL_DIV1 }, > + {}, > + [2] = { CLKSEL_VALID, CGA_PLL1, PLL_DIV2 }, > + } > +}; > + Based on the "ls1021a_cmux" in the context it looks like this patch is intended to apply on top of https://patchwork.kernel.org/patch/8923541/ but I don't see any mention of that. > static const struct clockgen_muxinfo t1040_cmux = { > { > [0] = { CLKSEL_VALID, CGA_PLL1, PLL_DIV1 }, > @@ -482,6 +490,16 @@ static const struct clockgen_chipinfo chipinfo[] = { > .pll_mask = 0x03, > }, > { > + .compat = "fsl,ls1012a-clockgen", > + .cmux_groups = { > + &ls1012a_cmux > + }, > + .cmux_to_group = { > + 0, -1 > + }, > + .pll_mask = 0x03, > + }, > + { > .compat = "fsl,ls1043a-clockgen", > .init_periph = t2080_init_periph, > .cmux_groups = { > @@ -1284,6 +1302,7 @@ CLK_OF_DECLARE(qoriq_clockgen_2, "fsl,qoriq-clockgen- > 2.0", clockgen_init); > CLK_OF_DECLARE(qoriq_clockgen_ls1021a, "fsl,ls1021a-clockgen", > clockgen_init); > CLK_OF_DECLARE(qoriq_clockgen_ls1043a, "fsl,ls1043a-clockgen", > clockgen_init); > CLK_OF_DECLARE(qoriq_clockgen_ls2080a, "fsl,ls2080a-clockgen", > clockgen_init); > +CLK_OF_DECLARE(qoriq_clockgen_ls1012a, "fsl,ls1012a-clockgen", > clockgen_init); Please keep these lists of chips sorted (or as close as you can in the case of the cmux structs which already have some sorting issues). -Scott
[RESEND PATCH v7 1/3] soc: Support for NPS HW scheduling
From: Noam Camus This new header file is for NPS400 SoC (part of ARC architecture). The header file includes macros for save/restore of HW scheduling. The control of HW scheduling is achieved by writing core registers. This code was moved from arc/plat-eznps so it can be used from drivers/clocksource/, available only for CONFIG_EZNPS_MTM_EXT. Signed-off-by: Noam Camus Acked-by: Daniel Lezcano --- arch/arc/plat-eznps/include/plat/ctop.h |2 - include/soc/nps/mtm.h | 59 +++ 2 files changed, 59 insertions(+), 2 deletions(-) create mode 100644 include/soc/nps/mtm.h diff --git a/arch/arc/plat-eznps/include/plat/ctop.h b/arch/arc/plat-eznps/include/plat/ctop.h index 9d6718c..ee2e32d 100644 --- a/arch/arc/plat-eznps/include/plat/ctop.h +++ b/arch/arc/plat-eznps/include/plat/ctop.h @@ -46,9 +46,7 @@ #define CTOP_AUX_UDMC (CTOP_AUX_BASE + 0x300) /* EZchip core instructions */ -#define CTOP_INST_HWSCHD_OFF_R30x3B6F00BF #define CTOP_INST_HWSCHD_OFF_R40x3C6F00BF -#define CTOP_INST_HWSCHD_RESTORE_R30x3E6F70C3 #define CTOP_INST_HWSCHD_RESTORE_R40x3E6F7103 #define CTOP_INST_SCHD_RW 0x3E6F7004 #define CTOP_INST_SCHD_RD 0x3E6F7084 diff --git a/include/soc/nps/mtm.h b/include/soc/nps/mtm.h new file mode 100644 index 000..d2f5e7e --- /dev/null +++ b/include/soc/nps/mtm.h @@ -0,0 +1,59 @@ +/* + * Copyright (c) 2016, Mellanox Technologies. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + *copyright notice, this list of conditions and the following + *disclaimer. + * + * - Redistributions in binary form must reproduce the above + *copyright notice, this list of conditions and the following + *disclaimer in the documentation and/or other materials + *provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef SOC_NPS_MTM_H +#define SOC_NPS_MTM_H + +#define CTOP_INST_HWSCHD_OFF_R3 0x3B6F00BF +#define CTOP_INST_HWSCHD_RESTORE_R3 0x3E6F70C3 + +static inline void hw_schd_save(unsigned int *flags) +{ + __asm__ __volatile__( + " .word %1\n" + " st r3,[%0]\n" + : + : "r"(flags), "i"(CTOP_INST_HWSCHD_OFF_R3) + : "r3", "memory"); +} + +static inline void hw_schd_restore(unsigned int flags) +{ + __asm__ __volatile__( + " mov r3, %0\n" + " .word %1\n" + : + : "r"(flags), "i"(CTOP_INST_HWSCHD_RESTORE_R3) + : "r3"); +} + +#endif /* SOC_NPS_MTM_H */ -- 1.7.1
Re: [PATCH v3 0/8] pstore: Improve performance of ftrace backend with ramoops
On Tue, Nov 15, 2016 at 2:14 PM, Kees Cook wrote: > On Tue, Nov 15, 2016 at 2:06 PM, Joel Fernandes wrote: >> Hi Kees, >> >> On Tue, Nov 15, 2016 at 1:36 PM, Kees Cook wrote: >>> On Tue, Nov 15, 2016 at 11:55 AM, Joel Fernandes wrote: Hi Kees, On Fri, Nov 11, 2016 at 2:21 PM, Kees Cook wrote: > Hi Joel, > > I've reorganized a bunch of the logic here. Since pstore is going to need > the init_przs() logic for multiple pmsg przs, I wanted to get this in and > make sure I was happy with how it looks. I figured this would reduce our > round-trip time on reviews. :) > > Can you test this series and verify that it works as you're expecting? > I've > validated some basic behavior already, but don't have a good test-case for > ftrace. What commands do you actually use for testing ftrace? I'd like to > add something to my local tests. I normally do the following: dd if=/dev/urandom | pv | dd of=/dev/null and in parallel, I do a: echo 1 > /sys/kernel/debug/pstore/record_ftrace and then check the throughput; and then reboot the system and do a read out of /sys/fs/pstore/ ftrace file. >>> >>> Cool. Does something normally parse these? Lots of kernel addresses is >>> all I see. ;) >>> >> >> It should print symbol names if KALLSYMS is working properly as it >> uses %pf (in pstore_ftrace_seq_show function). > > Hrm. No such luck for me, but it's clearly using pstore correctly, so > I'm satisfied things are working along that path. :) Tested your for-next/pstore branch and ftrace on pstore works fine. Thanks! Joel
Re: [PATCH v4 4/4] ARM: dts: da850: Add the usb otg device nodeg
On Wednesday 16 November 2016 02:49 AM, Bin Liu wrote: > On Tue, Nov 15, 2016 at 04:16:02PM +0530, Sekhar Nori wrote: >> On Thursday 03 November 2016 09:29 PM, Alexandre Bailon wrote: >>> This adds the device tree node for the usb otg >>> controller present in the da850 family of SoC's. >>> This also enables the otg usb controller for the lcdk board. >>> >>> Signed-off-by: Alexandre Bailon >>> --- >>> arch/arm/boot/dts/da850-lcdk.dts | 8 >>> arch/arm/boot/dts/da850.dtsi | 15 +++ >>> 2 files changed, 23 insertions(+) >>> >>> diff --git a/arch/arm/boot/dts/da850-lcdk.dts >>> b/arch/arm/boot/dts/da850-lcdk.dts >>> index 7b8ab21..9f5040c 100644 >>> --- a/arch/arm/boot/dts/da850-lcdk.dts >>> +++ b/arch/arm/boot/dts/da850-lcdk.dts >>> @@ -158,6 +158,14 @@ >>> rx-num-evt = <32>; >>> }; >>> >>> +&usb_phy { >>> + status = "okay"; >>> + }; >> >> As mentioned by David already, this node needs to be removed. Please >> rebase this on top of latest linux-davinci/master when ready for merging >> (driver changes accepted). > > Dropped this patch due to this comment. Bin, Please do not apply dts or arch/arm/mach-davinci patches. I have a bunch queued through my tree and more in pipeline and it will cause unnecessary merge conflicts in linux-next or at Linus. For future, I have asked Alexandre to send driver and dts patches as separate series so there is no confusion on who should apply. Thanks, Sekhar
[RESEND PATCH v7 0/3] Add clockevent for timer-nps driver to NPS400 SoC
From: Noam Camus Change log --- V6 --> V7 Apply several comments made by Daniel Lezcano: 1) Remove CLOCK_EVT_FEAT_PERIODIC support. This way it is pure oneshot driver. This simplifies driver so that: nps_clkevent_add_thread() nps_clkevent_rm_thread() are more clearer without any vague logic if to change TSI bit of current HW thread or not. 2) tick_resume is also calls nps_clkevent_rm_thread() 3) Few (hopefully last) typo fixes. V5 --> V6 Apply several comments made by Daniel Lezcano: 1) nps_get_timer_clk() - use clk_put() on error scenario 2) nps_get_timer_clk() - return EINVAL and not plain 1 3) Fix typos in log (double checked with spell checker) V4 --> V5 Apply several comments made by Daniel Lezcano: 1) Add __init attribute to nps_get_timer_clk() 2) Fix return value of nps_get_timer_clk() when failing to get clk rate 3) Change clocksource rate from 301 -> 300 V3 --> V4 Main changes are [Thanks for the review]: Fix many typos at log [Daniel] Add handling for bad return values [Daniel and Thomas] Replace use of internal irqchip pointers with existing IRQ API [Thomas] Provide interrupt handler (percpu) with dev_id equal to evt [Thomas] Fix passing *clk by reference to nps_get_timer_clk() [Daniel] V2 --> V3 Apply Rob Herring comment about backword compatibility V1 --> V2 Apply Daniel Lezcano comments: CLOCKSOURCE_OF_DECLARE return value update hotplug callbacks usage squash of 2 first commits. In this version I created new commit to serve as preperation for adding clockevents. This way the last patch is more readable with clockevent content. --- In first version of this driver we supported clocksource for the NPS400. The support for clockevent was taken from Synopsys ARC timer driver. This was good for working with our simulator of NPS400. However in NPS400 ASIC the timers behave differently than simulation. The timers in ASIC are shared between all threads whithin a core and hence need different driver to support this behaviour. The idea of this design is that we got 16 HW threads per core each represented at bimask in a shared register in this core. So when thread wants that next clockevent expiration will produce timer interrupt to itself the correspondance bit in this register should be set. So theoretically if all 16 bits are set then all HW threads will get timer interrupt on next expiration of timer 0. Note that we use Synopsys ARC design naming convention for the timers where: timer0 is used for clockevents timer1 is used for clocksource. Noam Camus (3): soc: Support for NPS HW scheduling clocksource: update "fn" at CLOCKSOURCE_OF_DECLARE() of nps400 timer clocksource: Add clockevent support to NPS400 driver .../bindings/timer/ezchip,nps400-timer.txt | 15 -- .../bindings/timer/ezchip,nps400-timer0.txt| 17 ++ .../bindings/timer/ezchip,nps400-timer1.txt| 15 ++ arch/arc/plat-eznps/include/plat/ctop.h|2 - drivers/clocksource/timer-nps.c| 223 ++-- include/soc/nps/mtm.h | 59 + 6 files changed, 294 insertions(+), 37 deletions(-) delete mode 100644 Documentation/devicetree/bindings/timer/ezchip,nps400-timer.txt create mode 100644 Documentation/devicetree/bindings/timer/ezchip,nps400-timer0.txt create mode 100644 Documentation/devicetree/bindings/timer/ezchip,nps400-timer1.txt create mode 100644 include/soc/nps/mtm.h
[RESEND PATCH v7 2/3] clocksource: update "fn" at CLOCKSOURCE_OF_DECLARE() of nps400 timer
From: Noam Camus nps_setup_clocksource() should take node as only argument as defined by typedef int (*of_init_fn_1_ret)(struct device_node *) Therefore need to replace: int __init nps_setup_clocksource(struct device_node *node, struct clk *clk) with int __init nps_setup_clocksource(struct device_node *node) This patch also serve as preparation for next patch which add support for clockevents to nps400. Specifically we add new function nps_get_timer_clk() to serve clocksource and later clockevent registration. Signed-off-by: Noam Camus Acked-by: Daniel Lezcano --- drivers/clocksource/timer-nps.c | 65 +++--- 1 files changed, 39 insertions(+), 26 deletions(-) diff --git a/drivers/clocksource/timer-nps.c b/drivers/clocksource/timer-nps.c index 70c149a..0c8e21f 100644 --- a/drivers/clocksource/timer-nps.c +++ b/drivers/clocksource/timer-nps.c @@ -46,7 +46,35 @@ /* This array is per cluster of CPUs (Each NPS400 cluster got 256 CPUs) */ static void *nps_msu_reg_low_addr[NPS_CLUSTER_NUM] __read_mostly; -static unsigned long nps_timer_rate; +static int __init nps_get_timer_clk(struct device_node *node, +unsigned long *timer_freq, +struct clk **clk) +{ + int ret; + + *clk = of_clk_get(node, 0); + if (IS_ERR(*clk)) { + pr_err("timer missing clk"); + return PTR_ERR(*clk); + } + + ret = clk_prepare_enable(*clk); + if (ret) { + pr_err("Couldn't enable parent clk\n"); + clk_put(*clk); + return ret; + } + + *timer_freq = clk_get_rate(*clk); + if (!(*timer_freq)) { + pr_err("Couldn't get clk rate\n"); + clk_disable_unprepare(*clk); + clk_put(*clk); + return -EINVAL; + } + + return 0; +} static cycle_t nps_clksrc_read(struct clocksource *clksrc) { @@ -55,26 +83,24 @@ static cycle_t nps_clksrc_read(struct clocksource *clksrc) return (cycle_t)ioread32be(nps_msu_reg_low_addr[cluster]); } -static int __init nps_setup_clocksource(struct device_node *node, - struct clk *clk) +static int __init nps_setup_clocksource(struct device_node *node) { int ret, cluster; + struct clk *clk; + unsigned long nps_timer1_freq; + for (cluster = 0; cluster < NPS_CLUSTER_NUM; cluster++) nps_msu_reg_low_addr[cluster] = nps_host_reg((cluster << NPS_CLUSTER_OFFSET), -NPS_MSU_BLKID, NPS_MSU_TICK_LOW); +NPS_MSU_BLKID, NPS_MSU_TICK_LOW); - ret = clk_prepare_enable(clk); - if (ret) { - pr_err("Couldn't enable parent clock\n"); + ret = nps_get_timer_clk(node, &nps_timer1_freq, &clk); + if (ret) return ret; - } - nps_timer_rate = clk_get_rate(clk); - - ret = clocksource_mmio_init(nps_msu_reg_low_addr, "EZnps-tick", - nps_timer_rate, 301, 32, nps_clksrc_read); + ret = clocksource_mmio_init(nps_msu_reg_low_addr, "nps-tick", + nps_timer1_freq, 300, 32, nps_clksrc_read); if (ret) { pr_err("Couldn't register clock source.\n"); clk_disable_unprepare(clk); @@ -83,18 +109,5 @@ static int __init nps_setup_clocksource(struct device_node *node, return ret; } -static int __init nps_timer_init(struct device_node *node) -{ - struct clk *clk; - - clk = of_clk_get(node, 0); - if (IS_ERR(clk)) { - pr_err("Can't get timer clock.\n"); - return PTR_ERR(clk); - } - - return nps_setup_clocksource(node, clk); -} - CLOCKSOURCE_OF_DECLARE(ezchip_nps400_clksrc, "ezchip,nps400-timer", - nps_timer_init); + nps_setup_clocksource); -- 1.7.1
[PATCH] edac: mpc85xx: implement "remove" for mpc85xx_pci_err_driver
From: Yanjiang Jin Tested on a T4240QDS board. If we execute the below steps without this patch: 1. modprobe mpc85xx_edac [The first insmod, everything is well.] 2. modprobe -r mpc85xx_edac 3. modprobe mpc85xx_edac [insmod again, error happens.] We would get the below error: BUG: recent printk recursion! Oops: Kernel access of bad area, sig: 11 [#48] PREEMPT SMP NR_CPUS=24 CoreNet Generic Modules linked in: mpc85xx_edac edac_core softdog [last unloaded: mpc85xx_edac] CPU: 5 PID: 14773 Comm: modprobe Tainted: G D C 4.8.3-rt2 task: c005cdc40d40 task.stack: c005c8814000 NIP: c05c5b60 LR: c05c895c CTR: c05c8940 REGS: c005c8816e20 TRAP: 0300 Tainted: G D C (4.8.3-rt2-WR9.0.0.0_preempt-rt) MSR: 80029000 CR: 28222828 XER: 2000 DEAR: 805392d8 ESR: 0100 SOFTE: 0 GPR00: c05c8844 c005c88170a0 c11db400 c1220496 GPR04: c1220838 c1220838 04ff000a 805392d8 GPR08: c05cb400 c05c8940 fffe 804c9108 GPR12: c0bdad80 c0003fff7300 fff1 c0d1c7f0 GPR16: 0001 003f c005c8817c20 c0bed4e0 GPR20: c11fdaa0 0002 804ccafe GPR24: c005c8817390 0025 c1220458 0020 GPR28: 03e0 c1220838 804ccafe c1220496 NIP [c05c5b60] .string+0x20/0xa0 LR [c05c895c] .vsnprintf+0x1ac/0x490 Call Trace: [c005c88170a0] [c05c8844] .vsnprintf+0x94/0x490 (unreliable) [c005c8817170] [c05c8c58] .vscnprintf+0x18/0x70 [c005c88171f0] [c00d5920] .vprintk_emit+0x120/0x600 [c005c88172c0] [c0bdae44] .printk+0xc4/0xe0 [c005c8817340] [804c6f5c] .edac_pci_add_device+0x2fc/0x350 [edac_core] [c005c88173e0] [80759d64] .mpc85xx_pci_err_probe+0x344/0x550 [mpc85xx_edac] [c005c88174c0] [c06952b4] .platform_drv_probe+0x84/0x120 [c005c8817550] [c0692294] .driver_probe_device+0x2f4/0x3d0 [c005c88175f0] [c069248c] .__driver_attach+0x11c/0x120 [c005c8817680] [c068f034] .bus_for_each_dev+0x94/0x100 [c005c8817720] [c0691624] .driver_attach+0x34/0x50 [c005c88177a0] [c0690e88] .bus_add_driver+0x1b8/0x310 [c005c8817840] [c0693404] .driver_register+0x94/0x170 [c005c88178c0] [c06954b0] .__platform_register_drivers+0xa0/0x150 [c005c8817980] [8075b51c] .mpc85xx_mc_init+0x60/0xd0 [mpc85xx_edac] [c005c8817a00] [c0001a68] .do_one_initcall+0x68/0x1e0 [c005c8817ae0] [c0bdb2e8] .do_init_module+0x88/0x24c [c005c8817b80] [c011961c] .load_module+0x1e3c/0x2840 [c005c8817d20] [c011a320] .SyS_finit_module+0x100/0x130 [c005c8817e30] [c698] system_call+0x38/0xe8 Instruction dump: 4ba71abd 6000 7214 4b20 2ba50fff 7ca72b78 7cca0734 7c852378 40dd0030 2faa 394a 41de0014 <8907> 38e70001 2fa8 40fe002c ---[ end trace 0031 ]--- Signed-off-by: Yanjiang Jin --- drivers/edac/mpc85xx_edac.c | 17 + 1 file changed, 17 insertions(+) diff --git a/drivers/edac/mpc85xx_edac.c b/drivers/edac/mpc85xx_edac.c index ff05675..c626021 100644 --- a/drivers/edac/mpc85xx_edac.c +++ b/drivers/edac/mpc85xx_edac.c @@ -300,6 +300,22 @@ static int mpc85xx_pci_err_probe(struct platform_device *op) return res; } +static int mpc85xx_pci_err_remove(struct platform_device *op) +{ + struct edac_pci_ctl_info *pci = dev_get_drvdata(&op->dev); + struct mpc85xx_pci_pdata *pdata = pci->pvt_info; + + edac_dbg(0, "\n"); + + out_be32(pdata->pci_vbase + MPC85XX_PCI_ERR_ADDR, orig_pci_err_cap_dr); + out_be32(pdata->pci_vbase + MPC85XX_PCI_ERR_EN, orig_pci_err_en); + + edac_pci_del_device(&op->dev); + edac_pci_free_ctl_info(pci); + + return 0; +} + static const struct platform_device_id mpc85xx_pci_err_match[] = { { .name = "mpc85xx-pci-edac" @@ -309,6 +325,7 @@ static int mpc85xx_pci_err_probe(struct platform_device *op) static struct platform_driver mpc85xx_pci_err_driver = { .probe = mpc85xx_pci_err_probe, + .remove = mpc85xx_pci_err_remove, .id_table = mpc85xx_pci_err_match, .driver = { .name = "mpc85xx_pci_err", -- 1.9.1
[PATCH v4 2/6] usb: chipidea: use bus->sysdev for DMA configuration
From: Arnd Bergmann Set the dma for chipidea from sysdev. This is inherited from its parent node. Also, do not set dma mask for child as it is not required now. Signed-off-by: Arnd Bergmann Signed-off-by: Sriram Dash Acked-by: Peter Chen --- Changes in v4: - No update Changes in v3: - No update Changes in v2: - integrate chipidea driver changes together. drivers/usb/chipidea/core.c | 3 --- drivers/usb/chipidea/host.c | 3 ++- drivers/usb/chipidea/udc.c | 10 ++ 3 files changed, 8 insertions(+), 8 deletions(-) diff --git a/drivers/usb/chipidea/core.c b/drivers/usb/chipidea/core.c index 69426e6..8917a03 100644 --- a/drivers/usb/chipidea/core.c +++ b/drivers/usb/chipidea/core.c @@ -833,9 +833,6 @@ struct platform_device *ci_hdrc_add_device(struct device *dev, } pdev->dev.parent = dev; - pdev->dev.dma_mask = dev->dma_mask; - pdev->dev.dma_parms = dev->dma_parms; - dma_set_coherent_mask(&pdev->dev, dev->coherent_dma_mask); ret = platform_device_add_resources(pdev, res, nres); if (ret) diff --git a/drivers/usb/chipidea/host.c b/drivers/usb/chipidea/host.c index 111b0e0b..3218b49 100644 --- a/drivers/usb/chipidea/host.c +++ b/drivers/usb/chipidea/host.c @@ -116,7 +116,8 @@ static int host_start(struct ci_hdrc *ci) if (usb_disabled()) return -ENODEV; - hcd = usb_create_hcd(&ci_ehci_hc_driver, ci->dev, dev_name(ci->dev)); + hcd = __usb_create_hcd(&ci_ehci_hc_driver, ci->dev->parent, + ci->dev, dev_name(ci->dev), NULL); if (!hcd) return -ENOMEM; diff --git a/drivers/usb/chipidea/udc.c b/drivers/usb/chipidea/udc.c index 661f43f..bc55922 100644 --- a/drivers/usb/chipidea/udc.c +++ b/drivers/usb/chipidea/udc.c @@ -423,7 +423,8 @@ static int _hardware_enqueue(struct ci_hw_ep *hwep, struct ci_hw_req *hwreq) hwreq->req.status = -EALREADY; - ret = usb_gadget_map_request(&ci->gadget, &hwreq->req, hwep->dir); + ret = usb_gadget_map_request_by_dev(ci->dev->parent, + &hwreq->req, hwep->dir); if (ret) return ret; @@ -603,7 +604,8 @@ static int _hardware_dequeue(struct ci_hw_ep *hwep, struct ci_hw_req *hwreq) list_del_init(&node->td); } - usb_gadget_unmap_request(&hwep->ci->gadget, &hwreq->req, hwep->dir); + usb_gadget_unmap_request_by_dev(hwep->ci->dev->parent, + &hwreq->req, hwep->dir); hwreq->req.actual += actual; @@ -1904,13 +1906,13 @@ static int udc_start(struct ci_hdrc *ci) INIT_LIST_HEAD(&ci->gadget.ep_list); /* alloc resources */ - ci->qh_pool = dma_pool_create("ci_hw_qh", dev, + ci->qh_pool = dma_pool_create("ci_hw_qh", dev->parent, sizeof(struct ci_hw_qh), 64, CI_HDRC_PAGE_SIZE); if (ci->qh_pool == NULL) return -ENOMEM; - ci->td_pool = dma_pool_create("ci_hw_td", dev, + ci->td_pool = dma_pool_create("ci_hw_td", dev->parent, sizeof(struct ci_hw_td), 64, CI_HDRC_PAGE_SIZE); if (ci->td_pool == NULL) { -- 2.1.0
[PATCH] clk: qoriq: added ls1012a clock configuration
From: Tang Yuantian Added ls1012a clock configuation information. Signed-off-by: Tang Yuantian --- drivers/clk/clk-qoriq.c | 19 +++ 1 file changed, 19 insertions(+) diff --git a/drivers/clk/clk-qoriq.c b/drivers/clk/clk-qoriq.c index 1bece0f..563d874 100644 --- a/drivers/clk/clk-qoriq.c +++ b/drivers/clk/clk-qoriq.c @@ -202,6 +202,14 @@ static const struct clockgen_muxinfo ls1021a_cmux = { } }; +static const struct clockgen_muxinfo ls1012a_cmux = { + { + [0] = { CLKSEL_VALID, CGA_PLL1, PLL_DIV1 }, + {}, + [2] = { CLKSEL_VALID, CGA_PLL1, PLL_DIV2 }, + } +}; + static const struct clockgen_muxinfo t1040_cmux = { { [0] = { CLKSEL_VALID, CGA_PLL1, PLL_DIV1 }, @@ -482,6 +490,16 @@ static const struct clockgen_chipinfo chipinfo[] = { .pll_mask = 0x03, }, { + .compat = "fsl,ls1012a-clockgen", + .cmux_groups = { + &ls1012a_cmux + }, + .cmux_to_group = { + 0, -1 + }, + .pll_mask = 0x03, + }, + { .compat = "fsl,ls1043a-clockgen", .init_periph = t2080_init_periph, .cmux_groups = { @@ -1284,6 +1302,7 @@ CLK_OF_DECLARE(qoriq_clockgen_2, "fsl,qoriq-clockgen-2.0", clockgen_init); CLK_OF_DECLARE(qoriq_clockgen_ls1021a, "fsl,ls1021a-clockgen", clockgen_init); CLK_OF_DECLARE(qoriq_clockgen_ls1043a, "fsl,ls1043a-clockgen", clockgen_init); CLK_OF_DECLARE(qoriq_clockgen_ls2080a, "fsl,ls2080a-clockgen", clockgen_init); +CLK_OF_DECLARE(qoriq_clockgen_ls1012a, "fsl,ls1012a-clockgen", clockgen_init); /* Legacy nodes */ CLK_OF_DECLARE(qoriq_sysclk_1, "fsl,qoriq-sysclk-1.0", sysclk_init); -- 2.1.0.27.g96db324
[PATCH v4 4/6] usb: xhci: use bus->sysdev for DMA configuration
From: Arnd Bergmann For xhci-hcd platform device, all the DMA parameters are not configured properly, notably dma ops for dwc3 devices. So, set the dma for xhci from sysdev. sysdev is pointing to device that is known to the system firmware or hardware. Signed-off-by: Arnd Bergmann Signed-off-by: Sriram Dash Tested-by: Baolin Wang --- Changes in v4: - No update Changes in v3: - No update Changes in v2: - Separate out xhci driver changes apart drivers/usb/host/xhci-mem.c | 12 ++-- drivers/usb/host/xhci-plat.c | 33 ++--- drivers/usb/host/xhci.c | 15 +++ 3 files changed, 43 insertions(+), 17 deletions(-) diff --git a/drivers/usb/host/xhci-mem.c b/drivers/usb/host/xhci-mem.c index 6afe323..79608df 100644 --- a/drivers/usb/host/xhci-mem.c +++ b/drivers/usb/host/xhci-mem.c @@ -586,7 +586,7 @@ static void xhci_free_stream_ctx(struct xhci_hcd *xhci, unsigned int num_stream_ctxs, struct xhci_stream_ctx *stream_ctx, dma_addr_t dma) { - struct device *dev = xhci_to_hcd(xhci)->self.controller; + struct device *dev = xhci_to_hcd(xhci)->self.sysdev; size_t size = sizeof(struct xhci_stream_ctx) * num_stream_ctxs; if (size > MEDIUM_STREAM_ARRAY_SIZE) @@ -614,7 +614,7 @@ static struct xhci_stream_ctx *xhci_alloc_stream_ctx(struct xhci_hcd *xhci, unsigned int num_stream_ctxs, dma_addr_t *dma, gfp_t mem_flags) { - struct device *dev = xhci_to_hcd(xhci)->self.controller; + struct device *dev = xhci_to_hcd(xhci)->self.sysdev; size_t size = sizeof(struct xhci_stream_ctx) * num_stream_ctxs; if (size > MEDIUM_STREAM_ARRAY_SIZE) @@ -1644,7 +1644,7 @@ void xhci_slot_copy(struct xhci_hcd *xhci, static int scratchpad_alloc(struct xhci_hcd *xhci, gfp_t flags) { int i; - struct device *dev = xhci_to_hcd(xhci)->self.controller; + struct device *dev = xhci_to_hcd(xhci)->self.sysdev; int num_sp = HCS_MAX_SCRATCHPAD(xhci->hcs_params2); xhci_dbg_trace(xhci, trace_xhci_dbg_init, @@ -1716,7 +1716,7 @@ static void scratchpad_free(struct xhci_hcd *xhci) { int num_sp; int i; - struct device *dev = xhci_to_hcd(xhci)->self.controller; + struct device *dev = xhci_to_hcd(xhci)->self.sysdev; if (!xhci->scratchpad) return; @@ -1792,7 +1792,7 @@ void xhci_free_command(struct xhci_hcd *xhci, void xhci_mem_cleanup(struct xhci_hcd *xhci) { - struct device *dev = xhci_to_hcd(xhci)->self.controller; + struct device *dev = xhci_to_hcd(xhci)->self.sysdev; int size; int i, j, num_ports; @@ -2334,7 +2334,7 @@ static int xhci_setup_port_arrays(struct xhci_hcd *xhci, gfp_t flags) int xhci_mem_init(struct xhci_hcd *xhci, gfp_t flags) { dma_addr_t dma; - struct device *dev = xhci_to_hcd(xhci)->self.controller; + struct device *dev = xhci_to_hcd(xhci)->self.sysdev; unsigned intval, val2; u64 val_64; struct xhci_segment *seg; diff --git a/drivers/usb/host/xhci-plat.c b/drivers/usb/host/xhci-plat.c index ed56bf9..beb95c8 100644 --- a/drivers/usb/host/xhci-plat.c +++ b/drivers/usb/host/xhci-plat.c @@ -14,6 +14,7 @@ #include #include #include +#include #include #include #include @@ -139,6 +140,7 @@ static int xhci_plat_probe(struct platform_device *pdev) { const struct of_device_id *match; const struct hc_driver *driver; + struct device *sysdev; struct xhci_hcd *xhci; struct resource *res; struct usb_hcd *hcd; @@ -155,22 +157,39 @@ static int xhci_plat_probe(struct platform_device *pdev) if (irq < 0) return -ENODEV; + /* +* sysdev must point to a device that is known to the system firmware +* or PCI hardware. We handle these three cases here: +* 1. xhci_plat comes from firmware +* 2. xhci_plat is child of a device from firmware (dwc3-plat) +* 3. xhci_plat is grandchild of a pci device (dwc3-pci) +*/ + sysdev = &pdev->dev; + if (sysdev->parent && !sysdev->of_node && sysdev->parent->of_node) + sysdev = sysdev->parent; +#ifdef CONFIG_PCI + else if (sysdev->parent && sysdev->parent->parent && +sysdev->parent->parent->bus == &pci_bus_type) + sysdev = sysdev->parent->parent; +#endif + /* Try to set 64-bit DMA first */ - if (WARN_ON(!pdev->dev.dma_mask)) + if (WARN_ON(!sysdev->dma_mask)) /* Platform did not initialize dma_mask */ - ret = dma_coerce_mask_and_coherent(&pdev->dev, + ret = dma_coerce_mask_and_coherent(sysdev, DMA_BIT_MASK(64)); else - ret = dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(64))
[PATCH v4 3/6] usb: ehci: fsl: use bus->sysdev for DMA configuration
From: Arnd Bergmann For the dual role ehci fsl driver, sysdev will handle the dma config. Signed-off-by: Arnd Bergmann Signed-off-by: Sriram Dash --- Changes in v4: - No update Changes in v3: - fix compile errors Changes in v2: - fix compile warnings drivers/usb/host/ehci-fsl.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/usb/host/ehci-fsl.c b/drivers/usb/host/ehci-fsl.c index 9f5ffb6..4d4ab42 100644 --- a/drivers/usb/host/ehci-fsl.c +++ b/drivers/usb/host/ehci-fsl.c @@ -96,8 +96,8 @@ static int fsl_ehci_drv_probe(struct platform_device *pdev) } irq = res->start; - hcd = usb_create_hcd(&fsl_ehci_hc_driver, &pdev->dev, - dev_name(&pdev->dev)); + hcd = __usb_create_hcd(&fsl_ehci_hc_driver, pdev->dev.parent, + &pdev->dev, dev_name(&pdev->dev), NULL); if (!hcd) { retval = -ENOMEM; goto err1; -- 2.1.0
Re: [GIT PULL] fbdev fixes for 4.9
Tomi Valkeinen writes: > [ Unknown signature status ] > On 14/11/16 18:25, Linus Torvalds wrote: >> On Mon, Nov 14, 2016 at 3:44 AM, Tomi Valkeinen >> wrote: >>> >>> Please pull two fbdev fixes for 4.9. >> >> No. >> >> This has obviously never even been test-compiled. It introduces two >> new annoying warnings. > > Sorry about that. I dropped the patch for now and here's a new pull request. > > I did test compile, and I would have noticed errors but I missed the warnings > (I blame the flu...). I need to improve my testing methods, but I couldn't > find > a way to add -Werror to kernel builds. You can turn it on per-directory, eg: diff --git a/drivers/video/fbdev/Makefile b/drivers/video/fbdev/Makefile index ee8c81405a7f..2ee96810d26d 100644 --- a/drivers/video/fbdev/Makefile +++ b/drivers/video/fbdev/Makefile @@ -4,6 +4,8 @@ # Each configuration option enables a list of files. +subdir-ccflags-y := -Werror + obj-y += core/ obj-$(CONFIG_FB_MACMODES) += macmodes.o cheers
Re: [RFC PATCH] xen/x86: Increase xen_e820_map to E820_X_MAX possible entries
On 15/11/16 16:22, Alex Thorlton wrote: > On Tue, Nov 15, 2016 at 10:55:49AM +0100, Juergen Gross wrote: >> I'd go with the new error code. What about E2BIG or ENOSPC? >> >> I think the hypervisor should fill in the number of entries required >> in this case. >> >> In case nobody objects I can post patches for this purpose (both Xen >> and Linux). > > This sounds like a good solution to me. I think it's definitely more > appropriate than simply bumping up the size of xen_e820_map, especially > considering the fact that it's theoretically possible for the e820 map > generated by the hypercall to grow too large, even on a non-EFI machine, > where my change would have no effect. Well, it won't help with the current hypervisor, so bumping up the size of xen_e820_map will still be a good idea. I think using E820_X_MAX is okay since in the end xen_e820_map will be transferred into a struct e820map which can't hold more than E820_X_MAX entries (additional entries are ignored here, so this won't let the boot fail). Juergen
[PATCH v2 5/8] perf sched timehist: Add -w/--wakeups option
From: David Ahern The -w option is to show wakeup events with timehist. $ perf sched timehist -w timecpu task name b/n time sch delay run time [tid/pid](msec) (msec) (msec) --- -- - - - 2412598.429689 [0002] perf[7219] awakened: perf[7220] 2412598.429696 [0009] 0.000 0.000 0.000 2412598.429767 [0002] perf[7219]0.000 0.000 0.000 2412598.429780 [0009] perf[7220] awakened: migration/9[53] ... Signed-off-by: David Ahern Signed-off-by: Namhyung Kim --- tools/perf/builtin-sched.c | 58 ++ 1 file changed, 54 insertions(+), 4 deletions(-) diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c index 1e7d81ad5ec6..8fb7bcc2cb76 100644 --- a/tools/perf/builtin-sched.c +++ b/tools/perf/builtin-sched.c @@ -198,6 +198,7 @@ struct perf_sched { /* options for timehist command */ boolsummary; boolsummary_only; + boolshow_wakeups; u64 skipped_samples; }; @@ -1807,7 +1808,8 @@ static void timehist_header(void) printf("\n"); } -static void timehist_print_sample(struct perf_sample *sample, +static void timehist_print_sample(struct perf_sched *sched, + struct perf_sample *sample, struct thread *thread) { struct thread_runtime *tr = thread__priv(thread); @@ -1821,6 +1823,10 @@ static void timehist_print_sample(struct perf_sample *sample, print_sched_time(tr->dt_wait, 6); print_sched_time(tr->dt_delay, 6); print_sched_time(tr->dt_run, 6); + + if (sched->show_wakeups) + printf(" %-*s", comm_width, ""); + printf("\n"); } @@ -2028,12 +2034,44 @@ static bool timehist_skip_sample(struct perf_sched *sched, return rc; } -static int timehist_sched_wakeup_event(struct perf_tool *tool __maybe_unused, +static void timehist_print_wakeup_event(struct perf_sched *sched, + struct perf_sample *sample, + struct machine *machine, + struct thread *awakened) +{ + struct thread *thread; + char tstr[64]; + + thread = machine__findnew_thread(machine, sample->pid, sample->tid); + if (thread == NULL) + return; + + /* show wakeup unless both awakee and awaker are filtered */ + if (timehist_skip_sample(sched, thread) && + timehist_skip_sample(sched, awakened)) { + return; + } + + timestamp__scnprintf_usec(sample->time, tstr, sizeof(tstr)); + printf("%15s [%04d] ", tstr, sample->cpu); + + printf(" %-*s ", comm_width, timehist_get_commstr(thread)); + + /* dt spacer */ + printf(" %9s %9s %9s ", "", "", ""); + + printf("awakened: %s", timehist_get_commstr(awakened)); + + printf("\n"); +} + +static int timehist_sched_wakeup_event(struct perf_tool *tool, union perf_event *event __maybe_unused, struct perf_evsel *evsel, struct perf_sample *sample, struct machine *machine) { + struct perf_sched *sched = container_of(tool, struct perf_sched, tool); struct thread *thread; struct thread_runtime *tr = NULL; /* want pid of awakened task not pid in sample */ @@ -2050,6 +2088,10 @@ static int timehist_sched_wakeup_event(struct perf_tool *tool __maybe_unused, if (tr->ready_to_run == 0) tr->ready_to_run = sample->time; + /* show wakeups if requested */ + if (sched->show_wakeups) + timehist_print_wakeup_event(sched, sample, machine, thread); + return 0; } @@ -2059,12 +2101,12 @@ static int timehist_sched_change_event(struct perf_tool *tool, struct perf_sample *sample, struct machine *machine) { + struct perf_sched *sched = container_of(tool, struct perf_sched, tool); struct addr_location al; struct thread *thread; struct thread_runtime *tr = NULL; u64 tprev; int rc = 0; - struct perf_sched *sched = container_of(tool, struct perf_sched, tool); if (machine__resolve(machine, &al, sample) < 0) { pr_err("problem processing %d event. skipping it\n", @@ -2092,7 +2134,7 @@ static int timehist_sched_change_event(struct perf_tool *tool, timehist_update_runtime_stats(tr, sample->time, tprev); if (!sched->summary_only
Re: linux-next: unable to fetch the amlogic tree
HI Stephen, On Tue, Nov 15, 2016 at 2:18 PM, Stephen Rothwell wrote: > Hi all, > > Fetching the amlogic tree > (git://git.kernel.org/pub/scm/linux/kernel/git/khilman/linux-amlogic.git#for-next) > produces this error: > > fatal: Couldn't find remote ref refs/heads/for-next Hmm, must've been bad timing. I was updating branches much earlier, but maybe the your mirror didn't have it in time. Seems to be fully mirrored now: https://git.kernel.org/cgit/linux/kernel/git/khilman/linux-amlogic.git/log/?h=for-next > I will continue to use te previously fetched version of this tree. Should be fine for today. Thanks, Kevin
[PATCH v4 5/6] usb: dwc3: use bus->sysdev for DMA configuration
From: Arnd Bergmann The dma ops for dwc3 devices are not set properly. So, use a physical device sysdev, which will be inherited from parent, to set the hardware / firmware parameters like dma. Signed-off-by: Arnd Bergmann Signed-off-by: Sriram Dash Signed-off-by: Felipe Balbi Tested-by: Baolin Wang --- Changes in v4: - removed the ifdefs for pci - made the sysdev as a device property - phy create lookup take up the correct device. Changes in v3: - No update Changes in v2: - integrate dwc3 driver changes together drivers/usb/dwc3/core.c | 27 ++- drivers/usb/dwc3/core.h | 3 +++ drivers/usb/dwc3/dwc3-pci.c | 11 +++ drivers/usb/dwc3/ep0.c | 8 drivers/usb/dwc3/gadget.c | 37 +++-- drivers/usb/dwc3/host.c | 16 ++-- 6 files changed, 57 insertions(+), 45 deletions(-) diff --git a/drivers/usb/dwc3/core.c b/drivers/usb/dwc3/core.c index fea4469..cf37d3e 100644 --- a/drivers/usb/dwc3/core.c +++ b/drivers/usb/dwc3/core.c @@ -229,7 +229,7 @@ static void dwc3_frame_length_adjustment(struct dwc3 *dwc) static void dwc3_free_one_event_buffer(struct dwc3 *dwc, struct dwc3_event_buffer *evt) { - dma_free_coherent(dwc->dev, evt->length, evt->buf, evt->dma); + dma_free_coherent(dwc->sysdev, evt->length, evt->buf, evt->dma); } /** @@ -251,7 +251,7 @@ static struct dwc3_event_buffer *dwc3_alloc_one_event_buffer(struct dwc3 *dwc, evt->dwc= dwc; evt->length = length; - evt->buf= dma_alloc_coherent(dwc->dev, length, + evt->buf= dma_alloc_coherent(dwc->sysdev, length, &evt->dma, GFP_KERNEL); if (!evt->buf) return ERR_PTR(-ENOMEM); @@ -370,11 +370,11 @@ static int dwc3_setup_scratch_buffers(struct dwc3 *dwc) if (!WARN_ON(dwc->scratchbuf)) return 0; - scratch_addr = dma_map_single(dwc->dev, dwc->scratchbuf, + scratch_addr = dma_map_single(dwc->sysdev, dwc->scratchbuf, dwc->nr_scratch * DWC3_SCRATCHBUF_SIZE, DMA_BIDIRECTIONAL); - if (dma_mapping_error(dwc->dev, scratch_addr)) { - dev_err(dwc->dev, "failed to map scratch buffer\n"); + if (dma_mapping_error(dwc->sysdev, scratch_addr)) { + dev_err(dwc->sysdev, "failed to map scratch buffer\n"); ret = -EFAULT; goto err0; } @@ -398,7 +398,7 @@ static int dwc3_setup_scratch_buffers(struct dwc3 *dwc) return 0; err1: - dma_unmap_single(dwc->dev, dwc->scratch_addr, dwc->nr_scratch * + dma_unmap_single(dwc->sysdev, dwc->scratch_addr, dwc->nr_scratch * DWC3_SCRATCHBUF_SIZE, DMA_BIDIRECTIONAL); err0: @@ -417,7 +417,7 @@ static void dwc3_free_scratch_buffers(struct dwc3 *dwc) if (!WARN_ON(dwc->scratchbuf)) return; - dma_unmap_single(dwc->dev, dwc->scratch_addr, dwc->nr_scratch * + dma_unmap_single(dwc->sysdev, dwc->scratch_addr, dwc->nr_scratch * DWC3_SCRATCHBUF_SIZE, DMA_BIDIRECTIONAL); kfree(dwc->scratchbuf); } @@ -986,6 +986,13 @@ static int dwc3_probe(struct platform_device *pdev) dwc->dr_mode = usb_get_dr_mode(dev); dwc->hsphy_mode = of_usb_get_phy_mode(dev->of_node); + dwc->sysdev_is_parent = device_property_read_bool(dev, + "linux,sysdev_is_parent"); + if (dwc->sysdev_is_parent) + dwc->sysdev = dwc->dev->parent; + else + dwc->sysdev = dwc->dev; + dwc->has_lpm_erratum = device_property_read_bool(dev, "snps,has-lpm-erratum"); device_property_read_u8(dev, "snps,lpm-nyet-threshold", @@ -1050,12 +1057,6 @@ static int dwc3_probe(struct platform_device *pdev) spin_lock_init(&dwc->lock); - if (!dev->dma_mask) { - dev->dma_mask = dev->parent->dma_mask; - dev->dma_parms = dev->parent->dma_parms; - dma_set_coherent_mask(dev, dev->parent->coherent_dma_mask); - } - pm_runtime_set_active(dev); pm_runtime_use_autosuspend(dev); pm_runtime_set_autosuspend_delay(dev, DWC3_DEFAULT_AUTOSUSPEND_DELAY); diff --git a/drivers/usb/dwc3/core.h b/drivers/usb/dwc3/core.h index 6b60e42..6999e28 100644 --- a/drivers/usb/dwc3/core.h +++ b/drivers/usb/dwc3/core.h @@ -798,6 +798,7 @@ struct dwc3_scratchpad_array { * @ep0_bounced: true when we used bounce buffer * @ep0_expect_in: true when we expect a DATA IN transfer * @has_hibernation: true when dwc3 was configured with Hibernation + * @sysdev_is_parent: true when dwc3 device has a parent driver * @has_lpm_erratum: true when core was configured with LPM Erratum. Note that * there's now way for software to detect this in runtime. * @is_utmi_l1_suspend: the core as
RE: [PATCH] btusb: fix zero BD address problem during stress test
Hi Marcel, > From: Amitkumar Karwar [mailto:akar...@marvell.com] > Sent: Tuesday, October 18, 2016 6:27 PM > To: linux-blueto...@vger.kernel.org > Cc: mar...@holtmann.org; linux-kernel@vger.kernel.org; Cathy Luo; > Nishant Sarmukadam; Ganapathi Bhat; Amitkumar Karwar > Subject: [PATCH] btusb: fix zero BD address problem during stress test > > From: Ganapathi Bhat > > We came across a corner case issue during reboot stress test in which > hciconfig shows BD address is all zero. Reason is we don't get response > for HCI RESET command during initialization > > The issue is tracked to a race where USB subsystem calls > btusb_intr_complete() to deliver a data(NOOP frame) received on > interrupt endpoint. HCI_RUNNING flag is not yet set by bluetooth > subsystem. So we ignore that frame and return. > > As we missed to resubmit the buffer to interrupt endpoint in this case, > we don't get response for BT reset command downloaded after this. > > This patch handles the corner case to resolve zero BD address problem. > > Signed-off-by: Ganapathi Bhat > Signed-off-by: Amitkumar Karwar > --- > drivers/bluetooth/btusb.c | 5 + > 1 file changed, 1 insertion(+), 4 deletions(-) > > diff --git a/drivers/bluetooth/btusb.c b/drivers/bluetooth/btusb.c > index 811f9b9..b5596ac 100644 > --- a/drivers/bluetooth/btusb.c > +++ b/drivers/bluetooth/btusb.c > @@ -607,10 +607,7 @@ static void btusb_intr_complete(struct urb *urb) > BT_DBG("%s urb %p status %d count %d", hdev->name, urb, urb- > >status, > urb->actual_length); > > - if (!test_bit(HCI_RUNNING, &hdev->flags)) > - return; > - > - if (urb->status == 0) { > + if (urb->status == 0 && test_bit(HCI_RUNNING, &hdev->flags)) { > hdev->stat.byte_rx += urb->actual_length; > > if (btusb_recv_intr(data, urb->transfer_buffer, Did you get a chance to check this? Please let us know if you have any review comments. Regards, Amitkumar
[PATCH v4 6/6] usb: dwc3: Do not set dma coherent mask
From: Arnd Bergmann The dma mask is correctly set up by the DT probe function, no need to override it any more. Signed-off-by: Arnd Bergmann Signed-off-by: Sriram Dash --- Changes in v4: - No update Changes in v3: - No update Changes in v2: - club the cleanup for dma coherent mask for device drivers/usb/dwc3/dwc3-exynos.c | 10 -- drivers/usb/dwc3/dwc3-st.c | 1 - 2 files changed, 11 deletions(-) diff --git a/drivers/usb/dwc3/dwc3-exynos.c b/drivers/usb/dwc3/dwc3-exynos.c index 2f1fb7e..e27899b 100644 --- a/drivers/usb/dwc3/dwc3-exynos.c +++ b/drivers/usb/dwc3/dwc3-exynos.c @@ -20,7 +20,6 @@ #include #include #include -#include #include #include #include @@ -117,15 +116,6 @@ static int dwc3_exynos_probe(struct platform_device *pdev) if (!exynos) return -ENOMEM; - /* -* Right now device-tree probed devices don't get dma_mask set. -* Since shared usb code relies on it, set it here for now. -* Once we move to full device tree support this will vanish off. -*/ - ret = dma_coerce_mask_and_coherent(dev, DMA_BIT_MASK(32)); - if (ret) - return ret; - platform_set_drvdata(pdev, exynos); exynos->dev = dev; diff --git a/drivers/usb/dwc3/dwc3-st.c b/drivers/usb/dwc3/dwc3-st.c index aaaf256..dfbf464 100644 --- a/drivers/usb/dwc3/dwc3-st.c +++ b/drivers/usb/dwc3/dwc3-st.c @@ -219,7 +219,6 @@ static int st_dwc3_probe(struct platform_device *pdev) if (IS_ERR(regmap)) return PTR_ERR(regmap); - dma_set_coherent_mask(dev, dev->coherent_dma_mask); dwc3_data->dev = dev; dwc3_data->regmap = regmap; -- 2.1.0
[PATCH v4 1/6] usb: separate out sysdev pointer from usb_bus
From: Arnd Bergmann For xhci-hcd platform device, all the DMA parameters are not configured properly, notably dma ops for dwc3 devices. The idea here is that you pass in the parent of_node along with the child device pointer, so it would behave exactly like the parent already does. The difference is that it also handles all the other attributes besides the mask. sysdev will represent the physical device, as seen from firmware or bus.Splitting the usb_bus->controller field into the Linux-internal device (used for the sysfs hierarchy, for printks and for power management) and a new pointer (used for DMA, DT enumeration and phy lookup) probably covers all that we really need. Signed-off-by: Arnd Bergmann Signed-off-by: Sriram Dash Tested-by: Baolin Wang Cc: Felipe Balbi Cc: Grygorii Strashko Cc: Sinjan Kumar Cc: David Fisher Cc: Catalin Marinas Cc: "Thang Q. Nguyen" Cc: Yoshihiro Shimoda Cc: Stephen Boyd Cc: Bjorn Andersson Cc: Ming Lei Cc: Jon Masters Cc: Dann Frazier Cc: Peter Chen Cc: Leo Li --- Changes in v4: - No update Changes in v3: - usb is_device_dma_capable instead of directly accessing dma props. Changes in v2: - Split the patch wrt driver drivers/usb/core/buffer.c | 12 ++-- drivers/usb/core/hcd.c| 48 --- drivers/usb/core/usb.c| 18 +- include/linux/usb.h | 1 + include/linux/usb/hcd.h | 3 +++ 5 files changed, 48 insertions(+), 34 deletions(-) diff --git a/drivers/usb/core/buffer.c b/drivers/usb/core/buffer.c index 98e39f9..a6cd44a 100644 --- a/drivers/usb/core/buffer.c +++ b/drivers/usb/core/buffer.c @@ -63,7 +63,7 @@ int hcd_buffer_create(struct usb_hcd *hcd) int i, size; if (!IS_ENABLED(CONFIG_HAS_DMA) || - (!hcd->self.controller->dma_mask && + (!is_device_dma_capable(hcd->self.sysdev) && !(hcd->driver->flags & HCD_LOCAL_MEM))) return 0; @@ -72,7 +72,7 @@ int hcd_buffer_create(struct usb_hcd *hcd) if (!size) continue; snprintf(name, sizeof(name), "buffer-%d", size); - hcd->pool[i] = dma_pool_create(name, hcd->self.controller, + hcd->pool[i] = dma_pool_create(name, hcd->self.sysdev, size, size, 0); if (!hcd->pool[i]) { hcd_buffer_destroy(hcd); @@ -127,7 +127,7 @@ void *hcd_buffer_alloc( /* some USB hosts just use PIO */ if (!IS_ENABLED(CONFIG_HAS_DMA) || - (!bus->controller->dma_mask && + (!is_device_dma_capable(bus->sysdev) && !(hcd->driver->flags & HCD_LOCAL_MEM))) { *dma = ~(dma_addr_t) 0; return kmalloc(size, mem_flags); @@ -137,7 +137,7 @@ void *hcd_buffer_alloc( if (size <= pool_max[i]) return dma_pool_alloc(hcd->pool[i], mem_flags, dma); } - return dma_alloc_coherent(hcd->self.controller, size, dma, mem_flags); + return dma_alloc_coherent(hcd->self.sysdev, size, dma, mem_flags); } void hcd_buffer_free( @@ -154,7 +154,7 @@ void hcd_buffer_free( return; if (!IS_ENABLED(CONFIG_HAS_DMA) || - (!bus->controller->dma_mask && + (!is_device_dma_capable(bus->sysdev) && !(hcd->driver->flags & HCD_LOCAL_MEM))) { kfree(addr); return; @@ -166,5 +166,5 @@ void hcd_buffer_free( return; } } - dma_free_coherent(hcd->self.controller, size, addr, dma); + dma_free_coherent(hcd->self.sysdev, size, addr, dma); } diff --git a/drivers/usb/core/hcd.c b/drivers/usb/core/hcd.c index 479e223..f8feb08 100644 --- a/drivers/usb/core/hcd.c +++ b/drivers/usb/core/hcd.c @@ -1073,6 +1073,7 @@ static void usb_deregister_bus (struct usb_bus *bus) static int register_root_hub(struct usb_hcd *hcd) { struct device *parent_dev = hcd->self.controller; + struct device *sysdev = hcd->self.sysdev; struct usb_device *usb_dev = hcd->self.root_hub; const int devnum = 1; int retval; @@ -1119,7 +1120,7 @@ static int register_root_hub(struct usb_hcd *hcd) /* Did the HC die before the root hub was registered? */ if (HCD_DEAD(hcd)) usb_hc_died (hcd); /* This time clean up */ - usb_dev->dev.of_node = parent_dev->of_node; + usb_dev->dev.of_node = sysdev->of_node; } mutex_unlock(&usb_bus_idr_lock); @@ -1465,19 +1466,19 @@ void usb_hcd_unmap_urb_for_dma(struct usb_hcd *hcd, struct urb *urb) dir = usb_urb_dir_in(urb) ? DMA_FROM_DEVICE : DMA_TO_DEVICE; if (IS_ENABLED(CONFIG_HAS_DMA) && (urb->transfer_flags & URB_DMA_MAP_SG)) - dma_unmap_sg(hcd->self.controller, + dma_unmap_sg(hcd->self.sysdev,
[PATCH v4 0/6] inherit dma configuration from parent dev
For xhci-hcd platform device, all the DMA parameters are not configured properly, notably dma ops for dwc3 devices. The idea here is that you pass in the parent of_node along with the child device pointer, so it would behave exactly like the parent already does. The difference is that it also handles all the other attributes besides the mask. Arnd Bergmann (6): usb: separate out sysdev pointer from usb_bus usb: chipidea: use bus->sysdev for DMA configuration usb: ehci: fsl: use bus->sysdev for DMA configuration usb: xhci: use bus->sysdev for DMA configuration usb: dwc3: use bus->sysdev for DMA configuration usb: dwc3: Do not set dma coherent mask drivers/usb/chipidea/core.c| 3 --- drivers/usb/chipidea/host.c| 3 ++- drivers/usb/chipidea/udc.c | 10 + drivers/usb/core/buffer.c | 12 +-- drivers/usb/core/hcd.c | 48 +- drivers/usb/core/usb.c | 18 drivers/usb/dwc3/core.c| 27 drivers/usb/dwc3/core.h| 3 +++ drivers/usb/dwc3/dwc3-exynos.c | 10 - drivers/usb/dwc3/dwc3-pci.c| 11 ++ drivers/usb/dwc3/dwc3-st.c | 1 - drivers/usb/dwc3/ep0.c | 8 +++ drivers/usb/dwc3/gadget.c | 37 drivers/usb/dwc3/host.c| 16 ++ drivers/usb/host/ehci-fsl.c| 4 ++-- drivers/usb/host/xhci-mem.c| 12 +-- drivers/usb/host/xhci-plat.c | 33 +++-- drivers/usb/host/xhci.c| 15 + include/linux/usb.h| 1 + include/linux/usb/hcd.h| 3 +++ 20 files changed, 158 insertions(+), 117 deletions(-) -- 2.1.0
Re: [PATCH] drm/bridge: analogix_dp: return error if transfer none byte
On 11/15/2016 10:39 PM, Sean Paul wrote: On Thu, Nov 3, 2016 at 3:17 AM, Jianqun Xu wrote: Reference from drm_dp_aux description (about transfer): Upon success, the implementation should return the number of payload bytes that were transferred, or a negative error-code on failure. Helpers propagate errors from the .transfer() function, with the exception of the -EBUSY error, which causes a transaction to be retried. On a short, helpers will return -EPROTO to make it simpler to check for failure. The analogix_dp_transfer will return num_transferred, but if there is none byte been transferred, the return value will be 0, which means success, we should return error-code if transfer none byte. for (retry = 0; retry < 32; retry++) { err = aux->transfer(aux, &msg); if (err < 0) { if (err == -EBUSY) continue; goto unlock; } } Cc: zain wang Cc: Sean Paul Signed-off-by: Jianqun Xu Reviewed-by: Sean Paul queued to drm-misc. --- drivers/gpu/drm/bridge/analogix/analogix_dp_reg.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/bridge/analogix/analogix_dp_reg.c b/drivers/gpu/drm/bridge/analogix/analogix_dp_reg.c index cd37ac0..303083a 100644 --- a/drivers/gpu/drm/bridge/analogix/analogix_dp_reg.c +++ b/drivers/gpu/drm/bridge/analogix/analogix_dp_reg.c @@ -1162,5 +1162,5 @@ ssize_t analogix_dp_transfer(struct analogix_dp_device *dp, (msg->request & ~DP_AUX_I2C_MOT) == DP_AUX_NATIVE_READ) msg->reply = DP_AUX_NATIVE_REPLY_ACK; - return num_transferred; + return num_transferred > 0 ? num_transferred : -EBUSY; } -- 1.9.1 -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
[PATCH v2 2/8] perf tools: Support printing callchains with arrows
The EVSEL__PRINT_CALLCHAIN_ARROW options can be used to print callchains with arrows for readability. It will be used 'sched timehist' command like below: __schedule <- schedule <- schedule_timeout <- rcu_gp_kthread <- kthread <- ret_from_fork __schedule <- schedule <- schedule_timeout <- rcu_gp_kthread <- kthread <- ret_from_fork __schedule <- schedule <- worker_thread <- kthread <- ret_from_fork Suggested-by: Ingo Molnar Signed-off-by: Namhyung Kim --- tools/perf/util/evsel.h | 1 + tools/perf/util/evsel_fprintf.c | 6 ++ 2 files changed, 7 insertions(+) diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h index 8cd7cd227483..27fa3a343577 100644 --- a/tools/perf/util/evsel.h +++ b/tools/perf/util/evsel.h @@ -391,6 +391,7 @@ int perf_evsel__fprintf(struct perf_evsel *evsel, #define EVSEL__PRINT_ONELINE (1<<4) #define EVSEL__PRINT_SRCLINE (1<<5) #define EVSEL__PRINT_UNKNOWN_AS_ADDR (1<<6) +#define EVSEL__PRINT_CALLCHAIN_ARROW (1<<7) struct callchain_cursor; diff --git a/tools/perf/util/evsel_fprintf.c b/tools/perf/util/evsel_fprintf.c index ccb602397b60..53bb614feafb 100644 --- a/tools/perf/util/evsel_fprintf.c +++ b/tools/perf/util/evsel_fprintf.c @@ -108,7 +108,9 @@ int sample__fprintf_callchain(struct perf_sample *sample, int left_alignment, int print_oneline = print_opts & EVSEL__PRINT_ONELINE; int print_srcline = print_opts & EVSEL__PRINT_SRCLINE; int print_unknown_as_addr = print_opts & EVSEL__PRINT_UNKNOWN_AS_ADDR; + int print_arrow = print_opts & EVSEL__PRINT_CALLCHAIN_ARROW; char s = print_oneline ? ' ' : '\t'; + bool first = true; if (sample->callchain) { struct addr_location node_al; @@ -124,6 +126,9 @@ int sample__fprintf_callchain(struct perf_sample *sample, int left_alignment, printed += fprintf(fp, "%-*.*s", left_alignment, left_alignment, " "); + if (print_arrow && !first) + printed += fprintf(fp, " <-"); + if (print_ip) printed += fprintf(fp, "%c%16" PRIx64, s, node->ip); @@ -158,6 +163,7 @@ int sample__fprintf_callchain(struct perf_sample *sample, int left_alignment, printed += fprintf(fp, "\n"); callchain_cursor_advance(cursor); + first = false; } } -- 2.10.1
[PATCH v2 1/8] perf symbol: Print symbol offsets conditionally
The __symbol__fprintf_symname_offs() always shows symbol offsets. So there's no difference between 'perf script -F ip,sym' and 'perf script -F ip,sym,symoff'. I don't think it's a desired behavior.. Signed-off-by: Namhyung Kim --- tools/perf/util/evsel_fprintf.c | 6 -- tools/perf/util/symbol.h | 3 ++- tools/perf/util/symbol_fprintf.c | 11 ++- 3 files changed, 12 insertions(+), 8 deletions(-) diff --git a/tools/perf/util/evsel_fprintf.c b/tools/perf/util/evsel_fprintf.c index 662a0a6182e7..ccb602397b60 100644 --- a/tools/perf/util/evsel_fprintf.c +++ b/tools/perf/util/evsel_fprintf.c @@ -137,7 +137,8 @@ int sample__fprintf_callchain(struct perf_sample *sample, int left_alignment, if (print_symoffset) { printed += __symbol__fprintf_symname_offs(node->sym, &node_al, - print_unknown_as_addr, fp); + print_unknown_as_addr, + true, fp); } else { printed += __symbol__fprintf_symname(node->sym, &node_al, print_unknown_as_addr, fp); @@ -188,7 +189,8 @@ int sample__fprintf_sym(struct perf_sample *sample, struct addr_location *al, printed += fprintf(fp, " "); if (print_symoffset) { printed += __symbol__fprintf_symname_offs(al->sym, al, - print_unknown_as_addr, fp); + print_unknown_as_addr, + true, fp); } else { printed += __symbol__fprintf_symname(al->sym, al, print_unknown_as_addr, fp); diff --git a/tools/perf/util/symbol.h b/tools/perf/util/symbol.h index 2d0a905c879a..dec7e2d44885 100644 --- a/tools/perf/util/symbol.h +++ b/tools/perf/util/symbol.h @@ -282,7 +282,8 @@ int symbol__annotation_init(void); struct symbol *symbol__new(u64 start, u64 len, u8 binding, const char *name); size_t __symbol__fprintf_symname_offs(const struct symbol *sym, const struct addr_location *al, - bool unknown_as_addr, FILE *fp); + bool unknown_as_addr, + bool print_offsets, FILE *fp); size_t symbol__fprintf_symname_offs(const struct symbol *sym, const struct addr_location *al, FILE *fp); size_t __symbol__fprintf_symname(const struct symbol *sym, diff --git a/tools/perf/util/symbol_fprintf.c b/tools/perf/util/symbol_fprintf.c index a680bdaa65dc..7c6b33e8e2d2 100644 --- a/tools/perf/util/symbol_fprintf.c +++ b/tools/perf/util/symbol_fprintf.c @@ -15,14 +15,15 @@ size_t symbol__fprintf(struct symbol *sym, FILE *fp) size_t __symbol__fprintf_symname_offs(const struct symbol *sym, const struct addr_location *al, - bool unknown_as_addr, FILE *fp) + bool unknown_as_addr, + bool print_offsets, FILE *fp) { unsigned long offset; size_t length; if (sym && sym->name) { length = fprintf(fp, "%s", sym->name); - if (al) { + if (al && print_offsets) { if (al->addr < sym->end) offset = al->addr - sym->start; else @@ -40,19 +41,19 @@ size_t symbol__fprintf_symname_offs(const struct symbol *sym, const struct addr_location *al, FILE *fp) { - return __symbol__fprintf_symname_offs(sym, al, false, fp); + return __symbol__fprintf_symname_offs(sym, al, false, true, fp); } size_t __symbol__fprintf_symname(const struct symbol *sym, const struct addr_location *al, bool unknown_as_addr, FILE *fp) { - return __symbol__fprintf_symname_offs(sym, al, unknown_as_addr, fp); + return __symbol__fprintf_symname_offs(sym, al, unknown_as_addr, false, fp); } size_t symbol__fprintf_symname(const struct symbol *sym, FILE *fp) { - return __symbol__fprintf_symname_offs(sym, NULL, false, fp); + return __symbol__fprintf_symname_offs(sym, NULL, false, false, fp); } size_t dso__fprintf_symbols_by_name(struct d
[PATCH v2 8/8] perf sched: Add documentation for timehist options
From: David Ahern Add entry to perf-sched documentation for timehist command and its options. Signed-off-by: David Ahern Signed-off-by: Namhyung Kim --- tools/perf/Documentation/perf-sched.txt | 46 +++-- 1 file changed, 44 insertions(+), 2 deletions(-) diff --git a/tools/perf/Documentation/perf-sched.txt b/tools/perf/Documentation/perf-sched.txt index 1cc08cc47ac5..a0344643f008 100644 --- a/tools/perf/Documentation/perf-sched.txt +++ b/tools/perf/Documentation/perf-sched.txt @@ -8,11 +8,11 @@ perf-sched - Tool to trace/measure scheduler properties (latencies) SYNOPSIS [verse] -'perf sched' {record|latency|map|replay|script} +'perf sched' {record|latency|map|replay|script|timehist} DESCRIPTION --- -There are five variants of perf sched: +There are several variants of perf sched: 'perf sched record ' to record the scheduling events of an arbitrary workload. @@ -36,6 +36,11 @@ There are five variants of perf sched: are running on a CPU. A '*' denotes the CPU that had the event, and a dot signals an idle CPU. + 'perf sched timehist' to show context-switching analysis with times + between schedule-in, schedule delay, and run time. If callchains are + present stack trace is dumped at the end of the line. A summary of + run times can be shown as well. + OPTIONS --- -i:: @@ -66,6 +71,43 @@ OPTIONS for 'perf sched map' --color-pids:: Highlight the given pids. +OPTIONS for 'perf sched timehist' +- +-k:: +--vmlinux=:: +vmlinux pathname + +--kallsyms=:: +kallsyms pathname + +-s:: +--summary:: +Show only a summary of scheduling by thread with min, max, and average +run times (in sec) and relative stddev. + +-S:: +--with-summary:: +Show all scheduling events followed by a summary by thread with min, +max, and average run times (in sec) and relative stddev. + +-w:: +--wakeups:: +Show wakeup events. + +--call-graph:: +Display call chains. Default is on. Use --no-call-graph to suppress + +--max-stack:: +Set the stack depth limit when showing the callchains. Default: 5 + +--symfs=:: +Look for files with symbols relative to this directory. + +-V:: +--cpu-visual:: +Add a visual that better emphasizes activity by cpu. Idle times +are denoted with 'i'; schedule events with an 's'. + SEE ALSO linkperf:perf-record[1] -- 2.10.1
[PATCHSET 0/7] perf sched: Introduce timehist command, again (v2)
Hello, This patchset is a rebased version of David's sched timehist work [1]. I plan to improve perf sched command more and think that having timehist command before the work looks good. It seems David is busy these days, so I'm retrying it by myself. * changes in v2) - change name 'b/n time' to 'wait time' (Ingo) - show arrow between functions in the callchain (Ingo) - fix a bug in calculating initial run time This implements only basic feature and a few options. I just split the patch to make it easier to review and did some cosmetic changes. More patches will come later. The below is from the David's original description (w/ slight change): 8<- 'perf sched timehist' provides an analysis of scheduling events. Example usage: perf sched record -- sleep 1 perf sched timehist By default it shows the individual schedule events, including the time between sched-in events for the task, the task scheduling delay (time between wakeup and actually running) and run time for the task: time cpu task name[tid/pid] wait time sch delay run time - - - - 79371.874569 [11] gcc[31949] 0.014 0.000 1.148 79371.874591 [10] gcc[31951] 0.000 0.000 0.024 79371.874603 [10] migration/10[59] 3.350 0.004 0.011 79371.874604 [11]1.148 0.000 0.035 79371.874723 [05]0.016 0.000 1.383 79371.874746 [05] gcc[31949] 0.153 0.078 0.022 ... Times are in msec.usec. If callchains were recorded they are appended to the line with a default stack depth of 5: 79371.874569 [11] gcc[31949] 0.014 0.000 1.148 wait_for_completion_killable <- do_fork <- sys_vfork <- stub_vfork <- __vfork 79371.874591 [10] gcc[31951] 0.000 0.000 0.024 __cond_resched <- _cond_resched <- wait_for_completion <- stop_one_cpu <- sched_exec 79371.874603 [10] migration/10[59] 3.350 0.004 0.011 smpboot_thread_fn <- kthread <- ret_from_fork 79371.874604 [11]1.148 0.000 0.035 cpu_startup_entry <- start_secondary 79371.874723 [05]0.016 0.000 1.383 cpu_startup_entry <- start_secondary 79371.874746 [05] gcc[31949] 0.153 0.078 0.022 do_wait sys_wait4 <- system_call_fastpath <- __GI___waitpid --no-call-graph can be used to not show the callchains. --max-stack is used to control the number of frames shown (default of 5). -x/--excl options can be used to collapse redundant callchains to get more relevant data on screen. Similar to perf-trace -s and -S can be used to dump a statistical summary without or with events (respectively). Statistics include min run time, average run time and max run time. Stats are also shown for run time by cpu. The cpu-visual option provides a visual aid for sched switches by cpu: ... 79371.874569 [11]s gcc[31949] 0.014 0.000 1.148 79371.874591 [10] s gcc[31951] 0.000 0.000 0.024 79371.874603 [10] s migration/10[59]3.350 0.004 0.011 79371.874604 [11]i1.148 0.000 0.035 79371.874723 [05] i 0.016 0.000 1.383 79371.874746 [05] sgcc[31949] 0.153 0.078 0.022 ... 8<- This code is available at 'perf/timehist-v2' branch in my tree git://git.kernel.org/pub/scm/linux/kernel/git/namhyung/linux-perf.git Any feedback is welcomed. Thanks, Namhyung [1] https://lkml.org/lkml/2013/12/1/129 David Ahern (6): perf sched timehist: Introduce timehist command perf sched timehist: Add summary options perf sched timehist: Add -w/--wakeups option perf sched timehist: Add call graph options perf sched timehist: Add -V/--cpu-visual option perf sched: Add documentation for timehist options Namhyung Kim (2): perf symbol: Print symbol offsets conditionally perf tools: Support printing callchains with arrows tools/perf/Documentation/perf-sched.txt | 46 +- tools/perf/builtin-sched.c | 914 +++- tools/perf/util/evsel.h | 1 + tools/perf/util/evsel_fprintf.c | 12 +- tools/perf/util/symbol.h| 3 +- tools/perf/util/symbol_fprintf.c| 11 +- 6 files changed, 972 insertions(+), 15 deletions(-) -- 2.10.1
[PATCH v2 7/8] perf sched timehist: Add -V/--cpu-visual option
From: David Ahern The -V option provides a visual aid for sched switches by cpu: $ perf sched timehist -V timecpu 0123456789abc task name b/n time sch delay run time [tid/pid](msec) (msec) (msec) --- -- - - - - ... 2412598.429696 [0009] i 0.000 0.000 0.000 2412598.429767 [0002]sperf[7219]0.000 0.000 0.000 2412598.429783 [0009] s perf[7220]0.000 0.006 0.087 2412598.429794 [0010]i0.000 0.000 0.000 2412598.429795 [0009] s migration/9[53] 0.000 0.003 0.011 2412598.430370 [0010]ssleep[7220] 0.011 0.000 0.576 2412598.432584 [0003] i 0.000 0.000 0.000 ... Signed-off-by: David Ahern Signed-off-by: Namhyung Kim --- tools/perf/builtin-sched.c | 44 ++-- 1 file changed, 42 insertions(+), 2 deletions(-) diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c index 1f8731640809..829468defa07 100644 --- a/tools/perf/builtin-sched.c +++ b/tools/perf/builtin-sched.c @@ -201,6 +201,7 @@ struct perf_sched { boolsummary_only; boolshow_callchain; unsigned intmax_stack; + boolshow_cpu_visual; boolshow_wakeups; u64 skipped_samples; }; @@ -1783,10 +1784,23 @@ static char *timehist_get_commstr(struct thread *thread) return str; } -static void timehist_header(void) +static void timehist_header(struct perf_sched *sched) { + u32 ncpus = sched->max_cpu + 1; + u32 i, j; + printf("%15s %6s ", "time", "cpu"); + if (sched->show_cpu_visual) { + printf(" "); + for (i = 0, j = 0; i < ncpus; ++i) { + printf("%x", j++); + if (j > 15) + j = 0; + } + printf(" "); + } + printf(" %-20s %9s %9s %9s", "task name", "wait time", "sch delay", "run time"); @@ -1797,6 +1811,9 @@ static void timehist_header(void) */ printf("%15s %-6s ", "", ""); + if (sched->show_cpu_visual) + printf(" %*s ", ncpus, ""); + printf(" %-20s %9s %9s %9s\n", "[tid/pid]", "(msec)", "(msec)", "(msec)"); /* @@ -1804,6 +1821,9 @@ static void timehist_header(void) */ printf("%.15s %.6s ", graph_dotted_line, graph_dotted_line); + if (sched->show_cpu_visual) + printf(" %.*s ", ncpus, graph_dotted_line); + printf(" %.20s %.9s %.9s %.9s", graph_dotted_line, graph_dotted_line, graph_dotted_line, graph_dotted_line); @@ -1817,11 +1837,28 @@ static void timehist_print_sample(struct perf_sched *sched, struct thread *thread) { struct thread_runtime *tr = thread__priv(thread); + u32 max_cpus = sched->max_cpu + 1; char tstr[64]; timestamp__scnprintf_usec(sample->time, tstr, sizeof(tstr)); printf("%15s [%04d] ", tstr, sample->cpu); + if (sched->show_cpu_visual) { + u32 i; + char c; + + printf(" "); + for (i = 0; i < max_cpus; ++i) { + /* flag idle times with 'i'; others are sched events */ + if (i == sample->cpu) + c = (thread->tid == 0) ? 'i' : 's'; + else + c = ' '; + printf("%c", c); + } + printf(" "); + } + printf(" %-*s ", comm_width, timehist_get_commstr(thread)); print_sched_time(tr->dt_wait, 6); @@ -2095,6 +2132,8 @@ static void timehist_print_wakeup_event(struct perf_sched *sched, timestamp__scnprintf_usec(sample->time, tstr, sizeof(tstr)); printf("%15s [%04d] ", tstr, sample->cpu); + if (sched->show_cpu_visual) + printf(" %*s ", sched->max_cpu + 1, ""); printf(" %-*s ", comm_width, timehist_get_commstr(thread)); @@ -2458,7 +2497,7 @@ static int perf_sched__timehist(struct perf_sched *sched) sched->summary = sched->summary_only; if (!sched->summary_only) - timehist_header(); + timehist_header(sched); err = perf_session__process_events(session); if (err) { @@ -2842,6 +2881,7 @@ int cmd_sched(int argc, const char **argv, const char *prefix __maybe_unused) OPT_BOOLEAN('S', "with-summary", &sched.summary,
[PATCH v2 3/8] perf sched timehist: Introduce timehist command
From: David Ahern 'perf sched timehist' provides an analysis of scheduling events. Example usage: perf sched record -- sleep 1 perf sched timehist By default it shows the individual schedule events, including the wait time (time between sched-out and next sched-in events for the task), the task scheduling delay (time between wakeup and actually running) and run time for the task: timecpu task name wait time sch delay run time [tid/pid](msec) (msec) (msec) -- -- - - - 79371.874569 [0011] gcc[31949]0.014 0.000 1.148 79371.874591 [0010] gcc[31951]0.000 0.000 0.024 79371.874603 [0010] migration/10[59] 3.350 0.004 0.011 79371.874604 [0011] 1.148 0.000 0.035 79371.874723 [0005] 0.016 0.000 1.383 79371.874746 [0005] gcc[31949]0.153 0.078 0.022 ... Times are in msec.usec. Signed-off-by: David Ahern Signed-off-by: Namhyung Kim --- tools/perf/builtin-sched.c | 594 - 1 file changed, 589 insertions(+), 5 deletions(-) diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c index fb3441211e4b..c0ac0c9557e8 100644 --- a/tools/perf/builtin-sched.c +++ b/tools/perf/builtin-sched.c @@ -13,12 +13,14 @@ #include "util/cloexec.h" #include "util/thread_map.h" #include "util/color.h" +#include "util/stat.h" #include #include "util/trace-event.h" #include "util/debug.h" +#include #include #include @@ -194,6 +196,29 @@ struct perf_sched { struct perf_sched_map map; }; +/* per thread run time data */ +struct thread_runtime { + u64 last_time; /* time of previous sched in/out event */ + u64 dt_run; /* run time */ + u64 dt_wait;/* time between CPU access (off cpu) */ + u64 dt_delay; /* time between wakeup and sched-in */ + u64 ready_to_run; /* time of wakeup */ + + struct stats run_stats; + u64 total_run_time; +}; + +/* per event run time data */ +struct evsel_runtime { + u64 *last_time; /* time this event was last seen per cpu */ + u32 ncpu; /* highest cpu slot allocated */ +}; + +/* track idle times per cpu */ +static struct thread **idle_threads; +static int idle_max_cpu; +static char idle_comm[] = ""; + static u64 get_nsecs(void) { struct timespec ts; @@ -1654,6 +1679,546 @@ static int perf_sched__read_events(struct perf_sched *sched) return rc; } +/* + * scheduling times are printed as msec.usec + */ +static inline void print_sched_time(unsigned long long nsecs, int width) +{ + unsigned long msecs; + unsigned long usecs; + + msecs = nsecs / NSEC_PER_MSEC; + nsecs -= msecs * NSEC_PER_MSEC; + usecs = nsecs / NSEC_PER_USEC; + printf("%*lu.%03lu ", width, msecs, usecs); +} + +/* + * returns runtime data for event, allocating memory for it the + * first time it is used. + */ +static struct evsel_runtime *perf_evsel__get_runtime(struct perf_evsel *evsel) +{ + struct evsel_runtime *r = evsel->priv; + + if (r == NULL) { + r = zalloc(sizeof(struct evsel_runtime)); + evsel->priv = r; + } + + return r; +} + +/* + * save last time event was seen per cpu + */ +static void perf_evsel__save_time(struct perf_evsel *evsel, + u64 timestamp, u32 cpu) +{ + struct evsel_runtime *r = perf_evsel__get_runtime(evsel); + + if (r == NULL) + return; + + if ((cpu >= r->ncpu) || (r->last_time == NULL)) { + int i, n = __roundup_pow_of_two(cpu+1); + void *p = r->last_time; + + p = realloc(r->last_time, n * sizeof(u64)); + if (!p) + return; + + r->last_time = p; + for (i = r->ncpu; i < n; ++i) + r->last_time[i] = (u64) 0; + + r->ncpu = n; + } + + r->last_time[cpu] = timestamp; +} + +/* returns last time this event was seen on the given cpu */ +static u64 perf_evsel__get_time(struct perf_evsel *evsel, u32 cpu) +{ + struct evsel_runtime *r = perf_evsel__get_runtime(evsel); + + if ((r == NULL) || (r->last_time == NULL) || (cpu >= r->ncpu)) + return 0; + + return r->last_time[cpu]; +} + +static int comm_width = 20; + +static char *timehist_get_commstr(struct thread *thread) +{ + static char str[32]; + const char *comm = thread__comm_str(thread); + pid_t tid = thread->tid; + pid_t pid = thread->pid_; + int n; + + if (pid == 0) + n = scnprintf(str, sizeof(str), "%s", comm); + + else if (tid != pid) + n = scnprintf(str, siz
[PATCH v2 4/8] perf sched timehist: Add summary options
From: David Ahern The -s/--summary option is to show process runtime statistics. And the -S/--with-summary option is to show the stats with the normal output. $ perf sched timehist -s Runtime summary comm parent sched-in run-timemin-run avg-run max-run stddev (count) (msec) (msec) (msec) (msec) % - ksoftirqd/0[3] 2 20.011 0.004 0.005 0.006 14.87 rcu_preempt[7] 2 110.071 0.002 0.006 0.017 20.23 watchdog/0[11] 2 10.002 0.002 0.002 0.0020.00 watchdog/1[12] 2 10.004 0.004 0.004 0.0040.00 ... Terminated tasks: sleep[7220]7219 30.770 0.087 0.256 0.576 62.28 Idle stats: CPU 0 idle for 2352.006 msec CPU 1 idle for 2764.497 msec CPU 2 idle for 2998.229 msec CPU 3 idle for 2967.800 msec Total number of unique tasks: 52 Total number of context switches: 2532 Total run time (msec): 218.036 Signed-off-by: David Ahern Signed-off-by: Namhyung Kim --- tools/perf/builtin-sched.c | 166 +++-- 1 file changed, 160 insertions(+), 6 deletions(-) diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c index c0ac0c9557e8..1e7d81ad5ec6 100644 --- a/tools/perf/builtin-sched.c +++ b/tools/perf/builtin-sched.c @@ -194,6 +194,11 @@ struct perf_sched { bool force; bool skip_merge; struct perf_sched_map map; + + /* options for timehist command */ + boolsummary; + boolsummary_only; + u64 skipped_samples; }; /* per thread run time data */ @@ -2010,12 +2015,15 @@ static struct thread *timehist_get_thread(struct perf_sample *sample, return thread; } -static bool timehist_skip_sample(struct thread *thread) +static bool timehist_skip_sample(struct perf_sched *sched, +struct thread *thread) { bool rc = false; - if (thread__is_filtered(thread)) + if (thread__is_filtered(thread)) { rc = true; + sched->skipped_samples++; + } return rc; } @@ -2045,7 +2053,7 @@ static int timehist_sched_wakeup_event(struct perf_tool *tool __maybe_unused, return 0; } -static int timehist_sched_change_event(struct perf_tool *tool __maybe_unused, +static int timehist_sched_change_event(struct perf_tool *tool, union perf_event *event, struct perf_evsel *evsel, struct perf_sample *sample, @@ -2056,6 +2064,7 @@ static int timehist_sched_change_event(struct perf_tool *tool __maybe_unused, struct thread_runtime *tr = NULL; u64 tprev; int rc = 0; + struct perf_sched *sched = container_of(tool, struct perf_sched, tool); if (machine__resolve(machine, &al, sample) < 0) { pr_err("problem processing %d event. skipping it\n", @@ -2070,7 +2079,7 @@ static int timehist_sched_change_event(struct perf_tool *tool __maybe_unused, goto out; } - if (timehist_skip_sample(thread)) + if (timehist_skip_sample(sched, thread)) goto out; tr = thread__get_runtime(thread); @@ -2082,7 +2091,8 @@ static int timehist_sched_change_event(struct perf_tool *tool __maybe_unused, tprev = perf_evsel__get_time(evsel, sample->cpu); timehist_update_runtime_stats(tr, sample->time, tprev); - timehist_print_sample(sample, thread); + if (!sched->summary_only) + timehist_print_sample(sample, thread); out: if (tr) { @@ -2122,6 +2132,131 @@ static int process_lost(struct perf_tool *tool __maybe_unused, } +static void print_thread_runtime(struct thread *t, +struct thread_runtime *r) +{ + double mean = avg_stats(&r->run_stats); + float stddev; + + printf("%*s %5d %9" PRIu64 " ", + comm_width, timehist_get_commstr(t), t->ppid, + (u64) r->run_stats.n); + + print_sched_time(r->total_run_time, 8); + stddev = rel_stddev_stats(stddev_stats(&r->run_stats), mean); + print_sched_time(r->run_stats.min, 6); + printf(" "); + print_sched_time((u64) mean, 6); + printf(" "); + print_sched_time(r->run_stats.max, 6); + printf(" "); + printf("%5.2f", stddev); + printf("\n"); +} + +struct total_run_stats { + u64 sched_count; +
[PATCH v2 6/8] perf sched timehist: Add call graph options
From: David Ahern If callchains were recorded they are appended to the line with a default stack depth of 5: 79371.874569 [0011] gcc[31949]0.0140.0001.148 wait_for_completion_killable <- do_fork <- sys_vfork <- stub_vfork <- __vfork 79371.874591 [0010] gcc[31951]0.0000.0000.024 __cond_resched <- _cond_resched <- wait_for_completion <- stop_one_cpu <- sched_exec 79371.874603 [0010] migration/10[59] 3.3500.0040.011 smpboot_thread_fn <- kthread <- ret_from_fork 79371.874604 [0011] 1.1480.0000.035 cpu_startup_entry <- start_secondary 79371.874723 [0005] 0.0160.0001.383 cpu_startup_entry <- start_secondary 79371.874746 [0005] gcc[31949]0.1530.0780.022 do_wait sys_wait4 <- system_call_fastpath <- __GI___waitpid --no-call-graph can be used to not show the callchains. --max-stack is used to control the number of frames shown (default of 5). -x/--excl options can be used to collapse redundant callchains to get more relevant data on screen. Signed-off-by: David Ahern Signed-off-by: Namhyung Kim --- tools/perf/builtin-sched.c | 88 ++ 1 file changed, 82 insertions(+), 6 deletions(-) diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c index 8fb7bcc2cb76..1f8731640809 100644 --- a/tools/perf/builtin-sched.c +++ b/tools/perf/builtin-sched.c @@ -14,6 +14,7 @@ #include "util/thread_map.h" #include "util/color.h" #include "util/stat.h" +#include "util/callchain.h" #include #include "util/trace-event.h" @@ -198,6 +199,8 @@ struct perf_sched { /* options for timehist command */ boolsummary; boolsummary_only; + boolshow_callchain; + unsigned intmax_stack; boolshow_wakeups; u64 skipped_samples; }; @@ -1810,6 +1813,7 @@ static void timehist_header(void) static void timehist_print_sample(struct perf_sched *sched, struct perf_sample *sample, + struct addr_location *al, struct thread *thread) { struct thread_runtime *tr = thread__priv(thread); @@ -1827,6 +1831,18 @@ static void timehist_print_sample(struct perf_sched *sched, if (sched->show_wakeups) printf(" %-*s", comm_width, ""); + if (thread->tid == 0) + goto out; + + if (sched->show_callchain) + printf(" "); + + sample__fprintf_sym(sample, al, 0, + EVSEL__PRINT_SYM | EVSEL__PRINT_ONELINE | + EVSEL__PRINT_CALLCHAIN_ARROW, + &callchain_cursor, stdout); + +out: printf("\n"); } @@ -1878,9 +1894,14 @@ static void timehist_update_runtime_stats(struct thread_runtime *r, r->total_run_time += r->dt_run; } -static bool is_idle_sample(struct perf_sample *sample, - struct perf_evsel *evsel) +static bool is_idle_sample(struct perf_sched *sched, + struct perf_sample *sample, + struct perf_evsel *evsel, + struct machine *machine) { + struct thread *thread; + struct callchain_cursor *cursor = &callchain_cursor; + /* pid 0 == swapper == idle task */ if (sample->pid == 0) return true; @@ -1889,6 +1910,25 @@ static bool is_idle_sample(struct perf_sample *sample, if (perf_evsel__intval(evsel, sample, "prev_pid") == 0) return true; } + + /* want main thread for process - has maps */ + thread = machine__findnew_thread(machine, sample->pid, sample->pid); + if (thread == NULL) { + pr_debug("Failed to get thread for pid %d.\n", sample->pid); + return false; + } + + if (!symbol_conf.use_callchain || sample->callchain == NULL) + return false; + + if (thread__resolve_callchain(thread, cursor, evsel, sample, + NULL, NULL, sched->max_stack) != 0) { + if (verbose) + error("Failed to resolve callchain. Skipping\n"); + + return false; + } + callchain_cursor_commit(cursor); return false; } @@ -1999,13 +2039,14 @@ static struct thread_runtime *thread__get_runtime(struct thread *thread) return tr; } -static struct thread *timehist_get_thread(struct perf_sample *sample, +static struct thread *timehist_get_thread(struct perf_sched *sched, + struct perf_sample *sample, struct machine *machine, struct perf_evsel *evsel) { s
Re: [net] 34fad54c25: kernel BUG at include/linux/skbuff.h:1935!
On 11/16, Duyck, Alexander H wrote: >On Wed, 2016-11-16 at 05:20 +0800, kernel test robot wrote: >From what I can tell it looks like the size of the frame is 0x160 hex, >or 352. For whatever reason we are only pulling 8 bytes into the >header which is giving us an skb->len of 352 (0x160), and a skb- >>data_len of 344 (0x158). When we go to pull the 14 bytes for the >Ethernet header we end up at a skb->len of 338 (0x152) which is >resulting in the panic. > >The question is how are we coming up with 8 instead of 14 which is the >lowest limit supported by eth_get_headlen? My first thought was there >is an incorrect sizeof(eth) instead of the sizeof(*eth) somewhere in >the code but I can't find anything like that anywhere. > >Is there any way you can provide me with the net/ethernet/eth.o and >drivers/net/ethernet/igb/igb.o files? With that I can look over the Is vmlinux ok for you? I've sent it to you separately. Btw: I've also tried gcc-5 (Debian 5.4.1-2) 5.4.1 20160904, and the result shows the same failures. Thanks, Xiaolong
Re: [PATCH v12 12/22] vfio: Add notifier callback to parent's ops structure of mdev
On 11/15/2016 11:19 PM, Alex Williamson wrote: > On Tue, 15 Nov 2016 14:45:42 +0800 > Jike Song wrote: > >> On 11/14/2016 11:42 PM, Kirti Wankhede wrote: >>> Add a notifier calback to parent's ops structure of mdev device so that per >>> device notifer for vfio module is registered through vfio_mdev module. >>> >>> Signed-off-by: Kirti Wankhede >>> Signed-off-by: Neo Jia >>> Change-Id: Iafa6f1721aecdd6e50eb93b153b5621e6d29b637 >>> --- >>> drivers/vfio/mdev/vfio_mdev.c | 19 +++ >>> include/linux/mdev.h | 9 + >>> 2 files changed, 28 insertions(+) >>> >>> diff --git a/drivers/vfio/mdev/vfio_mdev.c b/drivers/vfio/mdev/vfio_mdev.c >>> index ffc36758cb84..1694b1635607 100644 >>> --- a/drivers/vfio/mdev/vfio_mdev.c >>> +++ b/drivers/vfio/mdev/vfio_mdev.c >>> @@ -24,6 +24,15 @@ >>> #define DRIVER_AUTHOR "NVIDIA Corporation" >>> #define DRIVER_DESC "VFIO based driver for Mediated device" >>> >>> +static int vfio_mdev_notifier(struct notifier_block *nb, unsigned long >>> action, >>> + void *data) >>> +{ >>> + struct mdev_device *mdev = container_of(nb, struct mdev_device, nb); >>> + struct parent_device *parent = mdev->parent; >>> + >>> + return parent->ops->notifier(mdev, action, data); >>> +} >>> + >>> static int vfio_mdev_open(void *device_data) >>> { >>> struct mdev_device *mdev = device_data; >>> @@ -40,6 +49,11 @@ static int vfio_mdev_open(void *device_data) >>> if (ret) >>> module_put(THIS_MODULE); >>> >>> + if (likely(parent->ops->notifier)) { >>> + mdev->nb.notifier_call = vfio_mdev_notifier; >>> + if (vfio_register_notifier(&mdev->dev, &mdev->nb)) >>> + pr_err("Failed to register notifier for mdev\n"); >>> + } >> >> Hi Kirti, >> >> Could you please move the notifier registration before parent->ops->open()? >> as you might know, I'm extending your vfio_register_notifier to also include >> the attaching/detaching events of vfio_group and kvm. Basically if >> vfio_group >> not attached to any kvm instance, the parent->ops->open() should return >> -ENODEV >> to indicate the failure, but to know whether kvm is available in open(), the >> notifier registration should be earlier. > > It seems like you're giving general guidance for how a vendor driver > open() function should work, yet a hard dependency on KVM should be > discouraged. You're making a choice for your vendor driver alone. I apologize for any confusion, but all I meant here was, if the real world requires a vendor driver to indicate errors instead of false success, it has to know some information before making the choice. > I would also be very cautious about the coherency of signaling the KVM > association relative to the user of the group. Is it possible that the > association of one KVM instance by a user of the group can leak to the > next user? Does vfio need to seen a gratuitous un-set of the KVM > association on group close()? etc. Thanks, I failed to see how this is possible, per my understanding the vfio_group_set_kvm gets called twice (once with kvm, another with NULL) during kvm's holding the group reference. Would you elaborate a bit more? -- Thanks, Jike
Re: [PATCH 2/3] thermal: hisilicon: fix for dependency
On Tue, Nov 15, 2016 at 08:24:55PM +0800, Zhang Rui wrote: > On Sat, 2016-11-12 at 20:05 +0800, Leo Yan wrote: > > Hi Rui, Eduardo, > > > > On Wed, Aug 31, 2016 at 04:50:16PM +0800, Leo Yan wrote: > > > > > > The thermal driver is standalone driver which is used to enable > > > thermal sensors, so it can be used with any cooling device and > > > should not bind with CPU cooling device driver. > > > > > > This original patch is suggested by Amit Kucheria; so it's to > > > polish the dependency in Kconfig, and remove the dependency with > > > CPU_THERMAL. > > Could you help review this patch? Or need me resend this patch? Sorry > > I have not tracked this patches well before, this is one missed > > patch for 96board Hikey. > > > as it still applies cleanly, the patch is queued for 4.10. Thanks a lot. > thanks, > rui > > Thanks, > > Leo Yan > > > > > > > > Signed-off-by: Leo Yan > > > --- > > > drivers/thermal/Kconfig | 4 +++- > > > 1 file changed, 3 insertions(+), 1 deletion(-) > > > > > > diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig > > > index 2d702ca..91ebab3 100644 > > > --- a/drivers/thermal/Kconfig > > > +++ b/drivers/thermal/Kconfig > > > @@ -177,8 +177,10 @@ config THERMAL_EMULATION > > > > > > config HISI_THERMAL > > > tristate "Hisilicon thermal driver" > > > - depends on (ARCH_HISI && CPU_THERMAL && OF) || > > > COMPILE_TEST > > > + depends on ARCH_HISI || COMPILE_TEST > > > depends on HAS_IOMEM > > > + depends on OF > > > + default y > > > help > > > Enable this to plug hisilicon's thermal sensor driver > > > into the Linux > > > thermal framework. cpufreq is used as the cooling device > > > to throttle
Re: [PATCH] tpm: drop chip->is_open and chip->duration_adjusted
On Mon, Nov 14, 2016 at 09:11:54PM -0800, Jarkko Sakkinen wrote: > How strong is your opposition here? I do not see any exceptional damage > done but see some subtle but still significant benefits. It seems OK, but I never like seeing locking made less clear - this should be manageable, and there isn't a performance concern with tpm either.. Jason
Re: [PATCH v4 2/2] mailbox: Add Tegra HSP driver
On Tue, Nov 15, 2016 at 9:18 PM, Thierry Reding wrote: > + > +struct tegra_hsp_channel; > +struct tegra_hsp; > + > +struct tegra_hsp_channel_ops { > + int (*send_data)(struct tegra_hsp_channel *channel, void *data); > + int (*startup)(struct tegra_hsp_channel *channel); > + void (*shutdown)(struct tegra_hsp_channel *channel); > + bool (*last_tx_done)(struct tegra_hsp_channel *channel); > +}; > + > +struct tegra_hsp_channel { > + struct tegra_hsp *hsp; > + const struct tegra_hsp_channel_ops *ops; > + struct mbox_chan *chan; > + void __iomem *regs; > +}; > + > +static struct tegra_hsp_channel *to_tegra_hsp_channel(struct mbox_chan *chan) > +{ > + return chan->con_priv; > +} > + It seems channel = to_tegra_hsp_channel(chan); is no simpler ritual than channel = chan->con_priv; ? > +struct tegra_hsp_doorbell { > + struct tegra_hsp_channel channel; > + struct list_head list; > + const char *name; > + unsigned int master; > + unsigned int index; > +}; > + > +static struct tegra_hsp_doorbell * > +to_tegra_hsp_doorbell(struct tegra_hsp_channel *channel) > +{ > + if (!channel) > + return NULL; > + > + return container_of(channel, struct tegra_hsp_doorbell, channel); > +} > + But you don't check for NULL returned, before dereferencing the pointer 'db' > +struct tegra_hsp_db_map { > + const char *name; > + unsigned int master; > + unsigned int index; > +}; > + > +struct tegra_hsp_soc { > + const struct tegra_hsp_db_map *map; > +}; > + > +struct tegra_hsp { > + const struct tegra_hsp_soc *soc; > + struct mbox_controller mbox; > + void __iomem *regs; > + unsigned int irq; > + unsigned int num_sm; > + unsigned int num_as; > + unsigned int num_ss; > + unsigned int num_db; > + unsigned int num_si; > + spinlock_t lock; > + > + struct list_head doorbells; > +}; > + > +static inline struct tegra_hsp * > +to_tegra_hsp(struct mbox_controller *mbox) > +{ > + return container_of(mbox, struct tegra_hsp, mbox); > +} > + > +static inline u32 tegra_hsp_readl(struct tegra_hsp *hsp, unsigned int offset) > +{ > + return readl(hsp->regs + offset); > +} > + > +static inline void tegra_hsp_writel(struct tegra_hsp *hsp, u32 value, > + unsigned int offset) > +{ > + writel(value, hsp->regs + offset); > +} > + > +static inline u32 tegra_hsp_channel_readl(struct tegra_hsp_channel *channel, > + unsigned int offset) > +{ > + return readl(channel->regs + offset); > +} > + > +static inline void tegra_hsp_channel_writel(struct tegra_hsp_channel > *channel, > + u32 value, unsigned int offset) > +{ > + writel(value, channel->regs + offset); > +} > + > +static bool tegra_hsp_doorbell_can_ring(struct tegra_hsp_doorbell *db) > +{ > + u32 value; > + > + value = tegra_hsp_channel_readl(&db->channel, HSP_DB_ENABLE); > + > + return (value & BIT(TEGRA_HSP_DB_MASTER_CCPLEX)) != 0; > +} > + > +static struct tegra_hsp_doorbell * > +__tegra_hsp_doorbell_get(struct tegra_hsp *hsp, unsigned int master) > +{ > + struct tegra_hsp_doorbell *entry; > + > + list_for_each_entry(entry, &hsp->doorbells, list) > + if (entry->master == master) > + return entry; > + > + return NULL; > +} > + > +static struct tegra_hsp_doorbell * > +tegra_hsp_doorbell_get(struct tegra_hsp *hsp, unsigned int master) > +{ > + struct tegra_hsp_doorbell *db; > + unsigned long flags; > + > + spin_lock_irqsave(&hsp->lock, flags); > + db = __tegra_hsp_doorbell_get(hsp, master); > + spin_unlock_irqrestore(&hsp->lock, flags); > + > + return db; > +} > + . > + > +static int tegra_hsp_doorbell_send_data(struct tegra_hsp_channel *channel, > + void *data) > +{ > + tegra_hsp_channel_writel(channel, 1, HSP_DB_TRIGGER); > + > + return 0; > +} > + > +static int tegra_hsp_doorbell_startup(struct tegra_hsp_channel *channel) > +{ > + struct tegra_hsp_doorbell *db = to_tegra_hsp_doorbell(channel); > + struct tegra_hsp *hsp = channel->hsp; > + struct tegra_hsp_doorbell *ccplex; > + unsigned long flags; > + u32 value; > + > + if (db->master >= hsp->mbox.num_chans) { > + dev_err(hsp->mbox.dev, > + "invalid master ID %u for HSP channel\n", > + db->master); > + return -EINVAL; > + } > + > + ccplex = tegra_hsp_doorbell_get(hsp, TEGRA_HSP_DB_MASTER_CCPLEX); > + if (!ccplex) > + return -ENODEV; > + > + if (!tegra_hsp_doorbell_can_ring(db)) > + return -ENODEV; > + > + spin_lock_irqsave(&hsp->lock, flags); > + > + value = tegra_hs
Re: [RFC PATCH 2/2] ptr_ring_ll: pop/push multiple objects at once
On Tue, Nov 15, 2016 at 08:42:03PM -0800, John Fastabend wrote: > On 16-11-14 03:06 PM, Michael S. Tsirkin wrote: > > On Thu, Nov 10, 2016 at 08:44:32PM -0800, John Fastabend wrote: > >> Signed-off-by: John Fastabend > > > > This will naturally reduce the cache line bounce > > costs, but so will a _many API for ptr-ring, > > doing lock-add many-unlock. > > > > the number of atomics also scales better with the lock: > > one per push instead of one per queue. > > > > Also, when can qdisc use a _many operation? > > > > On dequeue we can pull off many skbs instead of one at a time and > then either (a) pass them down as an array to the driver (I started > to write this on top of ixgbe and it seems like a win) or (b) pass > them one by one down to the driver and set the xmit_more bit correctly. > > The pass one by one also seems like a win because we avoid the lock > per skb. > > On enqueue qdisc side its a bit more evasive to start doing this. > > > [...] I see. So we could wrap __ptr_ring_consume and implement __skb_array_consume. You can call that in a loop under a lock. I would limit it to something small like 16 pointers, to make sure lock contention is not an issue. -- MST
Re: linux-next: build warning after merge of the pstore tree
On Tue, Nov 15, 2016 at 4:35 PM, Kees Cook wrote: > On Tue, Nov 15, 2016 at 4:27 PM, Stephen Rothwell > wrote: >> Hi Kees, >> >> After merging the pstore tree, today's linux-next build (x86_64 >> allmodconfig) produced this warning: >> >> In file included from include/linux/rcupdate.h:38:0, >> from include/linux/idr.h:18, >> from include/linux/kernfs.h:14, >> from include/linux/sysfs.h:15, >> from include/linux/kobject.h:21, >> from include/linux/device.h:17, >> from fs/pstore/ram_core.c:17: >> fs/pstore/ram_core.c: In function 'buffer_size_add': >> include/linux/spinlock.h:246:3: warning: 'flags' may be used uninitialized >> in this function [-Wmaybe-uninitialized] >>_raw_spin_unlock_irqrestore(lock, flags); \ >>^ >> fs/pstore/ram_core.c:78:16: note: 'flags' was declared here >> unsigned long flags; >> ^ >> In file included from include/linux/rcupdate.h:38:0, >> from include/linux/idr.h:18, >> from include/linux/kernfs.h:14, >> from include/linux/sysfs.h:15, >> from include/linux/kobject.h:21, >> from include/linux/device.h:17, >> from fs/pstore/ram_core.c:17: >> fs/pstore/ram_core.c: In function 'buffer_start_add': >> include/linux/spinlock.h:246:3: warning: 'flags' may be used uninitialized >> in this function [-Wmaybe-uninitialized] >>_raw_spin_unlock_irqrestore(lock, flags); \ >>^ >> fs/pstore/ram_core.c:56:16: note: 'flags' was declared here >> unsigned long flags; >> ^ >> >> Introduced by commit >> >> 95937ddce59a ("pstore: Allow prz to control need for locking") >> >> They appear to be a very noisy false positives. :-( > > Hah. Ironically, I ran sparse against this code to make sure it would > be happy with the conditional locking, and totally missed the flags > bit. I'll switch it to explicitly initialize flags to silence this. > Ah! False positive. Thanks a lot Kees for fixing it. Regards, Joel
Re: [alsa-devel] [PATCH v2] clkdev: add devm_of_clk_get()
Hi Rob, Michael, Russell What is the conclusion of this patch ? We shouldn't add devm_of_clk_get() ? or can I continue ? The problem of current [devm_]clk_get() handles *dev only, but I need to get clocks from DT node, not dev sound_soc { ... cpu { ... => clocks = <&xxx>; }; codec { ... => clocks = <&xxx>; }; }; > > Thank you for your feedback > > > > > > struct clk *clk_get(struct device *dev, const char *con_id) > > > > { > > > > ... > > > > if (dev) { > > > > clk = __of_clk_get_by_name(dev->of_node, dev_id, > > > > con_id); > > > > > > > > ... > > > > } > > > > } > > > > > > > > I would like to select specific device_node. > > > > > > Do you have access to the struct device that you want to target? Can you > > > pass that device into either clk_get or devm_clk_get? > > > > If my understanding was correct, I think I can't. > > In below case, "sound_soc" has its *dev, but "cpu" and "codec" doesn't > > have *dev, it has node only. Thus, we are using of_clk_get() for these now. > > > > clk = of_clk_get(cpu, xxx); > > clk = of_clk_get(codec, xxx); > > > > sound_soc { > > ... > > cpu { > > ... > > => clocks = <&xxx>; > > }; > > codec { > > ... > > => clocks = <&xxx>; > > }; > > }; Best regards --- Kuninori Morimoto
[patch v5 repost 1/1] i2c: add master driver for mellanox systems
From: Vadim Pasternak Device driver for Mellanox I2C controller logic, implemented in Lattice CPLD device. Device supports: - Master mode - One physical bus - Polling mode The Kconfig currently controlling compilation of this code is: drivers/i2c/busses/Kconfig:config I2C_MLXCPLD Signed-off-by: Michael Shych Signed-off-by: Vadim Pasternak Reviewed-by: Jiri Pirko Reviewed-by: Vladimir Zapolskiy --- v4->v5: Comments pointed out by Vladimir: - Remove "default n" from Kconfig; - Fix the comments for timeout and pool time; - Optimize error flow in mlxcpld_i2c_probe; v3->v4: Comments pointed out by Vladimir: - Set default to no in Kconfig; - Make mlxcpld_i2c_plat_dev static and add empty line before the declaration; - In function mlxcpld_i2c_invalid_len remove (msg->len < 0), since len is unsigned; - Remove unused symbol mlxcpld_i2c_plat_dev; - Remove extra spaces in comments to mlxcpld_i2c_check_msg_params; - Remove unnecessary round braces in mlxcpld_i2c_set_transf_data; - Remove the assignment of 'i' variable in mlxcpld_i2c_wait_for_tc; - Add extra line in mlxcpld_i2c_xfer; - Move assignment of the adapter's fields retries and nr inside mlxcpld_i2c_adapter declaration; v2->v3: Comments pointed out by Vladimir: - Use tab symbol as indentation in Kconfig - Add the Kconfig section preserving the alphabetical order - added within "Other I2C/SMBus bus drivers" after I2C_ELEKTOR (but after this sections others are not follow alphabetical); - Change license to dual; - Replace ADRR with ADDR in macros; - Remove unused macros: MLXCPLD_LPCI2C_LPF_DFLT, MLXCPLD_LPCI2C_HALF_CYC_100, MLXCPLD_LPCI2C_I2C_HOLD_100, MLXCPLD_LPCI2C_HALF_CYC_REG, MLXCPLD_LPCI2C_I2C_HOLD_REG; - Fix checkpatch warnings (**/ and the end of comment); - Add empty line before structures mlxcpld_i2c_regs, mlxcpld_i2c_curr_transf, mlxcpld_i2c_priv; - Remove unused structure mlxcpld_i2c_regs; - Remove from mlxcpld_i2c_priv the next fields: retr_num, poll_time, block_sz, xfer_to; use instead macros respectively: MLXCPLD_I2C_RETR_NUM, MLXCPLD_I2C_POLL_TIME, MLXCPLD_I2C_DATA_REG_SZ, MLXCPLD_I2C_XFER_TO; - In mlxcpld_i2c_invalid_len remove unnecessary else; - Optimize mlxcpld_i2c_set_transf_data; - mlxcpld_i2c_reset - add empty lines after/before mutex lock/unlock; - mlxcpld_i2c_wait_for_free - cover case timeout is equal MLXCPLD_I2C_XFER_TO; - mlxcpld_i2c_wait_for_tc: - Do not assign err in declaration (also err is removed); - Insert empty line before case MLXCPLD_LPCI2C_ACK_IND; - inside case MLXCPLD_LPCI2C_ACK_IND - avoid unnecessary indentation; - Remove case MLXCPLD_LPCI2C_ERR_IND and remove this macro; - Add empty lines in mlxcpld_i2c_xfer before/after mutex_lock/ mutex_unlock; - In mlxcpld_i2c_probe add emtpy line after platform_set_drvdata; - Replace platfrom handle pdev in mlxcpld_i2c_priv with the pointer to the structure device; - Place assignment of base_addr near the others; - Enclose e-mail with <>; Fixes added by Vadim: - Change structure description format according to Documentation/kernel-documentation.rst guideline; - mlxcpld_i2c_wait_for_tc: return error if status reaches default case; v1->v2 Fixes added by Vadim: - Put new record in Makefile in alphabetic order; - Remove http://www.mellanox.com from MAINTAINERS record; --- Documentation/i2c/busses/i2c-mlxcpld | 47 +++ MAINTAINERS | 8 + drivers/i2c/busses/Kconfig | 11 + drivers/i2c/busses/Makefile | 1 + drivers/i2c/busses/i2c-mlxcpld.c | 551 +++ 5 files changed, 618 insertions(+) create mode 100644 Documentation/i2c/busses/i2c-mlxcpld create mode 100644 drivers/i2c/busses/i2c-mlxcpld.c diff --git a/Documentation/i2c/busses/i2c-mlxcpld b/Documentation/i2c/busses/i2c-mlxcpld new file mode 100644 index 000..0f8678a --- /dev/null +++ b/Documentation/i2c/busses/i2c-mlxcpld @@ -0,0 +1,47 @@ +Driver i2c-mlxcpld + +Author: Michael Shych + +This is a for Mellanox I2C controller logic, implemented in Lattice CPLD +device. +Device supports: + - Master mode. + - One physical bus. + - Polling mode. + +This controller is equipped within the next Mellanox systems: +"msx6710", "msx6720", "msb7700", "msn2700", "msx1410", "msn2410", "msb7800", +"msn2740", "msn2100". + +The next transaction types are supported: + - Receive Byte/Block. + - Send Byte/Block. + - Read Byte/Block. + - Write Byte/Block. + +Registers: +CTRL 0x1 - control reg. + Resets all the registers. +HALF_CYC 0x4 - cycle reg. + Configure the width of I2C SCL half clock cycle (in 4 LPC_CLK + units). +I2C_HOLD 0x5 - hold reg. + OE (output enable) is delayed by value set to this register + (in LPC_CLK units) +CMD0x6 - command reg. + Bit 7(lsb), 0 = write, 1 = r
Re: [PATCH] rcu: Avoid unnecessary contention of rcu node lock
On Wed, Nov 09, 2016 at 05:57:13PM +0900, Byungchul Park wrote: > It's unnecessary to try to print stacks of blocked tasks in the case > that ndetected == 0. Furthermore, calling rcu_print_detail_task_stall() > causes to acquire rnp locks as many times as the number of leaf nodes > plus one for root node. It's unnecessary at all in the case. Hello, I have two questions. Could you answer them? 1. What do you think about this patch? 2. Is there a tree where patches about rcu are pulled into, before being pulled into mainline tree? For example, tip tree in case of scheduler patches. It would be appriciated if you answer them. Thank you in advance, Byungchul > > Signed-off-by: Byungchul Park > --- > kernel/rcu/tree.c | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > index 287f468..ab2f743 100644 > --- a/kernel/rcu/tree.c > +++ b/kernel/rcu/tree.c > @@ -1374,6 +1374,9 @@ static void print_other_cpu_stall(struct rcu_state > *rsp, unsigned long gpnum) > (long)rsp->gpnum, (long)rsp->completed, totqlen); > if (ndetected) { > rcu_dump_cpu_stacks(rsp); > + > + /* Complain about tasks blocking the grace period. */ > + rcu_print_detail_task_stall(rsp); > } else { > if (READ_ONCE(rsp->gpnum) != gpnum || > READ_ONCE(rsp->completed) == gpnum) { > @@ -1390,9 +1393,6 @@ static void print_other_cpu_stall(struct rcu_state > *rsp, unsigned long gpnum) > } > } > > - /* Complain about tasks blocking the grace period. */ > - rcu_print_detail_task_stall(rsp); > - > rcu_check_gp_kthread_starvation(rsp); > > panic_on_rcu_stall(); > -- > 1.9.1
Re: [RFC PATCH 2/2] ptr_ring_ll: pop/push multiple objects at once
On 16-11-14 03:06 PM, Michael S. Tsirkin wrote: > On Thu, Nov 10, 2016 at 08:44:32PM -0800, John Fastabend wrote: >> Signed-off-by: John Fastabend > > This will naturally reduce the cache line bounce > costs, but so will a _many API for ptr-ring, > doing lock-add many-unlock. > > the number of atomics also scales better with the lock: > one per push instead of one per queue. > > Also, when can qdisc use a _many operation? > On dequeue we can pull off many skbs instead of one at a time and then either (a) pass them down as an array to the driver (I started to write this on top of ixgbe and it seems like a win) or (b) pass them one by one down to the driver and set the xmit_more bit correctly. The pass one by one also seems like a win because we avoid the lock per skb. On enqueue qdisc side its a bit more evasive to start doing this. [...] >> +++ b/net/sched/sch_generic.c >> @@ -571,7 +571,7 @@ static int pfifo_fast_enqueue(struct sk_buff *skb, >> struct Qdisc *qdisc, >> struct skb_array_ll *q = band2list(priv, band); >> int err; >> >> -err = skb_array_ll_produce(q, skb); >> +err = skb_array_ll_produce(q, &skb); >> >> if (unlikely(err)) { >> net_warn_ratelimited("drop a packet from fast enqueue\n"); > > I don't see a pop many operation here. > Patches need a bit of cleanup looks like it was part of another patch. .John
[PATCH] mm: don't cap request size based on read-ahead setting
Hi, We ran into a funky issue, where someone doing 256K buffered reads saw 128K requests at the device level. Turns out it is read-ahead capping the request size, since we use 128K as the default setting. This doesn't make a lot of sense - if someone is issuing 256K reads, they should see 256K reads, regardless of the read-ahead setting, if the underlying device can support a 256K read in a single command. To make matters more confusing, there's an odd interaction with the fadvise hint setting. If we tell the kernel we're doing sequential IO on this file descriptor, we can get twice the read-ahead size. But if we tell the kernel that we are doing random IO, hence disabling read-ahead, we do get nice 256K requests at the lower level. This is because ondemand and forced read-ahead behave differently, with the latter doing the right thing. An application developer will be, rightfully, scratching his head at this point, wondering wtf is going on. A good one will dive into the kernel source, and silently weep. This patch introduces a bdi hint, io_pages. This is the soft max IO size for the lower level, I've hooked it up to the bdev settings here. Read-ahead is modified to issue the maximum of the user request size, and the read-ahead max size, but capped to the max request size on the device side. The latter is done to avoid reading ahead too much, if the application asks for a huge read. With this patch, the kernel behaves like the application expects. Signed-off-by: Jens Axboe diff --git a/block/blk-settings.c b/block/blk-settings.c index f679ae122843..65f16cf4f850 100644 --- a/block/blk-settings.c +++ b/block/blk-settings.c @@ -249,6 +249,7 @@ void blk_queue_max_hw_sectors(struct request_queue *q, unsigned int max_hw_secto max_sectors = min_not_zero(max_hw_sectors, limits->max_dev_sectors); max_sectors = min_t(unsigned int, max_sectors, BLK_DEF_MAX_SECTORS); limits->max_sectors = max_sectors; + q->backing_dev_info.io_pages = max_sectors >> (PAGE_SHIFT - 9); } EXPORT_SYMBOL(blk_queue_max_hw_sectors); diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index 9cc8d7c5439a..ea374e820775 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -212,6 +212,7 @@ queue_max_sectors_store(struct request_queue *q, const char *page, size_t count) spin_lock_irq(q->queue_lock); q->limits.max_sectors = max_sectors_kb << 1; + q->backing_dev_info.io_pages = max_sectors_kb >> (PAGE_SHIFT - 10); spin_unlock_irq(q->queue_lock); return ret; diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h index c357f27d5483..b8144b2d59ce 100644 --- a/include/linux/backing-dev-defs.h +++ b/include/linux/backing-dev-defs.h @@ -136,6 +136,7 @@ struct bdi_writeback { struct backing_dev_info { struct list_head bdi_list; unsigned long ra_pages; /* max readahead in PAGE_SIZE units */ + unsigned long io_pages; /* max allowed IO size */ unsigned int capabilities; /* Device capabilities */ congested_fn *congested_fn; /* Function pointer if device is md/dm */ void *congested_data; /* Pointer to aux data for congested func */ diff --git a/mm/readahead.c b/mm/readahead.c index c8a955b1297e..4eaec947c7c9 100644 --- a/mm/readahead.c +++ b/mm/readahead.c @@ -369,10 +369,25 @@ ondemand_readahead(struct address_space *mapping, bool hit_readahead_marker, pgoff_t offset, unsigned long req_size) { - unsigned long max = ra->ra_pages; + unsigned long io_pages, max_pages; pgoff_t prev_offset; /* +* If bdi->io_pages is set, that indicates the (soft) max IO size +* per command for that device. If we have that available, use +* that as the max suitable read-ahead size for this IO. Instead of +* capping read-ahead at ra_pages if req_size is larger, we can go +* up to io_pages. If io_pages isn't set, fall back to using +* ra_pages as a safe max. +*/ + io_pages = inode_to_bdi(mapping->host)->io_pages; + if (io_pages) { + max_pages = max_t(unsigned long, ra->ra_pages, req_size); + io_pages = min(io_pages, max_pages); + } else + max_pages = ra->ra_pages; + + /* * start of file */ if (!offset) @@ -385,7 +400,7 @@ ondemand_readahead(struct address_space *mapping, if ((offset == (ra->start + ra->size - ra->async_size) || offset == (ra->start + ra->size))) { ra->start += ra->size; - ra->size = get_next_ra_size(ra, max); + ra->size = get_next_ra_size(ra, max_pages); ra->async_size = ra->size; goto readit; } @@ -400,16 +415,16 @@ ondemand_readahead(struct address_space *mapping, pgoff_t start; rcu_read_lock(); - start = page_cache_next_hole(mapping, offset +
Re: [PATCH v13 11/22] vfio iommu: Add blocking notifier to notify DMA_UNMAP
On Wed, 16 Nov 2016 09:46:20 +0530 Kirti Wankhede wrote: > On 11/16/2016 9:28 AM, Alex Williamson wrote: > > On Wed, 16 Nov 2016 09:13:37 +0530 > > Kirti Wankhede wrote: > > > >> On 11/16/2016 8:55 AM, Alex Williamson wrote: > >>> On Tue, 15 Nov 2016 20:16:12 -0700 > >>> Alex Williamson wrote: > >>> > On Wed, 16 Nov 2016 08:16:15 +0530 > Kirti Wankhede wrote: > > > On 11/16/2016 3:49 AM, Alex Williamson wrote: > >> On Tue, 15 Nov 2016 20:59:54 +0530 > >> Kirti Wankhede wrote: > >> > > ... > > > >>> @@ -854,7 +857,28 @@ static int vfio_dma_do_unmap(struct vfio_iommu > >>> *iommu, > >>>*/ > >>> if (dma->task->mm != current->mm) > >>> break; > >>> + > >>> unmapped += dma->size; > >>> + > >>> + if (iommu->external_domain && > >>> !RB_EMPTY_ROOT(&dma->pfn_list)) { > >>> + struct vfio_iommu_type1_dma_unmap nb_unmap; > >>> + > >>> + nb_unmap.iova = dma->iova; > >>> + nb_unmap.size = dma->size; > >>> + > >>> + /* > >>> + * Notifier callback would call > >>> vfio_unpin_pages() which > >>> + * would acquire iommu->lock. Release lock here > >>> and > >>> + * reacquire it again. > >>> + */ > >>> + mutex_unlock(&iommu->lock); > >>> + blocking_notifier_call_chain(&iommu->notifier, > >>> + > >>> VFIO_IOMMU_NOTIFY_DMA_UNMAP, > >>> + &nb_unmap); > >>> + mutex_lock(&iommu->lock); > >>> + if (WARN_ON(!RB_EMPTY_ROOT(&dma->pfn_list))) > >>> + break; > >>> + } > >> > >> > >> Why exactly do we need to notify per vfio_dma rather than per unmap > >> request? If we do the latter we can send the notify first, limiting us > >> to races where a page is pinned between the notify and the locking, > >> whereas here, even our dma pointer is suspect once we re-acquire the > >> lock, we don't technically know if another unmap could have removed > >> that already. Perhaps something like this (untested): > >> > > > > There are checks to validate unmap request, like v2 check and who is > > calling unmap and is it allowed for that task to unmap. Before these > > checks its not sure that unmap region range which asked for would be > > unmapped all. Notify call should be at the place where its sure that the > > range provided to notify call is definitely going to be removed. My > > change do that. > > Ok, but that does solve the problem. What about this (untested): > >>> > >>> s/does/does not/ > >>> > >>> BTW, I like how the retries here fill the gap in my previous proposal > >>> where we could still race re-pinning. We've given it an honest shot or > >>> someone is not participating if we've retried 10 times. I don't > >>> understand why the test for iommu->external_domain was there, clearly > >>> if the list is not empty, we need to notify. Thanks, > >>> > >> > >> Ok. Retry is good to give a chance to unpin all. But is it really > >> required to use BUG_ON() that would panic the host. I think WARN_ON > >> should be fine and then when container is closed or when the last group > >> is removed from the container, vfio_iommu_type1_release() is called and > >> we have a chance to unpin it all. > > > > See my comments on patch 10/22, we need to be vigilant that the vendor > > driver is participating. I don't think we should be cleaning up after > > the vendor driver on release, if we need to do that, it implies we > > already have problems in multi-mdev containers since we'll be left with > > pfn_list entries that no longer have an owner. Thanks, > > > > If any vendor driver doesn't clean its pinned pages and there are > entries in pfn_list with no owner, that would be indicated by WARN_ON, > which should be fixed by that vendor driver. I still feel it shouldn't > cause host panic. > When such warning is seen with multiple mdev devices in container, it is > easy to isolate and find which vendor driver is not cleaning their > stuff, same warning would be seen with single mdev device in a > container. To isolate and find which vendor driver is culprit check with > one mdev device at a time. > Finally, we have a chance to clean all residue from > vfio_iommu_type1_release() so that vfio_iommu_type1 module doesn't leave > any leaks. How can we claim that we've resolved anything by unpinning the residue? In fact, is it actually safe to unpin any residue left by the vendor driver
Re: [PATCH v5] drm/mediatek: fixed the calc method of data rate per lane
Hi Jitao, On Wed, Nov 16, 2016 at 11:20 AM, Jitao Shi wrote: > > Tune dsi frame rate by pixel clock, dsi add some extra signal (i.e. > Tlpx, Ths-prepare, Ths-zero, Ths-trail,Ths-exit) when enter and exit LP > mode, those signals will cause h-time larger than normal and reduce FPS. > So need to multiply a coefficient to offset the extra signal's effect. > coefficient = ((htotal*bpp/lane_number)+Tlpx+Ths_prep+Ths_zero+ > Ths_trail+Ths_exit)/(htotal*bpp/lane_number) > > Signed-off-by: Jitao Shi For this patch, Reviewed-by: Daniel Kurtz But, one more clean up suggestion for another patch, below... > --- > Change since v4: > - tune the calc comment more clear. > - define the phy timings as constants. > > Chnage since v3: > - wrapp the commit msg. > - fix alignment of some lines. > > Change since v2: > - move phy timing back to dsi_phy_timconfig. > > Change since v1: > - phy_timing2 and phy_timing3 refer clock cycle time. > - define values of LPX HS_PRPR HS_ZERO HS_TRAIL TA_GO TA_SURE TA_GET > DA_HS_EXIT. > --- > drivers/gpu/drm/mediatek/mtk_dsi.c | 64 > +++- > 1 file changed, 48 insertions(+), 16 deletions(-) > > diff --git a/drivers/gpu/drm/mediatek/mtk_dsi.c > b/drivers/gpu/drm/mediatek/mtk_dsi.c > index 28b2044..eaa5a22 100644 > --- a/drivers/gpu/drm/mediatek/mtk_dsi.c > +++ b/drivers/gpu/drm/mediatek/mtk_dsi.c > @@ -86,7 +86,7 @@ > > #define DSI_PHY_TIMECON0 0x110 > #define LPX(0xff << 0) > -#define HS_PRPR(0xff << 8) > +#define HS_PREP(0xff << 8) > #define HS_ZERO(0xff << 16) > #define HS_TRAIL (0xff << 24) > > @@ -102,10 +102,16 @@ > #define CLK_TRAIL (0xff << 24) > > #define DSI_PHY_TIMECON3 0x11c > -#define CLK_HS_PRPR(0xff << 0) > +#define CLK_HS_PREP(0xff << 0) > #define CLK_HS_POST(0xff << 8) > #define CLK_HS_EXIT(0xff << 16) > > +#define T_LPX 5 > +#define T_HS_PREP 6 > +#define T_HS_TRAIL 8 > +#define T_HS_EXIT 7 > +#define T_HS_ZERO 10 > + > #define NS_TO_CYCLE(n, c)((n) / (c) + (((n) % (c)) ? 1 : 0)) > > struct phy; > @@ -161,20 +167,18 @@ static void mtk_dsi_mask(struct mtk_dsi *dsi, u32 > offset, u32 mask, u32 data) > static void dsi_phy_timconfig(struct mtk_dsi *dsi) > { > u32 timcon0, timcon1, timcon2, timcon3; > - unsigned int ui, cycle_time; > - unsigned int lpx; > + u32 ui, cycle_time; > > ui = 1000 / dsi->data_rate + 0x01; > cycle_time = 8000 / dsi->data_rate + 0x01; > - lpx = 5; > > - timcon0 = (8 << 24) | (0xa << 16) | (0x6 << 8) | lpx; > - timcon1 = (7 << 24) | (5 * lpx << 16) | ((3 * lpx) / 2) << 8 | > - (4 * lpx); > + timcon0 = T_LPX | T_HS_PREP << 8 | T_HS_ZERO << 16 | T_HS_TRAIL << 24; > + timcon1 = 4 * T_LPX | (3 * T_LPX / 2) << 8 | 5 * T_LPX << 16 | > + T_HS_EXIT << 24; > timcon2 = ((NS_TO_CYCLE(0x64, cycle_time) + 0xa) << 24) | > (NS_TO_CYCLE(0x150, cycle_time) << 16); > - timcon3 = (2 * lpx) << 16 | NS_TO_CYCLE(80 + 52 * ui, cycle_time) << > 8 | > - NS_TO_CYCLE(0x40, cycle_time); > + timcon3 = NS_TO_CYCLE(0x40, cycle_time) | (2 * T_LPX) << 16 | > + NS_TO_CYCLE(80 + 52 * ui, cycle_time) << 8; > > writel(timcon0, dsi->regs + DSI_PHY_TIMECON0); > writel(timcon1, dsi->regs + DSI_PHY_TIMECON1); > @@ -202,19 +206,47 @@ static int mtk_dsi_poweron(struct mtk_dsi *dsi) > { > struct device *dev = dsi->dev; > int ret; > + u64 pixel_clock, total_bits; > + u32 htotal, htotal_bits, bit_per_pixel, overhead_cycles, > overhead_bits; > > if (++dsi->refcount != 1) > return 0; > > + switch (dsi->format) { > + case MIPI_DSI_FMT_RGB565: > + bit_per_pixel = 16; > + break; > + case MIPI_DSI_FMT_RGB666_PACKED: > + bit_per_pixel = 18; > + break; > + case MIPI_DSI_FMT_RGB666: > + case MIPI_DSI_FMT_RGB888: > + default: > + bit_per_pixel = 24; > + break; > + } > + > /** > -* data_rate = (pixel_clock / 1000) * pixel_dipth * mipi_ratio; > -* pixel_clock unit is Khz, data_rata unit is MHz, so need divide > 1000. > -* mipi_ratio is mipi clk coefficient for balance the pixel clk in > mipi. > -* we set mipi_ratio is 1.05. > +* vm.pixelclock is in kHz, pixel_clock unit is Hz, so multiply by > 1000 > +* htotal_time = htotal * byte_per_pixel / num_lanes > +* overhead_time = lpx + hs_prepare + hs_zero + hs_trail + hs_exit > +* mipi_ratio = (htotal_time + overhead_time) / htotal_time > +* data_rate = p
Re: [PATCH] nvmem: qfprom: Fix to support single byte read/write
Hi Stephen, On Wed, Nov 16, 2016 at 12:29 AM, Stephen Boyd wrote: > On 11/15, Vivek Gautam wrote: >> @@ -53,7 +53,7 @@ static int qfprom_remove(struct platform_device *pdev) >> static struct nvmem_config econfig = { >> .name = "qfprom", >> .owner = THIS_MODULE, >> - .stride = 4, >> + .stride = 1, > > Are we certain that all qfproms support byte accesses? I have tested on 8916 and 8996. Will give a try on older targets as well. For a note: we had been using the reg_stride = 1, before removing regmap support for nvmem access [1]. [1] 382c62f nvmem: qfprom: remove nvmem regmap dependency Thanks Vivek -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
Re: [RFC PATCH 1/2] net: use cmpxchg instead of spinlock in ptr rings
On 16-11-14 03:01 PM, Michael S. Tsirkin wrote: > On Thu, Nov 10, 2016 at 08:44:08PM -0800, John Fastabend wrote: >> >> --- >> include/linux/ptr_ring_ll.h | 136 >> +++ >> include/linux/skb_array.h | 25 >> 2 files changed, 161 insertions(+) >> create mode 100644 include/linux/ptr_ring_ll.h >> >> diff --git a/include/linux/ptr_ring_ll.h b/include/linux/ptr_ring_ll.h >> new file mode 100644 >> index 000..bcb11f3 >> --- /dev/null >> +++ b/include/linux/ptr_ring_ll.h >> @@ -0,0 +1,136 @@ >> +/* >> + * Definitions for the 'struct ptr_ring_ll' datastructure. >> + * >> + * Author: >> + * John Fastabend >> + * >> + * Copyright (C) 2016 Intel Corp. >> + * >> + * This program is free software; you can redistribute it and/or modify it >> + * under the terms of the GNU General Public License as published by the >> + * Free Software Foundation; either version 2 of the License, or (at your >> + * option) any later version. >> + * >> + * This is a limited-size FIFO maintaining pointers in FIFO order, with >> + * one CPU producing entries and another consuming entries from a FIFO. >> + * extended from ptr_ring_ll to use cmpxchg over spin lock. > > So when is each one (ptr-ring/ptr-ring-ll) a win? _ll suffix seems to > imply this gives a better latency, OTOH for a ping/pong I suspect > ptr-ring would be better as it avoids index cache line bounces. My observation under qdisc testing with pktgen is that I get better pps numbers with this code vs ptr_ring using spinlock. I actually wrote this implementation before the skb_array code was around though and haven't done a thorough analysis of the two yet only pktgen benchmarks. In my pktgen benchmarks I test 1:1 producer/consumer and many to one producer/consumer tests. I'll post some numbers later this week. [...] >> + */ >> +static inline int __ptr_ring_ll_produce(struct ptr_ring_ll *r, void *ptr) >> +{ >> +u32 ret, head, tail, next, slots, mask; >> + >> +do { >> +head = READ_ONCE(r->prod_head); >> +mask = READ_ONCE(r->prod_mask); >> +tail = READ_ONCE(r->cons_tail); >> + >> +slots = mask + tail - head; >> +if (slots < 1) >> +return -ENOMEM; >> + >> +next = head + 1; >> +ret = cmpxchg(&r->prod_head, head, next); >> +} while (ret != head); > > > So why is this preferable to a lock? > > I suspect it's nothing else than the qspinlock fairness > and polling code complexity. It's all not very useful if you > 1. are just doing a couple of instructions under the lock > and > 2. use a finite FIFO which is unfair anyway > > > How about this hack (lifted from virt_spin_lock): > > static inline void quick_spin_lock(struct qspinlock *lock) > { > do { > while (atomic_read(&lock->val) != 0) > cpu_relax(); > } while (atomic_cmpxchg(&lock->val, 0, _Q_LOCKED_VAL) != 0); > } > > Or maybe we should even drop the atomic_read in the middle - > worth profiling and comparing: > > static inline void quick_spin_lock(struct qspinlock *lock) > { > while (atomic_cmpxchg(&lock->val, 0, _Q_LOCKED_VAL) != 0) > cpu_relax(); > } > > > Then, use quick_spin_lock instead of spin_lock everywhere in > ptr_ring - will that make it more efficient? > I think this could be the case. I'll give it a test later this week I am working on the xdp bits for virtio at the moment. To be honest though for my qdisc patchset first I need to resolve a bug and then probably in the first set just use the existing skb_array implementation. Its fun to micro-optimize this stuff but really any implementation will show improvement over existing code. Thanks, John
linux-next: Tree for Nov 16
Hi all, Changes since 20161115: Non-merge commits (relative to Linus' tree): 5617 5979 files changed, 365287 insertions(+), 131171 deletions(-) I have created today's linux-next tree at git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git (patches at http://www.kernel.org/pub/linux/kernel/next/ ). If you are tracking the linux-next tree using git, you should not use "git pull" to do so as that will try to merge the new linux-next release with the old one. You should use "git fetch" and checkout or reset to the new master. You can see which trees have been included by looking in the Next/Trees file in the source. There are also quilt-import.log and merge.log files in the Next directory. Between each merge, the tree was built with a ppc64_defconfig for powerpc and an allmodconfig (with CONFIG_BUILD_DOCSRC=n) for x86_64, a multi_v7_defconfig for arm and a native build of tools/perf. After the final fixups (if any), I do an x86_64 modules_install followed by builds for x86_64 allnoconfig, powerpc allnoconfig (32 and 64 bit), ppc44x_defconfig, allyesconfig (this fails its final link) and pseries_le_defconfig and i386, sparc and sparc64 defconfig. Below is a summary of the state of the merge. I am currently merging 244 trees (counting Linus' and 34 trees of bug fix patches pending for the current merge release). Stats about the size of the tree over time can be seen at http://neuling.org/linux-next-size.html . Status of my local build tests will be at http://kisskb.ellerman.id.au/linux-next . If maintainers want to give advice about cross compilers/configs that work, we are always open to add more builds. Thanks to Randy Dunlap for doing many randconfig builds. And to Paul Gortmaker for triage and bug fixes. -- Cheers, Stephen Rothwell $ git checkout master $ git reset --hard stable Merging origin/master (81bcfe5e48f9 Merge tag 'trace-v4.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace) Merging fixes/master (30066ce675d3 Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6) Merging kbuild-current/rc-fixes (cc6acc11cad1 kbuild: be more careful about matching preprocessed asm ___EXPORT_SYMBOL) Merging arc-current/for-curr (a25f0944ba9b Linux 4.9-rc5) Merging arm-current/fixes (6127d124ee4e ARM: wire up new pkey syscalls) Merging m68k-current/for-linus (7e251bb21ae0 m68k: Fix ndelay() macro) Merging metag-fixes/fixes (35d04077ad96 metag: Only define atomic_dec_if_positive conditionally) Merging powerpc-fixes/fixes (c0a36013639b powerpc/64: Fix setting of AIL in hypervisor mode) Merging sparc/master (87a349f9cc09 sparc64: fix compile warning section mismatch in find_node()) Merging net/master (f6c365fad103 net: ethernet: Fix SGMII unable to switch speed and autonego failure) Merging ipsec/master (7f92083eb58f vti6: flush x-netns xfrm cache when vti interface is removed) Merging netfilter/master (9b6c14d51bd2 net: tcp response should set oif only if it is L3 master) Merging ipvs/master (9b6c14d51bd2 net: tcp response should set oif only if it is L3 master) Merging wireless-drivers/master (d3532ea6ce4e brcmfmac: avoid maybe-uninitialized warning in brcmf_cfg80211_start_ap) Merging mac80211/master (4fb7f8af1f4c mac80211_hwsim: fix beacon delta calculation) Merging sound-current/for-linus (6ff1a25318eb ALSA: usb-audio: Fix use-after-free of usb_device at disconnect) Merging pci-current/for-linus (bc79c9851a76 PCI: VMD: Update filename to reflect move) Merging driver-core.current/driver-core-linus (a25f0944ba9b Linux 4.9-rc5) Merging tty.current/tty-linus (a909d3e63699 Linux 4.9-rc3) Merging usb.current/usb-linus (a5d906bb261c usb: chipidea: move the lock initialization to core file) Merging usb-gadget-fixes/fixes (fd9afd3cbe40 usb: gadget: u_ether: remove interrupt throttling) Merging usb-serial-fixes/usb-linus (9bfef729a3d1 USB: serial: ftdi_sio: add support for TI CC3200 LaunchPad) Merging usb-chipidea-fixes/ci-for-usb-stable (c7fbb09b2ea1 usb: chipidea: move the lock initialization to core file) Merging phy/fixes (4320f9d4c183 phy: sun4i: check PMU presence when poking unknown bit of pmu) Merging staging.current/staging-linus (a25f0944ba9b Linux 4.9-rc5) Merging char-misc.current/char-misc-linus (a25f0944ba9b Linux 4.9-rc5) Merging input-current/for-linus (324ae0958cab Input: psmouse - cleanup Focaltech code) Merging crypto-current/master (83d2c9a9c17b crypto: caam - do not register AES-XTS mode on LP units) Merging ide/master (797cee982eef Merge branch 'stable-4.8' of git://git.infradead.org/users/pcmoore/audit) Merging vfio-fixes/for-linus (05692d7005a3 vfio/pci: Fix integer overflows, bitmask check) Merging kselftest-fixes/fixes (1001354ca341 Linux 4.9-rc1) Merging backlight-fixes/for-backlight-fixes (68feaca0b13e backlight: pwm: Handle EPROBE_DEF
Re: [PATCH v7 06/11] x86, paravirt: Add interface to support kvm/xen vcpu preempted check
在 2016/11/15 23:47, Peter Zijlstra 写道: On Wed, Nov 02, 2016 at 05:08:33AM -0400, Pan Xinhui wrote: diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index 0f400c0..38c3bb7 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -310,6 +310,8 @@ struct pv_lock_ops { void (*wait)(u8 *ptr, u8 val); void (*kick)(int cpu); + + bool (*vcpu_is_preempted)(int cpu); }; So that ends up with a full function call in the native case. I did something like the below on top, completely untested, not been near a compiler etc.. Hi, Peter. I think we can avoid a function call in a simpler way. How about below static inline bool vcpu_is_preempted(int cpu) { /* only set in pv case*/ if (pv_lock_ops.vcpu_is_preempted) return pv_lock_ops.vcpu_is_preempted(cpu); return false; } It doesn't get rid of the branch, but at least it avoids the function call, and hardware should have no trouble predicting a constant condition. Also, it looks like you end up not setting vcpu_is_preempted when KVM doesn't support steal clock, which would end up in an instant NULL deref. Fixed that too. maybe not true. There is .vcpu_is_preempted = native_vcpu_is_preempted when we define pv_lock_ops. your patch is a good example for any people who want to add any native/pv function. :) thanks xinhui --- --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -673,6 +673,11 @@ static __always_inline void pv_kick(int PVOP_VCALL1(pv_lock_ops.kick, cpu); } +static __always_inline void pv_vcpu_is_prempted(int cpu) +{ + PVOP_VCALLEE1(pv_lock_ops.vcpu_is_preempted, cpu); +} + #endif /* SMP && PARAVIRT_SPINLOCKS */ #ifdef CONFIG_X86_32 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -309,7 +309,7 @@ struct pv_lock_ops { void (*wait)(u8 *ptr, u8 val); void (*kick)(int cpu); - bool (*vcpu_is_preempted)(int cpu); + struct paravirt_callee_save vcpu_is_preempted; }; /* This contains all the paravirt structures: we get a convenient --- a/arch/x86/include/asm/qspinlock.h +++ b/arch/x86/include/asm/qspinlock.h @@ -32,6 +32,12 @@ static inline void queued_spin_unlock(st { pv_queued_spin_unlock(lock); } + +#define vcpu_is_preempted vcpu_is_preempted +static inline bool vcpu_is_preempted(int cpu) +{ + return pv_vcpu_is_preempted(cpu); +} #else static inline void queued_spin_unlock(struct qspinlock *lock) { --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -26,14 +26,6 @@ extern struct static_key paravirt_ticketlocks_enabled; static __always_inline bool static_key_false(struct static_key *key); -#ifdef CONFIG_PARAVIRT_SPINLOCKS -#define vcpu_is_preempted vcpu_is_preempted -static inline bool vcpu_is_preempted(int cpu) -{ - return pv_lock_ops.vcpu_is_preempted(cpu); -} -#endif - #include /* --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -415,15 +415,6 @@ void kvm_disable_steal_time(void) wrmsr(MSR_KVM_STEAL_TIME, 0, 0); } -static bool kvm_vcpu_is_preempted(int cpu) -{ - struct kvm_steal_time *src; - - src = &per_cpu(steal_time, cpu); - - return !!src->preempted; -} - #ifdef CONFIG_SMP static void __init kvm_smp_prepare_boot_cpu(void) { @@ -480,9 +471,6 @@ void __init kvm_guest_init(void) if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) { has_steal_clock = 1; pv_time_ops.steal_clock = kvm_steal_clock; -#ifdef CONFIG_PARAVIRT_SPINLOCKS - pv_lock_ops.vcpu_is_preempted = kvm_vcpu_is_preempted; -#endif } if (kvm_para_has_feature(KVM_FEATURE_PV_EOI)) @@ -604,6 +592,14 @@ static void kvm_wait(u8 *ptr, u8 val) local_irq_restore(flags); } +static bool __kvm_vcpu_is_preempted(int cpu) +{ + struct kvm_steal_time *src = &per_cpu(steal_time, cpu); + + return !!src->preempted; +} +PV_CALLEE_SAVE_REGS_THUNK(__kvm_vcpu_is_preempted); + /* * Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present. */ @@ -620,6 +616,12 @@ void __init kvm_spinlock_init(void) pv_lock_ops.queued_spin_unlock = PV_CALLEE_SAVE(__pv_queued_spin_unlock); pv_lock_ops.wait = kvm_wait; pv_lock_ops.kick = kvm_kick_cpu; + pv_lock_ops.vcpu_is_preempted = PV_CALLEE_SAVE(__native_vcpu_is_preempted); + + if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) { + pv_lock_ops.vcpu_is_preempted = + PV_CALLEE_SAVE(__kvm_vcpu_is_preempted); + } } static __init int kvm_spinlock_init_jump(void) --- a/arch/x86/kernel/paravirt-spinlocks.c +++ b/arch/x86/kernel/paravirt-spinlocks.c @@ -12,7 +12,6 @@ __visible void __native_queued_spin_unlo { native_queued_spin_unlock(lock); } - PV_CALLEE_SAVE_REGS_THUNK(__native_queued_spin_unlock); bool pv_is_native_spi
Re: [PATCH v13 11/22] vfio iommu: Add blocking notifier to notify DMA_UNMAP
On 11/16/2016 9:28 AM, Alex Williamson wrote: > On Wed, 16 Nov 2016 09:13:37 +0530 > Kirti Wankhede wrote: > >> On 11/16/2016 8:55 AM, Alex Williamson wrote: >>> On Tue, 15 Nov 2016 20:16:12 -0700 >>> Alex Williamson wrote: >>> On Wed, 16 Nov 2016 08:16:15 +0530 Kirti Wankhede wrote: > On 11/16/2016 3:49 AM, Alex Williamson wrote: >> On Tue, 15 Nov 2016 20:59:54 +0530 >> Kirti Wankhede wrote: >> > ... > >>> @@ -854,7 +857,28 @@ static int vfio_dma_do_unmap(struct vfio_iommu >>> *iommu, >>> */ >>> if (dma->task->mm != current->mm) >>> break; >>> + >>> unmapped += dma->size; >>> + >>> + if (iommu->external_domain && >>> !RB_EMPTY_ROOT(&dma->pfn_list)) { >>> + struct vfio_iommu_type1_dma_unmap nb_unmap; >>> + >>> + nb_unmap.iova = dma->iova; >>> + nb_unmap.size = dma->size; >>> + >>> + /* >>> +* Notifier callback would call >>> vfio_unpin_pages() which >>> +* would acquire iommu->lock. Release lock here >>> and >>> +* reacquire it again. >>> +*/ >>> + mutex_unlock(&iommu->lock); >>> + blocking_notifier_call_chain(&iommu->notifier, >>> + >>> VFIO_IOMMU_NOTIFY_DMA_UNMAP, >>> + &nb_unmap); >>> + mutex_lock(&iommu->lock); >>> + if (WARN_ON(!RB_EMPTY_ROOT(&dma->pfn_list))) >>> + break; >>> + } >> >> >> Why exactly do we need to notify per vfio_dma rather than per unmap >> request? If we do the latter we can send the notify first, limiting us >> to races where a page is pinned between the notify and the locking, >> whereas here, even our dma pointer is suspect once we re-acquire the >> lock, we don't technically know if another unmap could have removed >> that already. Perhaps something like this (untested): >> > > There are checks to validate unmap request, like v2 check and who is > calling unmap and is it allowed for that task to unmap. Before these > checks its not sure that unmap region range which asked for would be > unmapped all. Notify call should be at the place where its sure that the > range provided to notify call is definitely going to be removed. My > change do that. Ok, but that does solve the problem. What about this (untested): >>> >>> s/does/does not/ >>> >>> BTW, I like how the retries here fill the gap in my previous proposal >>> where we could still race re-pinning. We've given it an honest shot or >>> someone is not participating if we've retried 10 times. I don't >>> understand why the test for iommu->external_domain was there, clearly >>> if the list is not empty, we need to notify. Thanks, >>> >> >> Ok. Retry is good to give a chance to unpin all. But is it really >> required to use BUG_ON() that would panic the host. I think WARN_ON >> should be fine and then when container is closed or when the last group >> is removed from the container, vfio_iommu_type1_release() is called and >> we have a chance to unpin it all. > > See my comments on patch 10/22, we need to be vigilant that the vendor > driver is participating. I don't think we should be cleaning up after > the vendor driver on release, if we need to do that, it implies we > already have problems in multi-mdev containers since we'll be left with > pfn_list entries that no longer have an owner. Thanks, > If any vendor driver doesn't clean its pinned pages and there are entries in pfn_list with no owner, that would be indicated by WARN_ON, which should be fixed by that vendor driver. I still feel it shouldn't cause host panic. When such warning is seen with multiple mdev devices in container, it is easy to isolate and find which vendor driver is not cleaning their stuff, same warning would be seen with single mdev device in a container. To isolate and find which vendor driver is culprit check with one mdev device at a time. Finally, we have a chance to clean all residue from vfio_iommu_type1_release() so that vfio_iommu_type1 module doesn't leave any leaks. Thanks, Kirti
[PATCH v2] staging: slicoss: fix different address space warnings
This patch fix the following sparse warnings in slicoss driver: warning: incorrect type in assignment (different address spaces) Changes in v2: * Remove IOMEM_GET_FIELDADDR macro * Add ioread64 and iowrite64 defines Signed-off-by: Sergio Paracuellos --- drivers/staging/slicoss/slicoss.c | 111 ++ 1 file changed, 76 insertions(+), 35 deletions(-) diff --git a/drivers/staging/slicoss/slicoss.c b/drivers/staging/slicoss/slicoss.c index d2929b9..d68a463 100644 --- a/drivers/staging/slicoss/slicoss.c +++ b/drivers/staging/slicoss/slicoss.c @@ -128,6 +128,35 @@ MODULE_DEVICE_TABLE(pci, slic_pci_tbl); +#ifndef ioread64 +#ifdef readq +#define ioread64 readq +#else +#define ioread64 _ioread64 +static inline u64 _ioread64(void __iomem *mmio) +{ + u64 low, high; + + low = ioread32(mmio); + high = ioread32(mmio + sizeof(u32)); + return low | (high << 32); +} +#endif +#endif + +#ifndef iowrite64 +#ifdef writeq +#define iowrite64 writeq +#else +#define iowrite64 _iowrite64 +static inline void _iowrite64(u64 val, void __iomem *mmio) +{ + iowrite32(val, mmio); + iowrite32(val >> 32, mmio + sizeof(u32)); +} +#endif +#endif + static void slic_mcast_set_bit(struct adapter *adapter, char *address) { unsigned char crcpoly; @@ -923,8 +952,8 @@ static int slic_upr_request(struct adapter *adapter, static void slic_link_upr_complete(struct adapter *adapter, u32 isr) { struct slic_shmemory *sm = &adapter->shmem; - struct slic_shmem_data *sm_data = sm->shmem_data; - u32 lst = sm_data->lnkstatus; + struct slic_shmem_data __iomem *sm_data = sm->shmem_data; + u32 lst = ioread32(&sm_data->lnkstatus); uint linkup; unsigned char linkspeed; unsigned char linkduplex; @@ -1003,8 +1032,8 @@ static void slic_upr_request_complete(struct adapter *adapter, u32 isr) switch (upr->upr_request) { case SLIC_UPR_STATS: { struct slic_shmemory *sm = &adapter->shmem; - struct slic_shmem_data *sm_data = sm->shmem_data; - struct slic_stats *stats = &sm_data->stats; + struct slic_shmem_data __iomem *sm_data = sm->shmem_data; + struct slic_stats __iomem *stats = &sm_data->stats; struct slic_stats *old = &adapter->inicstats_prev; struct slicnet_stats *stst = &adapter->slic_stats; @@ -1014,50 +1043,62 @@ static void slic_upr_request_complete(struct adapter *adapter, u32 isr) break; } - UPDATE_STATS_GB(stst->tcp.xmit_tcp_segs, stats->xmit_tcp_segs, + UPDATE_STATS_GB(stst->tcp.xmit_tcp_segs, + ioread64(&stats->xmit_tcp_segs), old->xmit_tcp_segs); - UPDATE_STATS_GB(stst->tcp.xmit_tcp_bytes, stats->xmit_tcp_bytes, + UPDATE_STATS_GB(stst->tcp.xmit_tcp_bytes, + ioread64(&stats->xmit_tcp_bytes), old->xmit_tcp_bytes); - UPDATE_STATS_GB(stst->tcp.rcv_tcp_segs, stats->rcv_tcp_segs, + UPDATE_STATS_GB(stst->tcp.rcv_tcp_segs, + ioread64(&stats->rcv_tcp_segs), old->rcv_tcp_segs); - UPDATE_STATS_GB(stst->tcp.rcv_tcp_bytes, stats->rcv_tcp_bytes, + UPDATE_STATS_GB(stst->tcp.rcv_tcp_bytes, + ioread64(&stats->rcv_tcp_bytes), old->rcv_tcp_bytes); - UPDATE_STATS_GB(stst->iface.xmt_bytes, stats->xmit_bytes, + UPDATE_STATS_GB(stst->iface.xmt_bytes, + ioread64(&stats->xmit_bytes), old->xmit_bytes); - UPDATE_STATS_GB(stst->iface.xmt_ucast, stats->xmit_unicasts, + UPDATE_STATS_GB(stst->iface.xmt_ucast, + ioread64(&stats->xmit_unicasts), old->xmit_unicasts); - UPDATE_STATS_GB(stst->iface.rcv_bytes, stats->rcv_bytes, + UPDATE_STATS_GB(stst->iface.rcv_bytes, + ioread64(&stats->rcv_bytes), old->rcv_bytes); - UPDATE_STATS_GB(stst->iface.rcv_ucast, stats->rcv_unicasts, + UPDATE_STATS_GB(stst->iface.rcv_ucast, + ioread64(&stats->rcv_unicasts), old->rcv_unicasts); - UPDATE_STATS_GB(stst->iface.xmt_errors, stats->xmit_collisions, + UPDATE_STATS_GB(stst->iface.xmt_errors, + ioread64(&stats->xmit_collisions), old->xmit_collisions); UPDATE_STATS_GB(stst->iface.xmt_errors, - stats->xmit_excess_collisions, +
Re: [PATCH v13 11/22] vfio iommu: Add blocking notifier to notify DMA_UNMAP
On Wed, 16 Nov 2016 09:13:37 +0530 Kirti Wankhede wrote: > On 11/16/2016 8:55 AM, Alex Williamson wrote: > > On Tue, 15 Nov 2016 20:16:12 -0700 > > Alex Williamson wrote: > > > >> On Wed, 16 Nov 2016 08:16:15 +0530 > >> Kirti Wankhede wrote: > >> > >>> On 11/16/2016 3:49 AM, Alex Williamson wrote: > On Tue, 15 Nov 2016 20:59:54 +0530 > Kirti Wankhede wrote: > > >>> ... > >>> > > @@ -854,7 +857,28 @@ static int vfio_dma_do_unmap(struct vfio_iommu > > *iommu, > > */ > > if (dma->task->mm != current->mm) > > break; > > + > > unmapped += dma->size; > > + > > + if (iommu->external_domain && > > !RB_EMPTY_ROOT(&dma->pfn_list)) { > > + struct vfio_iommu_type1_dma_unmap nb_unmap; > > + > > + nb_unmap.iova = dma->iova; > > + nb_unmap.size = dma->size; > > + > > + /* > > +* Notifier callback would call > > vfio_unpin_pages() which > > +* would acquire iommu->lock. Release lock here > > and > > +* reacquire it again. > > +*/ > > + mutex_unlock(&iommu->lock); > > + blocking_notifier_call_chain(&iommu->notifier, > > + > > VFIO_IOMMU_NOTIFY_DMA_UNMAP, > > + &nb_unmap); > > + mutex_lock(&iommu->lock); > > + if (WARN_ON(!RB_EMPTY_ROOT(&dma->pfn_list))) > > + break; > > + } > > > Why exactly do we need to notify per vfio_dma rather than per unmap > request? If we do the latter we can send the notify first, limiting us > to races where a page is pinned between the notify and the locking, > whereas here, even our dma pointer is suspect once we re-acquire the > lock, we don't technically know if another unmap could have removed > that already. Perhaps something like this (untested): > > >>> > >>> There are checks to validate unmap request, like v2 check and who is > >>> calling unmap and is it allowed for that task to unmap. Before these > >>> checks its not sure that unmap region range which asked for would be > >>> unmapped all. Notify call should be at the place where its sure that the > >>> range provided to notify call is definitely going to be removed. My > >>> change do that. > >> > >> Ok, but that does solve the problem. What about this (untested): > > > > s/does/does not/ > > > > BTW, I like how the retries here fill the gap in my previous proposal > > where we could still race re-pinning. We've given it an honest shot or > > someone is not participating if we've retried 10 times. I don't > > understand why the test for iommu->external_domain was there, clearly > > if the list is not empty, we need to notify. Thanks, > > > > Ok. Retry is good to give a chance to unpin all. But is it really > required to use BUG_ON() that would panic the host. I think WARN_ON > should be fine and then when container is closed or when the last group > is removed from the container, vfio_iommu_type1_release() is called and > we have a chance to unpin it all. See my comments on patch 10/22, we need to be vigilant that the vendor driver is participating. I don't think we should be cleaning up after the vendor driver on release, if we need to do that, it implies we already have problems in multi-mdev containers since we'll be left with pfn_list entries that no longer have an owner. Thanks, Alex
[PATCH 2/2] ahci: qoriq: report warning when ecc register is missing
From: Tang Yuantian For ls1021a and ls1046a socs, sata ecc must be disabled. If ecc register is not found in sata node in dts, report a warning. Signed-off-by: Tang Yuantian --- drivers/ata/ahci_qoriq.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/ata/ahci_qoriq.c b/drivers/ata/ahci_qoriq.c index 45c88de..66eb4b5 100644 --- a/drivers/ata/ahci_qoriq.c +++ b/drivers/ata/ahci_qoriq.c @@ -158,6 +158,7 @@ static int ahci_qoriq_phy_init(struct ahci_host_priv *hpriv) switch (qpriv->type) { case AHCI_LS1021A: + WARN_ON(!qpriv->ecc_addr); writel(SATA_ECC_DISABLE, qpriv->ecc_addr); writel(AHCI_PORT_PHY_1_CFG, reg_base + PORT_PHY1); writel(LS1021A_PORT_PHY2, reg_base + PORT_PHY2); @@ -185,6 +186,7 @@ static int ahci_qoriq_phy_init(struct ahci_host_priv *hpriv) break; case AHCI_LS1046A: + WARN_ON(!qpriv->ecc_addr); writel(LS1046A_SATA_ECC_DIS, qpriv->ecc_addr); writel(AHCI_PORT_PHY_1_CFG, reg_base + PORT_PHY1); writel(AHCI_PORT_TRANS_CFG, reg_base + PORT_TRANS); -- 2.1.0.27.g96db324
[PATCH 1/2] ahci: qoriq: added a condition to enable dma coherence
From: Tang Yuantian Enable DMA coherence in SATA controller on condition that dma-coherent property exists in sata node in DTS. Signed-off-by: Tang Yuantian --- drivers/ata/ahci_qoriq.c | 15 +++ 1 file changed, 11 insertions(+), 4 deletions(-) diff --git a/drivers/ata/ahci_qoriq.c b/drivers/ata/ahci_qoriq.c index 9884c8c..45c88de 100644 --- a/drivers/ata/ahci_qoriq.c +++ b/drivers/ata/ahci_qoriq.c @@ -59,6 +59,7 @@ struct ahci_qoriq_priv { struct ccsr_ahci *reg_base; enum ahci_qoriq_type type; void __iomem *ecc_addr; + bool is_dmacoherent; }; static const struct of_device_id ahci_qoriq_of_match[] = { @@ -164,26 +165,31 @@ static int ahci_qoriq_phy_init(struct ahci_host_priv *hpriv) writel(LS1021A_PORT_PHY4, reg_base + PORT_PHY4); writel(LS1021A_PORT_PHY5, reg_base + PORT_PHY5); writel(AHCI_PORT_TRANS_CFG, reg_base + PORT_TRANS); - writel(AHCI_PORT_AXICC_CFG, reg_base + LS1021A_AXICC_ADDR); + if (qpriv->is_dmacoherent) + writel(AHCI_PORT_AXICC_CFG, + reg_base + LS1021A_AXICC_ADDR); break; case AHCI_LS1043A: writel(AHCI_PORT_PHY_1_CFG, reg_base + PORT_PHY1); writel(AHCI_PORT_TRANS_CFG, reg_base + PORT_TRANS); - writel(AHCI_PORT_AXICC_CFG, reg_base + PORT_AXICC); + if (qpriv->is_dmacoherent) + writel(AHCI_PORT_AXICC_CFG, reg_base + PORT_AXICC); break; case AHCI_LS2080A: writel(AHCI_PORT_PHY_1_CFG, reg_base + PORT_PHY1); writel(AHCI_PORT_TRANS_CFG, reg_base + PORT_TRANS); - writel(AHCI_PORT_AXICC_CFG, reg_base + PORT_AXICC); + if (qpriv->is_dmacoherent) + writel(AHCI_PORT_AXICC_CFG, reg_base + PORT_AXICC); break; case AHCI_LS1046A: writel(LS1046A_SATA_ECC_DIS, qpriv->ecc_addr); writel(AHCI_PORT_PHY_1_CFG, reg_base + PORT_PHY1); writel(AHCI_PORT_TRANS_CFG, reg_base + PORT_TRANS); - writel(AHCI_PORT_AXICC_CFG, reg_base + PORT_AXICC); + if (qpriv->is_dmacoherent) + writel(AHCI_PORT_AXICC_CFG, reg_base + PORT_AXICC); break; } @@ -221,6 +227,7 @@ static int ahci_qoriq_probe(struct platform_device *pdev) if (IS_ERR(qoriq_priv->ecc_addr)) return PTR_ERR(qoriq_priv->ecc_addr); } + qoriq_priv->is_dmacoherent = of_property_read_bool(np, "dma-coherent"); rc = ahci_platform_enable_resources(hpriv); if (rc) -- 2.1.0.27.g96db324
Re: [PATCH v3] cpufreq: conservative: Decrease frequency faster when the update deferred
On 15-11-16, 23:25, Stratos Karafotis wrote: > diff --git a/drivers/cpufreq/cpufreq_conservative.c > b/drivers/cpufreq/cpufreq_conservative.c > index 0681fcf..808cc4d 100644 > --- a/drivers/cpufreq/cpufreq_conservative.c > +++ b/drivers/cpufreq/cpufreq_conservative.c > @@ -66,6 +66,7 @@ static unsigned int cs_dbs_update(struct cpufreq_policy > *policy) > struct dbs_data *dbs_data = policy_dbs->dbs_data; > struct cs_dbs_tuners *cs_tuners = dbs_data->tuners; > unsigned int load = dbs_update(policy); > + unsigned int freq_step; > > /* >* break out if we 'cannot' reduce the speed as the user might > @@ -82,6 +83,22 @@ static unsigned int cs_dbs_update(struct cpufreq_policy > *policy) > if (requested_freq > policy->max || requested_freq < policy->min) > requested_freq = policy->cur; > > + freq_step = get_freq_step(cs_tuners, policy); > + > + /* > + * Decrease requested_freq one freq_step for each idle period that > + * we didn't update the frequency. > + */ > + if (policy_dbs->idle_periods < UINT_MAX) { > + unsigned int freq_steps = policy_dbs->idle_periods * freq_step; > + > + if (requested_freq > freq_steps) > + requested_freq -= freq_steps; > + else > + requested_freq = policy->min; Need a blank line here. > + policy_dbs->idle_periods = UINT_MAX; > + } > + > /* Check for frequency increase */ > if (load > dbs_data->up_threshold) { > dbs_info->down_skip = 0; > @@ -90,7 +107,7 @@ static unsigned int cs_dbs_update(struct cpufreq_policy > *policy) > if (requested_freq == policy->max) > goto out; > > - requested_freq += get_freq_step(cs_tuners, policy); > + requested_freq += freq_step; > if (requested_freq > policy->max) > requested_freq = policy->max; > > @@ -106,14 +123,12 @@ static unsigned int cs_dbs_update(struct cpufreq_policy > *policy) > > /* Check for frequency decrease */ > if (load < cs_tuners->down_threshold) { > - unsigned int freq_step; > /* >* if we cannot reduce the frequency anymore, break out early >*/ > if (requested_freq == policy->min) > goto out; > > - freq_step = get_freq_step(cs_tuners, policy); > if (requested_freq > freq_step) > requested_freq -= freq_step; > else > diff --git a/drivers/cpufreq/cpufreq_governor.c > b/drivers/cpufreq/cpufreq_governor.c > index 3729474..9780f50 100644 > --- a/drivers/cpufreq/cpufreq_governor.c > +++ b/drivers/cpufreq/cpufreq_governor.c > @@ -117,7 +117,7 @@ unsigned int dbs_update(struct cpufreq_policy *policy) > struct policy_dbs_info *policy_dbs = policy->governor_data; > struct dbs_data *dbs_data = policy_dbs->dbs_data; > unsigned int ignore_nice = dbs_data->ignore_nice_load; > - unsigned int max_load = 0; > + unsigned int max_load = 0, idle_periods = UINT_MAX; > unsigned int sampling_rate, io_busy, j; > > /* > @@ -214,10 +214,17 @@ unsigned int dbs_update(struct cpufreq_policy *policy) > } > j_cdbs->prev_load = load; > } Here as well.. > + if (time_elapsed > 2 * sampling_rate) { > + unsigned int periods = time_elapsed / sampling_rate; > > + if (periods < idle_periods) > + idle_periods = periods; > + } Here too > if (load > max_load) > max_load = load; > } And here.. > + policy_dbs->idle_periods = idle_periods; > + > return max_load; > } > EXPORT_SYMBOL_GPL(dbs_update); > diff --git a/drivers/cpufreq/cpufreq_governor.h > b/drivers/cpufreq/cpufreq_governor.h > index 9660cc6..10a3e0a 100644 > --- a/drivers/cpufreq/cpufreq_governor.h > +++ b/drivers/cpufreq/cpufreq_governor.h > @@ -97,6 +97,7 @@ struct policy_dbs_info { > struct list_head list; > /* Multiplier for increasing sample delay temporarily. */ > unsigned int rate_mult; > + unsigned int idle_periods; > /* Status indicators */ > bool is_shared; /* This object is used by multiple CPUs */ > bool work_in_progress; /* Work is being queued up or in progress */ And after fixing this trivial things, you can add Acked-by: Viresh Kumar -- viresh
Re: [PATCH] um: Fix compile failure due to current_text_address() definition
Just as an FYI, the linker bug has been fixed in binutils. On Fri, Nov 11, 2016 at 5:07 PM, Richard Weinberger wrote: > On 11.11.2016 22:03, Keno Fischer wrote: >> Did you have CONFIG_INET set? I'm attaching my full .config. This is >> on vanilla Ubuntu 16.10. > > Yes, CONFIG_INET is set. Let my try on Ubuntu. ;-\ > >> I did see the same error when building with `CONFIG_STATIC_LINK=y`. >> Note that I also, separately, ran into a linker problem, though I >> believe it is unrelated to this patch >> (though perhaps is related to the problem you're seeing?): >> https://sourceware.org/bugzilla/show_bug.cgi?id=20800. > > This seems to be an UML<->glibc issue. > memmove() is now an ifunc and for whatever reason it does not work with UML. > >> I'd also be happy to provide you with ssh access to the machine that >> I'm seeing this on if that >> would be helpful. > > Okay, let me try myself first. I think I'm able to install Ubuntu. :) > > Thanks, > //richard
Re: [PATCH v13 11/22] vfio iommu: Add blocking notifier to notify DMA_UNMAP
On 11/16/2016 8:55 AM, Alex Williamson wrote: > On Tue, 15 Nov 2016 20:16:12 -0700 > Alex Williamson wrote: > >> On Wed, 16 Nov 2016 08:16:15 +0530 >> Kirti Wankhede wrote: >> >>> On 11/16/2016 3:49 AM, Alex Williamson wrote: On Tue, 15 Nov 2016 20:59:54 +0530 Kirti Wankhede wrote: >>> ... >>> > @@ -854,7 +857,28 @@ static int vfio_dma_do_unmap(struct vfio_iommu > *iommu, >*/ > if (dma->task->mm != current->mm) > break; > + > unmapped += dma->size; > + > + if (iommu->external_domain && !RB_EMPTY_ROOT(&dma->pfn_list)) { > + struct vfio_iommu_type1_dma_unmap nb_unmap; > + > + nb_unmap.iova = dma->iova; > + nb_unmap.size = dma->size; > + > + /* > + * Notifier callback would call vfio_unpin_pages() which > + * would acquire iommu->lock. Release lock here and > + * reacquire it again. > + */ > + mutex_unlock(&iommu->lock); > + blocking_notifier_call_chain(&iommu->notifier, > + VFIO_IOMMU_NOTIFY_DMA_UNMAP, > + &nb_unmap); > + mutex_lock(&iommu->lock); > + if (WARN_ON(!RB_EMPTY_ROOT(&dma->pfn_list))) > + break; > + } Why exactly do we need to notify per vfio_dma rather than per unmap request? If we do the latter we can send the notify first, limiting us to races where a page is pinned between the notify and the locking, whereas here, even our dma pointer is suspect once we re-acquire the lock, we don't technically know if another unmap could have removed that already. Perhaps something like this (untested): >>> >>> There are checks to validate unmap request, like v2 check and who is >>> calling unmap and is it allowed for that task to unmap. Before these >>> checks its not sure that unmap region range which asked for would be >>> unmapped all. Notify call should be at the place where its sure that the >>> range provided to notify call is definitely going to be removed. My >>> change do that. >> >> Ok, but that does solve the problem. What about this (untested): > > s/does/does not/ > > BTW, I like how the retries here fill the gap in my previous proposal > where we could still race re-pinning. We've given it an honest shot or > someone is not participating if we've retried 10 times. I don't > understand why the test for iommu->external_domain was there, clearly > if the list is not empty, we need to notify. Thanks, > Ok. Retry is good to give a chance to unpin all. But is it really required to use BUG_ON() that would panic the host. I think WARN_ON should be fine and then when container is closed or when the last group is removed from the container, vfio_iommu_type1_release() is called and we have a chance to unpin it all. Thanks, Kirti > Alex > >> diff --git a/drivers/vfio/vfio_iommu_type1.c >> b/drivers/vfio/vfio_iommu_type1.c >> index ee9a680..50cafdf 100644 >> --- a/drivers/vfio/vfio_iommu_type1.c >> +++ b/drivers/vfio/vfio_iommu_type1.c >> @@ -782,9 +782,9 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu, >> struct vfio_iommu_type1_dma_unmap *unmap) >> { >> uint64_t mask; >> -struct vfio_dma *dma; >> +struct vfio_dma *dma, *dma_last = NULL; >> size_t unmapped = 0; >> -int ret = 0; >> +int ret = 0, retries; >> >> mask = ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1; >> >> @@ -794,7 +794,7 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu, >> return -EINVAL; >> >> WARN_ON(mask & PAGE_MASK); >> - >> +again: >> mutex_lock(&iommu->lock); >> >> /* >> @@ -851,11 +851,16 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu, >> if (dma->task->mm != current->mm) >> break; >> >> -unmapped += dma->size; >> - >> -if (iommu->external_domain && !RB_EMPTY_ROOT(&dma->pfn_list)) { >> +if (!RB_EMPTY_ROOT(&dma->pfn_list)) { >> struct vfio_iommu_type1_dma_unmap nb_unmap; >> >> +if (dma_last == dma) { >> +BUG_ON(++retries > 10); >> +} else { >> +dma_last = dma; >> +retries = 0; >> +} >> + >> nb_unmap.iova = dma->iova; >> nb_unmap.size = dma->size; >> >> @@ -868,11 +873,11 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu, >> blocking_notifier_call_chain(&iommu->notifier, >> VFIO_IOMMU_
Re: [PATCH net-next v8 0/9] dpaa_eth: Add the QorIQ DPAA Ethernet driver
From: Madalin Bucur Date: Tue, 15 Nov 2016 10:41:00 +0200 > This patch series adds the Ethernet driver for the Freescale > QorIQ Data Path Acceleration Architecture (DPAA). Series applied, thanks.
Re: [PATCH net-next] tcp: allow to enable the repair mode for non-listening sockets
From: Andrei Vagin Date: Mon, 14 Nov 2016 18:15:14 -0800 > The repair mode is used to get and restore sequence numbers and > data from queues. It used to checkpoint/restore connections. > > Currently the repair mode can be enabled for sockets in the established > and closed states, but for other states we have to dump the same socket > properties, so lets allow to enable repair mode for these sockets. > > The repair mode reveals nothing more for sockets in other states. > > Signed-off-by: Andrei Vagin Applied.
Re: [TEST PATCH] WIP: Test OPP multi regulator support with ti-opp-domain driver
Thanks for this Dave :) On 15-11-16, 16:10, Dave Gerlach wrote: > NOT FOR MERGE! > > Introduce a test version of a 'ti-opp-domain' driver that will use new > multiple regulator support introduced to the OPP core by Viresh [1]. > Tested on v4.9-rc1 with that series applied. This is needed on TI > platforms like DRA7/AM57 in order to control both CPU regulator and > Adaptive Body Bias (ABB) regulator as described by Nishanth Menon here > [2]. These regulators must be scaled in sequence during an OPP > transition depending on whether or not the frequency is being scaled up > or down. Based on the new functionality provided by Viresh this driver > does the following: > > * Call dev_pm_opp_set_regulators with the names of the two regulators > that feed the CPU: > * vdd is the 'cpu-supply' commonly used for cpufreq-dt but > renamed so the cpufreq-dt driver doesn't use it directly. > Note that this is supplied in board dts as it's external to > SoC. I think I can fix this somehow.. Lemme check. > * vbb for the ABB regulator. This is provided in SoC dtsi as it > is internal to the SoC. > * Provide a platform set_opp function using > dev_pm_opp_register_set_opp_helper that is called when an OPP > transition is requested. > * Allow cpufreq-dt to probe which will work because no cpu-supply > regulator is found so the driver proceeds and calls > dev_pm_opp_set_rate which through the OPP core invokes the platform > set_opp call we provided > * Platform set_opp call provided by this driver checks to see if we are > scaling frequency up or down and based on this, scales vbb before vdd > for up or the other way around for down. > > In addition to that, this driver implements AVS Class 0 as described in > section 18.4.6.12 of AM572x TRM [3] using the same platform set_rate > hook added to the OPP core. There are registers that define the optimal > voltage for that specific piece of silicon for an OPP so this driver > simply looks up this optimal value and programs that for an OPP instead > of the nominal value. > > Missing from this is a good way to ensure that cpufreq-dt does not just > proceed if no cpu-supply regulator is found but we were intending to > rely on a platform set_opp and multiple regulators. > > [1] https://marc.info/?l=linux-pm&m=147746362402994&w=2 > [2] https://marc.info/?l=linux-pm&m=145684495832764&w=2 > [3] http://www.ti.com/lit/ug/spruhz6g/spruhz6g.pdf > > Signed-off-by: Dave Gerlach > --- > arch/arm/boot/dts/am57xx-beagle-x15-common.dtsi | 2 +- > arch/arm/boot/dts/dra7.dtsi | 46 ++- > drivers/soc/ti/Makefile | 2 + > drivers/soc/ti/ti-opp-domain.c | 427 > I would rather ask you to move this to drivers/base/power/opp/ -- viresh
Re: [PATCH v13 11/22] vfio iommu: Add blocking notifier to notify DMA_UNMAP
On Tue, 15 Nov 2016 20:16:12 -0700 Alex Williamson wrote: > On Wed, 16 Nov 2016 08:16:15 +0530 > Kirti Wankhede wrote: > > > On 11/16/2016 3:49 AM, Alex Williamson wrote: > > > On Tue, 15 Nov 2016 20:59:54 +0530 > > > Kirti Wankhede wrote: > > > > > ... > > > > >> @@ -854,7 +857,28 @@ static int vfio_dma_do_unmap(struct vfio_iommu > > >> *iommu, > > >> */ > > >> if (dma->task->mm != current->mm) > > >> break; > > >> + > > >> unmapped += dma->size; > > >> + > > >> +if (iommu->external_domain && > > >> !RB_EMPTY_ROOT(&dma->pfn_list)) { > > >> +struct vfio_iommu_type1_dma_unmap nb_unmap; > > >> + > > >> +nb_unmap.iova = dma->iova; > > >> +nb_unmap.size = dma->size; > > >> + > > >> +/* > > >> + * Notifier callback would call > > >> vfio_unpin_pages() which > > >> + * would acquire iommu->lock. Release lock here > > >> and > > >> + * reacquire it again. > > >> + */ > > >> +mutex_unlock(&iommu->lock); > > >> +blocking_notifier_call_chain(&iommu->notifier, > > >> + > > >> VFIO_IOMMU_NOTIFY_DMA_UNMAP, > > >> +&nb_unmap); > > >> +mutex_lock(&iommu->lock); > > >> +if (WARN_ON(!RB_EMPTY_ROOT(&dma->pfn_list))) > > >> +break; > > >> +} > > > > > > > > > Why exactly do we need to notify per vfio_dma rather than per unmap > > > request? If we do the latter we can send the notify first, limiting us > > > to races where a page is pinned between the notify and the locking, > > > whereas here, even our dma pointer is suspect once we re-acquire the > > > lock, we don't technically know if another unmap could have removed > > > that already. Perhaps something like this (untested): > > > > > > > There are checks to validate unmap request, like v2 check and who is > > calling unmap and is it allowed for that task to unmap. Before these > > checks its not sure that unmap region range which asked for would be > > unmapped all. Notify call should be at the place where its sure that the > > range provided to notify call is definitely going to be removed. My > > change do that. > > Ok, but that does solve the problem. What about this (untested): s/does/does not/ BTW, I like how the retries here fill the gap in my previous proposal where we could still race re-pinning. We've given it an honest shot or someone is not participating if we've retried 10 times. I don't understand why the test for iommu->external_domain was there, clearly if the list is not empty, we need to notify. Thanks, Alex > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c > index ee9a680..50cafdf 100644 > --- a/drivers/vfio/vfio_iommu_type1.c > +++ b/drivers/vfio/vfio_iommu_type1.c > @@ -782,9 +782,9 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu, >struct vfio_iommu_type1_dma_unmap *unmap) > { > uint64_t mask; > - struct vfio_dma *dma; > + struct vfio_dma *dma, *dma_last = NULL; > size_t unmapped = 0; > - int ret = 0; > + int ret = 0, retries; > > mask = ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1; > > @@ -794,7 +794,7 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu, > return -EINVAL; > > WARN_ON(mask & PAGE_MASK); > - > +again: > mutex_lock(&iommu->lock); > > /* > @@ -851,11 +851,16 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu, > if (dma->task->mm != current->mm) > break; > > - unmapped += dma->size; > - > - if (iommu->external_domain && !RB_EMPTY_ROOT(&dma->pfn_list)) { > + if (!RB_EMPTY_ROOT(&dma->pfn_list)) { > struct vfio_iommu_type1_dma_unmap nb_unmap; > > + if (dma_last == dma) { > + BUG_ON(++retries > 10); > + } else { > + dma_last = dma; > + retries = 0; > + } > + > nb_unmap.iova = dma->iova; > nb_unmap.size = dma->size; > > @@ -868,11 +873,11 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu, > blocking_notifier_call_chain(&iommu->notifier, > VFIO_IOMMU_NOTIFY_DMA_UNMAP, > &nb_unmap); > - mutex_lock(&iommu->lock); > - if (WARN_ON(!RB_EMPTY_ROOT(&dma->pfn_list))) > -
Re: [PATCH v10 01/11] remoteproc: st_slim_rproc: add a slimcore rproc driver
On Mon, Nov 14, 2016 at 11:42:16AM +, Peter Griffin wrote: > Hi Vinod, > > On Mon, 14 Nov 2016, Vinod Koul wrote: > > > On Mon, Nov 07, 2016 at 01:57:35PM +, Peter Griffin wrote: > > > > > > > > As you now make changes to the entire remoteproc Kconfig file, rather > > > > than simply add a Kconfig symbol we can't bring this in via Vinod's tree > > > > without providing Linus with a messy merge conflict. > > > > > > > > So the remoteproc parts now has to go through my tree. > > > > > > OK, I think the best approach is for Vinod to create an immutable > > > branch with the entire fdma series on, and then both of you merge that > > > branch into > > > your respective trees. > > > > my topic/st_fdma is immutable branch. You cna merge it, if you need a signed > > tag, please do let me know > > OK. > > > > > > > > > That way there won't be any conflicts and you can both accept further > > > changes > > > for v4.9 release. Trying to take half the series via rproc, and half via > > > dma trees won't work > > > because they have dependencies on each other. > > > > > > I will send a v11 series in a moment which includes the feedback in this > > > email > > > and also include the additional fixes which Vinod has applied since the > > > driver > > > has been in linux-next. > > > > WHY.. Stuff is already merged twice! > > When the feedback is "there is an unrelated change in this patch", the only > way > you can fix that is by having a new version of the patch. It can be reverted and clean patch applied.. -- ~Vinod
Re: [PATCH] reset: hisilicon: add a polarity cell for reset line specifier
Hi Philipp, On 2016/11/15 18:43, Philipp Zabel wrote: > Hi Jiancheng, > > Am Dienstag, den 15.11.2016, 15:09 +0800 schrieb Jiancheng Xue: >> Add a polarity cell for reset line specifier. If the reset line >> is asserted when the register bit is 1, the polarity is >> normal. Otherwise, it is inverted. >> >> Signed-off-by: Jiancheng Xue >> --- Thank you very much for replying so soon. Please allow me to decribe the reason why this patch exists first. All bits in the reset controller were designed to be active-high. But in a recent chip only one bit was implemented to be active-low :( >> .../devicetree/bindings/clock/hisi-crg.txt | 11 --- >> arch/arm/boot/dts/hi3519.dtsi | 2 +- >> drivers/clk/hisilicon/reset.c | 36 >> -- >> 3 files changed, 33 insertions(+), 16 deletions(-) >> >> diff --git a/Documentation/devicetree/bindings/clock/hisi-crg.txt >> b/Documentation/devicetree/bindings/clock/hisi-crg.txt >> index e3919b6..fcbb4f3 100644 >> --- a/Documentation/devicetree/bindings/clock/hisi-crg.txt >> +++ b/Documentation/devicetree/bindings/clock/hisi-crg.txt >> @@ -25,19 +25,20 @@ to specify the clock which they consume. >> >> All these identifier could be found in . >> >> -- #reset-cells: should be 2. >> +- #reset-cells: should be 3. >> >> A reset signal can be controlled by writing a bit register in the CRG >> module. >> -The reset specifier consists of two cells. The first cell represents the >> +The reset specifier consists of three cells. The first cell represents the >> register offset relative to the base address. The second cell represents the >> -bit index in the register. >> +bit index in the register. The third cell represents the polarity of the >> reset >> +line (0 for normal, 1 for inverted). > > What is normal and what is inverted? Please specify which is active-high > and which is active-low. > OK. I'll use active-high and active-low instead. >> >> Example: CRG nodes >> CRG: clock-reset-controller@1201 { >> compatible = "hisilicon,hi3519-crg"; >> reg = <0x1201 0x1>; >> #clock-cells = <1>; >> -#reset-cells = <2>; >> +#reset-cells = <3>; >> }; >> >> Example: consumer nodes >> @@ -45,5 +46,5 @@ i2c0: i2c@1211 { >> compatible = "hisilicon,hi3519-i2c"; >> reg = <0x1211 0x1000>; >> clocks = <&CRG HI3519_I2C0_RST>; >> -resets = <&CRG 0xe4 0>; >> +resets = <&CRG 0xe4 0 0>; >> }; >> diff --git a/arch/arm/boot/dts/hi3519.dtsi b/arch/arm/boot/dts/hi3519.dtsi >> index 5729ecf..b7cb182 100644 >> --- a/arch/arm/boot/dts/hi3519.dtsi >> +++ b/arch/arm/boot/dts/hi3519.dtsi >> @@ -50,7 +50,7 @@ >> crg: clock-reset-controller@1201 { >> compatible = "hisilicon,hi3519-crg"; >> #clock-cells = <1>; >> -#reset-cells = <2>; >> +#reset-cells = <3>; > > That is a backwards incompatible change. Which I think in this case > could be tolerated, because there are no users yet of the reset > controller. Or are there any hi3519 based device trees that use the > resets out in the wild? If there are, the driver must continue to > support old device trees with two reset-cells. Which would not be > trivial because currently the core checks in reset_control_get that > rcdev->of_n_reset_cells is equal to the #reset-cells value from DT. I understand the backwards compatiblity is very important. As it can be basically confirmed that the possibility of using hi3519 based device trees is very low, to keep the code simple, I chose to give up the backwards compatiblity. Maybe it is not very convincing. If you think it's better to keep backwards compatiblity here, I can only change reset-cells to 3 for chipsets except Hi3519. > One possibility to get around changing the binding would be to stuff the > polarity bit into low bits of the register address cell. > It's also a solution. But I feel it's not very clear for reset consumer to composite these information together as a index number. Maybe I'm wrong. > Either way, I'm not very happy with blowing up the complexity of the > reset phandles at the reset consumer side. By complexity, do you mean that the consumer side doesn't need to know the detailed information of the implementation of the reset controller eventhough only in the device tree? > If you do change the binding, is there any way you could change from a > register address + bit offset binding to an index based binding with the > information about reset bit positions and polarities contained in the > driver, or in the crg node, similarly to the ti-syscon-reset bindings? > That would also improve consistency with clock bindings, which already > use a number as identifier. > I agree with that this solution is more modular. But the device node of the reset controller will get more complicated. Actually, the device tree of SoC part is provided by SoC vendor. In my opinion, we need balance between
[PATCH v5] drm/mediatek: fixed the calc method of data rate per lane
Tune dsi frame rate by pixel clock, dsi add some extra signal (i.e. Tlpx, Ths-prepare, Ths-zero, Ths-trail,Ths-exit) when enter and exit LP mode, those signals will cause h-time larger than normal and reduce FPS. So need to multiply a coefficient to offset the extra signal's effect. coefficient = ((htotal*bpp/lane_number)+Tlpx+Ths_prep+Ths_zero+ Ths_trail+Ths_exit)/(htotal*bpp/lane_number) Signed-off-by: Jitao Shi --- Change since v4: - tune the calc comment more clear. - define the phy timings as constants. Chnage since v3: - wrapp the commit msg. - fix alignment of some lines. Change since v2: - move phy timing back to dsi_phy_timconfig. Change since v1: - phy_timing2 and phy_timing3 refer clock cycle time. - define values of LPX HS_PRPR HS_ZERO HS_TRAIL TA_GO TA_SURE TA_GET DA_HS_EXIT. --- drivers/gpu/drm/mediatek/mtk_dsi.c | 64 +++- 1 file changed, 48 insertions(+), 16 deletions(-) diff --git a/drivers/gpu/drm/mediatek/mtk_dsi.c b/drivers/gpu/drm/mediatek/mtk_dsi.c index 28b2044..eaa5a22 100644 --- a/drivers/gpu/drm/mediatek/mtk_dsi.c +++ b/drivers/gpu/drm/mediatek/mtk_dsi.c @@ -86,7 +86,7 @@ #define DSI_PHY_TIMECON0 0x110 #define LPX(0xff << 0) -#define HS_PRPR(0xff << 8) +#define HS_PREP(0xff << 8) #define HS_ZERO(0xff << 16) #define HS_TRAIL (0xff << 24) @@ -102,10 +102,16 @@ #define CLK_TRAIL (0xff << 24) #define DSI_PHY_TIMECON3 0x11c -#define CLK_HS_PRPR(0xff << 0) +#define CLK_HS_PREP(0xff << 0) #define CLK_HS_POST(0xff << 8) #define CLK_HS_EXIT(0xff << 16) +#define T_LPX 5 +#define T_HS_PREP 6 +#define T_HS_TRAIL 8 +#define T_HS_EXIT 7 +#define T_HS_ZERO 10 + #define NS_TO_CYCLE(n, c)((n) / (c) + (((n) % (c)) ? 1 : 0)) struct phy; @@ -161,20 +167,18 @@ static void mtk_dsi_mask(struct mtk_dsi *dsi, u32 offset, u32 mask, u32 data) static void dsi_phy_timconfig(struct mtk_dsi *dsi) { u32 timcon0, timcon1, timcon2, timcon3; - unsigned int ui, cycle_time; - unsigned int lpx; + u32 ui, cycle_time; ui = 1000 / dsi->data_rate + 0x01; cycle_time = 8000 / dsi->data_rate + 0x01; - lpx = 5; - timcon0 = (8 << 24) | (0xa << 16) | (0x6 << 8) | lpx; - timcon1 = (7 << 24) | (5 * lpx << 16) | ((3 * lpx) / 2) << 8 | - (4 * lpx); + timcon0 = T_LPX | T_HS_PREP << 8 | T_HS_ZERO << 16 | T_HS_TRAIL << 24; + timcon1 = 4 * T_LPX | (3 * T_LPX / 2) << 8 | 5 * T_LPX << 16 | + T_HS_EXIT << 24; timcon2 = ((NS_TO_CYCLE(0x64, cycle_time) + 0xa) << 24) | (NS_TO_CYCLE(0x150, cycle_time) << 16); - timcon3 = (2 * lpx) << 16 | NS_TO_CYCLE(80 + 52 * ui, cycle_time) << 8 | - NS_TO_CYCLE(0x40, cycle_time); + timcon3 = NS_TO_CYCLE(0x40, cycle_time) | (2 * T_LPX) << 16 | + NS_TO_CYCLE(80 + 52 * ui, cycle_time) << 8; writel(timcon0, dsi->regs + DSI_PHY_TIMECON0); writel(timcon1, dsi->regs + DSI_PHY_TIMECON1); @@ -202,19 +206,47 @@ static int mtk_dsi_poweron(struct mtk_dsi *dsi) { struct device *dev = dsi->dev; int ret; + u64 pixel_clock, total_bits; + u32 htotal, htotal_bits, bit_per_pixel, overhead_cycles, overhead_bits; if (++dsi->refcount != 1) return 0; + switch (dsi->format) { + case MIPI_DSI_FMT_RGB565: + bit_per_pixel = 16; + break; + case MIPI_DSI_FMT_RGB666_PACKED: + bit_per_pixel = 18; + break; + case MIPI_DSI_FMT_RGB666: + case MIPI_DSI_FMT_RGB888: + default: + bit_per_pixel = 24; + break; + } + /** -* data_rate = (pixel_clock / 1000) * pixel_dipth * mipi_ratio; -* pixel_clock unit is Khz, data_rata unit is MHz, so need divide 1000. -* mipi_ratio is mipi clk coefficient for balance the pixel clk in mipi. -* we set mipi_ratio is 1.05. +* vm.pixelclock is in kHz, pixel_clock unit is Hz, so multiply by 1000 +* htotal_time = htotal * byte_per_pixel / num_lanes +* overhead_time = lpx + hs_prepare + hs_zero + hs_trail + hs_exit +* mipi_ratio = (htotal_time + overhead_time) / htotal_time +* data_rate = pixel_clock * bit_per_pixel * mipi_ratio / num_lanes; */ - dsi->data_rate = dsi->vm.pixelclock * 3 * 21 / (1 * 1000 * 10); + pixel_clock = dsi->vm.pixelclock * 1000; + htotal = dsi->vm.hactive + dsi->vm.hback_porch + dsi->vm.hfront_porch + + dsi->vm.hsync_len; + htotal_bits = htotal * bit_per_pixel; + + overhead_cycles = T_LPX + T_HS_PREP + T_HS_Z
linux-next: build warnings after merge of the rpmsg tree
Hi Bjorn, After merging the rpmsg tree, today's linux-next build (arm multi_v7_defconfig) produced these warnings: drivers/remoteproc/Kconfig:3:error: recursive dependency detected! For a resolution refer to Documentation/kbuild/kconfig-language.txt subsection "Kconfig recursive dependency limitations" drivers/remoteproc/Kconfig:3: symbol REMOTEPROC is selected by QCOM_ADSP_PIL For a resolution refer to Documentation/kbuild/kconfig-language.txt subsection "Kconfig recursive dependency limitations" drivers/remoteproc/Kconfig:81: symbol QCOM_ADSP_PIL depends on REMOTEPROC Introduced by commit b9e718e950c3 ("remoteproc: Introduce Qualcomm ADSP PIL") -- Cheers, Stephen Rothwell
Re: [PATCH V3 1/9] PM / OPP: Reword binding supporting multiple regulators per device
On 15-11-16, 16:11, Dave Gerlach wrote: > On 11/15/2016 12:56 PM, Stephen Boyd wrote: > >On 11/15, Viresh Kumar wrote: > >>There are two important pieces of information we need for multiple > >>regulator support: > >>- Which regulator in the consumer node corresponds to which entry in > >> the OPP table. As Mark mentioned earlier, DT should be able to get > >> us this. > > > >This is also possible from C code though. Or is there some case > >where it isn't possible if we're sharing the same table with two > >devices? I'm lost on when this would ever happen. > > > >It feels like trying to keep the OPP table agnostic of the > >consuming device and the device's binding is more trouble than > >it's worth. Especially considering we have opp-shared and *-name > >now. > > I agree with this, I do not like having to pass a list of regulator names to > the opp core that I *hope* the device I am controlling has provided. What do you mean by that? Are you saying this from DT's point of view or of the code? i.e. Are you saying that you don't like the dev_pm_opp_set_regulators() API ? > The > intent seems to be to use the cpufreq-dt driver as is and not pass any I would like to kill all regulators code from cpufreq-dt sometime soon. All that is left there is making sure we have a regulator in place, but I strongly feel OPP core is the right place for doing that now. > cpu-supply anymore so the cpufreq-dt driver has no knowledge of what > regulators are present (it operates as it would today on a system with no > regulator required). But as is it will move forward regardless of whether or > not we actually intended to provide a multi regulator set up or platform > set_opp helper, and this probably isn't ideal. Yes and that's why I am more inclined towards my above comment. We shall make it consistent. > I would think cpufreq-dt/opp > core should be have knowledge of what regulators are needed to achieve these > opp transitions and make sure everything is in place before moving ahead. The last patch in my series does what you are looking for: PM / OPP: Don't assume platform doesn't have regulators Isn't it ? -- viresh
Re: [PATCH v13 11/22] vfio iommu: Add blocking notifier to notify DMA_UNMAP
On Wed, 16 Nov 2016 08:16:15 +0530 Kirti Wankhede wrote: > On 11/16/2016 3:49 AM, Alex Williamson wrote: > > On Tue, 15 Nov 2016 20:59:54 +0530 > > Kirti Wankhede wrote: > > > ... > > >> @@ -854,7 +857,28 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu, > >> */ > >>if (dma->task->mm != current->mm) > >>break; > >> + > >>unmapped += dma->size; > >> + > >> + if (iommu->external_domain && !RB_EMPTY_ROOT(&dma->pfn_list)) { > >> + struct vfio_iommu_type1_dma_unmap nb_unmap; > >> + > >> + nb_unmap.iova = dma->iova; > >> + nb_unmap.size = dma->size; > >> + > >> + /* > >> + * Notifier callback would call vfio_unpin_pages() which > >> + * would acquire iommu->lock. Release lock here and > >> + * reacquire it again. > >> + */ > >> + mutex_unlock(&iommu->lock); > >> + blocking_notifier_call_chain(&iommu->notifier, > >> + VFIO_IOMMU_NOTIFY_DMA_UNMAP, > >> + &nb_unmap); > >> + mutex_lock(&iommu->lock); > >> + if (WARN_ON(!RB_EMPTY_ROOT(&dma->pfn_list))) > >> + break; > >> + } > > > > > > Why exactly do we need to notify per vfio_dma rather than per unmap > > request? If we do the latter we can send the notify first, limiting us > > to races where a page is pinned between the notify and the locking, > > whereas here, even our dma pointer is suspect once we re-acquire the > > lock, we don't technically know if another unmap could have removed > > that already. Perhaps something like this (untested): > > > > There are checks to validate unmap request, like v2 check and who is > calling unmap and is it allowed for that task to unmap. Before these > checks its not sure that unmap region range which asked for would be > unmapped all. Notify call should be at the place where its sure that the > range provided to notify call is definitely going to be removed. My > change do that. Ok, but that does solve the problem. What about this (untested): diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index ee9a680..50cafdf 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -782,9 +782,9 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu, struct vfio_iommu_type1_dma_unmap *unmap) { uint64_t mask; - struct vfio_dma *dma; + struct vfio_dma *dma, *dma_last = NULL; size_t unmapped = 0; - int ret = 0; + int ret = 0, retries; mask = ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1; @@ -794,7 +794,7 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu, return -EINVAL; WARN_ON(mask & PAGE_MASK); - +again: mutex_lock(&iommu->lock); /* @@ -851,11 +851,16 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu, if (dma->task->mm != current->mm) break; - unmapped += dma->size; - - if (iommu->external_domain && !RB_EMPTY_ROOT(&dma->pfn_list)) { + if (!RB_EMPTY_ROOT(&dma->pfn_list)) { struct vfio_iommu_type1_dma_unmap nb_unmap; + if (dma_last == dma) { + BUG_ON(++retries > 10); + } else { + dma_last = dma; + retries = 0; + } + nb_unmap.iova = dma->iova; nb_unmap.size = dma->size; @@ -868,11 +873,11 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu, blocking_notifier_call_chain(&iommu->notifier, VFIO_IOMMU_NOTIFY_DMA_UNMAP, &nb_unmap); - mutex_lock(&iommu->lock); - if (WARN_ON(!RB_EMPTY_ROOT(&dma->pfn_list))) - break; + goto again: } + unmapped += dma->size; vfio_remove_dma(iommu, dma); + } unlock:
[PATCH -v5 1/9] mm, swap: Make swap cluster size same of THP size on x86_64
From: Huang Ying In this patch, the size of the swap cluster is changed to that of the THP (Transparent Huge Page) on x86_64 architecture (512). This is for the THP swap support on x86_64. Where one swap cluster will be used to hold the contents of each THP swapped out. And some information of the swapped out THP (such as compound map count) will be recorded in the swap_cluster_info data structure. For other architectures which want THP swap support, ARCH_USES_THP_SWAP_CLUSTER need to be selected in the Kconfig file for the architecture. In effect, this will enlarge swap cluster size by 2 times on x86_64. Which may make it harder to find a free cluster when the swap space becomes fragmented. So that, this may reduce the continuous swap space allocation and sequential write in theory. The performance test in 0day shows no regressions caused by this. Cc: Hugh Dickins Cc: Shaohua Li Cc: Minchan Kim Cc: Rik van Riel Suggested-by: Andrew Morton Signed-off-by: "Huang, Ying" --- arch/x86/Kconfig | 1 + mm/Kconfig | 13 + mm/swapfile.c| 4 3 files changed, 18 insertions(+) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 8b93519..59dc488 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -165,6 +165,7 @@ config X86 select HAVE_STACK_VALIDATIONif X86_64 select ARCH_USES_HIGH_VMA_FLAGS if X86_INTEL_MEMORY_PROTECTION_KEYS select ARCH_HAS_PKEYS if X86_INTEL_MEMORY_PROTECTION_KEYS + select ARCH_USES_THP_SWAP_CLUSTER if X86_64 config INSTRUCTION_DECODER def_bool y diff --git a/mm/Kconfig b/mm/Kconfig index 86e3e0e..5a63c87 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -503,6 +503,19 @@ config FRONTSWAP If unsure, say Y to enable frontswap. +config ARCH_USES_THP_SWAP_CLUSTER + bool + default n + +config THP_SWAP_CLUSTER + bool + depends on SWAP && TRANSPARENT_HUGEPAGE && ARCH_USES_THP_SWAP_CLUSTER + default y + help + Use one swap cluster to hold the contents of the THP + (Transparent Huge Page) swapped out. The size of the swap + cluster will be same as that of THP. + config CMA bool "Contiguous Memory Allocator" depends on HAVE_MEMBLOCK && MMU diff --git a/mm/swapfile.c b/mm/swapfile.c index f304389..34888e5b 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -196,7 +196,11 @@ static void discard_swap_cluster(struct swap_info_struct *si, } } +#ifdef CONFIG_THP_SWAP_CLUSTER +#define SWAPFILE_CLUSTER HPAGE_PMD_NR +#else #define SWAPFILE_CLUSTER 256 +#endif #define LATENCY_LIMIT 256 static inline void cluster_set_flag(struct swap_cluster_info *info, -- 2.10.2
[PATCH -v5 2/9] mm, memcg: Support to charge/uncharge multiple swap entries
From: Huang Ying This patch make it possible to charge or uncharge a set of continuous swap entries in the swap cgroup. The number of swap entries is specified via an added parameter. This will be used for the THP (Transparent Huge Page) swap support. Where a swap cluster backing a THP may be allocated and freed as a whole. So a set of (HPAGE_PMD_NR) continuous swap entries backing one THP need to be charged or uncharged together. This will batch the cgroup operations for the THP swap too. Cc: Andrea Arcangeli Cc: Kirill A. Shutemov Cc: Vladimir Davydov Cc: Johannes Weiner Cc: Michal Hocko Cc: Tejun Heo Cc: cgro...@vger.kernel.org Signed-off-by: "Huang, Ying" --- include/linux/swap.h| 12 ++ include/linux/swap_cgroup.h | 6 +++-- mm/memcontrol.c | 55 + mm/shmem.c | 2 +- mm/swap_cgroup.c| 40 - mm/swap_state.c | 2 +- mm/swapfile.c | 2 +- 7 files changed, 76 insertions(+), 43 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index bfee1af..35484c9 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -555,8 +555,10 @@ static inline int mem_cgroup_swappiness(struct mem_cgroup *mem) #ifdef CONFIG_MEMCG_SWAP extern void mem_cgroup_swapout(struct page *page, swp_entry_t entry); -extern int mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry); -extern void mem_cgroup_uncharge_swap(swp_entry_t entry); +extern int mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry, + unsigned int nr_entries); +extern void mem_cgroup_uncharge_swap(swp_entry_t entry, +unsigned int nr_entries); extern long mem_cgroup_get_nr_swap_pages(struct mem_cgroup *memcg); extern bool mem_cgroup_swap_full(struct page *page); #else @@ -565,12 +567,14 @@ static inline void mem_cgroup_swapout(struct page *page, swp_entry_t entry) } static inline int mem_cgroup_try_charge_swap(struct page *page, -swp_entry_t entry) +swp_entry_t entry, +unsigned int nr_entries) { return 0; } -static inline void mem_cgroup_uncharge_swap(swp_entry_t entry) +static inline void mem_cgroup_uncharge_swap(swp_entry_t entry, + unsigned int nr_entries) { } diff --git a/include/linux/swap_cgroup.h b/include/linux/swap_cgroup.h index 145306b..b2b8ec7 100644 --- a/include/linux/swap_cgroup.h +++ b/include/linux/swap_cgroup.h @@ -7,7 +7,8 @@ extern unsigned short swap_cgroup_cmpxchg(swp_entry_t ent, unsigned short old, unsigned short new); -extern unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id); +extern unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id, +unsigned int nr_ents); extern unsigned short lookup_swap_cgroup_id(swp_entry_t ent); extern int swap_cgroup_swapon(int type, unsigned long max_pages); extern void swap_cgroup_swapoff(int type); @@ -15,7 +16,8 @@ extern void swap_cgroup_swapoff(int type); #else static inline -unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id) +unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id, + unsigned int nr_ents) { return 0; } diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 91dfc7c..a025dce 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2394,10 +2394,9 @@ void mem_cgroup_split_huge_fixup(struct page *head) #ifdef CONFIG_MEMCG_SWAP static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg, -bool charge) + int nr_entries) { - int val = (charge) ? 1 : -1; - this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_SWAP], val); + this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_SWAP], nr_entries); } /** @@ -2423,8 +2422,8 @@ static int mem_cgroup_move_swap_account(swp_entry_t entry, new_id = mem_cgroup_id(to); if (swap_cgroup_cmpxchg(entry, old_id, new_id) == old_id) { - mem_cgroup_swap_statistics(from, false); - mem_cgroup_swap_statistics(to, true); + mem_cgroup_swap_statistics(from, -1); + mem_cgroup_swap_statistics(to, 1); return 0; } return -EINVAL; @@ -5444,7 +5443,7 @@ void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg, * let's not wait for it. The page already received a * memory+swap charge, drop the swap entry duplicate. */ - mem_cgroup_uncharge_swap(entry); + mem_cgroup_uncharge_swap(entry, nr_pages); }
[PATCH -v5 9/9] mm, THP, swap: Delay splitting THP during swap out
From: Huang Ying In this patch, splitting huge page is delayed from almost the first step of swapping out to after allocating the swap space for the THP (Transparent Huge Page) and adding the THP into the swap cache. This will reduce lock acquiring/releasing for the locks used for the swap cache management. This is the first step for the THP swap support. The plan is to delay splitting the THP step by step and avoid splitting the THP finally. The advantages of the THP swap support include: - Batch the swap operations for the THP to reduce lock acquiring/releasing, including allocating/freeing the swap space, adding/deleting to/from the swap cache, and writing/reading the swap space, etc. This will help to improve the THP swap performance. - The THP swap space read/write will be 2M sequential IO. It is particularly helpful for the swap read, which usually are 4k random IO. This will help to improve the THP swap performance too. - It will help the memory fragmentation, especially when the THP is heavily used by the applications. The 2M continuous pages will be free up after the THP swapping out. - It will improve the THP utilization on the system with the swap turned on. Because the speed for khugepaged to collapse the normal pages into the THP is quite slow. After the THP is split during the swapping out, it will take quite long time for the normal pages to collapse back into the THP after being swapped in. The high THP utilization helps the efficiency of the page based memory management too. There are some concerns regarding THP swap in, mainly because possible enlarged read/write IO size (for swap in/out) may put more overhead on the storage device. To deal with that, the THP swap in should be turned on only when necessary. For example, it can be selected via "always/never/madvise" logic, to be turned on globally, turned off globally, or turned on only for VMA with MADV_HUGEPAGE, etc. With the patchset, the swap out throughput improved 12.1% (from 1.12GB/s to 1.25GB/s) in the vm-scalability swap-w-seq test case with 16 processes. The test is done on a Xeon E5 v3 system. The RAM simulated PMEM (persistent memory) device is used as the swap device. To test sequential swapping out, the test case uses 16 processes sequentially allocate and write to the anonymous pages until the RAM and part of the swap device is used up. The detailed compare result is as follow, base base+patchset -- %stddev %change %stddev \ |\ 1118821 ± 0% +12.1%1254241 ± 1% vmstat.swap.so 2460636 ± 1% +10.6%2720983 ± 1% vm-scalability.throughput 308.79 ± 1% -7.9% 284.53 ± 1% vm-scalability.time.elapsed_time 1639 ± 4%+232.3% 5446 ± 1% meminfo.SwapCached 0.70 ± 3% +8.7% 0.77 ± 5% perf-stat.ipc 9.82 ± 8% -31.6% 6.72 ± 2% perf-profile.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list Signed-off-by: "Huang, Ying" --- mm/swap_state.c | 65 ++--- 1 file changed, 62 insertions(+), 3 deletions(-) diff --git a/mm/swap_state.c b/mm/swap_state.c index 13fb1c5..2db8359 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -17,6 +17,7 @@ #include #include #include +#include #include @@ -173,12 +174,53 @@ void __delete_from_swap_cache(struct page *page) ADD_CACHE_INFO(del_total, nr); } +#ifdef CONFIG_THP_SWAP_CLUSTER +int add_to_swap_trans_huge(struct page *page, struct list_head *list) +{ + swp_entry_t entry; + int ret = 0; + + /* cannot split, which may be needed during swap in, skip it */ + if (!can_split_huge_page(page, NULL)) + return -EBUSY; + /* fallback to split huge page firstly if no PMD map */ + if (!compound_mapcount(page)) + return 0; + entry = get_huge_swap_page(); + if (!entry.val) + return 0; + if (mem_cgroup_try_charge_swap(page, entry, HPAGE_PMD_NR)) { + __swapcache_free(entry, true); + return -EOVERFLOW; + } + ret = add_to_swap_cache(page, entry, + __GFP_HIGH | __GFP_NOMEMALLOC|__GFP_NOWARN); + /* -ENOMEM radix-tree allocation failure */ + if (ret) { + __swapcache_free(entry, true); + return 0; + } + ret = split_huge_page_to_list(page, list); + if (ret) { + delete_from_swap_cache(page); + return -EBUSY; + } + return 1; +} +#else +static inline int add_to_swap_trans_huge(struct page *page, +struct list_head *list) +{ + return 0; +} +#endif + /** * add_to_swap - allocate swap space for a page * @page: page we want to move to swap * * Alloca
[PATCH -v5 7/9] mm, THP: Add can_split_huge_page()
From: Huang Ying Separates checking whether we can split the huge page from split_huge_page_to_list() into a function. This will help to check that before splitting the THP (Transparent Huge Page) really. This will be used for delaying splitting THP during swapping out. Where for a THP, we will allocate a swap cluster, add the THP into the swap cache, then split the THP. To avoid the unnecessary operations for the un-splittable THP, we will check that firstly. There is no functionality change in this patch. Cc: Andrea Arcangeli Cc: Kirill A. Shutemov Cc: Ebru Akagunduz Signed-off-by: "Huang, Ying" --- include/linux/huge_mm.h | 7 +++ mm/huge_memory.c| 17 ++--- 2 files changed, 21 insertions(+), 3 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 31f2c32..1ccb49d 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -94,6 +94,7 @@ extern unsigned long thp_get_unmapped_area(struct file *filp, extern void prep_transhuge_page(struct page *page); extern void free_transhuge_page(struct page *page); +bool can_split_huge_page(struct page *page, int *pextra_pins); int split_huge_page_to_list(struct page *page, struct list_head *list); static inline int split_huge_page(struct page *page) { @@ -176,6 +177,12 @@ static inline void prep_transhuge_page(struct page *page) {} #define thp_get_unmapped_area NULL +static inline bool +can_split_huge_page(struct page *page, int *pextra_pins) +{ + BUILD_BUG(); + return false; +} static inline int split_huge_page_to_list(struct page *page, struct list_head *list) { diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 2d1d6bb..e894154 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2017,6 +2017,19 @@ int page_trans_huge_mapcount(struct page *page, int *total_mapcount) return ret; } +/* Racy check whether the huge page can be split */ +bool can_split_huge_page(struct page *page, int *pextra_pins) +{ + int extra_pins = 0; + + /* Additional pins from radix tree */ + if (!PageAnon(page)) + extra_pins = HPAGE_PMD_NR; + if (pextra_pins) + *pextra_pins = extra_pins; + return total_mapcount(page) == page_count(page) - extra_pins - 1; +} + /* * This function splits huge page into normal pages. @page can point to any * subpage of huge page to split. Split doesn't change the position of @page. @@ -2077,8 +2090,6 @@ int split_huge_page_to_list(struct page *page, struct list_head *list) goto out; } - /* Addidional pins from radix tree */ - extra_pins = HPAGE_PMD_NR; anon_vma = NULL; i_mmap_lock_read(mapping); } @@ -2087,7 +2098,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list) * Racy check if we can split the page, before freeze_page() will * split PMDs */ - if (total_mapcount(head) != page_count(head) - extra_pins - 1) { + if (!can_split_huge_page(head, &extra_pins)) { ret = -EBUSY; goto out_unlock; } -- 2.10.2
[PATCH -v5 6/9] mm, THP, swap: Support to add/delete THP to/from swap cache
From: Huang Ying With this patch, a THP (Transparent Huge Page) can be added/deleted to/from the swap cache as a set of (HPAGE_PMD_NR) sub-pages. This will be used for the THP (Transparent Huge Page) swap support. Where one THP may be added/delted to/from the swap cache. This will batch the swap cache operations to reduce the lock acquire/release times for the THP swap too. Cc: Hugh Dickins Cc: Shaohua Li Cc: Minchan Kim Cc: Rik van Riel Cc: Andrea Arcangeli Cc: Kirill A. Shutemov Signed-off-by: "Huang, Ying" --- include/linux/page-flags.h | 2 +- mm/swap_state.c| 64 ++ 2 files changed, 43 insertions(+), 23 deletions(-) diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 74e4dda..f5bcbea 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -314,7 +314,7 @@ PAGEFLAG_FALSE(HighMem) #endif #ifdef CONFIG_SWAP -PAGEFLAG(SwapCache, swapcache, PF_NO_COMPOUND) +PAGEFLAG(SwapCache, swapcache, PF_NO_TAIL) #else PAGEFLAG_FALSE(SwapCache) #endif diff --git a/mm/swap_state.c b/mm/swap_state.c index d3f047b..13fb1c5 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -43,6 +43,7 @@ struct address_space swapper_spaces[MAX_SWAPFILES] = { }; #define INC_CACHE_INFO(x) do { swap_cache_info.x++; } while (0) +#define ADD_CACHE_INFO(x, nr) do { swap_cache_info.x += (nr); } while (0) static struct { unsigned long add_total; @@ -80,39 +81,52 @@ void show_swap_cache_info(void) */ int __add_to_swap_cache(struct page *page, swp_entry_t entry) { - int error; + int error, i, nr = hpage_nr_pages(page); struct address_space *address_space; + struct page *cur_page; + swp_entry_t cur_entry; VM_BUG_ON_PAGE(!PageLocked(page), page); VM_BUG_ON_PAGE(PageSwapCache(page), page); VM_BUG_ON_PAGE(!PageSwapBacked(page), page); - get_page(page); + page_ref_add(page, nr); SetPageSwapCache(page); - set_page_private(page, entry.val); address_space = swap_address_space(entry); + cur_page = page; + cur_entry.val = entry.val; spin_lock_irq(&address_space->tree_lock); - error = radix_tree_insert(&address_space->page_tree, - swp_offset(entry), page); - if (likely(!error)) { - address_space->nrpages++; - __inc_node_page_state(page, NR_FILE_PAGES); - INC_CACHE_INFO(add_total); + for (i = 0; i < nr; i++, cur_page++, cur_entry.val++) { + set_page_private(cur_page, cur_entry.val); + error = radix_tree_insert(&address_space->page_tree, + swp_offset(cur_entry), cur_page); + if (unlikely(error)) + break; } - spin_unlock_irq(&address_space->tree_lock); - - if (unlikely(error)) { + if (likely(!error)) { + address_space->nrpages += nr; + __mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, nr); + ADD_CACHE_INFO(add_total, nr); + } else { /* * Only the context which have set SWAP_HAS_CACHE flag * would call add_to_swap_cache(). * So add_to_swap_cache() doesn't returns -EEXIST. */ VM_BUG_ON(error == -EEXIST); - set_page_private(page, 0UL); + set_page_private(cur_page, 0UL); + while (i--) { + cur_page--; + cur_entry.val--; + radix_tree_delete(&address_space->page_tree, + swp_offset(cur_entry)); + set_page_private(cur_page, 0UL); + } ClearPageSwapCache(page); - put_page(page); + page_ref_sub(page, nr); } + spin_unlock_irq(&address_space->tree_lock); return error; } @@ -122,7 +136,7 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp_mask) { int error; - error = radix_tree_maybe_preload(gfp_mask); + error = radix_tree_maybe_preload_order(gfp_mask, compound_order(page)); if (!error) { error = __add_to_swap_cache(page, entry); radix_tree_preload_end(); @@ -138,6 +152,7 @@ void __delete_from_swap_cache(struct page *page) { swp_entry_t entry; struct address_space *address_space; + int i, nr = hpage_nr_pages(page); VM_BUG_ON_PAGE(!PageLocked(page), page); VM_BUG_ON_PAGE(!PageSwapCache(page), page); @@ -145,12 +160,17 @@ void __delete_from_swap_cache(struct page *page) entry.val = page_private(page); address_space = swap_address_space(entry); - radix_tree_delete(&address_space->page_tree, swp_offset(entry)); - set_page_private(page, 0); + for (
[PATCH -v5 8/9] mm, THP, swap: Support to split THP in swap cache
From: Huang Ying This patch enhanced the split_huge_page_to_list() to work properly for the THP (Transparent Huge Page) in the swap cache during swapping out. This is used for delaying splitting the THP during swapping out. Where for a THP to be swapped out, we will allocate a swap cluster, add the THP into the swap cache, then split the THP. The page lock will be held during this process. So in the code path other than swapping out, if the THP need to be split, the PageSwapCache(THP) will be always false. Cc: Andrea Arcangeli Cc: Kirill A. Shutemov Cc: Ebru Akagunduz Signed-off-by: "Huang, Ying" --- mm/huge_memory.c | 16 +++- 1 file changed, 11 insertions(+), 5 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index e894154..4e0d8b7 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1835,7 +1835,7 @@ static void __split_huge_page_tail(struct page *head, int tail, * atomic_set() here would be safe on all archs (and not only on x86), * it's safer to use atomic_inc()/atomic_add(). */ - if (PageAnon(head)) { + if (PageAnon(head) && !PageSwapCache(head)) { page_ref_inc(page_tail); } else { /* Additional pin to radix tree */ @@ -1846,6 +1846,7 @@ static void __split_huge_page_tail(struct page *head, int tail, page_tail->flags |= (head->flags & ((1L << PG_referenced) | (1L << PG_swapbacked) | +(1L << PG_swapcache) | (1L << PG_mlocked) | (1L << PG_uptodate) | (1L << PG_active) | @@ -1908,7 +1909,11 @@ static void __split_huge_page(struct page *page, struct list_head *list, ClearPageCompound(head); /* See comment in __split_huge_page_tail() */ if (PageAnon(head)) { - page_ref_inc(head); + /* Additional pin to radix tree of swap cache */ + if (PageSwapCache(head)) + page_ref_add(head, 2); + else + page_ref_inc(head); } else { /* Additional pin to radix tree */ page_ref_add(head, 2); @@ -2020,10 +2025,12 @@ int page_trans_huge_mapcount(struct page *page, int *total_mapcount) /* Racy check whether the huge page can be split */ bool can_split_huge_page(struct page *page, int *pextra_pins) { - int extra_pins = 0; + int extra_pins; /* Additional pins from radix tree */ - if (!PageAnon(page)) + if (PageAnon(page)) + extra_pins = PageSwapCache(page) ? HPAGE_PMD_NR : 0; + else extra_pins = HPAGE_PMD_NR; if (pextra_pins) *pextra_pins = extra_pins; @@ -2078,7 +2085,6 @@ int split_huge_page_to_list(struct page *page, struct list_head *list) ret = -EBUSY; goto out; } - extra_pins = 0; mapping = NULL; anon_vma_lock_write(anon_vma); } else { -- 2.10.2
[PATCH -v5 3/9] mm, THP, swap: Add swap cluster allocate/free functions
From: Huang Ying The swap cluster allocation/free functions are added based on the existing swap cluster management mechanism for SSD. These functions don't work for the rotating hard disks because the existing swap cluster management mechanism doesn't work for them. The hard disks support may be added if someone really need it. But that needn't be included in this patchset. This will be used for the THP (Transparent Huge Page) swap support. Where one swap cluster will hold the contents of each THP swapped out. Cc: Andrea Arcangeli Cc: Kirill A. Shutemov Cc: Hugh Dickins Cc: Shaohua Li Cc: Minchan Kim Cc: Rik van Riel Signed-off-by: "Huang, Ying" --- mm/swapfile.c | 203 +- 1 file changed, 146 insertions(+), 57 deletions(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index fe0a559..15c89f8 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -326,6 +326,14 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si, schedule_work(&si->discard_work); } +static void __free_cluster(struct swap_info_struct *si, unsigned long idx) +{ + struct swap_cluster_info *ci = si->cluster_info; + + cluster_set_flag(ci + idx, CLUSTER_FLAG_FREE); + cluster_list_add_tail(&si->free_clusters, ci, idx); +} + /* * Doing discard actually. After a cluster discard is finished, the cluster * will be added to free cluster list. caller should hold si->lock. @@ -345,8 +353,7 @@ static void swap_do_scheduled_discard(struct swap_info_struct *si) SWAPFILE_CLUSTER); spin_lock(&si->lock); - cluster_set_flag(&info[idx], CLUSTER_FLAG_FREE); - cluster_list_add_tail(&si->free_clusters, info, idx); + __free_cluster(si, idx); memset(si->swap_map + idx * SWAPFILE_CLUSTER, 0, SWAPFILE_CLUSTER); } @@ -363,6 +370,34 @@ static void swap_discard_work(struct work_struct *work) spin_unlock(&si->lock); } +static void alloc_cluster(struct swap_info_struct *si, unsigned long idx) +{ + struct swap_cluster_info *ci = si->cluster_info; + + VM_BUG_ON(cluster_list_first(&si->free_clusters) != idx); + cluster_list_del_first(&si->free_clusters, ci); + cluster_set_count_flag(ci + idx, 0, 0); +} + +static void free_cluster(struct swap_info_struct *si, unsigned long idx) +{ + struct swap_cluster_info *ci = si->cluster_info + idx; + + VM_BUG_ON(cluster_count(ci) != 0); + /* +* If the swap is discardable, prepare discard the cluster +* instead of free it immediately. The cluster will be freed +* after discard. +*/ + if ((si->flags & (SWP_WRITEOK | SWP_PAGE_DISCARD)) == + (SWP_WRITEOK | SWP_PAGE_DISCARD)) { + swap_cluster_schedule_discard(si, idx); + return; + } + + __free_cluster(si, idx); +} + /* * The cluster corresponding to page_nr will be used. The cluster will be * removed from free cluster list and its usage counter will be increased. @@ -374,11 +409,8 @@ static void inc_cluster_info_page(struct swap_info_struct *p, if (!cluster_info) return; - if (cluster_is_free(&cluster_info[idx])) { - VM_BUG_ON(cluster_list_first(&p->free_clusters) != idx); - cluster_list_del_first(&p->free_clusters, cluster_info); - cluster_set_count_flag(&cluster_info[idx], 0, 0); - } + if (cluster_is_free(&cluster_info[idx])) + alloc_cluster(p, idx); VM_BUG_ON(cluster_count(&cluster_info[idx]) >= SWAPFILE_CLUSTER); cluster_set_count(&cluster_info[idx], @@ -402,21 +434,8 @@ static void dec_cluster_info_page(struct swap_info_struct *p, cluster_set_count(&cluster_info[idx], cluster_count(&cluster_info[idx]) - 1); - if (cluster_count(&cluster_info[idx]) == 0) { - /* -* If the swap is discardable, prepare discard the cluster -* instead of free it immediately. The cluster will be freed -* after discard. -*/ - if ((p->flags & (SWP_WRITEOK | SWP_PAGE_DISCARD)) == -(SWP_WRITEOK | SWP_PAGE_DISCARD)) { - swap_cluster_schedule_discard(p, idx); - return; - } - - cluster_set_flag(&cluster_info[idx], CLUSTER_FLAG_FREE); - cluster_list_add_tail(&p->free_clusters, cluster_info, idx); - } + if (cluster_count(&cluster_info[idx]) == 0) + free_cluster(p, idx); } /* @@ -497,6 +516,69 @@ static void scan_swap_map_try_ssd_cluster(struct swap_info_struct *si, *scan_base = tmp; } +#ifdef CONFIG_THP_SWAP_CLUSTER +static inline unsigned int huge_cluster_nr_entries(bool huge) +{ + return huge ? SWAPFILE_CLUSTER
[PATCH -v5 0/9] THP swap: Delay splitting THP during swapping out
From: Huang Ying This patchset is to optimize the performance of Transparent Huge Page (THP) swap. Hi, Andrew, could you help me to check whether the overall design is reasonable? Hi, Hugh, Shaohua, Minchan and Rik, could you help me to review the swap part of the patchset? Especially [1/9], [3/9], [4/9], [5/9], [6/9], [9/9]. Hi, Andrea and Kirill, could you help me to review the THP part of the patchset? Especially [2/9], [7/9] and [8/9]. Hi, Johannes, Michal and Vladimir, I am not very confident about the memory cgroup part, especially [2/9]. Could you help me to review it? And for all, Any comment is welcome! Recently, the performance of the storage devices improved so fast that we cannot saturate the disk bandwidth with single logical CPU when do page swap out even on a high-end server machine. Because the performance of the storage device improved faster than that of single logical CPU. And it seems that the trend will not change in the near future. On the other hand, the THP becomes more and more popular because of increased memory size. So it becomes necessary to optimize THP swap performance. The advantages of the THP swap support include: - Batch the swap operations for the THP to reduce lock acquiring/releasing, including allocating/freeing the swap space, adding/deleting to/from the swap cache, and writing/reading the swap space, etc. This will help improve the performance of the THP swap. - The THP swap space read/write will be 2M sequential IO. It is particularly helpful for the swap read, which are usually 4k random IO. This will improve the performance of the THP swap too. - It will help the memory fragmentation, especially when the THP is heavily used by the applications. The 2M continuous pages will be free up after THP swapping out. - It will improve the THP utilization on the system with the swap turned on. Because the speed for khugepaged to collapse the normal pages into the THP is quite slow. After the THP is split during the swapping out, it will take quite long time for the normal pages to collapse back into the THP after being swapped in. The high THP utilization helps the efficiency of the page based memory management too. There are some concerns regarding THP swap in, mainly because possible enlarged read/write IO size (for swap in/out) may put more overhead on the storage device. To deal with that, the THP swap in should be turned on only when necessary. For example, it can be selected via "always/never/madvise" logic, to be turned on globally, turned off globally, or turned on only for VMA with MADV_HUGEPAGE, etc. This patchset is based on 11/08 head of mmotm/master. This patchset is the first step for the THP swap support. The plan is to delay splitting THP step by step, finally avoid splitting THP during the THP swapping out and swap out/in the THP as a whole. As the first step, in this patchset, the splitting huge page is delayed from almost the first step of swapping out to after allocating the swap space for the THP and adding the THP into the swap cache. This will reduce lock acquiring/releasing for the locks used for the swap cache management. With the patchset, the swap out throughput improves 12.1% (from about 1.12GB/s to about 1.25GB/s) in the vm-scalability swap-w-seq test case with 16 processes. The test is done on a Xeon E5 v3 system. The swap device used is a RAM simulated PMEM (persistent memory) device. To test the sequential swapping out, the test case uses 16 processes, which sequentially allocate and write to the anonymous pages until the RAM and part of the swap device is used up. The detailed compare result is as follow, base base+patchset -- %stddev %change %stddev \ |\ 1118821 ± 0% +12.1%1254241 ± 1% vmstat.swap.so 2460636 ± 1% +10.6%2720983 ± 1% vm-scalability.throughput 308.79 ± 1% -7.9% 284.53 ± 1% vm-scalability.time.elapsed_time 1639 ± 4%+232.3% 5446 ± 1% meminfo.SwapCached 0.70 ± 3% +8.7% 0.77 ± 5% perf-stat.ipc 9.82 ± 8% -31.6% 6.72 ± 2% perf-profile.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list >From the swap out throughput number, we can find, even tested on a RAM simulated PMEM (Persistent Memory) device, the swap out throughput can reach only about 1.1GB/s. While, in the file IO test, the sequential write throughput of an Intel P3700 SSD can reach about 1.8GB/s steadily. And according the following URL, https://www-ssl.intel.com/content/www/us/en/solid-state-drives/intel-ssd-dc-family-for-pcie.html The sequential write throughput of Intel P3608 SSD can reach about 3.0GB/s, while the random read IOPS can reach about 850k. It is clear that the bottleneck has moved from the disk to the kernel swap component itse
[PATCH -v5 4/9] mm, THP, swap: Add get_huge_swap_page()
From: Huang Ying A variation of get_swap_page(), get_huge_swap_page(), is added to allocate a swap cluster (HPAGE_PMD_NR swap slots) based on the swap cluster allocation function. A fair simple algorithm is used, that is, only the first swap device in priority list will be tried to allocate the swap cluster. The function will fail if the trying is not successful, and the caller will fallback to allocate a single swap slot instead. This works good enough for normal cases. This will be used for the THP (Transparent Huge Page) swap support. Where get_huge_swap_page() will be used to allocate one swap cluster for each THP swapped out. Because of the algorithm adopted, if the difference of the number of the free swap clusters among multiple swap devices is significant, it is possible that some THPs are split earlier than necessary. For example, this could be caused by big size difference among multiple swap devices. Cc: Andrea Arcangeli Cc: Kirill A. Shutemov Cc: Hugh Dickins Cc: Shaohua Li Cc: Minchan Kim Cc: Rik van Riel Signed-off-by: "Huang, Ying" --- include/linux/swap.h | 24 +++- mm/swapfile.c| 18 -- 2 files changed, 35 insertions(+), 7 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 35484c9..1df1e23 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -404,7 +404,7 @@ static inline long get_nr_swap_pages(void) } extern void si_swapinfo(struct sysinfo *); -extern swp_entry_t get_swap_page(void); +extern swp_entry_t __get_swap_page(bool huge); extern swp_entry_t get_swap_page_of_type(int); extern int add_swap_count_continuation(swp_entry_t, gfp_t); extern void swap_shmem_alloc(swp_entry_t); @@ -424,6 +424,23 @@ extern bool reuse_swap_page(struct page *, int *); extern int try_to_free_swap(struct page *); struct backing_dev_info; +static inline swp_entry_t get_swap_page(void) +{ + return __get_swap_page(false); +} + +#ifdef CONFIG_THP_SWAP_CLUSTER +static inline swp_entry_t get_huge_swap_page(void) +{ + return __get_swap_page(true); +} +#else +static inline swp_entry_t get_huge_swap_page(void) +{ + return (swp_entry_t) {0}; +} +#endif + #else /* CONFIG_SWAP */ #define swap_address_space(entry) (NULL) @@ -530,6 +547,11 @@ static inline swp_entry_t get_swap_page(void) return entry; } +static inline swp_entry_t get_huge_swap_page(void) +{ + return (swp_entry_t) {0}; +} + #endif /* CONFIG_SWAP */ #ifdef CONFIG_MEMCG diff --git a/mm/swapfile.c b/mm/swapfile.c index 15c89f8..6d9dffb 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -760,14 +760,15 @@ static inline unsigned long swap_alloc_huge_cluster(struct swap_info_struct *si) } #endif -swp_entry_t get_swap_page(void) +swp_entry_t __get_swap_page(bool huge) { struct swap_info_struct *si, *next; pgoff_t offset; + int nr_pages = huge_cluster_nr_entries(huge); - if (atomic_long_read(&nr_swap_pages) <= 0) + if (atomic_long_read(&nr_swap_pages) < nr_pages) goto noswap; - atomic_long_dec(&nr_swap_pages); + atomic_long_sub(nr_pages, &nr_swap_pages); spin_lock(&swap_avail_lock); @@ -795,10 +796,15 @@ swp_entry_t get_swap_page(void) } /* This is called for allocating swap entry for cache */ - offset = scan_swap_map(si, SWAP_HAS_CACHE); + if (likely(nr_pages == 1)) + offset = scan_swap_map(si, SWAP_HAS_CACHE); + else + offset = swap_alloc_huge_cluster(si); spin_unlock(&si->lock); if (offset) return swp_entry(si->type, offset); + else if (unlikely(nr_pages != 1)) + goto fail_alloc; pr_debug("scan_swap_map of si %d failed to find offset\n", si->type); spin_lock(&swap_avail_lock); @@ -818,8 +824,8 @@ swp_entry_t get_swap_page(void) } spin_unlock(&swap_avail_lock); - - atomic_long_inc(&nr_swap_pages); +fail_alloc: + atomic_long_add(nr_pages, &nr_swap_pages); noswap: return (swp_entry_t) {0}; } -- 2.10.2
[PATCH -v5 5/9] mm, THP, swap: Support to clear SWAP_HAS_CACHE for huge page
From: Huang Ying __swapcache_free() is added to support to clear the SWAP_HAS_CACHE flag for the huge page. This will free the specified swap cluster now. Because now this function will be called only in the error path to free the swap cluster just allocated. So the corresponding swap_map[i] == SWAP_HAS_CACHE, that is, the swap count is 0. This makes the implementation simpler than that of the ordinary swap entry. This will be used for delaying splitting THP (Transparent Huge Page) during swapping out. Where for one THP to swap out, we will allocate a swap cluster, add the THP into the swap cache, then split the THP. If anything fails after allocating the swap cluster and before splitting the THP successfully, the swapcache_free_trans_huge() will be used to free the swap space allocated. Cc: Andrea Arcangeli Cc: Kirill A. Shutemov Cc: Hugh Dickins Cc: Shaohua Li Cc: Minchan Kim Cc: Rik van Riel Signed-off-by: "Huang, Ying" --- include/linux/swap.h | 9 +++-- mm/swapfile.c| 33 +++-- 2 files changed, 38 insertions(+), 4 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 1df1e23..cd1dc5c 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -411,7 +411,7 @@ extern void swap_shmem_alloc(swp_entry_t); extern int swap_duplicate(swp_entry_t); extern int swapcache_prepare(swp_entry_t); extern void swap_free(swp_entry_t); -extern void swapcache_free(swp_entry_t); +extern void __swapcache_free(swp_entry_t, bool); extern int free_swap_and_cache(swp_entry_t); extern int swap_type_of(dev_t, sector_t, struct block_device **); extern unsigned int count_swap_pages(int, int); @@ -483,7 +483,7 @@ static inline void swap_free(swp_entry_t swp) { } -static inline void swapcache_free(swp_entry_t swp) +static inline void __swapcache_free(swp_entry_t swp, bool huge) { } @@ -554,6 +554,11 @@ static inline swp_entry_t get_huge_swap_page(void) #endif /* CONFIG_SWAP */ +static inline void swapcache_free(swp_entry_t entry) +{ + __swapcache_free(entry, false); +} + #ifdef CONFIG_MEMCG static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg) { diff --git a/mm/swapfile.c b/mm/swapfile.c index 6d9dffb..e8d64ef 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -732,6 +732,27 @@ static void swap_free_huge_cluster(struct swap_info_struct *si, __swap_entry_free(si, offset, true); } +/* + * Caller should hold si->lock. + */ +static void swapcache_free_trans_huge(struct swap_info_struct *si, + swp_entry_t entry) +{ + unsigned long offset = swp_offset(entry); + unsigned long idx = offset / SWAPFILE_CLUSTER; + unsigned char *map; + unsigned int i; + + map = si->swap_map + offset; + for (i = 0; i < SWAPFILE_CLUSTER; i++) { + VM_BUG_ON(map[i] != SWAP_HAS_CACHE); + map[i] &= ~SWAP_HAS_CACHE; + } + /* Cluster size is same as huge page size */ + mem_cgroup_uncharge_swap(entry, HPAGE_PMD_NR); + swap_free_huge_cluster(si, idx); +} + static unsigned long swap_alloc_huge_cluster(struct swap_info_struct *si) { unsigned long idx; @@ -758,6 +779,11 @@ static inline unsigned long swap_alloc_huge_cluster(struct swap_info_struct *si) { return 0; } + +static inline void swapcache_free_trans_huge(struct swap_info_struct *si, +swp_entry_t entry) +{ +} #endif swp_entry_t __get_swap_page(bool huge) @@ -949,13 +975,16 @@ void swap_free(swp_entry_t entry) /* * Called after dropping swapcache to decrease refcnt to swap entries. */ -void swapcache_free(swp_entry_t entry) +void __swapcache_free(swp_entry_t entry, bool huge) { struct swap_info_struct *p; p = swap_info_get(entry); if (p) { - swap_entry_free(p, entry, SWAP_HAS_CACHE); + if (unlikely(huge)) + swapcache_free_trans_huge(p, entry); + else + swap_entry_free(p, entry, SWAP_HAS_CACHE); spin_unlock(&p->lock); } } -- 2.10.2
Re: [PATCH] net/phy/vitesse: Configure RGMII skew on VSC8601, if needed
From: Alex Date: Mon, 14 Nov 2016 13:54:57 -0800 > > > On 11/14/2016 01:25 PM, Florian Fainelli wrote: >> On 11/14/2016 01:18 PM, David Miller wrote: >>> From: Alexandru Gagniuc >>> Date: Sat, 12 Nov 2016 15:32:13 -0800 >>> + if (phydev->interface == PHY_INTERFACE_MODE_RGMII_ID) + ret = vsc8601_add_skew(phydev); >>> >>> I think you should use phy_interface_is_rgmii() here. >>> >> >> This would include all RGMII modes, here I think the intent is to >> check >> for PHY_INTERFACE_MODE_RGMII_ID and PHY_INTERFACE_MODE_RGMII_TXID (or >> RXID), > > That is correct. > >> Alexandru, what direction does the skew settings apply to? > > It applies a skew in both TX and RX directions. Please repost your patch, making the intent clear either in the commit message or a code comment. Thanks.
[PATCH v4 3/3] Make core_pattern support namespace
From: Zhao Lei Currently, each container shared one copy of coredump setting with the host system, if host system changed the setting, each running containers will be affected. Same story happened when container changed core_pattern, both host and other container will be affected. For container based on namespace design, it is good to allow each container keeping their own coredump setting. It will bring us following benefit: 1: Each container can change their own coredump setting based on operation on /proc/sys/kernel/core_pattern 2: Coredump setting changed in host will not affect running containers. 3: Support both case of "putting coredump in guest" and "putting curedump in host". Each namespace-based software(lxc, docker, ..) can use this function to custom their dump setting. And this function makes each continer working as separate system, it fit for design goal of namespace. Test(in lxc): # In the host # # echo host_core >/proc/sys/kernel/core_pattern # cat /proc/sys/kernel/core_pattern host_core # ulimit -c 1024000 # ./make_dump Segmentation fault (core dumped) # ls -l -rw--- 1 root root 331776 Feb 4 18:02 host_core.2175 -rwxr-xr-x 1 root root 759731 Feb 4 18:01 make_dump # # In the container # # cat /proc/sys/kernel/core_pattern host_core # echo container_core >/proc/sys/kernel/core_pattern # ./make_dump Segmentation fault (core dumped) # ls -l -rwxr-xr-x1 root root 759731 Feb 4 10:45 make_dump -rw---1 root root 331776 Feb 4 10:45 container_core.16 # # Return to host # # cat /proc/sys/kernel/core_pattern host_core # ls host_core.2175 make_dump make_dump.c # rm -f host_core.2175 # ./make_dump Segmentation fault (core dumped) # ls -l -rw--- 1 root root 331776 Feb 4 18:49 host_core.2351 -rwxr-xr-x 1 root root 759731 Feb 4 18:01 make_dump # Signed-off-by: Zhao Lei --- fs/coredump.c | 25 -- include/linux/pid_namespace.h | 3 +++ kernel/pid.c | 2 ++ kernel/pid_namespace.c| 2 ++ kernel/sysctl.c | 50 ++- 5 files changed, 70 insertions(+), 12 deletions(-) diff --git a/fs/coredump.c b/fs/coredump.c index aa2ef6c..f97a987 100644 --- a/fs/coredump.c +++ b/fs/coredump.c @@ -49,7 +49,6 @@ int core_uses_pid; unsigned int core_pipe_limit; -char core_pattern[CORENAME_MAX_SIZE] = "core"; static int core_name_size = CORENAME_MAX_SIZE; struct core_name { @@ -57,8 +56,6 @@ struct core_name { int used, size; }; -/* The maximal length of core_pattern is also specified in sysctl.c */ - static int expand_corename(struct core_name *cn, int size) { char *corename = krealloc(cn->corename, size, GFP_KERNEL); @@ -183,10 +180,10 @@ static int cn_print_exe_file(struct core_name *cn) * name into corename, which must have space for at least * CORENAME_MAX_SIZE bytes plus one byte for the zero terminator. */ -static int format_corename(struct core_name *cn, struct coredump_params *cprm) +static int format_corename(struct core_name *cn, const char *pat_ptr, + struct coredump_params *cprm) { const struct cred *cred = current_cred(); - const char *pat_ptr = core_pattern; int ispipe = (*pat_ptr == '|'); int pid_in_pattern = 0; int err = 0; @@ -663,6 +660,8 @@ void do_coredump(const siginfo_t *siginfo) */ .mm_flags = mm->flags, }; + struct pid_namespace *pid_ns; + char core_pattern[CORENAME_MAX_SIZE]; audit_core_dumps(siginfo->si_signo); @@ -672,6 +671,18 @@ void do_coredump(const siginfo_t *siginfo) if (!__get_dumpable(cprm.mm_flags)) goto fail; + pid_ns = task_active_pid_ns(current); + spin_lock(&pid_ns->core_pattern_lock); + while (pid_ns != &init_pid_ns) { + if (pid_ns->core_pattern[0]) + break; + spin_unlock(&pid_ns->core_pattern_lock); + pid_ns = pid_ns->parent, + spin_lock(&pid_ns->core_pattern_lock); + } + strcpy(core_pattern, pid_ns->core_pattern); + spin_unlock(&pid_ns->core_pattern_lock); + cred = prepare_creds(); if (!cred) goto fail; @@ -693,7 +704,7 @@ void do_coredump(const siginfo_t *siginfo) old_cred = override_creds(cred); - ispipe = format_corename(&cn, &cprm); + ispipe = format_corename(&cn, core_pattern, &cprm); if (ispipe) { int dump_count; @@ -740,7 +751,7 @@ void do_coredump(const siginfo_t *siginfo) } rcu_read_lock(); - vinit_task = find_task_by_vpid(1); + vinit_task = find_task_by_pid_ns(1, pid_ns); rcu_read_unlock(); if (!vinit_task) { print
[PATCH v4 1/3] Make call_usermodehelper_exec possible to set namespaces
Current call_usermodehelper_work() can not set namespaces for the executed program. This patch add above function for call_usermodehelper_work(). The init_intermediate is introduced for init works which should be done before fork(). So that we get a method to set namespaces for children. The cleanup_intermediate is introduced for cleaning up what we have done in init_intermediate, like switching back the namespace. This function is helpful for coredump to run pipe_program in specific container environment. Signed-off-by: Cao Shufeng Co-author-by: Zhao Lei --- fs/coredump.c | 3 ++- include/linux/kmod.h| 4 init/do_mounts_initrd.c | 3 ++- kernel/kmod.c | 43 +++ lib/kobject_uevent.c| 3 ++- security/keys/request_key.c | 4 ++-- 6 files changed, 47 insertions(+), 13 deletions(-) diff --git a/fs/coredump.c b/fs/coredump.c index 281b768..52f2ed6 100644 --- a/fs/coredump.c +++ b/fs/coredump.c @@ -641,7 +641,8 @@ void do_coredump(const siginfo_t *siginfo) retval = -ENOMEM; sub_info = call_usermodehelper_setup(helper_argv[0], helper_argv, NULL, GFP_KERNEL, - umh_pipe_setup, NULL, &cprm); + NULL, NULL, umh_pipe_setup, + NULL, &cprm); if (sub_info) retval = call_usermodehelper_exec(sub_info, UMH_WAIT_EXEC); diff --git a/include/linux/kmod.h b/include/linux/kmod.h index fcfd2bf..994e429 100644 --- a/include/linux/kmod.h +++ b/include/linux/kmod.h @@ -61,6 +61,8 @@ struct subprocess_info { char **envp; int wait; int retval; + void (*init_intermediate)(struct subprocess_info *info); + void (*cleanup_intermediate)(struct subprocess_info *info); int (*init)(struct subprocess_info *info, struct cred *new); void (*cleanup)(struct subprocess_info *info); void *data; @@ -71,6 +73,8 @@ call_usermodehelper(char *path, char **argv, char **envp, int wait); extern struct subprocess_info * call_usermodehelper_setup(char *path, char **argv, char **envp, gfp_t gfp_mask, + void (*init_intermediate)(struct subprocess_info *info), + void (*cleanup_intermediate)(struct subprocess_info *info), int (*init)(struct subprocess_info *info, struct cred *new), void (*cleanup)(struct subprocess_info *), void *data); diff --git a/init/do_mounts_initrd.c b/init/do_mounts_initrd.c index a1000ca..59d11c9 100644 --- a/init/do_mounts_initrd.c +++ b/init/do_mounts_initrd.c @@ -72,7 +72,8 @@ static void __init handle_initrd(void) current->flags |= PF_FREEZER_SKIP; info = call_usermodehelper_setup("/linuxrc", argv, envp_init, -GFP_KERNEL, init_linuxrc, NULL, NULL); +GFP_KERNEL, NULL, NULL, init_linuxrc, +NULL, NULL); if (!info) return; call_usermodehelper_exec(info, UMH_WAIT_PROC); diff --git a/kernel/kmod.c b/kernel/kmod.c index 0277d12..42f5a74 100644 --- a/kernel/kmod.c +++ b/kernel/kmod.c @@ -91,7 +91,8 @@ static int call_modprobe(char *module_name, int wait) argv[4] = NULL; info = call_usermodehelper_setup(modprobe_path, argv, envp, GFP_KERNEL, -NULL, free_modprobe_argv, NULL); +NULL, NULL, NULL, free_modprobe_argv, + NULL); if (!info) goto free_module_name; @@ -301,6 +302,9 @@ static void call_usermodehelper_exec_sync(struct subprocess_info *sub_info) /* Restore default kernel sig handler */ kernel_sigaction(SIGCHLD, SIG_IGN); + if(sub_info->cleanup_intermediate) { + sub_info->cleanup_intermediate(sub_info); + } umh_complete(sub_info); } @@ -322,6 +326,9 @@ static void call_usermodehelper_exec_work(struct work_struct *work) { struct subprocess_info *sub_info = container_of(work, struct subprocess_info, work); + if(sub_info->init_intermediate) { + sub_info->init_intermediate(sub_info); + } if (sub_info->wait & UMH_WAIT_PROC) { call_usermodehelper_exec_sync(sub_info); @@ -334,6 +341,11 @@ static void call_usermodehelper_exec_work(struct work_struct *work) */ pid = kernel_thread(call_usermodehelper_exec_async, sub_info, CLONE_PARENT | SIGCHLD); + + if(sub_info->cleanup_intermediate) { + sub_info->cleanup_i
[PATCH v4 0/3] Make core_pattern support namespace
This patchset includes following function points: 1: Let usermodehelper function possible to set pid namespace done by: [PATCH v4 1/3] Make call_usermodehelper_exec possible to set pid namespace. 2: Let pipe_type core_pattern write dump into container's rootfs done by: [PATCH v4 2/3] Limit dump_pipe program's permission to init for container. 2: Make separate core_pattern setting for each container done by: [PATCH v4 3/3] Make core_pattern support namespace 3: Compatibility with current system also included in: [PATCH v4 3/3] Make core_pattern support namespace If container hadn't change core_pattern setting, it will keep same setting with host. Test: 1: Pass a test script for each function of this patchset ## TEST IN HOST ## [root@kerneldev dumptest]# ./test_host Set file core_pattern: OK ./test_host: line 41: 2366 Segmentation fault (core dumped) "$SCRI= PT_BASE_DIR"/make_dump Checking dumpfile: OK Set file core_pattern: OK ./test_host: line 41: 2369 Segmentation fault (core dumped) "$SCRI= PT_BASE_DIR"/make_dump Checking dump_pipe triggered: OK Checking rootfs: OK Checking dumpfile: OK Checking namespace: OK Checking process list: OK Checking capabilities: OK ## TEST IN GUEST ## # ./test Segmentation fault (core dumped) Checking dump_pipe triggered: OK Checking rootfs: OK Checking dumpfile: OK Checking namespace: OK Checking process list: OK Checking cg pids: OK Checking capabilities: OK [ 64.940734] make_dump[2432]: segfault at 0 ip 0040049d sp 000= 07ffc4af025f0 error 6 in make_dump[40+a6000] # 2: Pass other test(which is not easy to do in script) by hand. Changelog v3.1-v4: 1. remove extra fork pointed out by: Andrei Vagin Changelog v3-v3.1: 1. Switch "pwd" of pipe program to container's root fs. 2. Rebase on top of v4.9-rc1 Changelog v2->v3: 1: Fix problem of setting pid namespace, pointed out by: Andrei Vagin Changelog v1(RFC)->v2: 1: Add [PATCH 2/2] which was todo in [RFC v1]. 2: Pass a test script for each function. 3: Rebase on top of v4.7. Suggested-by: Eric W. Biederman Suggested-by: KOSAKI Motohiro Signed-off-by: Zhao Lei Signed-off-by: Cao Shufeng Cao Shufeng (2): Make call_usermodehelper_exec possible to set namespaces Limit dump_pipe program's permission to init for container Zhao Lei (1): Make core_pattern support namespace fs/coredump.c | 150 +++--- include/linux/binfmts.h | 2 + include/linux/kmod.h | 4 ++ include/linux/pid_namespace.h | 3 + init/do_mounts_initrd.c | 3 +- kernel/kmod.c | 43 +--- kernel/pid.c | 2 + kernel/pid_namespace.c| 2 + kernel/sysctl.c | 50 -- lib/kobject_uevent.c | 3 +- security/keys/request_key.c | 4 +- 11 files changed, 241 insertions(+), 25 deletions(-) -- 2.7.4 >From caosf.f...@cn.fujitsu.com Tue Oct 25 15:28:53 2016 Received: from localhost.localdomain (10.167.226.94) by G08CNEXCHPEKD01.g08.fujitsu.local (10.167.33.89) with Microsoft SMTP Server (TLS) id 14.3.279.2; Tue, 25 Oct 2016 15:28:53 +0800 From: Cao Shufeng To: CC: , , , , , , , , Subject: [PATCH v4 3/3] Make core_pattern support namespace Date: Tue, 25 Oct 2016 15:28:56 +0800 Message-ID: <1477380536-3307-4-git-send-email-caosf.f...@cn.fujitsu.com> X-Mailer: git-send-email 2.1.0 In-Reply-To: <1477380536-3307-1-git-send-email-caosf.f...@cn.fujitsu.com> References: <1477380536-3307-1-git-send-email-caosf.f...@cn.fujitsu.com> Content-Type: text/plain Return-Path: caosf.f...@cn.fujitsu.com X-MS-Exchange-Organization-AuthSource: G08CNEXCHPEKD01.g08.fujitsu.local X-MS-Exchange-Organization-AuthAs: Internal X-MS-Exchange-Organization-AuthMechanism: 06 X-Originating-IP: [10.167.226.94] X-MS-Exchange-Organization-AVStamp-Mailbox: SMEXw]nP;1285660;0;This mail has been scanned by Trend Micro ScanMail for Microsoft Exchange; X-MS-Exchange-Organization-SCL: 0 MIME-Version: 1.0 X-Evolution-POP3-UID: 24016 X-Evolution-Source: 1406508640.5943.5@localhost.localdomain Content-Transfer-Encoding: 8bit From: Zhao Lei Currently, each container shared one copy of coredump setting with the host system, if host system changed the setting, each running containers will be affected. Same story happened when container changed core_pattern, both host and other container will be affected. For container based on namespace design, it is good to allow each container keeping their own coredump setting. It will bring us following benefit: 1: Each container can change their own coredump setting based on operation on /proc/sys/kernel/core_pattern 2: Coredump setting changed in host will not affect running containers. 3: Support both case of "putting coredump in guest" and "putting curedump in host". Each namespace-based software(lxc, docker, ..) can use this function to custom
[PATCH v4 2/3] Limit dump_pipe program's permission to init for container
Currently when we set core_pattern to a pipe, the pipe program is forked by kthread running with root's permission, and write dumpfile into host's filesystem. Same thing happened for container, the dumper and dumpfile are also in host(not in container). It have following program: 1: Not consistent with file_type core_pattern When we set core_pattern to a file, the container will write dump into container's filesystem instead of host. 2: Not safe for privileged container In a privileged container, user can destroy host system by following command: # # In a container # echo "|/bin/dd of=/boot/vmlinuz" >/proc/sys/kernel/core_pattern # make_dump This patch switch dumper program's environment to init task, so, for container, dumper program have same environment with init task in container, which make dumper program put in container's filesystem, and write coredump into container's filesystem. The dumper's permission is also limited into subset of container's init process. Suggested-by: Eric W. Biederman Suggested-by: KOSAKI Motohiro Signed-off-by: Cao ShuFeng --- fs/coredump.c | 126 +++- include/linux/binfmts.h | 2 + 2 files changed, 126 insertions(+), 2 deletions(-) diff --git a/fs/coredump.c b/fs/coredump.c index 52f2ed6..aa2ef6c 100644 --- a/fs/coredump.c +++ b/fs/coredump.c @@ -502,6 +502,45 @@ static void wait_for_dump_helpers(struct file *file) } /* + * umh_ns_setup + * set the namesapces to the bask task of a container. + * we need to switch back to the original namespaces + * so that the thread of workqueue is not influlenced. + * + * this method runs in workqueue kernel thread. + */ +static void umh_ns_setup(struct subprocess_info *info) +{ + struct coredump_params *cp = (struct coredump_params *)info->data; + struct task_struct *base_task = cp->base_task; + + if (base_task) { + cp->current_task_nsproxy = current->nsproxy; + //prevent current namespace from being freed + get_nsproxy(current->nsproxy); + /* Set namespaces to base_task */ + get_nsproxy(base_task->nsproxy); + switch_task_namespaces(current, base_task->nsproxy); + } +} + +/* + * umh_ns_cleanup + * cleanup what we have done in umh_ns_setup. + * + * this method runs in workqueue kernel thread. + */ +static void umh_ns_cleanup(struct subprocess_info *info) +{ + struct coredump_params *cp = (struct coredump_params *)info->data; + struct nsproxy *current_task_nsproxy = cp->current_task_nsproxy; + if (current_task_nsproxy) { + /* switch workqueue's original namespace back */ + switch_task_namespaces(current, current_task_nsproxy); + } +} + +/* * umh_pipe_setup * helper function to customize the process used * to collect the core in userspace. Specifically @@ -516,6 +555,8 @@ static int umh_pipe_setup(struct subprocess_info *info, struct cred *new) { struct file *files[2]; struct coredump_params *cp = (struct coredump_params *)info->data; + struct task_struct *base_task; + int err = create_pipe_files(files, 0); if (err) return err; @@ -524,10 +565,76 @@ static int umh_pipe_setup(struct subprocess_info *info, struct cred *new) err = replace_fd(0, files[0], 0); fput(files[0]); + if (err) + return err; + /* and disallow core files too */ current->signal->rlim[RLIMIT_CORE] = (struct rlimit){1, 1}; - return err; + base_task = cp->base_task; + if (base_task) { + const struct cred *base_cred; + + /* Set fs_root to base_task */ + spin_lock(&base_task->fs->lock); + set_fs_root(current->fs, &base_task->fs->root); + set_fs_pwd(current->fs, &base_task->fs->pwd); + spin_unlock(&base_task->fs->lock); + + /* Set cgroup to base_task */ + current->flags &= ~PF_NO_SETAFFINITY; + err = cgroup_attach_task_all(base_task, current); + if (err < 0) + return err; + + /* Set cred to base_task */ + base_cred = get_task_cred(base_task); + + new->uid = base_cred->uid; + new->gid = base_cred->gid; + new->suid = base_cred->suid; + new->sgid = base_cred->sgid; + new->euid = base_cred->euid; + new->egid = base_cred->egid; + new->fsuid = base_cred->fsuid; + new->fsgid = base_cred->fsgid; + + new->securebits = base_cred->securebits; + + new->cap_inheritable = base_cred->cap_inheritable; + new->cap_permitted = base_cred->cap_permitted; + new->cap_effective = base_cred->cap_effective; + new->cap_bset= base_cred->cap_
Re: [PATCH V3 1/9] PM / OPP: Reword binding supporting multiple regulators per device
On 15-11-16, 10:56, Stephen Boyd wrote: > This is also possible from C code though. Right and this is what this patchset is doing right now. To make it clear, the order of regulator names in the call dev_pm_opp_set_regulators() is used now to communicate the order in which entries are present in the OPP table. > Or is there some case > where it isn't possible if we're sharing the same table with two > devices? Even in that case it will be possible to set regulators separately, so that's not a problem. > I'm lost on when this would ever happen. It would happen in case of Krait for example, where CPUs manage DVFS separately but their tables may all be same. > It feels like trying to keep the OPP table agnostic of the > consuming device and the device's binding is more trouble than > it's worth. Especially considering we have opp-shared and *-name > now. Right. > > - The order in which the supplies need to be programmed. We have all > > agreed to do this in code instead of inferring it from DT and this > > patch series already does that. > > Agreed. Encoding a sequence into DT doesn't sound very feasible. > How is this going to be handled though? I don't see any users of > the code we're reviewing here, so it's hard to grasp how things > will work. It would be really useful if we had some user of the > code included in the patch series to get the big picture. The TI guys would be doing it soon. The sequence will be handled by platform specific set_opp() callbacks now. So, there is nothing in the core for that. > > So, are you saying that the way this patchset does it is fine with you > > ? > > That's just to handle the ordering of operations? Not just that. The blocking question here is that "Do we want to know the sequence in which the entries for multiple regulators are present in the OPP nodes from the DT? Or is it fine to handle that in code". And AFAIU, you are saying that we better handle that in code as handling that in DT is going to be nightmare without a new ugly property. -- viresh
linux-next: build warning after merge of the phy-next tree
Hi Kishon, After merging the phy-next tree, today's linux-next build (x86_64 allmodconfig) produced this warning: drivers/phy/phy-rockchip-inno-usb2.c: In function 'rockchip_chg_detect_work': drivers/phy/phy-rockchip-inno-usb2.c:717:7: warning: 'tmout' may be used uninitialized in this function [-Wmaybe-uninitialized] if (tmout) { ^ Introduced by commit 0c42fe48fd23 ("phy: rockchip-inno-usb2: support otg-port for rk3399") -- Cheers, Stephen Rothwell
Re: [PATCH] usb: dwc3: core: Disable USB2.0 phy suspend when dwc3 acts as host role
Hi, On 15 November 2016 at 18:49, Felipe Balbi wrote: > > Hi, > > Baolin Wang writes: >> When dwc3 controller acts as host role with attaching slow speed device >> (like mouse or keypad). Then if we plugged out the slow speed device, >> it will timeout to run the deconfiguration endpoint command to drop the >> endpoint's resources. Some xHCI command timeout log as below when >> disconnecting one slow device: >> >> [ 99.807739] c0 xhci-hcd.0.auto: Port Status Change Event for port 1 >> [ 99.814699] c0 xhci-hcd.0.auto: resume root hub >> [ 99.819992] c0 xhci-hcd.0.auto: handle_port_status: starting port >> polling. >> [ 99.827808] c0 xhci-hcd.0.auto: get port status, actual port 0 status >> = 0x202a0 >> [ 99.835903] c0 xhci-hcd.0.auto: Get port status returned 0x10100 >> [ 99.850052] c0 xhci-hcd.0.auto: clear port connect change, actual >> port 0 status = 0x2a0 >> [ 99.859313] c0 xhci-hcd.0.auto: Cancel URB ffc01ed6cd00, dev 1, >> ep 0x81, starting at offset 0xc406d210 >> [ 99.869645] c0 xhci-hcd.0.auto: // Ding dong! >> [ 99.874776] c0 xhci-hcd.0.auto: Stopped on Transfer TRB >> [ 99.880713] c0 xhci-hcd.0.auto: Removing canceled TD starting at >> 0xc406d210 (dma). >> [ 99.889012] c0 xhci-hcd.0.auto: Finding endpoint context >> [ 99.895069] c0 xhci-hcd.0.auto: Cycle state = 0x1 >> [ 99.900519] c0 xhci-hcd.0.auto: New dequeue segment = >> ffc1112f0880 (virtual) >> [ 99.908655] c0 xhci-hcd.0.auto: New dequeue pointer = 0xc406d220 (DMA) >> [ 99.915927] c0 xhci-hcd.0.auto: Set TR Deq Ptr cmd, new deq seg = >> ffc1112f0880 (0xc406d000 dma), >> new deq ptr = ff8002175220 >> (0xc406d220 dma), new cycle = 1 >> [ 99.931242] c0 xhci-hcd.0.auto: // Ding dong! >> [ 99.936360] c0 xhci-hcd.0.auto: Successful Set TR Deq Ptr cmd, >> deq = @c406d220 >> [ 99.944458] c0 xhci-hcd.0.auto: xhci_hub_status_data: stopping port >> polling. >> [ 100.047619] c0 xhci-hcd.0.auto: xhci_drop_endpoint called for udev >> ffc01ae08800 >> [ 100.057002] c0 xhci-hcd.0.auto: drop ep 0x81, slot id 1, new drop >> flags = 0x8, new add flags = 0x0 >> [ 100.067878] c0 xhci-hcd.0.auto: xhci_check_bandwidth called for udev >> ffc01ae08800 >> [ 100.076868] c0 xhci-hcd.0.auto: New Input Control Context: >> >> .. >> >> [ 100.427252] c0 xhci-hcd.0.auto: // Ding dong! >> [ 105.430728] c0 xhci-hcd.0.auto: Command timeout >> [ 105.436029] c0 xhci-hcd.0.auto: Abort command ring >> [ 113.558223] c0 xhci-hcd.0.auto: Command completion event does not match >> command >> [ 113.569778] c0 xhci-hcd.0.auto: Timeout while waiting for configure >> endpoint command >> >> The reason is it will suspend USB phy to disable phy clock when >> disconnecting the slow USB decice, that will hang on the xHCI commands >> executing which depends on the phy clock. >> >> Thus we should disable USB2.0 phy suspend feature when dwc3 acts as host >> role. >> >> Signed-off-by: Baolin Wang >> --- >> drivers/usb/dwc3/core.c | 14 ++ >> 1 file changed, 14 insertions(+) >> >> diff --git a/drivers/usb/dwc3/core.c b/drivers/usb/dwc3/core.c >> index 9a4a5e4..0b646cf 100644 >> --- a/drivers/usb/dwc3/core.c >> +++ b/drivers/usb/dwc3/core.c >> @@ -565,6 +565,20 @@ static int dwc3_phy_setup(struct dwc3 *dwc) >> if (dwc->revision > DWC3_REVISION_194A) >> reg |= DWC3_GUSB2PHYCFG_SUSPHY; >> >> + /* >> + * When dwc3 controller acts as host role with attaching one slow speed >> + * device (like mouse or keypad). Then if we plugged out the slow speed >> + * device, it will timeout to run the deconfiguration endpoint command. >> + * The reason is it will suspend USB phy to disable phy clock when >> + * disconnecting slow speed decice, which will affect the xHCI commands >> + * executing. >> + * >> + * Thus we should disable USB 2.0 phy suspend feature when dwc3 acts as >> + * host role. >> + */ >> + if (dwc->dr_mode == USB_DR_MODE_HOST || dwc->dr_mode == >> USB_DR_MODE_OTG) >> + reg &= ~DWC3_GUSB2PHYCFG_SUSPHY; > > which version of the core you're using? Recent version (since 1.94A, My version is 2.80a. > IIRC) can manage core suspend automatically. Also, this patch of yours > will cause a power consumption regression. Yes, it can manage core suspend automatically, that is the problem. When plugging out one mouse or keypad device, the phy will suspend automatically to disable the phy clock. But now the dis
Re: [PATCH -RFC] moduleparam: introduce core_param_named macro for non-modular code
Paul Gortmaker writes: > We have the case where module_param_named() in file "foo.c" for > parameter myparam translates that into the bootarg for the > non-modular use case as "foo.myparam=..." > > The problem exists where the use case with the filename and the > dot prefix is established, but the code is then realized to be 100% > non-modular, or is converted to non-modular. Both of the existing > macros like core_param() or setup_param() do not append such a > prefix, so a straight conversion to either will break the existing > use cases. IMHO you should keep using moduleparam. I originally called everything simply param(), but there was a name clash. Linus' answer was basically that "everything is a module, even if it's not a .ko". And it's his tree, so he must be right! Cheers, Rusty.