Re: [PATCH -mm v4 6/9] atmel_serial: Split the interrupt handler
Hi, Haavard Skinnemoen wrote: On Wed, 30 Jan 2008 16:26:27 +0100 michael <[EMAIL PROTECTED]> wrote: I have no idea. Could you post some more specifics about what you modified, for example a diff? ... /* The interrupt handler does not take the lock */ spin_lock_irqsave(&port->lock, flags); atmel_tx_chars(port); spin_unlock_irqrestore(&port->lock, flags); Sorry, this isn't going to work. Please post a diff with the changes you did to the driver, and whatever output you got when it crashed. It's really difficult to help you when I don't know (a) what code you're actually running, or (b) anything about the crash. Ok, but the problem is that I have some added code for using the uart with smart card in iso mode, (is never called) and the patch is not so clean. Now I return to the original patch without the spin_lock_irqsave and with the fix of buffer allocation,and I don't see the crash anymore. In full preemptive all works with threading hardirqs and sofirqs. I will do other testing before posting again. The atmel_tx_chars using the serial device registers like the interrupt routine and so I think that it is possible to have interference during send operation. No, it's only called from the tasklet, and the interrupt handler doesn't touch the TX data register. There shouldn't be any need to disable interrupts around the call to atmel_tx_chars(). In fact, this may very well be the cause of the overruns you're seeing. Haavard The overrun still remain. An lrz receive session is impossible using full preemption. I will try the dma patch too and test in iso mode for smart card. Thanks -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -mm v4 6/9] atmel_serial: Split the interrupt handler
Hi Haavard diff --git a/drivers/serial/atmel_serial.c b/drivers/serial/atmel_serial.c index 477950f..c61fcc3 100644 --- a/drivers/serial/atmel_serial.c +++ b/drivers/serial/atmel_serial.c @@ -337,9 +337,12 @@ atmel_buffer_rx_char(struct uart_port *port, unsigned int status, struct circ_buf *ring = &atmel_port->rx_ring; struct atmel_uart_char *c; - if (!CIRC_SPACE(ring->head, ring->tail, ATMEL_SERIAL_RINGSIZE)) + if (!CIRC_SPACE(ring->head, ring->tail, ATMEL_SERIAL_RINGSIZE)) { + dev_err(port->dev, "RX ring buffer full, dropping data\n"); + /* Buffer overflow, ignore char */ return; + } c = &((struct atmel_uart_char *)ring->buf)[ring->head]; c->status= status; I have already tried that but I have never seen the buffer full. So tomorrow I can do other tests with the serial device. I think the the atmel_interrupt handler must check the pass_counter before return IRQ_HANDLED. Regards Michael -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -mm v4 6/9] atmel_serial: Split the interrupt handler
hi, diff --git a/drivers/serial/atmel_serial.c b/drivers/serial/atmel_serial.c index cb70cc5..f310a80 100644 --- a/drivers/serial/atmel_serial.c +++ b/drivers/serial/atmel_serial.c @@ -552,7 +552,7 @@ static irqreturn_t atmel_interrupt(int irq, void *dev_id) atmel_handle_transmit(port, pending); } while (pass_counter++ < ATMEL_ISR_PASS_LIMIT); - return IRQ_HANDLED; + return pass_counter ? IRQ_HANDLED : IRQ_NONE; } /* Just one question: Receiving with hardware handshake works without PDC? Regards Michael -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -mm v4 6/9] atmel_serial: Split the interrupt handler
Hi, All works now for me with preempt-rt. The problem is using hrtimer. I think that hrtimer are executed with interrupts disabled so, if this happen when I must receive a char, i have an overrun. The only solution was the dma support to serial device. Regards Michael -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -mm v4 6/9] atmel_serial: Split the interrupt handler
Hi, Remy Bohmer wrote: Hello All, All works now for me with preempt-rt. The problem is using hrtimer. I think that hrtimer are executed with interrupts disabled so, if this happen when I must receive a char, i have an overrun. No, they share the same interrupt line... I think that the hrtimer use and other interrupt line. The AT91SAM9260_ID_TC2. So, while the timer interrupt handler is running, the serial handler has to wait until the timer interrupt handler has finished. Notice that the HRT interrupt handler is quite heavy and takes a long time to complete. The problem is the heavy of HRT interrupt handler of course. And, as I already mentioned, related to the 1 byte FIFO and a interrupt latency of about 85us (without HRT), it is logical that you can get an overrun at the higher serial speeds... (>=115200bps) I don't have the same problem without the hrtimer, I suppose the the timer latency is not so heavy. The only solution was the dma support to serial device. Or, use flow control? Yes :) Michael -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Add write verify on dataflash.
Add the write verification buffer to the dataflash. The mtd_dataflash has the CONFIG_DATAFLASH_WRITE_VERIFY so is better a change to Kconfig. Signed-off-by: Michael Trimarchi <[EMAIL PROTECTED]> --- fs/jffs2/wbuf.c | 12 1 files changed, 12 insertions(+), 0 deletions(-) diff --git a/fs/jffs2/wbuf.c b/fs/jffs2/wbuf.c index d1d4f27..ba49f19 100644 --- a/fs/jffs2/wbuf.c +++ b/fs/jffs2/wbuf.c @@ -1236,12 +1236,24 @@ int jffs2_dataflash_setup(struct jffs2_sb_info *c) { if (!c->wbuf) return -ENOMEM; +#ifdef CONFIG_JFFS2_FS_WBUF_VERIFY + c->wbuf_verify = kmalloc(c->wbuf_pagesize, GFP_KERNEL); + if (!c->wbuf_verify) { + kfree(c->oobbuf); + kfree(c->wbuf); + return -ENOMEM; + } +#endif + printk(KERN_INFO "JFFS2 write-buffering enabled buffer (%d) erasesize (%d)\n", c->wbuf_pagesize, c->sector_size); return 0; } void jffs2_dataflash_cleanup(struct jffs2_sb_info *c) { +#ifdef CONFIG_JFFS2_FS_WBUF_VERIFY + kfree(c->wbuf_verify); +#endif kfree(c->wbuf); } -- 1.5.2.1.174.gcd03 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[Fwd: Re: JFFS2 as rootfs on DataFlash]
--- Begin Message --- hi, I have build JFFS2 for dataflash using buildroot. It was some time ago though. You do need to have the -s 0x210 and -e 0x2100. Note that using old mtd S/W will break the JFFS2, since old MTD assumes that pages are 2^n. You need to fix the mtd S/W in the linux kernel/root fs AND you need to change the mtd used to generate the file system. I'm using the 2.6.24 tree. Take a look: Inode node at 0x2724, totlen 0x0115, #ino 21, version16, isize 461972, csize 209, dsize 1056, offset 14784 Inode node at 0x283c, totlen 0x0117, #ino 21, version17, isize 461972, csize 211, dsize 1056, offset 15840 Inode node at 0x2954, totlen 0x0334, #ino 21, version18, isize 461972, csize 752, dsize 1056, offset 16896 This a short dump of an image create with mkfs.jffs2 using your setting. Inode node at 0x002a8ae8, totlen 0x07a6, #ino700, version 4, isize 8192, csize 1890, dsize 4096, offset 4096 Inode node at 0x002a9290, totlen 0x0607, #ino700, version 5, isize12288, csize 1475, dsize 4096, offset 8192 Inode node at 0x002a9898, totlen 0x0549, #ino700, version 6, isize16384, csize 1285, dsize 4096, offset 12288 Inode node at 0x002a9de4, totlen 0x06b7, #ino700, version 7, isize20480, csize 1651, dsize 4096, offset 16384 This is created by linux, filling it just copy file. Inode node at 0x00042000, totlen 0x0a40, #ino 21, version75, isize 461972, csize 2556, dsize 4096, offset 229376 Inode node at 0x00042a40, totlen 0x0abe, #ino 21, version76, isize 461972, csize 2682, dsize 4096, offset 233472 Inode node at 0x00043500, totlen 0x0363, #ino 21, version77, isize 461972, csize 799, dsize 986, offset 237568 This is created by me using only the erase blocksize. As you can see the inode data can be 4096 and it is wrong to have 1056. Regards Michael --- End Message ---
Re: [PATCH -mm v4 6/9] atmel_serial: Split the interrupt handler
Hi, the serial driver works fine. The problem seems to be related to the tclib, when I use it as a clocksource. The numbers of overruns depends on the type of files too. It is possible? Regards Michael -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -mm v4 6/9] atmel_serial: Split the interrupt handler
Hi, Haavard Skinnemoen wrote: On Tue, 05 Feb 2008 13:29:35 +0100 michael <[EMAIL PROTECTED]> wrote: Just one question: Receiving with hardware handshake works without PDC? I don't know...I haven't tried. These patches shouldn't change anything though. Haavard -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ I refer to this part of documentation: "The USART behavior when hardware handshaking is enabled is the same as the behavior in standard synchronous or asynchronous mode, except that the receiver drives the RTS pin as described below and the level on the CTS pin modifies the behavior of the transmitter as described below. Using this mode requires using the PDC channel for reception. The transmitter can handle hardware handshaking in any case." Regards Michael -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
jffs2 summary buffer
0x0/0x38c) from [] (jffs2_garbage_collect_live+0x3b8/0x105c) [] (jffs2_garbage_collect_live+0x0/0x105c) from [] (jffs2_garbage_collect_pass+0x65c/0x714) [] (jffs2_garbage_collect_pass+0x0/0x714) from [] (jffs2_flush_wbuf_gc+0xc4/0x198) [] (jffs2_flush_wbuf_gc+0x0/0x198) from [] (jffs2_write_super+0x44/0x48) r7:c0222874 r6:c1b28000 r5: r4:c1a81e00 [] (jffs2_write_super+0x0/0x48) from [] (sync_supers+0x74/0xac) r5:c1bf703c r4:c1bf7000 [] (sync_supers+0x0/0xac) from [] (wb_kupdate+0x50/0x140) r5:c02267ec r4:c1b29fb0 [] (wb_kupdate+0x0/0x140) from [] (pdflush+0x114/0x1d8) r5:c02267ec r4:c1b29fb0 [] (pdflush+0x0/0x1d8) from [] (kthread+0x5c/0x90) r7: r6:c0060c90 r5: r4:c1b28000 [] (kthread+0x0/0x90) from [] (do_exit+0x0/0x744) r6: r5: r4: Code: e59f0084 e1a01005 eb004440 e3a03000 (e5833000) WARNING: at kernel/exit.c:892 do_exit() [] (dump_stack+0x0/0x14) from [] (do_exit+0x44/0x744) [] (do_exit+0x0/0x744) from [] (die+0x2a0/0x2fc) [] (die+0x0/0x2fc) from [] (__do_kernel_fault+0x6c/0x7c) [] (__do_kernel_fault+0x0/0x7c) from [] (do_page_fault+0x1f8/0x214) r7:c1b29b70 r6:c02c38a0 r5:c022013c r4: [] (do_page_fault+0x0/0x214) from [] (do_DataAbort+0x3c/0xa0) [] (do_DataAbort+0x0/0xa0) from [] (__dabt_svc+0x40/0x60) Exception stack(0xc1b29b70 to 0xc1b29bb8) 9b60: 0028 0001 0001 9b80: c286f0b0 c286f4d0 0001 c024c610 0001 c024c5f8 c1b29bd0 9ba0: c0221be4 c1b29bb8 c0221be4 c0027540 6013 r8:0001 r7:c024c610 r6:0001 r5:c1b29ba4 r4: [] (consistent_sync+0x0/0xe4) from [] (spi_transfer+0x104/0x1c0) r6:c02f9500 r5:c286 r4:2286f0b0 [] (spi_transfer+0x0/0x1c0) from [] (do_spi_transfer+0x54/0x5c) [] (do_spi_transfer+0x0/0x5c) from [] (at91_dataflash_write+0x1b8/0x2a0) [] (at91_dataflash_write+0x0/0x2a0) from [] (part_write+0xa8/0xb0) [] (part_write+0x0/0xb0) from [] (jffs2_flash_writev+0x288/0x448) r6:c1a81e00 r5:41b0 r4:2100 [] (jffs2_flash_writev+0x4/0x448) from [] (jffs2_sum_write_sumnode+0x320/0x3e0) [] (jffs2_sum_write_sumnode+0x0/0x3e0) from [] (jffs2_do_reserve_space+0x80/0x354) [] (jffs2_do_reserve_space+0x0/0x354) from [] (jffs2_reserve_space_gc+0x34/0x58) [] (jffs2_reserve_space_gc+0x0/0x58) from [] (jffs2_garbage_collect_pristine+0x5c/0x38c) r7:c1a81e00 r6:c02ee458 r5:083c r4:c1fb952c [] (jffs2_garbage_collect_pristine+0x0/0x38c) from [] (jffs2_garbage_collect_live+0x3b8/0x105c) [] (jffs2_garbage_collect_live+0x0/0x105c) from [] (jffs2_garbage_collect_pass+0x65c/0x714) [] (jffs2_garbage_collect_pass+0x0/0x714) from [] (jffs2_flush_wbuf_gc+0xc4/0x198) [] (jffs2_flush_wbuf_gc+0x0/0x198) from [] (jffs2_write_super+0x44/0x48) r7:c0222874 r6:c1b28000 r5: r4:c1a81e00 [] (jffs2_write_super+0x0/0x48) from [] (sync_supers+0x74/0xac) r5:c1bf703c r4:c1bf7000 [] (sync_supers+0x0/0xac) from [] (wb_kupdate+0x50/0x140) r5:c02267ec r4:c1b29fb0 [] (wb_kupdate+0x0/0x140) from [] (pdflush+0x114/0x1d8) r5:c02267ec r4:c1b29fb0 [] (pdflush+0x0/0x1d8) from [] (kthread+0x5c/0x90) r7: r6:c0060c90 r5: r4:c1b28000 [] (kthread+0x0/0x90) from [] (do_exit+0x0/0x744) r6: r5: r4: The solution was to copy in a safely buffer. Can be this affect others drivers? Regards Michael -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.2.18pre2aa2 and patches for 2.2.18pre3
I hate to post just to say me too, but we couldn't run 2.2.16 for more than a few hours and even 2.2.17 would stop responding with a load average >200 right around the time of our heaviest usage and never come back. Assuming 2.2.18pre2aa2 doesn't crash in the next 2 weeks (the original problem we upgraded to fix) then we'll probably never change our kernel :) (and my associate who believes in windows won't have anything left to complain about). - Michael On Sun, 10 Sep 2000, Roeland Th. Jansen wrote: > On Thu, Sep 07, 2000 at 11:26:56PM +1100, Matthew Hawkins wrote: > > I'd like to advocate the inclusion of the majority of these patches of > > Andrea's. I've been patching most of them in for a while now simply > > because I've found my SMP system much more stable and useable. > > > I also takled with Andrea and Alan about this. 2.2.16 will kill itself > within hours on my system. With Andrea's patches, it lives for long > times. > > > -- > Grobbebol's Home | Don't give in to spammers. -o) > http://www.xs4all.nl/~bengel | Use your real e-mail address /\ > Linux 2.2.16 SMP 2x466MHz / 256 MB |on Usenet. _\_v > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to [EMAIL PROTECTED] > Please read the FAQ at http://www.tux.org/lkml/ > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
intel etherpro100 on 2.2.18p21 vs 2.2.18p17
We have several Supermicro 370DL3 boards (scsi, built into epro100, dual pentium iii) - which are giving the following ethernet card error on 2.2.18p21, but not on 2.2.18p17. This error has happened on 3 out of 4 boards with this configuration. Oct 18 12:17:34 db1 kernel: eth0: card reports no RX buffers. The above message repeats itself and the ethernet card does not work. On bootup: Oct 18 12:17:34 db1 kernel: scsi : detected 1 SCSI disk total. Oct 18 12:17:34 db1 kernel: SCSI device sda: hdwr sector= 512 bytes. Sectors= 35843671 [17501 MB] [17.5 GB] Oct 18 12:17:34 db1 kernel: eepro100.c:v1.09j-t 9/29/99 Donald Becker http://cesdis.gsfc.nasa.gov/linux/drivers/eepro10$ Oct 18 12:17:34 db1 kernel: eepro100.c: $Revision: 1.20.2.10 $ 2000/05/31 Modified by Andrey V. Savochkin <[EMAIL PROTECTED]$ Oct 18 12:17:34 db1 kernel: eth0: Intel PCI EtherExpress Pro100 82557, 00:30:48:21:2F:9E, I/O at 0xd400, IRQ 31. Oct 18 12:17:34 db1 kernel: Board assembly 00-000, Physical connectors present: RJ45 Oct 18 12:17:34 db1 kernel: Primary interface chip i82555 PHY #1. Oct 18 12:17:34 db1 kernel: General self-test: passed. Oct 18 12:17:34 db1 kernel: Serial sub-system self-test: passed. Oct 18 12:17:34 db1 kernel: Internal registers self-test: passed. Oct 18 12:17:34 db1 kernel: ROM checksum self-test: passed (0x04f4518b). Oct 18 12:17:34 db1 kernel: Receiver lock-up workaround activated. Oct 18 12:17:34 db1 kernel: eepro100.c:v1.09j-t 9/29/99 Donald Becker http://cesdis.gsfc.nasa.gov/linux/drivers/eepro100.html Oct 18 12:17:34 db1 kernel: eepro100.c: $Revision: 1.20.2.10 $ 2000/05/31 Modified by Andrey V. Savochkin <[EMAIL PROTECTED]> and o$ Oct 18 12:17:34 db1 kernel: Partition check: Oct 18 12:17:34 db1 kernel: sda: sda1 sda2 < sda5 sda6 sda7 > Oct 18 12:17:34 db1 kernel: VFS: Mounted root (ext2 filesystem) readonly. Oct 18 12:17:34 db1 kernel: Freeing unused kernel memory: 52k freed Oct 18 12:17:34 db1 kernel: Adding Swap: 136512k swap-space (priority -1) Oct 18 12:17:34 db1 kernel: eth0: card reports no RX buffers. I believe this has been an ongoing issue for these intel nics? --Michael - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
RE: intel etherpro100 on 2.2.18p21 vs 2.2.18p17
We have the SUPER 370DL3 SuperMicro boards w/ the integrated Intel NIC, unfortunately a warm boot does not help. The problem also seems to happen when I turn on the alias ip feature in the kernel under network options. On Fri, 10 Nov 2000, Allen, David B wrote: > FWIW, I have a dual-proc SuperMicro motherboard P3DM3 with integrated > Adaptec SCSI and Intel 8255x built-in NIC. > > Sometimes on a cold boot I get the "kernel: eth0: card reports no RX > buffers" that repeats, but if I follow it with a warm boot the message > doesn't appear (even on subsequent warm boots). So this is definitely > reproducible, but it doesn't happen every time. > > I can't offer much more than that, but at least you know you're not the only > one experiencing this. > > -Original Message- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] > Sent: Friday, November 10, 2000 9:00 AM > To: [EMAIL PROTECTED] > Subject: intel etherpro100 on 2.2.18p21 vs 2.2.18p17 > > We have several Supermicro 370DL3 boards (scsi, built into epro100, dual > pentium iii) - which are giving the following ethernet card error on > 2.2.18p21, but not on 2.2.18p17. This error has happened on 3 out of 4 > boards with this configuration. > > Oct 18 12:17:34 db1 kernel: eth0: card reports no RX buffers. > The above message repeats itself and the ethernet card does not work. > > On bootup: > > Oct 18 12:17:34 db1 kernel: eepro100.c:v1.09j-t 9/29/99 Donald Becker > http://cesdis.gsfc.nasa.gov/linux/drivers/eepro10$ > Oct 18 12:17:34 db1 kernel: eepro100.c: $Revision: 1.20.2.10 $ 2000/05/31 > Modified by Andrey V. Savochkin <[EMAIL PROTECTED]$ > Oct 18 12:17:34 db1 kernel: eth0: Intel PCI EtherExpress Pro100 82557, > 00:30:48:21:2F:9E, I/O at 0xd400, IRQ 31. > Oct 18 12:17:34 db1 kernel: Board assembly 00-000, Physical > connectors present: RJ45 > Oct 18 12:17:34 db1 kernel: Primary interface chip i82555 PHY #1. > Oct 18 12:17:34 db1 kernel: General self-test: passed. > Oct 18 12:17:34 db1 kernel: Serial sub-system self-test: passed. > Oct 18 12:17:34 db1 kernel: Internal registers self-test: passed. > Oct 18 12:17:34 db1 kernel: ROM checksum self-test: passed (0x04f4518b). > Oct 18 12:17:34 db1 kernel: Receiver lock-up workaround activated. > > Oct 18 12:17:34 db1 kernel: eth0: card reports no RX buffers. > > > I believe this has been an ongoing issue for these intel nics? > > --Michael > > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
USB mouse wheel breakage was Re: Linux 2.4.5-ac5
On Wed, May 30, 2001 at 09:30:39PM +0100, Alan Cox wrote: > 2.4.5-ac4 > o Update USB hid drivers (Vojtech Pavlik) I think these changes have broken my USB wheel mouse. Events seems to be getting lost (/dev/input/mice) It only scrolls when either the scroll direction has changed or if other mouse events occur (e.g. you need to wiggle mouse from side to side to scroll down a long page in mozilla) problems seems to be in drivers/usb/hid-core.c hid_input_field line 772 for (n = 0; n < count; n++) { if (HID_MAIN_ITEM_VARIABLE & field->flags) { if ((field->flags & HID_MAIN_ITEM_RELATIVE) && !value[n]) continue; The next 2 lines are dropping the scroll wheel events (which appear in the input code as type:2, code: 8, value -1 or 1 depending on direction) if (value[n] == field->value[n]) continue; hid_process_event(hid, field, &field->usage[n], value[n]); continue; } Works fine in ac3. -- Michael. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: USB mouse wheel breakage was Re: Linux 2.4.5-ac5
On Fri, Jun 01, 2001 at 05:32:26PM -0400, Robert M. Love wrote: > I and another user thought the problem was in hid_input_field, but upon > looking I now think not. It is, check against hid.c in 2.4.5, the new code &&'s the first 2 if statements and so it now checks non-zero HID_MAIN_INPUT_RELATIVE values for new and old being the same, which AFAICT, they can and often will be. Patch against ac6 reverts back to original hid.c check :- --- ../linux.orig/drivers/usb/hid-core.cSat Jun 2 21:47:35 2001 +++ drivers/usb/hid-core.c Sat Jun 2 21:46:00 2001 @@ -773,10 +773,11 @@ if (HID_MAIN_ITEM_VARIABLE & field->flags) { - if ((field->flags & HID_MAIN_ITEM_RELATIVE) && !value[n]) - continue; - if (value[n] == field->value[n]) - continue; + if (field->flags & HID_MAIN_ITEM_RELATIVE) { + if (!value[n]) continue; + } else { + if (value[n] == field->value[n]) continue; + } hid_process_event(hid, field, &field->usage[n], value[n]); continue; } -- Michael. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Linux-cluster] GFS - updated patches
I have the same question as I asked before, how can I see GFS in "make menuconfig", after I patch gfs2-full.patch into a 2.6.12.2 kernel? Michael On 8/11/05, David Teigland <[EMAIL PROTECTED]> wrote: > Thanks for all the review and comments. This is a new set of patches that > incorporates the suggestions we've received. > > http://redhat.com/~teigland/gfs2/20050811/gfs2-full.patch > http://redhat.com/~teigland/gfs2/20050811/broken-out/ > > Dave > > -- > Linux-cluster mailing list > [EMAIL PROTECTED] > http://www.redhat.com/mailman/listinfo/linux-cluster > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: GFS - updated patches
yes, after apply dlm.patch, I saw it! although I don't know what's "-mm". Thanks, Michael On 8/11/05, David Teigland <[EMAIL PROTECTED]> wrote: > On Thu, Aug 11, 2005 at 04:21:04PM +0800, Michael wrote: > > I have the same question as I asked before, how can I see GFS in "make > > menuconfig", after I patch gfs2-full.patch into a 2.6.12.2 kernel? > > You need to select the dlm under drivers. It's in -mm, or apply > http://redhat.com/~teigland/dlm.patch > > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Linux-cluster] GFS - updated patches
Hi, Dave, I quickly applied gfs2 and dlm patches in kernel 2.6.12.2, it passed compiling but has some warning log, see attachment. maybe helpful to you. Thanks, Michael On 8/11/05, David Teigland <[EMAIL PROTECTED]> wrote: > Thanks for all the review and comments. This is a new set of patches that > incorporates the suggestions we've received. > > http://redhat.com/~teigland/gfs2/20050811/gfs2-full.patch > http://redhat.com/~teigland/gfs2/20050811/broken-out/ > > Dave > > -- > Linux-cluster mailing list > [EMAIL PROTECTED] > http://www.redhat.com/mailman/listinfo/linux-cluster > [EMAIL PROTECTED] kernel-gfs2-full-2.6.12.2]$ make SUBDIRS=fs/gfs2 LD fs/gfs2/built-in.o CC [M] fs/gfs2/acl.o CC [M] fs/gfs2/bits.o CC [M] fs/gfs2/bmap.o fs/gfs2/bmap.c: In function `find_metapath': fs/gfs2/bmap.c:320: warning: implicit declaration of function `kzalloc' fs/gfs2/bmap.c:320: warning: assignment makes pointer from integer without a cast CC [M] fs/gfs2/daemon.o CC [M] fs/gfs2/dir.o fs/gfs2/dir.c: In function `leaf_dealloc': fs/gfs2/dir.c:1910: warning: implicit declaration of function `kzalloc' fs/gfs2/dir.c:1910: warning: assignment makes pointer from integer without a cast CC [M] fs/gfs2/eaops.o CC [M] fs/gfs2/eattr.o CC [M] fs/gfs2/glock.o CC [M] fs/gfs2/glops.o CC [M] fs/gfs2/inode.o CC [M] fs/gfs2/ioctl.o CC [M] fs/gfs2/jdata.o CC [M] fs/gfs2/lm.o CC [M] fs/gfs2/log.o fs/gfs2/log.c: In function `gfs2_log_get_buf': fs/gfs2/log.c:363: warning: implicit declaration of function `kzalloc' fs/gfs2/log.c:363: warning: assignment makes pointer from integer without a cast fs/gfs2/log.c: In function `gfs2_log_fake_buf': fs/gfs2/log.c:393: warning: assignment makes pointer from integer without a cast fs/gfs2/log.c: In function `gfs2_log_flush_i': fs/gfs2/log.c:524: warning: assignment makes pointer from integer without a cast CC [M] fs/gfs2/lops.o CC [M] fs/gfs2/lvb.o CC [M] fs/gfs2/main.o CC [M] fs/gfs2/meta_io.o CC [M] fs/gfs2/mount.o CC [M] fs/gfs2/ondisk.o CC [M] fs/gfs2/ops_address.o CC [M] fs/gfs2/ops_dentry.o CC [M] fs/gfs2/ops_export.o CC [M] fs/gfs2/ops_file.o fs/gfs2/ops_file.c: In function `readdir_bad': fs/gfs2/ops_file.c:1052: warning: implicit declaration of function `kzalloc' fs/gfs2/ops_file.c:1052: warning: assignment makes pointer from integer without a cast fs/gfs2/ops_file.c: In function `gfs2_open': fs/gfs2/ops_file.c:1218: warning: assignment makes pointer from integer without a cast CC [M] fs/gfs2/ops_fstype.o CC [M] fs/gfs2/ops_inode.o CC [M] fs/gfs2/ops_super.o CC [M] fs/gfs2/ops_vm.o CC [M] fs/gfs2/page.o CC [M] fs/gfs2/proc.o CC [M] fs/gfs2/quota.o fs/gfs2/quota.c: In function `qd_alloc': fs/gfs2/quota.c:51: warning: implicit declaration of function `kzalloc' fs/gfs2/quota.c:51: warning: assignment makes pointer from integer without a cast fs/gfs2/quota.c: In function `gfs2_quota_init': fs/gfs2/quota.c:1058: warning: assignment makes pointer from integer without a cast CC [M] fs/gfs2/resize.o CC [M] fs/gfs2/recovery.o CC [M] fs/gfs2/rgrp.o fs/gfs2/rgrp.c: In function `gfs2_ri_update': fs/gfs2/rgrp.c:300: warning: implicit declaration of function `kzalloc' fs/gfs2/rgrp.c:300: warning: assignment makes pointer from integer without a cast fs/gfs2/rgrp.c: In function `gfs2_alloc_get': fs/gfs2/rgrp.c:530: warning: assignment makes pointer from integer without a cast CC [M] fs/gfs2/super.o fs/gfs2/super.c: In function `gfs2_jindex_hold': fs/gfs2/super.c:306: warning: implicit declaration of function `kzalloc' fs/gfs2/super.c:306: warning: assignment makes pointer from integer without a cast CC [M] fs/gfs2/trans.o fs/gfs2/trans.c: In function `gfs2_trans_begin_i': fs/gfs2/trans.c:38: warning: implicit declaration of function `kzalloc' fs/gfs2/trans.c:38: warning: assignment makes pointer from integer without a cast CC [M] fs/gfs2/unlinked.o fs/gfs2/unlinked.c: In function `ul_alloc': fs/gfs2/unlinked.c:154: warning: implicit declaration of function `kzalloc' fs/gfs2/unlinked.c:154: warning: assignment makes pointer from integer without a cast fs/gfs2/unlinked.c: In function `gfs2_unlinked_init': fs/gfs2/unlinked.c:342: warning: assignment makes pointer from integer without a cast CC [M] fs/gfs2/util.o LD [M] fs/gfs2/gfs2.o LD fs/gfs2/locking/dlm/built-in.o CC [M] fs/gfs2/locking/dlm/lock.o CC [M] fs/gfs2/locking/dlm/main.o CC [M] fs/gfs2/locking/dlm/mount.o CC [M] fs/gfs2/locking/dlm/sysfs.o CC [M] fs/gfs2/locking/dlm/thread.o LD [M] fs/gfs2/locking/dlm/lock_dlm.o LD fs/gfs2/locking/harness/built-in.o CC [M] fs/gfs2/locking/harness/main.o LD [M] fs/gfs2/locking/harness/lock_harness.o LD fs/gfs2/locking/nolock/built-in.o CC [M] fs/gfs2/locking/nolo
Re: Disturbing wide variation in execution time
Sheo Shanker Prasad <[EMAIL PROTECTED]> writes: [...] > When the repaired machine was started, I began to notice the disturbing wide > variation and the frequect significant slow down of the machine as exhibited > by the factor of 2 to 2.5 increased execution time of the test program as > described above. Sometimes it would be quite fast (executing at the original > 2m 40s) and sometime a factor of 2.5 slow, and sometimes with speed in > between. Something stuffed on the CPU heatsink causing thermal speed throttling? Just a wild guess. Michael. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
at91sam9260 wakeup on serial port
Hi, I implement a little patch (ndr just for a try) for the atmel serial driver atmel_serial.c to wakeup the system when it is in suspend-ram state. I reconfigure the RXD pin as a gpio in suspend function and restore it in the resume function. It is the correct way? Regards Michael -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -mm v4 6/9] atmel_serial: Split the interrupt handler
Hi, From: Remy Bohmer <[EMAIL PROTECTED]> This patch splits up the interrupt handler of the serial port into a interrupt top-half and a tasklet. The goal is to get the interrupt top-half as short as possible to minimize latencies on interrupts. But the old code also does some calls in the interrupt handler that are not allowed on preempt-RT in IRQF_NODELAY context. This handler is executed in this context because of the interrupt sharing with the timer interrupt. +static void +atmel_buffer_rx_char(struct uart_port *port, unsigned int status, +unsigned int ch) +{ + struct atmel_uart_port *atmel_port = (struct atmel_uart_port *)port; + struct circ_buf *ring = &atmel_port->rx_ring; + struct atmel_uart_char *c; I'm testing this patch on an at91sam9260 on 2.6.24-rt. I'm using this patch with the tclib support for hrtimer and the clocksource pit_clk. These are the results: - Voluntary Kernel Preemption the system (crash) - Preemptible Kernel (crash) /* * Drop the lock here since it might end up calling * uart_start(), which takes the lock. spin_unlock(&port->lock); */ tty_flip_buffer_push(port->info->tty); /* spin_lock(&port->lock); */ The same code with this comments out runs Complete Preemption (Real-Time) ok but the serials is just unusable due to too many overruns (just using lrz) The system is stable and doesn't crash reverting this patch. I don't test with the thread hardirqs active. + + if (!CIRC_SPACE(ring->head, ring->tail, ATMEL_SERIAL_RINGSIZE)) + /* Buffer overflow, ignore char */ + return; + + c = &((struct atmel_uart_char *)ring->buf)[ring->head]; + c->status= status; + c->ch= ch; + + /* Make sure the character is stored before we update head. */ + smp_wmb(); + + ring->head = (ring->head + 1) & (ATMEL_SERIAL_RINGSIZE - 1); +} + ... + port = &atmel_ports[pdev->id]; atmel_init_port(port, pdev); + ret = -ENOMEM; + data = kmalloc(ATMEL_SERIAL_RINGSIZE, GFP_KERNEL); + if (!data) + goto err_alloc_ring; + port->rx_ring.buf = data; + ret = uart_add_one_port(&atmel_uart, &port->uart); Is the kmalloc correct? maybe: data = kmalloc(ATMEL_SERIAL_RINGSIZE * sizeof(struct atmel_uart_char), GFP_KERNEL); if (ret) goto err_add_port; @@ -1013,6 +1142,9 @@ static int __devinit atmel_serial_probe(struct platform_device *pdev) return 0; err_add_port: + kfree(port->rx_ring.buf); + port->rx_ring.buf = NULL; +err_alloc_ring: if (!atmel_is_console_port(&port->uart)) { clk_disable(port->clk); clk_put(port->clk); @@ -1033,6 +1165,9 @@ static int __devexit atmel_serial_remove(struct platform_device *pdev) ret = uart_remove_one_port(&atmel_uart, port); + tasklet_kill(&atmel_port->tasklet); + kfree(atmel_port->rx_ring.buf); + Why the tasklet_kill is not done in atmel_shutdown? Regards Michael -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: at91sam9260 wakeup on serial port
Hi, On Mon, 28 Jan 2008 10:21:57 -0800 David Brownell <[EMAIL PROTECTED]> wrote: There's a separate WAKE_N pin that is completely asynchronous, so with some external logic, we can probably wake up the CPU all the way from Static mode if a given input state is present. But that's definitely "board specific" territory, and starting the oscillators take a _long_ time on the AP7000 (especially the 32 kHz, but then again, it barely consumes any power, so we might as well keep it running and keep the RTC going as well.) Maybe is possible to create a generic device based on the gpio to provide wakeup solutions on suspend-ram state to the peripherals that registered to him serial->register_gpio_wakeup x_driver->register_gpio_wakeup serial->suspend x_driver || || \--> gpio_power->suspend <---/ serial->resume x_driver || || \--> gpio_power->resume </ |request_irq n1 gpio_power-|request_irq n2 |request_irq n3 Create an attribute on the sysfs to add a wakeup reason to the user space. Regards Michael -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -mm v4 6/9] atmel_serial: Split the interrupt handler
Hi On Wed, 30 Jan 2008 00:12:23 +0100 michael <[EMAIL PROTECTED]> wrote: - Voluntary Kernel Preemption the system (crash) - Preemptible Kernel (crash) Ouch. I'm assuming this is with DMA disabled? Yes, is with DMA disabled /* * Drop the lock here since it might end up calling * uart_start(), which takes the lock. spin_unlock(&port->lock); */ tty_flip_buffer_push(port->info->tty); /* spin_lock(&port->lock); */ The same code with this comments out runs Now, _that_ is strange. I can't see anything that needs protection across that call; in fact, I think we can lock a lot less than what we currently do. I explain it bad: - with spin_lock the system seems, there is no problem with Valuntary Preeption and Preemptible Kernel - with full preemption it runs but the serial line can't be used for receiving at high bit rate (using lrz) Complete Preemption (Real-Time) ok but the serials is just unusable due to too many overruns (just using lrz) Is it worse than before? IIRC Remy mentioned something about IRQF_NODELAY being the reason for moving all this code to softirq context in the first place; does the interrupt handler run in hardirq context? In the complete preemption yes. The system is stable and doesn't crash reverting this patch. I don't test with the thread hardirqs active. Ok. Is the kmalloc correct? maybe: data = kmalloc(ATMEL_SERIAL_RINGSIZE * sizeof(struct atmel_uart_char), GFP_KERNEL); I think you're right. Can you change it and see if it helps? I just change it because I have corruption on receiving buffer. All my test are done with this fix I guess I didn't test it thoroughly enough with DMA disabled...slub_debug ought to catch such things, but not until we receive enough data to actually overflow the buffer. I just test it I don't have buffer overflow. I protect with a spinlock the access to the register when we sending from the tasklet. It is correct? Why should it be? If it should, we must move the call to tasklet_init into atmel_startup too, and I don't really see the point. Ok Regards Michael -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -mm v4 6/9] atmel_serial: Split the interrupt handler
Hi, A few questions arise here to me: * What serial port is used here? (DBGU, or something else) * No DMA was used, was flow-control enabled? (cannot with DBGU) * If some other UART, why not using DMA? DBGU, so no flow control Notice that the DBGU has no flow control, and just a 1 byte FIFO (thus no fifo at all). At high speeds (e.g. >=115200) it is _likely_ that you will miss characters, nothing can prevent that. DBGU should only be used at lower speeds, or just as text console. 115200 is running fine here as text-console. Overrun are admitted using DBGU and UART1..n without flow control, but with the old version of the driver I can send a file using lrz and with the new and full preemption is impossible. I would not expect that the behaviour is worse than without the patchset, because without it it does not work at all on Preempt-RT, but also: there was done much more in interrupt context previously, so the chance of buffer overruns was much more likely in the old situation. The real interrupt handler (doing the reading from the fifo) must be as short as possible, to be able to keep up with the data flow. A simple calculation: 115200bps results in approx. 11520 bytes per second. This means that the interrupt handler must be capable of handling each byte on DBGU within 87us. With a worst case interrupt latency of about 85us, and average between 2us and 54us (on Preempt-RT and AT91RM9200), you can simply understand that this will not match, how good/fast the interrupt handling will ever be. So, I suggest to either use flow-control, or DMA for bulkdata... (thus not DBGU) Kind Regards, Remy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -mm v4 6/9] atmel_serial: Split the interrupt handler
Hi, On Wed, 30 Jan 2008 11:29:59 +0100 michael <[EMAIL PROTECTED]> wrote: Now, _that_ is strange. I can't see anything that needs protection across that call; in fact, I think we can lock a lot less than what we currently do. I explain it bad: - with spin_lock the system seems, there is no problem with Valuntary Preeption and Preemptible Kernel - with full preemption it runs but the serial line can't be used for receiving at high bit rate (using lrz) ...but if you drop the spinlock across the call to tty_flip_buffer_push, you get an Oops? Could you post the Oops? So this code /* * Drop the lock here since it might end up calling * uart_start(), which takes the lock. */ spin_unlock(&port->lock); tty_flip_buffer_push(port->info->tty); spin_lock(&port->lock); Works with: CONFIG_PREEMPT_RT=y CONFIG_PREEMPT=y CONFIG_PREEMPT_SOFTIRQS=y CONFIG_PREEMPT_HARDIRQS=y CONFIG_PREEMPT_BKL=y but crash with: # CONFIG_PREEMPT_NONE is not set CONFIG_PREEMPT_VOLUNTARY=y # CONFIG_PREEMPT_DESKTOP is not set # CONFIG_PREEMPT_RT is not set CONFIG_PREEMPT_SOFTIRQS=y # CONFIG_PREEMPT_HARDIRQS is not set # CONFIG_PREEMPT_BKL is not set CONFIG_CLASSIC_RCU=y Seems to work in the last config if I comment the spin_lock & spin_unlock call. /* * Drop the lock here since it might end up calling * uart_start(), which takes the lock. spin_unlock(&port->lock); */ tty_flip_buffer_push(port->info->tty); /* spin_lock(&port->lock); */ It is not readable because I can't compile it with debugging information (poor memory system) Complete Preemption (Real-Time) ok but the serials is just unusable due to too many overruns (just using lrz) Is it worse than before? IIRC Remy mentioned something about IRQF_NODELAY being the reason for moving all this code to softirq context in the first place; does the interrupt handler run in hardirq context? In the complete preemption yes. Which question did you answer "yes" to? That it's worse than before or that the interrupt handler runs in hardirq context (i.e. IRQF_NODELAY)? The interrupt handler run in IRQF_NODELAY context. I think you're right. Can you change it and see if it helps? I just change it because I have corruption on receiving buffer. All my test are done with this fix Ok. I guess I didn't test it thoroughly enough with DMA disabled...slub_debug ought to catch such things, but not until we receive enough data to actually overflow the buffer. I just test it I don't have buffer overflow. No, I'd expect your allocation fix to take care of that. Or did you by any chance test without the fix and with slub_debug enabled? I just meant that the buffer never fills up to its size. I protect with a spinlock the access to the register when we sending from the tasklet. It is correct? I have no idea. Could you post some more specifics about what you modified, for example a diff? ... /* The interrupt handler does not take the lock */ spin_lock_irqsave(&port->lock, flags); atmel_tx_chars(port); spin_unlock_irqrestore(&port->lock, flags); spin_lock(&port->lock); ... The atmel_tx_chars using the serial device registers like the interrupt routine and so I think that it is possible to have interference during send operation. Most of the tasklet is already protected by the spinlock, so you must be careful to avoid any lock recursion. Haavard Regards Michael -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Oops] 2.4.5-ac14/2.4.6-pre3+Athlon+gcc3-prerelease+VIAKT133A
On Sat, Jun 16, 2001 at 04:18:01PM +0800, Richard Chan wrote: > Here's an oops from > > 1. Athlon kernel, gcc3 prerelease 14 June compiled > 2. Kernel version 2.4.5-ac14 > 3. Mobo: Soltek 75KAV (VT133A disaster??) with Athlon 1.2G C > > Any ideas? Bad compiler or bad kernel? > The problems occur in kmem_cache_. > > On this mobo and chipset I have had no luck with locally compiled > Athlon kernels at all (whether stock or -ac, RedHat gcc or gcc3-prerelease). > Me thinks something is seriously wrong with this mobo/chipset or is it > the Athlon code in gcc? FWIW, I've got 2 of these boards (with duron 800 chips) I use gcc2.95.4 in debian sid. Got it about the same time the 686b patch went into ac1 and its run flawlessly with every ac version I've used since. Didn't compile ac14, went straight to 15. -- Michael. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
linux crashes when i try to burn audio cd's
I'm using kernel 2.2.16-22 w/ RedHat 7.0 w/ cdrecord 1.9. I have a P133 w/ 64M RAM w/ a Smart & Friendly 2006 Plus SCSI CD-R. It burns data discs without problem but when I try to burn an audio disc Linux comes to a complete halt. I can't get any console or network response and no error messages appear or get logged. I've tried both wav and cdr sound files. In Windows the audio discs burn without problems. Thanks. *^*^*^* Michael McGlothlin <[EMAIL PROTECTED]> http://www.kavlon.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: i810 audio problem
"Delio Brignoli" <[EMAIL PROTECTED]> writes: > Switching from 2.4.2 to 2.4.5 breaks i810_audio on my intel MX440 based notebook: > > After some (in fact a few) seconds of playback it gets stuck until the app closes >and reopens /dev/dsp. (I do NOT use esd) [..] > It goes on until I kill the app, then it says: > > Jun 18 13:59:42 argo kernel: i810_audio: drain_dac, dma timeout? > > Any idea(s), suggestions ... What a co-incidence. I just hit this problem a few days ago. The problem here is that: 1. the dma buffer drains to zero. 2. interrupt handler set LVI to CIV. 3. app write more than a buffer size of data to dma buffer. 4. LVI is un-changed! There's a kludgey work-around I used, (never use more than 31 segments of the DMA buffer). (I.e. never use the last dmabuf->fragsize of the dma buffer). This cures the hang but it isn't an optimal solutions. --- i810_audio.c.oldTue Jun 19 11:22:56 2001 +++ i810_audio.cTue Jun 19 11:24:02 2001 @@ -1194,6 +1194,10 @@ cnt = dmabuf->dmasize - swptr; if(cnt > (dmabuf->dmasize - dmabuf->count)) cnt = dmabuf->dmasize - dmabuf->count; + + if (cnt >= dmabuf->fragsize && (dmabuf->count + cnt) >= +dmabuf->dmasize) + cnt -= dmabuf->fragsize; + spin_unlock_irqrestore(&state->card->lock, flags); if (cnt > count) A better fix _may_ be to set CIV to LVI instead of the other way around. (This assumes CIV is writeable). No testing at all; may not be a fix. Something like: diff -u i810_audio.c.old i810_audio.c --- i810_audio.c.oldTue Jun 19 11:22:56 2001 +++ i810_audio.cTue Jun 19 11:26:14 2001 @@ -807,11 +807,11 @@ * means no data on read, handle appropriately */ if(!rec && dmabuf->count == 0) { - outb(inb(port+OFF_CIV),port+OFF_LVI); + outb(inb(port+OFF_LVI),port+OFF_CIV); return; } if(rec && dmabuf->count == dmabuf->dmasize) { - outb(inb(port+OFF_CIV),port+OFF_LVI); + outb(inb(port+OFF_LVI),port+OFF_CIV); return; } /* swptr - 1 is the tail of our transfer */ but with testing and a glance at the docs. :) Michael. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.21-rc3-mm2
Hello Andrew, I found a little "hickup" in the mm kernel series since 2.6.21-rc2-mm1/mm2. 1.) appeared while boot (no VFS mounted at time) BUG: at arch/i386/mm/highmem.c:61 kmap_atomic() [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] === BUG: at arch/i386/mm/highmem.c:61 kmap_atomic() [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] === 2.) some time after when I run some ups i hit this BUG: atomic counter underflow at: [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] === Then in 2.6.21-rc3-mm2 Now for rc3-mm2 the bug of under 1.) of rc2-mm1/mm2 gone. But still here -> underflow..*huh* BUG: atomic counter underflow at: [] [] [] [] [] [] [] [] [] [] [] [] === Also I found some mis-beheviour of the Attansic "atl1 driver" Maybe I address it wrong but I don't know (sure) who is the real maintainer. Well I looked at atl1_main.c but to be honest there aren't obvious information to whom/where I should address such issues. Could you please so kind to address it to the right person? atl1: hw csum wrong pkt_flag:1600, err_flag:80 All these hickups never appeared in the latest vanilla kernel 2.6.21-rc2 + even the last git-updates 2.6.21-rc3-git3 there is also not such behaviour Thanks for your patience. Best regards Michael - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: set up new kernel with grub
Hi, Dick, Your steps work beautifully. Thanks. If you could explain a little about what happens in each step, that would be even better. # cd /usr/src/linux-2.6.20.3 If your current kernel is 2.6.20.3, edit the Makefile to add some character after "EXTRAVERSION" as EXTRAVERSION= 3x # cp .config .. # make distclean # cp ../.config . # make oldconfig # make # make modules_install # make install Regards, Mike - Original Message - From: "linux-os (Dick Johnson)" <[EMAIL PROTECTED]> To: "Michael" <[EMAIL PROTECTED]> Cc: Sent: Wednesday, April 04, 2007 10:53 AM Subject: Re: set up new kernel with grub On Wed, 4 Apr 2007, Michael wrote: Hi, I compiled a new kernel: 2.6.20.3, and hope to test it without removing my old kernel. You don't need to remove your old kernel. Log in as root. # cd /usr/src/linux-2.6.20.3 If your current kernel is 2.6.20.3, edit the Makefile to add some character after "EXTRAVERSION" as EXTRAVERSION= 3x # cp .config .. # make distclean # cp ../.config . # make oldconfig # make # make modules_install # make install The result will put the new kernel in the GRUB menu so you can always go back to the old one if the new one doesn't work. Cheers, Dick Johnson Penguin : Linux version 2.6.16.24 on an i686 machine (5592.65 BogoMips). New book: http://www.AbominableFirebug.com/ _ The information transmitted in this message is confidential and may be privileged. Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited. If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to [EMAIL PROTECTED] - and destroy all copies of this information, including any attachments, without reading or disclosing them. Thank you. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: set up new kernel with grub
Hi, Wang, Thanks for replying. So which step is doing the compilation of each module , "make oldconfig" or "make" ? For compilation, I mean the step to compile the source code to .o file. Regards, Mike - Original Message - From: "WANG Cong" <[EMAIL PROTECTED]> To: "Michael" <[EMAIL PROTECTED]> Cc: "linux-os (Dick Johnson)" <[EMAIL PROTECTED]>; Sent: Thursday, April 05, 2007 10:15 PM Subject: Re: set up new kernel with grub On Thu, Apr 05, 2007 at 12:28:03PM -0500, Michael wrote: Hi, Dick, Your steps work beautifully. Thanks. If you could explain a little about what happens in each step, that would be even better. # cd /usr/src/linux-2.6.20.3 If your current kernel is 2.6.20.3, edit the Makefile to add some character after "EXTRAVERSION" as EXTRAVERSION= 3x # cp .config .. Save your existing config file in the parent directory. # make distclean Clean the files generated by last compiling. # cp ../.config . Copy your .config back here. # make oldconfig "The make oldconfig command causes the kernel configuration process to read in your existing configuration information and then prompt you for a value for any kernel configuration variables that were not provided set the existing kernel configuration file." # make Check all changed object files, and do the final kernel image link. # make modules_install Reinstall the newly-compiled modules. # make install Copy the kernel image and system.map to /boot and modify /boot/grub/menu.lst (or lilo.conf) properly. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
set up new kernel with grub
Hi, I compiled a new kernel: 2.6.20.3, and hope to test it without removing my old kernel. Here is what I did by following http://searchenterpriselinux.techtarget.com/tip/0,289483,sid39_gci1204148,00.html : I built the kernel in a local directory, say "mydir", with kernel "bzImage-2.6.20.3" and map file "System.map". su (#become a root) cp mydir/bzImage-2.6.20.3 /boot/bzImage-2.6.20.3 cp mydir/System.map /boot/System.map-2.6.20.3 cd /boot ln -s /boot/bzImage-2.6.20.3 /boot/bzImage ln -s /boot/System.map-2.6.20.3 /boot/System.map vi grub/menu.lst add a new entry as ** title Red Hat Linux (2.6.20.3) root (hd0, 0) kernel /boot/bzImage ro root=LABEL=/ initrd /boot/initrd-2.6.20.3.img ** create an initial ram disk image by mkinitrd -k /boot/bzImage -i /boot/initrd-2.6.20.3 It reports wrong usage of "mkinitrd", and no initrd-2.6.20.3.img is created. Any idea about what's wrong with the "mkinitrd" command ? Thanks. Mike - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PCI device function not being enumerated [Was: PCMCIA not working on Panasonic Toughbook CF-29]
Have just had confirmation that the mmc_ricoh_mmc change works and both PCMCIA slots now work as intended on Panasonic Toughbook CF-29 Mk 4 and 5. Thank you to all who have made suggestions for this, your dedication to Linux is amazing and your help with this is appreciated. Stay safe. Michael. On 28/07/2020, Michael . wrote: > I have just compiled and uploaded a kernel to test for this issue, > members of the Toughbook community have been provided with the link, > though a forum discussion, to download the kernel and test it. > Hopefully we will get positive results and can confirm the > MMC_RICOH_MMC flag is the culprit. > Regards. > Stay safe. > Michael. > > On 27/02/2020, bluerocksadd...@willitsonline.com > wrote: >> Somewhere in these messages is a cluein that SD reader was involved. >> >> MK 4 and 5 have SD whilst MK 1, 2 and three do not. >> >> >> >> On 2020-02-25 22:10, Michael . wrote: >>>> Someone with access to real hardware could >>>> easily experiment with changing that magic value and seeing if it >>>> changes which function is disabled. >>> >>> One of our members has offered to supply a machine to a dev that can >>> use it to test any theory. >>> >>> It is nearly beyond the scope of the majority of us to do much more >>> than just testing. We appreciate all the effort the devs put in and >>> are willing to help in anyway we can but we aren't kernel devs. >>> >>> I, personally, use Debian. Others use Debian based distros such as MX >>> and Mint. We have been able to test many different distros such as >>> those listed in other comments but don't have the skills or expertise >>> to do much more. It is our hope that this discussion and subsequent >>> effort may enable others who prefer distros other than Debian based >>> distros can use a CF-29 (and possibly earlier) Toughbook with the >>> distro of their choice without having to rebuild a kernel so they can >>> use hardware that worked back in 2010. To do this the fix needs to be >>> at the kernel dev level not a local enthusiast level because while I >>> can rebuild a Debian kernel I can't rebuild a Fedora or Arch or >>> Slackware kernel. >>> >>> I did a search about this issue before I made initial contact late >>> last year and the issue was discovered on more than Toughbooks and >>> posted about on various sites not long after distros moved from >>> 2.6.32. It seems back then people just got new machines that didn't >>> have a 2nd slot so the search for an answer stopped. Us Toughbook >>> users are a loyal group we use our machines because they are exactly >>> what we need and they take alot of "punishment" taht other machines >>> simply cannot handle. Our machines are used rather than recycled or >>> worse still just left to sit in waste management facilities in a >>> country that the western world dumps its rubbish in, we are Linux and >>> Toughbook enthusiasts and hope to be able to keep our machines running >>> for many years to come with all their native capabilities working as >>> they were designed to but using a modern Linux instead of Windows XP >>> or Windows 7. (that wasn't a pep talk, its just an explanation of why >>> we are passionate about this). >>> >>> Let us know what you need us to do, we will let you know if we are >>> capable of it and give you any feedback you ask for. Over the weekend >>> I will try to rebuild a Debian kernel with the relevant option >>> disabled, provide it to my peers for testing and report back here what >>> the outcome is. >>> >>> Thank you all for all your time and effort, it is truly appreciated. >>> Cheers. >>> Michael. >>> >>> On 26/02/2020, Philip Langdale wrote: >>>> On Tue, 25 Feb 2020 23:51:05 -0500 >>>> Arvind Sankar wrote: >>>> >>>>> On Tue, Feb 25, 2020 at 09:12:48PM -0600, Trevor Jacobs wrote: >>>>> > That's correct, I tested a bunch of the old distros including >>>>> > slackware, and 2.6.32 is where the problem began. >>>>> > >>>>> > Also, the Panasonic Toughbook CF-29s effected that we tested are >>>>> > the later marks, MK4 and MK5 for certain. The MK2 CF-29 worked just >>>>> > fine because it has different hardware supporting the PCMCIA slots. >>>>> > I have not tested a MK3 but suspect it would work ok as it also >>>>
ATTN:
Two million dollars donated to you, contact donor on (michaeldun...@yeah.net) via email. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
ATTN:
two million dollars donated to you, contact donor on (michaeldun...@yeah.net) via email. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Problem when function alarmtimer_suspend returns 0 if time delta is zero
Thomas, thank you very much for your patch. Unfortunately currently I can only test it with a kernel 4.1.52 but i've tried to patch your new logic into my older kernel version. There seem to be rare cases where the "delta" value becomes negative. Therefore I added if(unlikely(delta < 0)) { delta = 0; } before min-check. Currently I still get returns here in the new code + if (min == KTIME_MAX) return 0; where the board afterwards is not woken up.So I think there is still something missing. I'm doing further tests and keep you informed. Again Thanks! Michael On 02.09.2019 12:57, Thomas Gleixner wrote: Michael, On Mon, 2 Sep 2019, Alexandre Belloni wrote: On 31/08/2019 20:32:06+0200, Michael wrote: currently I have a problem with the alarmtimer i'm using to cyclically wake up my i.MX6 ULL board from suspend to RAM. The problem is that in principle the timer wake ups work fine but seem to be not 100% stable. In about 1 percent the wake up alarm from suspend is missing. In my error case the alarm wake up always fails if the path "if(min==0)" is entered. If I understand this code correctly that means that when ever one of the timers in the list has a remaining tick time of zero, the function just returns 0 and continues the suspend process until it reaches suspend mode. No. That code is simply broken because it tries to handle the case where a alarmtimer nanosleep got woken up by the freezer. That's broken because it makes the delta = 0 assumption which leads to the issue you discovered. That whole cruft can be removed by switching alarmtimer nanosleep to use freezable_schedule(). That keeps the timer queued and avoids all the issues. Completely untested patch below. Thanks, tglx 8<-- kernel/time/alarmtimer.c | 57 +++ 1 file changed, 4 insertions(+), 53 deletions(-) --- a/kernel/time/alarmtimer.c +++ b/kernel/time/alarmtimer.c @@ -46,14 +46,6 @@ static struct alarm_base { clockid_t base_clockid; } alarm_bases[ALARM_NUMTYPE]; -#if defined(CONFIG_POSIX_TIMERS) || defined(CONFIG_RTC_CLASS) -/* freezer information to handle clock_nanosleep triggered wakeups */ -static enum alarmtimer_type freezer_alarmtype; -static ktime_t freezer_expires; -static ktime_t freezer_delta; -static DEFINE_SPINLOCK(freezer_delta_lock); -#endif - #ifdef CONFIG_RTC_CLASS static struct wakeup_source *ws; @@ -241,19 +233,12 @@ EXPORT_SYMBOL_GPL(alarm_expires_remainin */ static int alarmtimer_suspend(struct device *dev) { - ktime_t min, now, expires; + ktime_t now, expires, min = KTIME_MAX; int i, ret, type; struct rtc_device *rtc; unsigned long flags; struct rtc_time tm; - spin_lock_irqsave(&freezer_delta_lock, flags); - min = freezer_delta; - expires = freezer_expires; - type = freezer_alarmtype; - freezer_delta = 0; - spin_unlock_irqrestore(&freezer_delta_lock, flags); - rtc = alarmtimer_get_rtcdev(); /* If we have no rtcdev, just return */ if (!rtc) @@ -271,13 +256,13 @@ static int alarmtimer_suspend(struct dev if (!next) continue; delta = ktime_sub(next->expires, base->gettime()); - if (!min || (delta < min)) { + if (delta < min) { expires = next->expires; min = delta; type = i; } } - if (min == 0) + if (min == KTIME_MAX) return 0; if (ktime_to_ns(min) < 2 * NSEC_PER_SEC) { @@ -479,38 +464,6 @@ u64 alarm_forward_now(struct alarm *alar EXPORT_SYMBOL_GPL(alarm_forward_now); #ifdef CONFIG_POSIX_TIMERS - -static void alarmtimer_freezerset(ktime_t absexp, enum alarmtimer_type type) -{ - struct alarm_base *base; - unsigned long flags; - ktime_t delta; - - switch(type) { - case ALARM_REALTIME: - base = &alarm_bases[ALARM_REALTIME]; - type = ALARM_REALTIME_FREEZER; - break; - case ALARM_BOOTTIME: - base = &alarm_bases[ALARM_BOOTTIME]; - type = ALARM_BOOTTIME_FREEZER; - break; - default: - WARN_ONCE(1, "Invalid alarm type: %d\n", type); - return; - } - - delta = ktime_sub(absexp, base->gettime()); - - spin_lock_irqsave(&freezer_delta_lock, flags); - if (!freezer_delta || (delta < freezer_delta)) { - freezer_delta = delta; - freezer_expires = absexp; - freezer_alarmtype = type; - } - spin_unlock_irqrestore(&freezer_delta_lock, flags); -} - /** * clock2alarm - helper that converts from cl
Re: PCMCIA not working on Panasonic Toughbook CF-29
Thank you for your prompt reply Dominik, I have asked everyone in the discussion on Notebook review to gather the information required and either post it there so I can reply or post it here in the list if it is from someone in the CC list. Also thank you for replying to us all and not just on-list, none of us are subscribed to teh list so if a reply is only on-list none of us will receive it. Cheers. Michael. On 15/10/2019, Dominik Brodowski wrote: > On Tue, Oct 15, 2019 at 05:04:28PM +1100, Michael . wrote: >> Good afternoon kernel developers >> Please accept my apology for contacting you directly about this. A >> small group of friends, some of whom are CCed here, have come together >> to try and find a solution to a problem that originated with the >> demise of kernel 2.6:32. First some background to the issue. We are >> all users of Panasonic Toughbook CF-29 models (ranging from Mark 1 >> through to Mark 5). These Toughbooks have 2 PCMCIA card slots which >> are used by a variety of people for different purposes. On the CF-29 >> Mark 1 through to Mark 3 these slots work without problem. On the >> CF-29 Mark 4 and Mark 5 the last known kernel the top slot worked with >> was 2.6:32. This has been confirmed all all major distros by most of >> the small group of friends I mentioned earlier. >> >> Thinking it was just a kernel config issue I did some comparisons >> between Debian 6 (Squeeze), Debian 7 (Wheezy), Ubuntu 10.04, and >> Ubuntu 10.10. On all machines both slots functioned as they should >> with Debian 6 and Ubuntu 10.04 but the top slot stopped working on >> Mark 4 and Mark 5 machines on the next release with the next kernel. I >> also tested Ubuntu 10.04 and 10.10 with the 2.6:32 and 2.6:35 kernel >> and both slots worked with the 2.6:32 kernel but not the 2.6:35 >> kernel.With my comparisons I merged the config from 2.6:32 into the >> current kernel for Debian 4.19 and rebuilt the kernel, no matter what >> configuration changes I made the top slot still doesn't function on >> Mark 4 and Mark 5 machines. >> >> This issue, and its apparent start, has been confirmed on all major >> distro family groups. So this brings me, actually the small group of >> dedicated Linux users who own Panasonic Toughbook CF-29s, to contact >> you to ask for help in resolving this issue. I have some questions, >> and I realise the 2.6:32 kernel is long gone now but I'm hoping this >> is not a lost cause, what changes would have occurred between 2.6:32 >> and 2.6:33 that would have stopped the hardware working on Mark 4 and >> Mark 5 CF-29 Toughbooks but not Mark 1 through to Mark 3? Would it be >> possible to correct the problem so that the hardware on our machines >> works as it should. While we are not kernel devs or even programmers >> we are enthusiasts who love Linux and our machines and we are hoping >> that together with you and the kernel dev group we can fix this issue >> together. >> >> I have attached various tar.gz files with ls* outputs so you can see >> the information we have so far. Thank you for taking the time to read >> this. > > Is this with 16-bit PCMCIA cards, or with 32-bit CardBus cards? Either > case, > please provide the output of > > dmesg > > lspci -vvv > > and > > lspcmcia -v -v > > (ideally all for a working and non-working configuration). > > Thanks, > Dominik >
Re: PCI device function not being enumerated [Was: PCMCIA not working on Panasonic Toughbook CF-29]
Thank you Dominik for looking at this for us and passing it on. Good morning Bjorn, thank you also for looking into this for us and thank you for CCing us into this as non of us are on the mailing list. One question how do we apply this patch or is this for Dominik to try? Cheers. Michael On 22/10/2019, Bjorn Helgaas wrote: > On Sun, Oct 20, 2019 at 11:08:00AM +0200, Dominik Brodowski wrote: >> On the basis of the additional information (thanks), there might be a >> more specific path to investigate: It is that the PCI code does not >> enumerate the second cardbus bridge PCI function in the more recent 4.19 >> kernel compared to the anvient (and working) 2.6 kernel. >> >> Namely, only one CardBus bridge is recognized >> >> ... >> 06:01.0 CardBus bridge: Ricoh Co Ltd RL5c476 II (rev 8b) >> 06:01.1 SD Host controller: Ricoh Co Ltd R5C822 SD/SDIO/MMC/MS/MSPro Host >> Adapter (rev 11) >> 06:02.0 Network controller: Intel Corporation PRO/Wireless 2915ABG >> [Calexico2] Network Connection (rev 05) >> ... >> >> instead of the two which really should be present: >> >> ... >> 06:01.0 CardBus bridge: Ricoh Co Ltd RL5c476 II (rev 8b) >> 06:01.1 CardBus bridge: Ricoh Co Ltd RL5c476 II (rev 8b) >> 06:01.2 SD Host controller: Ricoh Co Ltd R5C822 SD/SDIO/MMC/MS/MSPro Host >> Adapter (rev 11) >> 06:02.0 Network controller: Intel Corporation PRO/Wireless 2915ABG >> [Calexico2] Network Connection (rev 05) >> ... >> >> To the PCI folks: any idea on what may cause the second cardbus bridge >> PCI >> device function to be missed? Are there any command line options the >> users >> who reported this issue[*] may try? > > Thanks for the report. Could you try disabling > ricoh_mmc_fixup_rl5c476(), e.g., with the patch below (this is based > on v5.4-rc1, but you can use v4.9 if that's easier for you)? This > isn't a fix; it's just something that looks like it might be related. > >> [*] For more information, see this thread: >> >> https://lore.kernel.org/lkml/cafjuqni+knsb9wvqoahcvfyxsiqoggwom7z1aqdbebnzp_-...@mail.gmail.com/ > > > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c > index 320255e5e8f8..7a1e1a242506 100644 > --- a/drivers/pci/quirks.c > +++ b/drivers/pci/quirks.c > @@ -3036,38 +3036,6 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_HINT, 0x0020, > quirk_hotplug_bridge); > * #1, and this will confuse the PCI core. > */ > #ifdef CONFIG_MMC_RICOH_MMC > -static void ricoh_mmc_fixup_rl5c476(struct pci_dev *dev) > -{ > - u8 write_enable; > - u8 write_target; > - u8 disable; > - > - /* > - * Disable via CardBus interface > - * > - * This must be done via function #0 > - */ > - if (PCI_FUNC(dev->devfn)) > - return; > - > - pci_read_config_byte(dev, 0xB7, &disable); > - if (disable & 0x02) > - return; > - > - pci_read_config_byte(dev, 0x8E, &write_enable); > - pci_write_config_byte(dev, 0x8E, 0xAA); > - pci_read_config_byte(dev, 0x8D, &write_target); > - pci_write_config_byte(dev, 0x8D, 0xB7); > - pci_write_config_byte(dev, 0xB7, disable | 0x02); > - pci_write_config_byte(dev, 0x8E, write_enable); > - pci_write_config_byte(dev, 0x8D, write_target); > - > - pci_notice(dev, "proprietary Ricoh MMC controller disabled (via CardBus > function)\n"); > - pci_notice(dev, "MMC cards are now supported by standard SDHCI > controller\n"); > -} > -DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_RICOH, PCI_DEVICE_ID_RICOH_RL5C476, > ricoh_mmc_fixup_rl5c476); > -DECLARE_PCI_FIXUP_RESUME_EARLY(PCI_VENDOR_ID_RICOH, > PCI_DEVICE_ID_RICOH_RL5C476, ricoh_mmc_fixup_rl5c476); > - > static void ricoh_mmc_fixup_r5c832(struct pci_dev *dev) > { > u8 write_enable; >
Re: PCI device function not being enumerated [Was: PCMCIA not working on Panasonic Toughbook CF-29]
Thanks Domunik I'll get onto this and report back the results. On 22/10/2019, Dominik Brodowski wrote: > On Tue, Oct 22, 2019 at 05:17:12AM +1100, Michael . wrote: >> Thank you Dominik for looking at this for us and passing it on. >> >> Good morning Bjorn, thank you also for looking into this for us and >> thank you for CCing us into this as non of us are on the mailing list. >> One question how do we apply this patch or is this for Dominik to try? > > That's for you and/or other users of this hardware; I cannot test this > myself, sorry. As to how to apply the patch: you'd need to apply the patch > for the linux kernel sources, and then build a custom kernel. Some hints on > that (details depend on the distribtion): > > https://wiki.ubuntu.com/Kernel/BuildYourOwnKernel > https://wiki.ubuntu.com/KernelTeam/GitKernelBuild > https://wiki.archlinux.org/index.php/Kernels/Arch_Build_System > https://kernelnewbies.org/KernelBuild > > Best, > Dominik >
Re: PCI device function not being enumerated [Was: PCMCIA not working on Panasonic Toughbook CF-29]
I have just compiled and uploaded a kernel to test for this issue, members of the Toughbook community have been provided with the link, though a forum discussion, to download the kernel and test it. Hopefully we will get positive results and can confirm the MMC_RICOH_MMC flag is the culprit. Regards. Stay safe. Michael. On 27/02/2020, bluerocksadd...@willitsonline.com wrote: > Somewhere in these messages is a cluein that SD reader was involved. > > MK 4 and 5 have SD whilst MK 1, 2 and three do not. > > > > On 2020-02-25 22:10, Michael . wrote: >>> Someone with access to real hardware could >>> easily experiment with changing that magic value and seeing if it >>> changes which function is disabled. >> >> One of our members has offered to supply a machine to a dev that can >> use it to test any theory. >> >> It is nearly beyond the scope of the majority of us to do much more >> than just testing. We appreciate all the effort the devs put in and >> are willing to help in anyway we can but we aren't kernel devs. >> >> I, personally, use Debian. Others use Debian based distros such as MX >> and Mint. We have been able to test many different distros such as >> those listed in other comments but don't have the skills or expertise >> to do much more. It is our hope that this discussion and subsequent >> effort may enable others who prefer distros other than Debian based >> distros can use a CF-29 (and possibly earlier) Toughbook with the >> distro of their choice without having to rebuild a kernel so they can >> use hardware that worked back in 2010. To do this the fix needs to be >> at the kernel dev level not a local enthusiast level because while I >> can rebuild a Debian kernel I can't rebuild a Fedora or Arch or >> Slackware kernel. >> >> I did a search about this issue before I made initial contact late >> last year and the issue was discovered on more than Toughbooks and >> posted about on various sites not long after distros moved from >> 2.6.32. It seems back then people just got new machines that didn't >> have a 2nd slot so the search for an answer stopped. Us Toughbook >> users are a loyal group we use our machines because they are exactly >> what we need and they take alot of "punishment" taht other machines >> simply cannot handle. Our machines are used rather than recycled or >> worse still just left to sit in waste management facilities in a >> country that the western world dumps its rubbish in, we are Linux and >> Toughbook enthusiasts and hope to be able to keep our machines running >> for many years to come with all their native capabilities working as >> they were designed to but using a modern Linux instead of Windows XP >> or Windows 7. (that wasn't a pep talk, its just an explanation of why >> we are passionate about this). >> >> Let us know what you need us to do, we will let you know if we are >> capable of it and give you any feedback you ask for. Over the weekend >> I will try to rebuild a Debian kernel with the relevant option >> disabled, provide it to my peers for testing and report back here what >> the outcome is. >> >> Thank you all for all your time and effort, it is truly appreciated. >> Cheers. >> Michael. >> >> On 26/02/2020, Philip Langdale wrote: >>> On Tue, 25 Feb 2020 23:51:05 -0500 >>> Arvind Sankar wrote: >>> >>>> On Tue, Feb 25, 2020 at 09:12:48PM -0600, Trevor Jacobs wrote: >>>> > That's correct, I tested a bunch of the old distros including >>>> > slackware, and 2.6.32 is where the problem began. >>>> > >>>> > Also, the Panasonic Toughbook CF-29s effected that we tested are >>>> > the later marks, MK4 and MK5 for certain. The MK2 CF-29 worked just >>>> > fine because it has different hardware supporting the PCMCIA slots. >>>> > I have not tested a MK3 but suspect it would work ok as it also >>>> > uses the older hardware. >>>> > >>>> > Thanks for your help guys! >>>> > Trevor >>>> > >>>> >>>> Right, the distros probably all enabled MMC_RICOH_MMC earlier than >>>> upstream. Can you test a custom kernel based off your distro kernel >>>> but just disabling that config option? That's probably the easiest >>>> fix >>>> currently, even though not ideal. Perhaps there should be a command >>>> line option to disable specific pci quirks to make this easier. >>>> >>>> An ide
Re: [PATCH RFC V2 0/5] Separate consigned (expected steal) from steal time.
On 10/22/2012 10:33 AM, Rik van Riel wrote: On 10/16/2012 10:23 PM, Michael Wolf wrote: In the case of where you have a system that is running in a capped or overcommitted environment the user may see steal time being reported in accounting tools such as top or vmstat. This can cause confusion for the end user. How do s390 and Power systems deal with reporting that kind of information? IMHO it would be good to see what those do, so we do not end up re-inventing the wheel, and confusing admins with yet another way of reporting the information... Sorry for the delay in the response. I'm assuming you are asking about s390 and Power lpars. In the case of lpar on POWER systems they simply report steal time and do not alter it in any way. They do however report how much processor is assigned to the partition and that information is in /proc/ppc64/lparcfg. Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/5] Alter steal time reporting in KVM
In the case of where you have a system that is running in a capped or overcommitted environment the user may see steal time being reported in accounting tools such as top or vmstat. This can cause confusion for the end user. To ease the confusion this patch set adds the idea of consigned (expected steal) time. The host will separate the consigned time from the steal time. The consignment limit passed to the host will be the amount of steal time expected within a fixed period of time. Any other steal time accruing during that period will show as the traditional steal time. --- Michael Wolf (5): Alter the amount of steal time reported by the guest. Expand the steal time msr to also contain the consigned time. Add the code to send the consigned time from the host to the guest Add a timer to allow the separation of consigned from steal time. Add an ioctl to communicate the consign limit to the host. arch/x86/include/asm/kvm_host.h | 11 +++ arch/x86/include/asm/kvm_para.h |3 +- arch/x86/include/asm/paravirt.h |4 +-- arch/x86/include/asm/paravirt_types.h |2 + arch/x86/kernel/kvm.c |8 ++--- arch/x86/kernel/paravirt.c|4 +-- arch/x86/kvm/x86.c| 50 - fs/proc/stat.c|9 +- include/linux/kernel_stat.h |2 + include/linux/kvm_host.h |2 + include/uapi/linux/kvm.h |2 + kernel/sched/core.c | 10 ++- kernel/sched/cputime.c| 21 +- kernel/sched/sched.h |2 + virt/kvm/kvm_main.c |7 + 15 files changed, 120 insertions(+), 17 deletions(-) -- Signature -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/5] Alter the amount of steal time reported by the guest.
Modify the amount of stealtime that the kernel reports via the /proc interface. Steal time will now be broken down into steal_time and consigned_time. Consigned_time will represent the amount of time that is expected to be lost due to overcommitment of the physical cpu or by using cpu capping. The amount consigned_time will be passed in using an ioctl. The time will be expressed in the number of nanoseconds to be lost in during the fixed period. The fixed period is currently 1/10th of a second. Signed-off-by: Michael Wolf --- fs/proc/stat.c |9 +++-- include/linux/kernel_stat.h |1 + 2 files changed, 8 insertions(+), 2 deletions(-) diff --git a/fs/proc/stat.c b/fs/proc/stat.c index e296572..cb7fe80 100644 --- a/fs/proc/stat.c +++ b/fs/proc/stat.c @@ -82,7 +82,7 @@ static int show_stat(struct seq_file *p, void *v) int i, j; unsigned long jif; u64 user, nice, system, idle, iowait, irq, softirq, steal; - u64 guest, guest_nice; + u64 guest, guest_nice, consign; u64 sum = 0; u64 sum_softirq = 0; unsigned int per_softirq_sums[NR_SOFTIRQS] = {0}; @@ -90,10 +90,11 @@ static int show_stat(struct seq_file *p, void *v) user = nice = system = idle = iowait = irq = softirq = steal = 0; - guest = guest_nice = 0; + guest = guest_nice = consign = 0; getboottime(&boottime); jif = boottime.tv_sec; + for_each_possible_cpu(i) { user += kcpustat_cpu(i).cpustat[CPUTIME_USER]; nice += kcpustat_cpu(i).cpustat[CPUTIME_NICE]; @@ -105,6 +106,7 @@ static int show_stat(struct seq_file *p, void *v) steal += kcpustat_cpu(i).cpustat[CPUTIME_STEAL]; guest += kcpustat_cpu(i).cpustat[CPUTIME_GUEST]; guest_nice += kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE]; + consign += kcpustat_cpu(i).cpustat[CPUTIME_CONSIGN]; sum += kstat_cpu_irqs_sum(i); sum += arch_irq_stat_cpu(i); @@ -128,6 +130,7 @@ static int show_stat(struct seq_file *p, void *v) seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(steal)); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest)); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest_nice)); + seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(consign)); seq_putc(p, '\n'); for_each_online_cpu(i) { @@ -142,6 +145,7 @@ static int show_stat(struct seq_file *p, void *v) steal = kcpustat_cpu(i).cpustat[CPUTIME_STEAL]; guest = kcpustat_cpu(i).cpustat[CPUTIME_GUEST]; guest_nice = kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE]; + consign = kcpustat_cpu(i).cpustat[CPUTIME_CONSIGN]; seq_printf(p, "cpu%d", i); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(user)); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(nice)); @@ -153,6 +157,7 @@ static int show_stat(struct seq_file *p, void *v) seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(steal)); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest)); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest_nice)); + seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(consign)); seq_putc(p, '\n'); } seq_printf(p, "intr %llu", (unsigned long long)sum); diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h index 1865b1f..e5978b0 100644 --- a/include/linux/kernel_stat.h +++ b/include/linux/kernel_stat.h @@ -28,6 +28,7 @@ enum cpu_usage_stat { CPUTIME_STEAL, CPUTIME_GUEST, CPUTIME_GUEST_NICE, + CPUTIME_CONSIGN, NR_STATS, }; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/5] Add the code to send the consigned time from the host to the guest
Add the code to send the consigned time from the host to the guest. Signed-off-by: Michael Wolf --- arch/x86/include/asm/kvm_host.h |1 + arch/x86/include/asm/kvm_para.h |3 ++- arch/x86/include/asm/paravirt.h |4 ++-- arch/x86/kernel/kvm.c |3 ++- arch/x86/kernel/paravirt.c |4 ++-- arch/x86/kvm/x86.c |2 ++ include/linux/kernel_stat.h |1 + kernel/sched/cputime.c | 21 +++-- kernel/sched/sched.h|2 ++ 9 files changed, 33 insertions(+), 8 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index b2e11f4..434d378 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -426,6 +426,7 @@ struct kvm_vcpu_arch { u64 msr_val; u64 last_steal; u64 accum_steal; + u64 accum_consigned; struct gfn_to_hva_cache stime; struct kvm_steal_time steal; } st; diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h index eb3e9d8..1763369 100644 --- a/arch/x86/include/asm/kvm_para.h +++ b/arch/x86/include/asm/kvm_para.h @@ -42,9 +42,10 @@ struct kvm_steal_time { __u64 steal; + __u64 consigned; __u32 version; __u32 flags; - __u32 pad[12]; + __u32 pad[10]; }; #define KVM_STEAL_ALIGNMENT_BITS 5 diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index a5f9f30..d39e8d0 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -196,9 +196,9 @@ struct static_key; extern struct static_key paravirt_steal_enabled; extern struct static_key paravirt_steal_rq_enabled; -static inline u64 paravirt_steal_clock(int cpu, u64 *steal) +static inline u64 paravirt_steal_clock(int cpu, u64 *steal, u64 *consigned) { - PVOP_VCALL2(pv_time_ops.steal_clock, cpu, steal); + PVOP_VCALL3(pv_time_ops.steal_clock, cpu, steal, consigned); } static inline unsigned long long paravirt_read_pmc(int counter) diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index ac357b3..4439a5c 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -372,7 +372,7 @@ static struct notifier_block kvm_pv_reboot_nb = { .notifier_call = kvm_pv_reboot_notify, }; -static void kvm_steal_clock(int cpu, u64 *steal) +static void kvm_steal_clock(int cpu, u64 *steal, u64 *consigned) { struct kvm_steal_time *src; int version; @@ -382,6 +382,7 @@ static void kvm_steal_clock(int cpu, u64 *steal) version = src->version; rmb(); *steal = src->steal; + *consigned = src->consigned; rmb(); } while ((version & 1) || (version != src->version)); } diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c index 17fff18..3797683 100644 --- a/arch/x86/kernel/paravirt.c +++ b/arch/x86/kernel/paravirt.c @@ -207,9 +207,9 @@ static void native_flush_tlb_single(unsigned long addr) struct static_key paravirt_steal_enabled; struct static_key paravirt_steal_rq_enabled; -static u64 native_steal_clock(int cpu) +static void native_steal_clock(int cpu, u64 *steal, u64 *consigned) { - return 0; + *steal = *consigned = 0; } /* These are in entry.S */ diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 1eefebe..683b531 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1565,8 +1565,10 @@ static void record_steal_time(struct kvm_vcpu *vcpu) return; vcpu->arch.st.steal.steal += vcpu->arch.st.accum_steal; + vcpu->arch.st.steal.consigned += vcpu->arch.st.accum_consigned; vcpu->arch.st.steal.version += 2; vcpu->arch.st.accum_steal = 0; + vcpu->arch.st.accum_consigned = 0; kvm_write_guest_cached(vcpu->kvm, &vcpu->arch.st.stime, &vcpu->arch.st.steal, sizeof(struct kvm_steal_time)); diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h index e5978b0..91afaa3 100644 --- a/include/linux/kernel_stat.h +++ b/include/linux/kernel_stat.h @@ -126,6 +126,7 @@ extern unsigned long long task_delta_exec(struct task_struct *); extern void account_user_time(struct task_struct *, cputime_t, cputime_t); extern void account_system_time(struct task_struct *, int, cputime_t, cputime_t); extern void account_steal_time(cputime_t); +extern void account_consigned_time(cputime_t); extern void account_idle_time(cputime_t); extern void account_process_tick(struct task_struct *, int user); diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index 593b647..53bd0be 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -244,6 +244,18 @@ void account_system_time(struct task_struct *p, int hardirq_offset, } /* + * This accounts for the time that is split out of steal time. + *
[PATCH 4/5] Add a timer to allow the separation of consigned from steal time.
Add a timer to the host. This will define the period. During a period the first n ticks will go into the consigned bucket. Any other ticks that occur within the period will be placed in the stealtime bucket. Signed-off-by: Michael Wolf --- arch/x86/include/asm/kvm_host.h | 10 + arch/x86/include/asm/paravirt.h |2 +- arch/x86/kvm/x86.c | 42 ++- 3 files changed, 52 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 434d378..4794c95 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -41,6 +41,8 @@ #define KVM_PIO_PAGE_OFFSET 1 #define KVM_COALESCED_MMIO_PAGE_OFFSET 2 +#define KVM_STEAL_TIMER_DELAY 1UL + #define CR0_RESERVED_BITS \ (~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \ | X86_CR0_ET | X86_CR0_NE | X86_CR0_WP | X86_CR0_AM \ @@ -353,6 +355,14 @@ struct kvm_vcpu_arch { bool tpr_access_reporting; /* +* timer used to determine if the time should be counted as +* steal time or consigned time. +*/ + struct hrtimer steal_timer; + u64 current_consigned; + u64 consigned_limit; + + /* * Paging state of the vcpu * * If the vcpu runs in guest mode with two level paging this still saves diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index d39e8d0..6db79f9 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -196,7 +196,7 @@ struct static_key; extern struct static_key paravirt_steal_enabled; extern struct static_key paravirt_steal_rq_enabled; -static inline u64 paravirt_steal_clock(int cpu, u64 *steal, u64 *consigned) +static inline void paravirt_steal_clock(int cpu, u64 *steal, u64 *consigned) { PVOP_VCALL3(pv_time_ops.steal_clock, cpu, steal, consigned); } diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 683b531..c91f4c9 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1546,13 +1546,32 @@ static void kvmclock_reset(struct kvm_vcpu *vcpu) static void accumulate_steal_time(struct kvm_vcpu *vcpu) { u64 delta; + u64 steal_delta; + u64 consigned_delta; if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED)) return; delta = current->sched_info.run_delay - vcpu->arch.st.last_steal; vcpu->arch.st.last_steal = current->sched_info.run_delay; - vcpu->arch.st.accum_steal = delta; + + /* split the delta into steal and consigned */ + if (vcpu->arch.current_consigned < vcpu->arch.consigned_limit) { + vcpu->arch.current_consigned += delta; + if (vcpu->arch.current_consigned > vcpu->arch.consigned_limit) { + steal_delta = vcpu->arch.current_consigned + - vcpu->arch.consigned_limit; + consigned_delta = delta - steal_delta; + } else { + consigned_delta = delta; + steal_delta = 0; + } + } else { + consigned_delta = 0; + steal_delta = delta; + } + vcpu->arch.st.accum_steal = steal_delta; + vcpu->arch.st.accum_consigned = consigned_delta; } static void record_steal_time(struct kvm_vcpu *vcpu) @@ -6203,11 +6222,25 @@ bool kvm_vcpu_compatible(struct kvm_vcpu *vcpu) struct static_key kvm_no_apic_vcpu __read_mostly; +enum hrtimer_restart steal_timer_fn(struct hrtimer *data) +{ + struct kvm_vcpu *vcpu; + ktime_t now; + + vcpu = container_of(data, struct kvm_vcpu, arch.steal_timer); + vcpu->arch.current_consigned = 0; + now = ktime_get(); + hrtimer_forward(&vcpu->arch.steal_timer, now, + ktime_set(0, KVM_STEAL_TIMER_DELAY)); + return HRTIMER_RESTART; +} + int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu) { struct page *page; struct kvm *kvm; int r; + ktime_t ktime; BUG_ON(vcpu->kvm == NULL); kvm = vcpu->kvm; @@ -6251,6 +6284,12 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu) kvm_async_pf_hash_reset(vcpu); kvm_pmu_init(vcpu); + /* Initialize and start a timer to capture steal and consigned time */ + hrtimer_init(&vcpu->arch.steal_timer, CLOCK_MONOTONIC, + HRTIMER_MODE_REL); + vcpu->arch.steal_timer.function = &steal_timer_fn; + ktime = ktime_set(0, KVM_STEAL_TIMER_DELAY); + hrtimer_start(&vcpu->arch.steal_timer, ktime, HRTIMER_MODE_REL); return 0; fail_free_mce_banks: @@ -6269,6 +6308,7 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) { int idx; +
[PATCH 5/5] Add an ioctl to communicate the consign limit to the host.
Add an ioctl to communicate the consign limit to the host. Signed-off-by: Michael Wolf --- arch/x86/kvm/x86.c |6 ++ include/linux/kvm_host.h |2 ++ include/uapi/linux/kvm.h |2 ++ virt/kvm/kvm_main.c |7 +++ 4 files changed, 17 insertions(+) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index c91f4c9..5d57469 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -5938,6 +5938,12 @@ int kvm_arch_vcpu_ioctl_translate(struct kvm_vcpu *vcpu, return 0; } +int kvm_arch_vcpu_ioctl_set_entitlement(struct kvm_vcpu *vcpu, long entitlement) +{ + vcpu->arch.consigned_limit = entitlement; + return 0; +} + int kvm_arch_vcpu_ioctl_get_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu) { struct i387_fxsave_struct *fxsave = diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 0e2212f..de13648 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -590,6 +590,8 @@ void kvm_arch_hardware_unsetup(void); void kvm_arch_check_processor_compat(void *rtn); int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu); int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu); +int kvm_arch_vcpu_ioctl_set_entitlement(struct kvm_vcpu *vcpu, + long entitlement); void kvm_free_physmem(struct kvm *kvm); diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 0a6d6ba..86f24bb 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -921,6 +921,8 @@ struct kvm_s390_ucas_mapping { #define KVM_SET_ONE_REG _IOW(KVMIO, 0xac, struct kvm_one_reg) /* VM is being stopped by host */ #define KVM_KVMCLOCK_CTRL_IO(KVMIO, 0xad) +/* Set the consignment limit which will be used to separete steal time */ +#define KVM_SET_ENTITLEMENT _IOW(KVMIO, 0xae, unsigned long) #define KVM_DEV_ASSIGN_ENABLE_IOMMU(1 << 0) #define KVM_DEV_ASSIGN_PCI_2_3 (1 << 1) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index be70035..c712fe5 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2062,6 +2062,13 @@ out_free2: r = 0; break; } + case KVM_SET_ENTITLEMENT: { + r = kvm_arch_vcpu_ioctl_set_entitlement(vcpu, arg); + if (r) + goto out; + r = 0; + break; + } default: r = kvm_arch_vcpu_ioctl(filp, ioctl, arg); } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/5] Expand the steal time msr to also contain the consigned time.
Add a consigned field. This field will hold the time lost due to capping or overcommit. The rest of the time will still show up in the steal-time field. Signed-off-by: Michael Wolf --- arch/x86/include/asm/paravirt.h |4 ++-- arch/x86/include/asm/paravirt_types.h |2 +- arch/x86/kernel/kvm.c |7 ++- kernel/sched/core.c | 10 +- kernel/sched/cputime.c|2 +- 5 files changed, 15 insertions(+), 10 deletions(-) diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index a0facf3..a5f9f30 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -196,9 +196,9 @@ struct static_key; extern struct static_key paravirt_steal_enabled; extern struct static_key paravirt_steal_rq_enabled; -static inline u64 paravirt_steal_clock(int cpu) +static inline u64 paravirt_steal_clock(int cpu, u64 *steal) { - return PVOP_CALL1(u64, pv_time_ops.steal_clock, cpu); + PVOP_VCALL2(pv_time_ops.steal_clock, cpu, steal); } static inline unsigned long long paravirt_read_pmc(int counter) diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index 142236e..5d4fc8b 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -95,7 +95,7 @@ struct pv_lazy_ops { struct pv_time_ops { unsigned long long (*sched_clock)(void); - unsigned long long (*steal_clock)(int cpu); + void (*steal_clock)(int cpu, unsigned long long *steal); unsigned long (*get_tsc_khz)(void); }; diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 4180a87..ac357b3 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -372,9 +372,8 @@ static struct notifier_block kvm_pv_reboot_nb = { .notifier_call = kvm_pv_reboot_notify, }; -static u64 kvm_steal_clock(int cpu) +static void kvm_steal_clock(int cpu, u64 *steal) { - u64 steal; struct kvm_steal_time *src; int version; @@ -382,11 +381,9 @@ static u64 kvm_steal_clock(int cpu) do { version = src->version; rmb(); - steal = src->steal; + *steal = src->steal; rmb(); } while ((version & 1) || (version != src->version)); - - return steal; } void kvm_disable_steal_time(void) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index c2e077c..b21d92d 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -748,6 +748,7 @@ static void update_rq_clock_task(struct rq *rq, s64 delta) */ #if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING) s64 steal = 0, irq_delta = 0; + u64 consigned = 0; #endif #ifdef CONFIG_IRQ_TIME_ACCOUNTING irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time; @@ -776,8 +777,15 @@ static void update_rq_clock_task(struct rq *rq, s64 delta) #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING if (static_key_false((¶virt_steal_rq_enabled))) { u64 st; + u64 cs; - steal = paravirt_steal_clock(cpu_of(rq)); + paravirt_steal_clock(cpu_of(rq), &steal, &consigned); + /* +* since we are not assigning the steal time to cpustats +* here, just combine the steal and consigned times to +* do the rest of the calculations. +*/ + steal += consigned; steal -= rq->prev_steal_time_rq; if (unlikely(steal > delta)) diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index 8d859da..593b647 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -275,7 +275,7 @@ static __always_inline bool steal_account_process_tick(void) if (static_key_false(¶virt_steal_enabled)) { u64 steal, st = 0; - steal = paravirt_steal_clock(smp_processor_id()); + paravirt_steal_clock(smp_processor_id(), &steal); steal -= this_rq()->prev_steal_time; st = steal_ticks(steal); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/5] Alter the amount of steal time reported by the guest.
Modify the amount of stealtime that the kernel reports via the /proc interface. Steal time will now be broken down into steal_time and consigned_time. Consigned_time will represent the amount of time that is expected to be lost due to overcommitment of the physical cpu or by using cpu capping. The amount consigned_time will be passed in using an ioctl. The time will be expressed in the number of nanoseconds to be lost in during the fixed period. The fixed period is currently 1/10th of a second. Signed-off-by: Michael Wolf --- fs/proc/stat.c |9 +++-- include/linux/kernel_stat.h |1 + 2 files changed, 8 insertions(+), 2 deletions(-) diff --git a/fs/proc/stat.c b/fs/proc/stat.c index e296572..cb7fe80 100644 --- a/fs/proc/stat.c +++ b/fs/proc/stat.c @@ -82,7 +82,7 @@ static int show_stat(struct seq_file *p, void *v) int i, j; unsigned long jif; u64 user, nice, system, idle, iowait, irq, softirq, steal; - u64 guest, guest_nice; + u64 guest, guest_nice, consign; u64 sum = 0; u64 sum_softirq = 0; unsigned int per_softirq_sums[NR_SOFTIRQS] = {0}; @@ -90,10 +90,11 @@ static int show_stat(struct seq_file *p, void *v) user = nice = system = idle = iowait = irq = softirq = steal = 0; - guest = guest_nice = 0; + guest = guest_nice = consign = 0; getboottime(&boottime); jif = boottime.tv_sec; + for_each_possible_cpu(i) { user += kcpustat_cpu(i).cpustat[CPUTIME_USER]; nice += kcpustat_cpu(i).cpustat[CPUTIME_NICE]; @@ -105,6 +106,7 @@ static int show_stat(struct seq_file *p, void *v) steal += kcpustat_cpu(i).cpustat[CPUTIME_STEAL]; guest += kcpustat_cpu(i).cpustat[CPUTIME_GUEST]; guest_nice += kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE]; + consign += kcpustat_cpu(i).cpustat[CPUTIME_CONSIGN]; sum += kstat_cpu_irqs_sum(i); sum += arch_irq_stat_cpu(i); @@ -128,6 +130,7 @@ static int show_stat(struct seq_file *p, void *v) seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(steal)); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest)); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest_nice)); + seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(consign)); seq_putc(p, '\n'); for_each_online_cpu(i) { @@ -142,6 +145,7 @@ static int show_stat(struct seq_file *p, void *v) steal = kcpustat_cpu(i).cpustat[CPUTIME_STEAL]; guest = kcpustat_cpu(i).cpustat[CPUTIME_GUEST]; guest_nice = kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE]; + consign = kcpustat_cpu(i).cpustat[CPUTIME_CONSIGN]; seq_printf(p, "cpu%d", i); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(user)); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(nice)); @@ -153,6 +157,7 @@ static int show_stat(struct seq_file *p, void *v) seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(steal)); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest)); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest_nice)); + seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(consign)); seq_putc(p, '\n'); } seq_printf(p, "intr %llu", (unsigned long long)sum); diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h index 1865b1f..e5978b0 100644 --- a/include/linux/kernel_stat.h +++ b/include/linux/kernel_stat.h @@ -28,6 +28,7 @@ enum cpu_usage_stat { CPUTIME_STEAL, CPUTIME_GUEST, CPUTIME_GUEST_NICE, + CPUTIME_CONSIGN, NR_STATS, }; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/5] Alter stealtime reporting in KVM
In the case of where you have a system that is running in a capped or overcommitted environment the user may see steal time being reported in accounting tools such as top or vmstat. This can cause confusion for the end user. To ease the confusion this patch set adds the idea of consigned (expected steal) time. The host will separate the consigned time from the steal time. The consignment limit passed to the host will be the amount of steal time expected within a fixed period of time. Any other steal time accruing during that period will show as the traditional steal time. --- Michael Wolf (5): Alter the amount of steal time reported by the guest. Expand the steal time msr to also contain the consigned time. Add the code to send the consigned time from the host to the guest Add a timer to allow the separation of consigned from steal time. Add an ioctl to communicate the consign limit to the host. CREDITS|5 Documentation/arm64/memory.txt | 12 Documentation/cgroups/memory.txt |4 .../devicetree/bindings/net/mdio-gpio.txt |9 Documentation/filesystems/proc.txt | 16 Documentation/hwmon/fam15h_power |2 Documentation/kernel-parameters.txt| 20 Documentation/networking/netdev-features.txt |2 Documentation/scheduler/numa-problem.txt | 20 MAINTAINERS| 87 + Makefile |2 arch/alpha/kernel/osf_sys.c|6 arch/arm/boot/Makefile | 10 arch/arm/boot/dts/tegra30.dtsi |4 arch/arm/include/asm/io.h |4 arch/arm/include/asm/sched_clock.h |2 arch/arm/include/asm/vfpmacros.h | 12 arch/arm/include/uapi/asm/hwcap.h |3 arch/arm/kernel/sched_clock.c | 18 arch/arm/mach-at91/at91rm9200_devices.c|2 arch/arm/mach-at91/at91sam9260_devices.c |2 arch/arm/mach-at91/at91sam9261_devices.c |2 arch/arm/mach-at91/at91sam9263_devices.c |2 arch/arm/mach-at91/at91sam9g45_devices.c | 12 arch/arm/mach-davinci/dm644x.c |3 arch/arm/mach-highbank/system.c|3 arch/arm/mach-imx/clk-gate2.c |2 arch/arm/mach-imx/ehci-imx25.c |2 arch/arm/mach-imx/ehci-imx35.c |2 arch/arm/mach-omap2/board-igep0020.c |5 arch/arm/mach-omap2/clockdomains44xx_data.c|2 arch/arm/mach-omap2/devices.c | 79 + arch/arm/mach-omap2/omap_hwmod.c | 63 + arch/arm/mach-omap2/omap_hwmod_44xx_data.c | 36 arch/arm/mach-omap2/twl-common.c |3 arch/arm/mach-omap2/vc.c |2 arch/arm/mach-pxa/hx4700.c |8 arch/arm/mach-pxa/spitz_pm.c |8 arch/arm/mm/alignment.c|2 arch/arm/plat-omap/include/plat/omap_hwmod.h |6 arch/arm/tools/Makefile|2 arch/arm/vfp/vfpmodule.c |9 arch/arm/xen/enlighten.c | 11 arch/arm/xen/hypercall.S | 14 arch/arm64/Kconfig |1 arch/arm64/include/asm/elf.h |5 arch/arm64/include/asm/fpsimd.h|5 arch/arm64/include/asm/io.h| 10 arch/arm64/include/asm/pgtable-hwdef.h |6 arch/arm64/include/asm/pgtable.h | 40 - arch/arm64/include/asm/processor.h |2 arch/arm64/include/asm/unistd.h|1 arch/arm64/kernel/perf_event.c | 10 arch/arm64/kernel/process.c| 18 arch/arm64/kernel/smp.c|3 arch/arm64/mm/init.c |2 arch/frv/Kconfig |1 arch/frv/boot/Makefile | 10 arch/frv/include/asm/unistd.h |1 arch/frv/kernel/entry.S| 28 arch/frv/kernel/process.c |5 arch/frv/mb93090-mb00/pci-dma-nommu.c |1 arch/h8300/include/asm/cache.h |3 arch/ia64/mm/init.c|1 arch/m68k/include/asm/signal.h |6 arch/mips/cavium-octeon/executive/cvmx-l2c.c | 900 arch/unicore32/include/asm/byteorder.h | 24 arch
[PATCH 2/5] Expand the steal time msr to also contain the consigned time.
Add a consigned field. This field will hold the time lost due to capping or overcommit. The rest of the time will still show up in the steal-time field. Signed-off-by: Michael Wolf --- arch/x86/include/asm/paravirt.h |4 ++-- arch/x86/include/asm/paravirt_types.h |2 +- arch/x86/kernel/kvm.c |7 ++- kernel/sched/core.c | 10 +- kernel/sched/cputime.c|2 +- 5 files changed, 15 insertions(+), 10 deletions(-) diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index a0facf3..a5f9f30 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -196,9 +196,9 @@ struct static_key; extern struct static_key paravirt_steal_enabled; extern struct static_key paravirt_steal_rq_enabled; -static inline u64 paravirt_steal_clock(int cpu) +static inline u64 paravirt_steal_clock(int cpu, u64 *steal) { - return PVOP_CALL1(u64, pv_time_ops.steal_clock, cpu); + PVOP_VCALL2(pv_time_ops.steal_clock, cpu, steal); } static inline unsigned long long paravirt_read_pmc(int counter) diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index 142236e..5d4fc8b 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -95,7 +95,7 @@ struct pv_lazy_ops { struct pv_time_ops { unsigned long long (*sched_clock)(void); - unsigned long long (*steal_clock)(int cpu); + void (*steal_clock)(int cpu, unsigned long long *steal); unsigned long (*get_tsc_khz)(void); }; diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 4180a87..ac357b3 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -372,9 +372,8 @@ static struct notifier_block kvm_pv_reboot_nb = { .notifier_call = kvm_pv_reboot_notify, }; -static u64 kvm_steal_clock(int cpu) +static void kvm_steal_clock(int cpu, u64 *steal) { - u64 steal; struct kvm_steal_time *src; int version; @@ -382,11 +381,9 @@ static u64 kvm_steal_clock(int cpu) do { version = src->version; rmb(); - steal = src->steal; + *steal = src->steal; rmb(); } while ((version & 1) || (version != src->version)); - - return steal; } void kvm_disable_steal_time(void) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index c2e077c..b21d92d 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -748,6 +748,7 @@ static void update_rq_clock_task(struct rq *rq, s64 delta) */ #if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING) s64 steal = 0, irq_delta = 0; + u64 consigned = 0; #endif #ifdef CONFIG_IRQ_TIME_ACCOUNTING irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time; @@ -776,8 +777,15 @@ static void update_rq_clock_task(struct rq *rq, s64 delta) #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING if (static_key_false((¶virt_steal_rq_enabled))) { u64 st; + u64 cs; - steal = paravirt_steal_clock(cpu_of(rq)); + paravirt_steal_clock(cpu_of(rq), &steal, &consigned); + /* +* since we are not assigning the steal time to cpustats +* here, just combine the steal and consigned times to +* do the rest of the calculations. +*/ + steal += consigned; steal -= rq->prev_steal_time_rq; if (unlikely(steal > delta)) diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index 8d859da..593b647 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -275,7 +275,7 @@ static __always_inline bool steal_account_process_tick(void) if (static_key_false(¶virt_steal_enabled)) { u64 steal, st = 0; - steal = paravirt_steal_clock(smp_processor_id()); + paravirt_steal_clock(smp_processor_id(), &steal); steal -= this_rq()->prev_steal_time; st = steal_ticks(steal); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/5] Add the code to send the consigned time from the host to the guest
Add the code to send the consigned time from the host to the guest. Signed-off-by: Michael Wolf --- arch/x86/include/asm/kvm_host.h |1 + arch/x86/include/asm/kvm_para.h |3 ++- arch/x86/include/asm/paravirt.h |4 ++-- arch/x86/kernel/kvm.c |3 ++- arch/x86/kernel/paravirt.c |4 ++-- arch/x86/kvm/x86.c |2 ++ include/linux/kernel_stat.h |1 + kernel/sched/cputime.c | 21 +++-- kernel/sched/sched.h|2 ++ 9 files changed, 33 insertions(+), 8 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index b2e11f4..434d378 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -426,6 +426,7 @@ struct kvm_vcpu_arch { u64 msr_val; u64 last_steal; u64 accum_steal; + u64 accum_consigned; struct gfn_to_hva_cache stime; struct kvm_steal_time steal; } st; diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h index eb3e9d8..1763369 100644 --- a/arch/x86/include/asm/kvm_para.h +++ b/arch/x86/include/asm/kvm_para.h @@ -42,9 +42,10 @@ struct kvm_steal_time { __u64 steal; + __u64 consigned; __u32 version; __u32 flags; - __u32 pad[12]; + __u32 pad[10]; }; #define KVM_STEAL_ALIGNMENT_BITS 5 diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index a5f9f30..d39e8d0 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -196,9 +196,9 @@ struct static_key; extern struct static_key paravirt_steal_enabled; extern struct static_key paravirt_steal_rq_enabled; -static inline u64 paravirt_steal_clock(int cpu, u64 *steal) +static inline u64 paravirt_steal_clock(int cpu, u64 *steal, u64 *consigned) { - PVOP_VCALL2(pv_time_ops.steal_clock, cpu, steal); + PVOP_VCALL3(pv_time_ops.steal_clock, cpu, steal, consigned); } static inline unsigned long long paravirt_read_pmc(int counter) diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index ac357b3..4439a5c 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -372,7 +372,7 @@ static struct notifier_block kvm_pv_reboot_nb = { .notifier_call = kvm_pv_reboot_notify, }; -static void kvm_steal_clock(int cpu, u64 *steal) +static void kvm_steal_clock(int cpu, u64 *steal, u64 *consigned) { struct kvm_steal_time *src; int version; @@ -382,6 +382,7 @@ static void kvm_steal_clock(int cpu, u64 *steal) version = src->version; rmb(); *steal = src->steal; + *consigned = src->consigned; rmb(); } while ((version & 1) || (version != src->version)); } diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c index 17fff18..3797683 100644 --- a/arch/x86/kernel/paravirt.c +++ b/arch/x86/kernel/paravirt.c @@ -207,9 +207,9 @@ static void native_flush_tlb_single(unsigned long addr) struct static_key paravirt_steal_enabled; struct static_key paravirt_steal_rq_enabled; -static u64 native_steal_clock(int cpu) +static void native_steal_clock(int cpu, u64 *steal, u64 *consigned) { - return 0; + *steal = *consigned = 0; } /* These are in entry.S */ diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 1eefebe..683b531 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1565,8 +1565,10 @@ static void record_steal_time(struct kvm_vcpu *vcpu) return; vcpu->arch.st.steal.steal += vcpu->arch.st.accum_steal; + vcpu->arch.st.steal.consigned += vcpu->arch.st.accum_consigned; vcpu->arch.st.steal.version += 2; vcpu->arch.st.accum_steal = 0; + vcpu->arch.st.accum_consigned = 0; kvm_write_guest_cached(vcpu->kvm, &vcpu->arch.st.stime, &vcpu->arch.st.steal, sizeof(struct kvm_steal_time)); diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h index e5978b0..91afaa3 100644 --- a/include/linux/kernel_stat.h +++ b/include/linux/kernel_stat.h @@ -126,6 +126,7 @@ extern unsigned long long task_delta_exec(struct task_struct *); extern void account_user_time(struct task_struct *, cputime_t, cputime_t); extern void account_system_time(struct task_struct *, int, cputime_t, cputime_t); extern void account_steal_time(cputime_t); +extern void account_consigned_time(cputime_t); extern void account_idle_time(cputime_t); extern void account_process_tick(struct task_struct *, int user); diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index 593b647..53bd0be 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -244,6 +244,18 @@ void account_system_time(struct task_struct *p, int hardirq_offset, } /* + * This accounts for the time that is split out of steal time. + *
[PATCH 4/5] Add a timer to allow the separation of consigned from steal time.
Add a timer to the host. This will define the period. During a period the first n ticks will go into the consigned bucket. Any other ticks that occur within the period will be placed in the stealtime bucket. Signed-off-by: Michael Wolf --- arch/x86/include/asm/kvm_host.h | 10 + arch/x86/include/asm/paravirt.h |2 +- arch/x86/kvm/x86.c | 42 ++- 3 files changed, 52 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 434d378..4794c95 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -41,6 +41,8 @@ #define KVM_PIO_PAGE_OFFSET 1 #define KVM_COALESCED_MMIO_PAGE_OFFSET 2 +#define KVM_STEAL_TIMER_DELAY 1UL + #define CR0_RESERVED_BITS \ (~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \ | X86_CR0_ET | X86_CR0_NE | X86_CR0_WP | X86_CR0_AM \ @@ -353,6 +355,14 @@ struct kvm_vcpu_arch { bool tpr_access_reporting; /* +* timer used to determine if the time should be counted as +* steal time or consigned time. +*/ + struct hrtimer steal_timer; + u64 current_consigned; + u64 consigned_limit; + + /* * Paging state of the vcpu * * If the vcpu runs in guest mode with two level paging this still saves diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index d39e8d0..6db79f9 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -196,7 +196,7 @@ struct static_key; extern struct static_key paravirt_steal_enabled; extern struct static_key paravirt_steal_rq_enabled; -static inline u64 paravirt_steal_clock(int cpu, u64 *steal, u64 *consigned) +static inline void paravirt_steal_clock(int cpu, u64 *steal, u64 *consigned) { PVOP_VCALL3(pv_time_ops.steal_clock, cpu, steal, consigned); } diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 683b531..c91f4c9 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1546,13 +1546,32 @@ static void kvmclock_reset(struct kvm_vcpu *vcpu) static void accumulate_steal_time(struct kvm_vcpu *vcpu) { u64 delta; + u64 steal_delta; + u64 consigned_delta; if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED)) return; delta = current->sched_info.run_delay - vcpu->arch.st.last_steal; vcpu->arch.st.last_steal = current->sched_info.run_delay; - vcpu->arch.st.accum_steal = delta; + + /* split the delta into steal and consigned */ + if (vcpu->arch.current_consigned < vcpu->arch.consigned_limit) { + vcpu->arch.current_consigned += delta; + if (vcpu->arch.current_consigned > vcpu->arch.consigned_limit) { + steal_delta = vcpu->arch.current_consigned + - vcpu->arch.consigned_limit; + consigned_delta = delta - steal_delta; + } else { + consigned_delta = delta; + steal_delta = 0; + } + } else { + consigned_delta = 0; + steal_delta = delta; + } + vcpu->arch.st.accum_steal = steal_delta; + vcpu->arch.st.accum_consigned = consigned_delta; } static void record_steal_time(struct kvm_vcpu *vcpu) @@ -6203,11 +6222,25 @@ bool kvm_vcpu_compatible(struct kvm_vcpu *vcpu) struct static_key kvm_no_apic_vcpu __read_mostly; +enum hrtimer_restart steal_timer_fn(struct hrtimer *data) +{ + struct kvm_vcpu *vcpu; + ktime_t now; + + vcpu = container_of(data, struct kvm_vcpu, arch.steal_timer); + vcpu->arch.current_consigned = 0; + now = ktime_get(); + hrtimer_forward(&vcpu->arch.steal_timer, now, + ktime_set(0, KVM_STEAL_TIMER_DELAY)); + return HRTIMER_RESTART; +} + int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu) { struct page *page; struct kvm *kvm; int r; + ktime_t ktime; BUG_ON(vcpu->kvm == NULL); kvm = vcpu->kvm; @@ -6251,6 +6284,12 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu) kvm_async_pf_hash_reset(vcpu); kvm_pmu_init(vcpu); + /* Initialize and start a timer to capture steal and consigned time */ + hrtimer_init(&vcpu->arch.steal_timer, CLOCK_MONOTONIC, + HRTIMER_MODE_REL); + vcpu->arch.steal_timer.function = &steal_timer_fn; + ktime = ktime_set(0, KVM_STEAL_TIMER_DELAY); + hrtimer_start(&vcpu->arch.steal_timer, ktime, HRTIMER_MODE_REL); return 0; fail_free_mce_banks: @@ -6269,6 +6308,7 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) { int idx; +
Re: [PATCH 0/5] Alter steal time reporting in KVM
On 11/27/2012 02:48 AM, Glauber Costa wrote: Hi, On 11/27/2012 12:36 AM, Michael Wolf wrote: In the case of where you have a system that is running in a capped or overcommitted environment the user may see steal time being reported in accounting tools such as top or vmstat. This can cause confusion for the end user. To ease the confusion this patch set adds the idea of consigned (expected steal) time. The host will separate the consigned time from the steal time. The consignment limit passed to the host will be the amount of steal time expected within a fixed period of time. Any other steal time accruing during that period will show as the traditional steal time. If you submit this again, please include a version number in your series. Will do. The patchset was sent twice yesterday by mistake. Got an error the first time and didn't think the patches went out. This has been corrected. It would also be helpful to include a small changelog about what changed between last version and this version, so we could focus on that. yes, will do that. When I took the RFC off the patches I was looking at it as a new patchset which was a mistake. I will make sure to add a changelog when I submit again. As for the rest, I answered your previous two submissions saying I don't agree with the concept. If you hadn't changed anything, resending it won't change my mind. I could of course, be mistaken or misguided. But I had also not seen any wave of support in favor of this previously, so basically I have no new data to make me believe I should see it any differently. Let's try this again: * Rik asked you in your last submission how does ppc handle this. You said, and I quote: "In the case of lpar on POWER systems they simply report steal time and do not alter it in any way. They do however report how much processor is assigned to the partition and that information is in /proc/ppc64/lparcfg." Yes, but we still get questions from users asking what is steal time? why am I seeing this? Now, that is a *way* more sensible thing to do. Much more. "Confusing users" is something extremely subjective. This is specially true about concepts that are know for quite some time, like steal time. If you out of a sudden change the meaning of this, it is sure to confuse a lot more users than it would clarify. Something like this could certainly be done. But when I was submitting the patch set as an RFC then qemu was passing a cpu percentage that would be used by the guest kernel to adjust the steal time. This percentage was being stored on the guest as a sysctl value. Avi stated he didn't like that kind of coupling, and that the value could get out of sync. Anthony stated "The guest shouldn't need to know it's entitlement. Or at least, it's up to a management tool to report that in a way that's meaningful for the guest." So perhaps I misunderstood what they were suggesting, but I took it to mean that they did not want the guest to know what the entitlement was. That the host should take care of it and just report the already adjusted data to the guest. So in this version of the code the host would use a set period for a timer and be passed essentially a number of ticks of expected steal time. The host would then use the timer to break out the steal time into consigned and steal buckets which would be reported to the guest. Both the consigned and the steal would be reported via /proc/stat. So anyone needing to see total time away could add the two fields together. The user, however, when using tools like top or vmstat would see the usage based on what the guest is entitled to. Do you have suggestions for how I can build consensus around one of the two approaches? --- Michael Wolf (5): Alter the amount of steal time reported by the guest. Expand the steal time msr to also contain the consigned time. Add the code to send the consigned time from the host to the guest Add a timer to allow the separation of consigned from steal time. Add an ioctl to communicate the consign limit to the host. arch/x86/include/asm/kvm_host.h | 11 +++ arch/x86/include/asm/kvm_para.h |3 +- arch/x86/include/asm/paravirt.h |4 +-- arch/x86/include/asm/paravirt_types.h |2 + arch/x86/kernel/kvm.c |8 ++--- arch/x86/kernel/paravirt.c|4 +-- arch/x86/kvm/x86.c| 50 - fs/proc/stat.c|9 +- include/linux/kernel_stat.h |2 + include/linux/kvm_host.h |2 + include/uapi/linux/kvm.h |2 + kernel/sched/core.c | 10 ++- kernel/sched/cputime.c| 21 +- kernel/sched/sched.h |2 + virt/kvm/kvm_main.c
Re: [PATCH v2] drivers/of: Constify device_node->name and ->path_component_name
> Neither of these should ever be changed once set. Make them const and > fix up the users that try to modify it in-place. In one case > kmalloc+memcpy is replaced with kstrdup() to avoid modifying the string. > > Build tested with defconfigs on ARM, PowerPC, Sparc, MIPS, x86 among > others. Grant, This breaks powerpc chroma_defconfig in next-20121127 with: arch/powerpc/sysdev/scom.c:160:17: error: assignment discards 'const' qualifier from pointer target type [-Werror] The following fixes it. The change is to generic code, so I'm not sure it's the right fix as it may break other configs/archs. diff --git a/include/linux/debugfs.h b/include/linux/debugfs.h index 66c434f..77f64e4 100644 --- a/include/linux/debugfs.h +++ b/include/linux/debugfs.h @@ -23,7 +23,7 @@ struct file_operations; struct debugfs_blob_wrapper { - void *data; + const void *data; unsigned long size; }; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RESEND 1/3] printk: convert byte-buffer to variable-length record buffer
Hi Kay, On Thu, May 3, 2012 at 2:29 AM, Kay Sievers wrote: > From: Kay Sievers [...] > case SYSLOG_ACTION_SIZE_UNREAD: > - error = log_end - log_start; > + raw_spin_lock_irq(&logbuf_lock); > + if (syslog_seq < log_first_seq) { > + /* messages are gone, move to first one */ > + syslog_seq = log_first_seq; > + syslog_idx = log_first_idx; > + } > + if (from_file) { > + /* > +* Short-cut for poll(/"proc/kmsg") which simply > checks > +* for pending data, not the size; return the count of > +* records, not the length. > +*/ > + error = log_next_idx - syslog_idx; > + } else { > + u64 seq; > + u32 idx; > + > + error = 0; > + seq = syslog_seq; > + idx = syslog_idx; > + while (seq < log_next_seq) { > + error += syslog_print_line(idx, NULL, 0); > + idx = log_next(idx); > + seq++; > + } > + } > + raw_spin_unlock_irq(&logbuf_lock); > break; [...] It looks as though the changes here have broken SYSLOG_ACTION_SIZE_UNREAD. On a 2.6.31 system, immediately after SYSLOG_ACTION_READ_CLEAR, a SYSLOG_ACTION_SIZE_UNREAD returns 0. On 3.5, immediately after SYSLOG_ACTION_READ_CLEAR, the value returned by SYSLOG_ACTION_SIZE_UNREAD is unchanged (i.e., assuming that the value returned was non-zero before SYSLOG_ACTION_SIZE_UNREAD, it is still nonzero afterward), even though a subsequent SYSLOG_ACTION_READ_CLEAR indicates that there are zero bytes to read. (All tests conducted with (r)syslogd stopped.) Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/5] Expand the steal time msr to also contain the consigned time.
On 11/27/2012 03:03 PM, Konrad Rzeszutek Wilk wrote: On Mon, Nov 26, 2012 at 02:36:45PM -0600, Michael Wolf wrote: Add a consigned field. This field will hold the time lost due to capping or overcommit. The rest of the time will still show up in the steal-time field. Signed-off-by: Michael Wolf --- arch/x86/include/asm/paravirt.h |4 ++-- arch/x86/include/asm/paravirt_types.h |2 +- arch/x86/kernel/kvm.c |7 ++- kernel/sched/core.c | 10 +- kernel/sched/cputime.c|2 +- 5 files changed, 15 insertions(+), 10 deletions(-) diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index a0facf3..a5f9f30 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -196,9 +196,9 @@ struct static_key; extern struct static_key paravirt_steal_enabled; extern struct static_key paravirt_steal_rq_enabled; -static inline u64 paravirt_steal_clock(int cpu) +static inline u64 paravirt_steal_clock(int cpu, u64 *steal) So its u64 here. { - return PVOP_CALL1(u64, pv_time_ops.steal_clock, cpu); + PVOP_VCALL2(pv_time_ops.steal_clock, cpu, steal); } static inline unsigned long long paravirt_read_pmc(int counter) diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index 142236e..5d4fc8b 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -95,7 +95,7 @@ struct pv_lazy_ops { struct pv_time_ops { unsigned long long (*sched_clock)(void); - unsigned long long (*steal_clock)(int cpu); + void (*steal_clock)(int cpu, unsigned long long *steal); But not u64 here? Any particular reason? It should be void everywhere, thanks for catching that I will change the code. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Alter steal time reporting in KVM
On 11/27/2012 05:24 PM, Marcelo Tosatti wrote: On Mon, Nov 26, 2012 at 02:36:24PM -0600, Michael Wolf wrote: In the case of where you have a system that is running in a capped or overcommitted environment the user may see steal time being reported in accounting tools such as top or vmstat. The definition of stolen time is 'time during which the virtual CPU is runnable to not running'. Overcommit is the main scenario which steal time helps to detect. Can you describe the 'capped' case? In the capped case, the time that the guest spends waiting due to it having used its full allottment of time shows up as steal time. The way my patchset currently stands is that you would set up the bandwidth control and you would have to pass it a matching value from qemu. In the future, it would be possible to have something parse the bandwidth setting and automatically adjust the setting in the host used for steal time reporting. This can cause confusion for the end user. To ease the confusion this patch set adds the idea of consigned (expected steal) time. The host will separate the consigned time from the steal time. The consignment limit passed to the host will be the amount of steal time expected within a fixed period of time. Any other steal time accruing during that period will show as the traditional steal time. --- Michael Wolf (5): Alter the amount of steal time reported by the guest. Expand the steal time msr to also contain the consigned time. Add the code to send the consigned time from the host to the guest Add a timer to allow the separation of consigned from steal time. Add an ioctl to communicate the consign limit to the host. arch/x86/include/asm/kvm_host.h | 11 +++ arch/x86/include/asm/kvm_para.h |3 +- arch/x86/include/asm/paravirt.h |4 +-- arch/x86/include/asm/paravirt_types.h |2 + arch/x86/kernel/kvm.c |8 ++--- arch/x86/kernel/paravirt.c|4 +-- arch/x86/kvm/x86.c| 50 - fs/proc/stat.c|9 +- include/linux/kernel_stat.h |2 + include/linux/kvm_host.h |2 + include/uapi/linux/kvm.h |2 + kernel/sched/core.c | 10 ++- kernel/sched/cputime.c| 21 +- kernel/sched/sched.h |2 + virt/kvm/kvm_main.c |7 + 15 files changed, 120 insertions(+), 17 deletions(-) -- Signature -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Alter steal time reporting in KVM
On 11/28/2012 02:45 AM, Glauber Costa wrote: On 11/27/2012 07:10 PM, Michael Wolf wrote: On 11/27/2012 02:48 AM, Glauber Costa wrote: Hi, On 11/27/2012 12:36 AM, Michael Wolf wrote: In the case of where you have a system that is running in a capped or overcommitted environment the user may see steal time being reported in accounting tools such as top or vmstat. This can cause confusion for the end user. To ease the confusion this patch set adds the idea of consigned (expected steal) time. The host will separate the consigned time from the steal time. The consignment limit passed to the host will be the amount of steal time expected within a fixed period of time. Any other steal time accruing during that period will show as the traditional steal time. If you submit this again, please include a version number in your series. Will do. The patchset was sent twice yesterday by mistake. Got an error the first time and didn't think the patches went out. This has been corrected. It would also be helpful to include a small changelog about what changed between last version and this version, so we could focus on that. yes, will do that. When I took the RFC off the patches I was looking at it as a new patchset which was a mistake. I will make sure to add a changelog when I submit again. As for the rest, I answered your previous two submissions saying I don't agree with the concept. If you hadn't changed anything, resending it won't change my mind. I could of course, be mistaken or misguided. But I had also not seen any wave of support in favor of this previously, so basically I have no new data to make me believe I should see it any differently. Let's try this again: * Rik asked you in your last submission how does ppc handle this. You said, and I quote: "In the case of lpar on POWER systems they simply report steal time and do not alter it in any way. They do however report how much processor is assigned to the partition and that information is in /proc/ppc64/lparcfg." Yes, but we still get questions from users asking what is steal time? why am I seeing this? Now, that is a *way* more sensible thing to do. Much more. "Confusing users" is something extremely subjective. This is specially true about concepts that are know for quite some time, like steal time. If you out of a sudden change the meaning of this, it is sure to confuse a lot more users than it would clarify. Something like this could certainly be done. But when I was submitting the patch set as an RFC then qemu was passing a cpu percentage that would be used by the guest kernel to adjust the steal time. This percentage was being stored on the guest as a sysctl value. Avi stated he didn't like that kind of coupling, and that the value could get out of sync. Anthony stated "The guest shouldn't need to know it's entitlement. Or at least, it's up to a management tool to report that in a way that's meaningful for the guest." So perhaps I misunderstood what they were suggesting, but I took it to mean that they did not want the guest to know what the entitlement was. That the host should take care of it and just report the already adjusted data to the guest. So in this version of the code the host would use a set period for a timer and be passed essentially a number of ticks of expected steal time. The host would then use the timer to break out the steal time into consigned and steal buckets which would be reported to the guest. Both the consigned and the steal would be reported via /proc/stat. So anyone needing to see total time away could add the two fields together. The user, however, when using tools like top or vmstat would see the usage based on what the guest is entitled to. Do you have suggestions for how I can build consensus around one of the two approaches? Before I answer this, can you please detail which mechanism are you using to enforce the entitlement? Is it the cgroup cpu controller, or something else? It is setup using cpu overcommit. But the request was for something that would work in both the overcommit environment as well as when hard capping is being used. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] Alter steal time reporting in KVM
On 11/28/2012 02:55 PM, Glauber Costa wrote: On 11/28/2012 10:43 PM, Michael Wolf wrote: On 11/27/2012 05:24 PM, Marcelo Tosatti wrote: On Mon, Nov 26, 2012 at 02:36:24PM -0600, Michael Wolf wrote: In the case of where you have a system that is running in a capped or overcommitted environment the user may see steal time being reported in accounting tools such as top or vmstat. The definition of stolen time is 'time during which the virtual CPU is runnable to not running'. Overcommit is the main scenario which steal time helps to detect. Can you describe the 'capped' case? In the capped case, the time that the guest spends waiting due to it having used its full allottment of time shows up as steal time. The way my patchset currently stands is that you would set up the bandwidth control and you would have to pass it a matching value from qemu. In the future, it would be possible to have something parse the bandwidth setting and automatically adjust the setting in the host used for steal time reporting. Ok, so correct me if I am wrong, but I believe you would be using something like the bandwidth capper in the cpu cgroup to set those entitlements, right? Yes, in the context above I'm referring to the cfs bandwidth control. Some time has passed since I last looked into it, but IIRC, after you get are out of your quota, you should be out of the runqueue. In the lovely world of KVM, we approximate steal time as runqueue time: arch/x86/kvm/x86.c: delta = current->sched_info.run_delay - vcpu->arch.st.last_steal; vcpu->arch.st.last_steal = current->sched_info.run_delay; vcpu->arch.st.accum_steal = delta; include/linux/sched.h: unsigned long long run_delay; /* time spent waiting on a runqueue */ So if you are out of the runqueue, you won't get steal time accounted, and then I truly fail to understand what you are doing. So I looked at something like this in the past. To make sure things haven't changed I set up a cgroup on my test server running a kernel built from the latest tip tree. [root]# cat cpu.cfs_quota_us 5 [root]# cat cpu.cfs_period_us 10 [root]# cat cpuset.cpus 1 [root]# cat cpuset.mems 0 Next I put the PID from the cpu thread into tasks. When I start a script that will hog the cpu I see the following in top on the guest Cpu(s): 1.9%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 48.3%hi, 0.0%si, 49.8%st So the steal time here is in line with the bandwidth control settings. In case I am wrong, and run_delay also includes the time you can't run because you are out of capacity, then maybe what we should do, is to just subtract it from run_delay in kvm/x86.c before we pass it on. In summary: About a year ago I was playing with this patch. It is out of date now but will give you an idea of what I was looking at. kernel/sched_fair.c |4 ++-- kernel/sched_stats.h |7 ++- 2 files changed, 8 insertions(+), 3 deletions(-) diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c index 5c9e679..a837e4e 100644 --- a/kernel/sched_fair.c +++ b/kernel/sched_fair.c @@ -707,7 +707,7 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se) #ifdef CONFIG_FAIR_GROUP_SCHED /* we need this in update_cfs_load and load-balance functions below */ -static inline int throttled_hierarchy(struct cfs_rq *cfs_rq); +inline int throttled_hierarchy(struct cfs_rq *cfs_rq); # ifdef CONFIG_SMP static void update_cfs_rq_load_contribution(struct cfs_rq *cfs_rq, int global_update) @@ -1420,7 +1420,7 @@ static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq) } /* check whether cfs_rq, or any parent, is throttled */ -static inline int throttled_hierarchy(struct cfs_rq *cfs_rq) +inline int throttled_hierarchy(struct cfs_rq *cfs_rq) { return cfs_rq->throttle_count; } diff --git a/kernel/sched_stats.h b/kernel/sched_stats.h index 87f9e36..e30ff26 100644 --- a/kernel/sched_stats.h +++ b/kernel/sched_stats.h @@ -213,14 +213,19 @@ static inline void sched_info_queued(struct task_struct *t) * sched_info_queued() to mark that it has now again started waiting on * the runqueue. */ +extern inline int throttled_hierarchy(struct cfs_rq *cfs_rq); static inline void sched_info_depart(struct task_struct *t) { +struct task_group *tg = task_group(t); +struct cfs_rq *cfs_rq; unsigned long long delta = task_rq(t)->clock - t->sched_info.last_arrival; +cfs_rq = tg->cfs_rq[smp_processor_id()]; rq_sched_info_depart(task_rq(t), delta); -if (t->state == TASK_RUNNING) + +if (t->state == TASK_RUNNING && !throttled_hierarchy(cfs_rq)) sched_info_queued(t); } So then the steal time did not show on the guest. You have no value that needs to be passed around. What I did not like about this approach was * only works for cfs bandwidth control. If another type of hard limit was added to the kernel the code wou
Re: [PATCH] ARM: Fix page counting in mem_init and show_mem
On Thu, Nov 29, 2012 at 8:08 AM, Russell King - ARM Linux wrote: > On Mon, Oct 22, 2012 at 09:34:51PM -0400, Michael Spang wrote: >> for_each_bank (i, mi) { >> struct membank *bank = &mi->bank[i]; >> - unsigned int pfn1, pfn2; >> - struct page *page, *end; >> + unsigned int start, end, pfn; >> >> - pfn1 = bank_pfn_start(bank); >> - pfn2 = bank_pfn_end(bank); >> + start = bank_pfn_start(bank); >> + end = bank_pfn_end(bank); >> >> - page = pfn_to_page(pfn1); >> - end = pfn_to_page(pfn2 - 1) + 1; >> + for (pfn = start; pfn < end; pfn++) { >> + struct page *page; >> + >> + if (!pfn_valid(pfn)) >> + continue; > > This is not a very good fix; what this means is that we end up calling > pfn_valid() for each and every page in the system, and as pfn_valid() > may not be a simple test (but a search) we should avoid that when we're > iterating over all pages in the system. > > Firstly, the mem blank information is assumed from the very beginning > to be aligned with the sparsemem split-up. This comes from the previous > discontiguous implementation where this was an absolute requirement. We > continue to require that. Little confused here. On my system, there are 2 membanks and 8 sparsemem sections. Obviously, the banks have been further divided into sections by sparsemem. My problem occurs because this code assumes there's a single struct page array for the whole bank, when really there are multiple. Each struct page array is allocated in a separate call to bootmem. It's disastrous if bootmem can't allocate them contiguously. This happens on one of my devices with certain kernel options. > > Secondly, if you're worred about the stolen memory, then we need to be > iterating over the memblock information instead of the membank information. > This is slightly more complex because memblock will merge neighbouring > regions into one contiguous entry - and this needs to be split up here. > This is why I persisted with the membank stuff here as that _should_ > already be appropriately split. > > In the long run though, moving to memblock and dealing better with the > split memory maps (rather than looking up each and every page using > pfn_to_page()) is the right way to go. Thanks, Michael -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched: wakeup buddy
[snip] >> It could bring the same benefit but at lower overhead, what's the point >> of computing the same value over and over again? Also, the rate limit >> thing naturally works for the soft/hard-irq case. > > Just try to confirm my understanding, so we are going to do something > like: > > if (now - wakee->last > time_limit) && wakeup_affine() > wakee->last = now > select_idle_sibling(curr_cpu) > else > select_idle_sibling(prev_cpu) > > And time_limit is some static value respect to the rate of load balance, > is that correct? > > Currently I haven't found regression by reduce the rate, but if we found > such benchmark, we may still need a way (knob or CONFIG) to disable this > limitation. I've done some fast tests on this proposal, on my 12 cpu box, the pgbench 32 clients test, for a 1000ms time_limit, the benefit is just like the 8 ref wakeup buddy, when adopt 10ms time_limit, the benefit dropped half, when time_limit is 1ms, the benefit is less than 10%. tps original43404 wakeup-buddy63024 +45.20% 1s-limit62359 +43.67% 100ms-limit 57547 +32.58% 10ms-limit 52258 +20.40% 1ms-limit 46535 +7.21% Other test items of pgbench are corresponding, and other benchmarks still inert to the changes. I'm planning to make a new patch for this approach later, in which time_limit is a knob with the default value 1ms (usually the initial value of balance_interval and the value of min_interval), that will based on the latest tip tree. Regards, Michael Wang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] alpha: makefile: don't enforce small data model for kernel builds
On 18/03/2013, at 10:48 AM, Will Deacon wrote: Due to all of the goodness being packed into today's kernels, the resulting image isn't as slim as it once was. In light of this, don't pass -msmall-data to the tools, which results in link failures due to impossible relocations when compiling anything but the most trivial configurations. I think many of us have been using -mlarge-data when compiling with gcc-4.6 or later so maybe it is time to get the change upstream. The interesting thing is that the kernel still compiles fine with gcc-4.5 and the relocation errors only appear if compiling with gcc-4.6 or later. I had asked before on this forum what had changed with gcc-4.6 that results in the extra usage of the small data area but never got an answer. I am still curious to know. BTW, the phrase "to the tools" in the commit message makes me think immediately of the tools directory (containing perf, etc.) which is not what is intended. Matt: Are you able to collect up this and the other patches of Will and get them sent to Linus? Cheers Michael. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] hrtimer: Don't reinitialize a cpu_base's lock on CPU_UP
The current code makes the assumption that a cpu_base lock won't be held if the CPU corresponding to that cpu_base is offline, which isn't always true. If a hrtimer is not queued, then it will not be migrated by migrate_hrtimers() when a CPU is offlined. Therefore, the hrtimer's cpu_base may still point to a CPU which has subsequently gone offline if the timer wasn't enqueued at the time the CPU went down. Normally this wouldn't be a problem, but a cpu_base's lock is blindly reinitialized each time a CPU is brought up. If a CPU is brought online during the period that another thread is performing a hrtimer operation on a stale hrtimer, then the lock will be reinitialized under its feet, and a SPIN_BUG() like the following will be observed: <0>[ 28.082085] BUG: spinlock already unlocked on CPU#0, swapper/0/0 <0>[ 28.087078] lock: 0xc4780b40, value 0x0 .magic: dead4ead, .owner: /-1, .owner_cpu: -1 <4>[ 42.451150] [] (unwind_backtrace+0x0/0x120) from [] (do_raw_spin_unlock+0x44/0xdc) <4>[ 42.460430] [] (do_raw_spin_unlock+0x44/0xdc) from [] (_raw_spin_unlock+0x8/0x30) <4>[ 42.469632] [] (_raw_spin_unlock+0x8/0x30) from [] (__hrtimer_start_range_ns+0x1e4/0x4f8) <4>[ 42.479521] [] (__hrtimer_start_range_ns+0x1e4/0x4f8) from [] (hrtimer_start+0x20/0x28) <4>[ 42.489247] [] (hrtimer_start+0x20/0x28) from [] (rcu_idle_enter_common+0x1ac/0x320) <4>[ 42.498709] [] (rcu_idle_enter_common+0x1ac/0x320) from [] (rcu_idle_enter+0xa0/0xb8) <4>[ 42.508259] [] (rcu_idle_enter+0xa0/0xb8) from [] (cpu_idle+0x24/0xf0) <4>[ 42.516503] [] (cpu_idle+0x24/0xf0) from [] (rest_init+0x88/0xa0) <4>[ 42.524319] [] (rest_init+0x88/0xa0) from [] (start_kernel+0x3d0/0x434) As an example, this particular crash occurred when hrtimer_start() was executed on CPU #0. The code locked the hrtimer's current cpu_base corresponding to CPU #1. CPU #0 then tried to switch the hrtimer's cpu_base to an optimal CPU which was online. In this case, it selected the cpu_base corresponding to CPU #3. Before it could proceed, CPU #1 came online and reinitialized the spinlock corresponding to its cpu_base. Thus now CPU #0 held a lock which was reinitialized. When CPU #0 finally ended up unlocking the old cpu_base corresponding to CPU #1 so that it could switch to CPU #3, we hit this SPIN_BUG() above while in switch_hrtimer_base(). CPU #0CPU #1 ... hrtimer_start() lock_hrtimer_base(base #1) ... init_hrtimers_cpu() switch_hrtimer_base() ... ... raw_spin_lock_init(&cpu_base->lock) raw_spin_unlock(&cpu_base->lock) ... Signed-off-by: Michael Bohan --- kernel/hrtimer.c |3 +-- 1 files changed, 1 insertions(+), 2 deletions(-) diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c index cc47812..14be27f 100644 --- a/kernel/hrtimer.c +++ b/kernel/hrtimer.c @@ -63,6 +63,7 @@ DEFINE_PER_CPU(struct hrtimer_cpu_base, hrtimer_bases) = { + .lock = __RAW_SPIN_LOCK_UNLOCKED(hrtimer_bases.lock), .clock_base = { { @@ -1642,8 +1643,6 @@ static void __cpuinit init_hrtimers_cpu(int cpu) struct hrtimer_cpu_base *cpu_base = &per_cpu(hrtimer_bases, cpu); int i; - raw_spin_lock_init(&cpu_base->lock); - for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++) { cpu_base->clock_base[i].cpu_base = cpu_base; timerqueue_init_head(&cpu_base->clock_base[i].active); -- The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 2/2] sched/fair: prefer a CPU in the "lowest" idle state
On 02/03/2013 01:50 AM, Sebastian Andrzej Siewior wrote: > On 01/31/2013 03:12 AM, Michael Wang wrote: >> I'm not sure, but just concern about this case: >> >> group 0 cpu 0 cpu 1 >> least idle 4 task >> >> group 1 cpu 2 cpu 3 >> 1 task 1 task >> >> The previous logical will pick group 1 and now it will take group 0, and >> that cause more imbalance, doesn't it? > > That depends on load of CPU 0 + 1 vs CPU 2 + 3. If the four tasks on > CPU1 are idle then the previous code should return group 0. > If the four tasks are running at 100% each then two of them should be > migrated to CPU0 and this point the idle state does not matter :) Hmm...may be I should make it more clear like this: Prev find_idlest_group(): cpu 0 is the least idle cpu 1 has 4 tasks on it's running queue cpu 2 has 1 task(current task) on it's running queue cpu 3 has 1 task on it's running queue and suppose no changes happen during the search, and this sd only contain 2 groups: group 0 has cpu 0 and 1 group 1 has cpu 2 and 3 So in the old world, group 0 has load 4096 (if all the task are nice 0, and let's put down the revise), group 1 has load 2048, so find_idlest_group() will return group 1 since it's the idlest. But now, since we directly using the idle group, that will be group 0, and after applied, group 0 will has 5120 load while group 1 only has 2048, and that's cause more imbalance (than 4096 : 3072). That's just flash in my mind when I saw the patch, may be not a good case or missed some thing, but since find_idlest_group() is trying to balance the load, if we want to override the rule, we need proof by logical or benchmarks. > >> May be check that state in find_idlest_cpu() will be better? > > You say to move this from find_idlest_group() to find_idlest_cpu()? Yes, since we already make sure the balance by find_idlest_group(), we only need to add some check like below in find_idlest_cpu(): if (load < min_load || (load == min_load && i == this_cpu)) { if (power state of 'idlest' < power state of 'i') continue; min_load = load; idlest = i; } That will get very limited benefit (only the case when there are multiple different power state idle cpu in the group), but is very easy to be proved by logical, doesn't it? And Namhyung mentioned some interesting implementation which may need no changes to the code in select, please take a look :) Regards, Michael Wang > >> Regards, >> Michael Wang > > Sebastian > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/3] Enable multiple MSI feature in pSeries
On Tue, 2013-01-15 at 15:38 +0800, Mike Qiu wrote: > Currently, multiple MSI feature hasn't been enabled in pSeries, > These patches try to enbale this feature. Hi Mike, > These patches have been tested by using ipr driver, and the driver patch > has been made by Wen Xiong : So who wrote these patches? Normally we would expect the original author to post the patches if at all possible. > [PATCH 0/7] Add support for new IBM SAS controllers I would like to see the full series, including the driver enablement. > Test platform: One partition of pSeries with one cpu core(4 SMTs) and >RAID bus controller: IBM PCI-E IPR SAS Adapter (ASIC) in POWER7 > OS version: SUSE Linux Enterprise Server 11 SP2 (ppc64) with 3.8-rc3 kernel > > IRQ 21 and 22 are assigned to the ipr device which support 2 mutiple MSI. > > The test results is shown by 'cat /proc/interrups': > CPU0 CPU1 CPU2 CPU3 > 21: 6 5 5 5 XICS Level host1-0 > 22:817814816813 XICS Level host1-1 This shows that you are correctly configuring two MSIs. But the key advantage of using multiple interrupts is to distribute load across CPUs and improve performance. So I would like to see some performance numbers that show that there is a real benefit for all the extra complexity in the code. cheers -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/3] Enable multiple MSI feature in pSeries
On Mon, 2013-02-04 at 11:49 +0800, Mike Qiu wrote: > > On Tue, 2013-01-15 at 15:38 +0800, Mike Qiu wrote: > > > Currently, multiple MSI feature hasn't been enabled in pSeries, > > > These patches try to enbale this feature. > > Hi Mike, > > > > > These patches have been tested by using ipr driver, and the driver patch > > > has been made by Wen Xiong : > > So who wrote these patches? Normally we would expect the original author > > to post the patches if at all possible. > Hi Michael > > These Multiple MSI patches were wrote by myself, you know this feature > has not enabled > and it need device driver to test whether it works suitable. So I test > my patches use > Wen Xiong's ipr patches, which has been send out to the maillinglist. > > I'm the original author :) Ah OK, sorry, that was more or less clear from your mail but I just misunderstood. > > > [PATCH 0/7] Add support for new IBM SAS controllers > > I would like to see the full series, including the driver enablement. > Yep, but the driver patches were wrote by Wen Xiong and has been send > out. OK, you mean this series? http://thread.gmane.org/gmane.linux.scsi/79639 > I just use her patches to test my patches. all device support Multiple > MSI can use my feature not only IBM SAS controllers, I also test my > patches use the broadcom wireless card tg3, and also works OK. You mean drivers/net/ethernet/broadcom/tg3.c ? I don't see where it calls pci_enable_msi_block() ? All devices /can/ use it, but the driver needs to be updated. Currently we have two drivers that do so (in Linus' tree), plus the updated IPR. > > > Test platform: One partition of pSeries with one cpu core(4 SMTs) and > > >RAID bus controller: IBM PCI-E IPR SAS Adapter (ASIC) in > > > POWER7 > > > OS version: SUSE Linux Enterprise Server 11 SP2 (ppc64) with 3.8-rc3 > > > kernel > > > > > > IRQ 21 and 22 are assigned to the ipr device which support 2 mutiple MSI. > > > > > > The test results is shown by 'cat /proc/interrups': > > > CPU0 CPU1 CPU2 CPU3 > > > 21: 6 5 5 5 XICS Level > > > host1-0 > > > 22:817814816813 XICS Level > > > host1-1 > > This shows that you are correctly configuring two MSIs. > > > > But the key advantage of using multiple interrupts is to distribute load > > across CPUs and improve performance. So I would like to see some > > performance numbers that show that there is a real benefit for all the > > extra complexity in the code. > Yes, the system just has suport two MSIs. Anyway, I will try to do > some proformance test, to show the real benefit. > But actually it needs the driver to do so. As the data show above, it > seems there is some problems in use the interrupt, the irq 21 use few, > most use 22, I will discuss with the driver author to see why and if > she fixed, I will give out the proformance result. Yeah that would be good. I really dislike that we have a separate API for multi-MSI vs MSI-X, and pci_enable_msi_block() also pushes the contiguous power-of-2 allocation into the irq domain layer, which is unpleasant. So if we really must do multi-MSI I would like to do it differently. cheers -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.8: possible circular locking dependency detected
On Sun, 2013-02-03 at 21:09 -0800, Christian Kujau wrote: > Hi, > > similar to what I reported earlier [0] for 3.8.0-rc1, this happens during > "ifup wlan0" (which in effect starts wpa_supplicant to bring up a Broadcom > b43 wifi network interface). The interface is working though and continues > to work over several ifup/ifdown iterations. > > The backtrace looks awfully similar to the earlier[0] report, but this > time it had b43* stuff in it so I thought I should report it. Full dmesg > and .config here: http://nerdbynature.de/bits/3.8.0-rc6/ >[0] https://lkml.org/lkml/2013/1/3/543 Actually the backtrace looks very different, that was fb vs console_lock. This one should probably go to linux-wirel...@vger.kernel.org (CC-ed). IIUI it's actually the work handling that is the problem. In b43_wireless_core_stop() it calls cancel_work_sync() with the RTNL held, but the work function (b43_request_firmware()) also takes the RTNL via ieee80211_register_hw() and wiphy_register(). eg: > [ 807.767412]CPU0CPU1 > [ 807.768561] > [ 807.769690] lock(rtnl_mutex); > [ 807.770822]process_one_work(..., > &wl->firmware_load); > [ 807.771970]lock(rtnl_mutex); > [ 807.773115] cancel_work_sync(&wl->firmware_load); cheers > [ 807.693791] > [ 807.695519] == > [ 807.697198] [ INFO: possible circular locking dependency detected ] > [ 807.698890] 3.8.0-rc6-8-g8b31849 #1 Not tainted > [ 807.700573] --- > [ 807.702255] wpa_supplicant/4129 is trying to acquire lock: > [ 807.703925] ((&wl->firmware_load)){+.+.+.}, at: [] > flush_work+0x0/0x2b0 > [ 807.705696] > [ 807.705696] but task is already holding lock: > [ 807.709023] (rtnl_mutex){+.+.+.}, at: [] rtnl_lock+0x1c/0x2c > [ 807.710743] > [ 807.710743] which lock already depends on the new lock. > [ 807.710743] > [ 807.715541] > [ 807.715541] the existing dependency chain (in reverse order) is: > [ 807.718691] > [ 807.718691] -> #1 (rtnl_mutex){+.+.+.}: > [ 807.721903][] mutex_lock_nested+0x6c/0x2bc > [ 807.723533][] rtnl_lock+0x1c/0x2c > [ 807.725138][] wiphy_register+0x510/0x53c [cfg80211] > [ 807.726798][] ieee80211_register_hw+0x3f8/0x82c > [mac80211] > [ 807.728431][] b43_request_firmware+0x8c/0x198 [b43] > [ 807.730025][] process_one_work+0x1a4/0x498 > [ 807.731549][] worker_thread+0x17c/0x428 > [ 807.733025][] kthread+0xa8/0xac > [ 807.734439][] ret_from_kernel_thread+0x64/0x6c > [ 807.735810] > [ 807.735810] -> #0 ((&wl->firmware_load)){+.+.+.}: > [ 807.738389][] lock_acquire+0x50/0x6c > [ 807.739694][] flush_work+0x3c/0x2b0 > [ 807.740980][] __cancel_work_timer+0x94/0xec > [ 807.742271][] b43_wireless_core_stop+0x5c/0x234 [b43] > [ 807.743574][] b43_op_stop+0x4c/0x88 [b43] > [ 807.744888][] ieee80211_stop_device+0x4c/0x8c [mac80211] > [ 807.746240][] ieee80211_do_stop+0x2c0/0x5dc [mac80211] > [ 807.747582][] ieee80211_stop+0x18/0x2c [mac80211] > [ 807.748925][] __dev_close_many+0xb0/0x100 > [ 807.750257][] __dev_close+0x2c/0x4c > [ 807.751560][] __dev_change_flags+0x124/0x178 > [ 807.752868][] dev_change_flags+0x1c/0x64 > [ 807.754177][] devinet_ioctl+0x69c/0x74c > [ 807.755459][] inet_ioctl+0xcc/0xf8 > [ 807.756709][] sock_ioctl+0x70/0x2e8 > [ 807.757948][] do_vfs_ioctl+0xa4/0x7c0 > [ 807.759182][] sys_ioctl+0x44/0x70 > [ 807.760407][] ret_from_syscall+0x0/0x38 > [ 807.761622] > [ 807.761622] other info that might help us debug this: > [ 807.761622] > [ 807.765107] Possible unsafe locking scenario: > [ 807.765107] > [ 807.767412]CPU0CPU1 > [ 807.768561] > [ 807.769690] lock(rtnl_mutex); > [ 807.770822]lock((&wl->firmware_load)); > [ 807.771970]lock(rtnl_mutex); > [ 807.773115] lock((&wl->firmware_load)); > [ 807.774244] > [ 807.774244] *** DEADLOCK *** > [ 807.774244] > [ 807.777405] 1 lock held by wpa_supplicant/4129: > [ 807.778475] #0: (rtnl_mutex){+.+.+.}, at: [] > rtnl_lock+0x1c/0x2c > [ 807.779628] > [ 807.779628] stack backtrace: > [ 807.781720] Call Trace: > [ 807.782765] [eea2db20] [c0009160] show_stack+0x48/0x15c (unreliable) > [ 807.784087] [eea2db60] [c04fae24] print_circular_bug+0x2b0/0x2c8 > [ 807.785169] [eea2db90] [c0071300] __lock_acquire+0x14f4/0x18c8 > [ 807.786254] [eea2dc30] [c0071b38] lock_acquire+0x50/0x6c > [ 807.787335] [eea2dc50] [c0049d58] flush_work+0x3c/0x2b0 > [ 807.788418] [eea2dcc0] [c004c30c] __cancel_work_timer+0x94/0xec > [ 807.789516] [eea2dcf0] [f64
Re: Opteron 6276 Corrected ECC errors
> On Wed, Jan 30, 2013 at 11:29:47AM -0500, Michael Madore wrote: >> Supermicro H8QGi-F server board (AMD SR5690/SR5670/SP5100 Chipset) >> 4 X AMD Opteron 6276 processors >> 32 X 8GB (256GB) DDR3-1600 ECC Registered memory >> Debian with kernel 3.2.35-2 >> >> We have received the following two hardware errors: >> >> 9/10/12 >> >> [591006.120039] [Hardware Error]: CPU:58 >> MC2_STATUS[-|CE|MiscV|-|-|-|-|CECC]: 0x9842c00c0176 >> [591006.120048] [Hardware Error]: Combined Unit Error: VB Data/ECC error. >> [591006.120052] [Hardware Error]: cache level: L2, tx: DATA, mem-tx: EV >> >> 1/21/12 >> >> [549004.336097] [Hardware Error]: CPU:40 >> MC4_STATUS[-|CE|MiscV|-|AddrV|-|-|CECC]: 0x9c3444e0001f010b >> [549004.336111] [Hardware Error]: MC4_ADDR: 0xe480 >> [549004.336117] [Hardware Error]: Northbridge Error (node 5): ECC >> Error in the Probe Filter directory. >> [549004.336125] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: GEN >> >> If I understand correctly, both of these errors represent single bit >> corrected errors in the CPU cache. > > Internal CPU structures, victim buffer the first and the second in the > probe filter which is part of L3. > >> On both occasions the system continued to function normally after the >> error was reported. > > As expected; both are single-bit ECC errors which were corrected and > system state wasn't influenced. > >> Is receiving two such errors (on different CPUs) over such a time span >> cause for concern? > > Not really. I'd say, only if the error rate starts increasing over time > and the error types keep repeating. > >> The end user is concerned there is a serious hardware problem. I'm >> reluctant to start replacing CPUs, however, without seeing a repeated >> pattern of errors. > > Yes, no need to replace, simply watch the error rates. Maybe check the > temperature of the CPUs, possibly improve cooling are some of the things > that come to mind. Hi Boris, Thank you for the information. The system has just received a third error: [573603.432036] [Hardware Error]: CPU:32 MC4_STATUS[-|CE|MiscV|-|AddrV|-|Poison|CECC]: 0x9c43ccb0011c017b [573603.432045] [Hardware Error]: MC4_ADDR: 0x002782598940 [573603.432048] [Hardware Error]: Northbridge Error (node 4): L3 ECC data cache error. [573603.432054] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: EV This is on a different node than the previous two errors. And each node has it's own L3, correct? Would you still advocate watching and waiting? Thanks, Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/4] Alter steal-time reporting in the guest
In the case of where you have a system that is running in a capped or overcommitted environment the user may see steal time being reported in accounting tools such as top or vmstat. This can cause confusion for the end user. To ease the confusion this patch set adds the idea of consigned (expected steal) time. The host will separate the consigned time from the steal time. Tthe steal time will only be altered if hard limits (cfs bandwidth control) is used. The period and the quota used to separate the consigned time (expected steal) from the steal time are taken from the cfs bandwidth control settings. Any other steal time accruing during that period will show as the traditional steal time. Changes from V2: * Dropped the ioctl that allowed qemu to send the entitlement value to the guest. * Added code to get the entitlement period and quota from cfs bandwidth. Changes from V1: * Removed the steal time allowed percentage from the guest * Moved the separation of consigned (expected steal) and steal time to the host. * No longer include a sysctl interface. --- Michael Wolf (4): Alter the amount of steal time reported by the guest. Expand the steal time msr to also contain the consigned time. Add the code to send the consigned time from the host to the guest Add a timer to allow the separation of consigned from steal time. arch/x86/include/asm/kvm_host.h | 10 + arch/x86/include/asm/paravirt.h |4 +- arch/x86/include/asm/paravirt_types.h |2 + arch/x86/include/uapi/asm/kvm_para.h |3 +- arch/x86/kernel/kvm.c |8 ++-- arch/x86/kernel/paravirt.c|4 +- arch/x86/kvm/x86.c| 64 - fs/proc/stat.c|9 - include/linux/kernel_stat.h |2 + kernel/sched/core.c | 30 +++ kernel/sched/cputime.c| 21 ++- kernel/sched/sched.h |2 + 12 files changed, 142 insertions(+), 17 deletions(-) -- Signature -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/4] Alter the amount of steal time reported by the guest.
Modify the amount of stealtime that the kernel reports via the /proc interface. Steal time will now be broken down into steal_time and consigned_time. Consigned_time will represent the amount of time that is expected to be lost due to overcommitment of the physical cpu or by using cpu hard capping. Signed-off-by: Michael Wolf --- fs/proc/stat.c |9 +++-- include/linux/kernel_stat.h |1 + 2 files changed, 8 insertions(+), 2 deletions(-) diff --git a/fs/proc/stat.c b/fs/proc/stat.c index e296572..cb7fe80 100644 --- a/fs/proc/stat.c +++ b/fs/proc/stat.c @@ -82,7 +82,7 @@ static int show_stat(struct seq_file *p, void *v) int i, j; unsigned long jif; u64 user, nice, system, idle, iowait, irq, softirq, steal; - u64 guest, guest_nice; + u64 guest, guest_nice, consign; u64 sum = 0; u64 sum_softirq = 0; unsigned int per_softirq_sums[NR_SOFTIRQS] = {0}; @@ -90,10 +90,11 @@ static int show_stat(struct seq_file *p, void *v) user = nice = system = idle = iowait = irq = softirq = steal = 0; - guest = guest_nice = 0; + guest = guest_nice = consign = 0; getboottime(&boottime); jif = boottime.tv_sec; + for_each_possible_cpu(i) { user += kcpustat_cpu(i).cpustat[CPUTIME_USER]; nice += kcpustat_cpu(i).cpustat[CPUTIME_NICE]; @@ -105,6 +106,7 @@ static int show_stat(struct seq_file *p, void *v) steal += kcpustat_cpu(i).cpustat[CPUTIME_STEAL]; guest += kcpustat_cpu(i).cpustat[CPUTIME_GUEST]; guest_nice += kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE]; + consign += kcpustat_cpu(i).cpustat[CPUTIME_CONSIGN]; sum += kstat_cpu_irqs_sum(i); sum += arch_irq_stat_cpu(i); @@ -128,6 +130,7 @@ static int show_stat(struct seq_file *p, void *v) seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(steal)); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest)); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest_nice)); + seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(consign)); seq_putc(p, '\n'); for_each_online_cpu(i) { @@ -142,6 +145,7 @@ static int show_stat(struct seq_file *p, void *v) steal = kcpustat_cpu(i).cpustat[CPUTIME_STEAL]; guest = kcpustat_cpu(i).cpustat[CPUTIME_GUEST]; guest_nice = kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE]; + consign = kcpustat_cpu(i).cpustat[CPUTIME_CONSIGN]; seq_printf(p, "cpu%d", i); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(user)); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(nice)); @@ -153,6 +157,7 @@ static int show_stat(struct seq_file *p, void *v) seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(steal)); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest)); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest_nice)); + seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(consign)); seq_putc(p, '\n'); } seq_printf(p, "intr %llu", (unsigned long long)sum); diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h index 66b7078..e352052 100644 --- a/include/linux/kernel_stat.h +++ b/include/linux/kernel_stat.h @@ -28,6 +28,7 @@ enum cpu_usage_stat { CPUTIME_STEAL, CPUTIME_GUEST, CPUTIME_GUEST_NICE, + CPUTIME_CONSIGN, NR_STATS, }; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/4] Expand the steal time msr to also contain the consigned time.
Expand the steal time msr to also contain the consigned time. Signed-off-by: Michael Wolf --- arch/x86/include/asm/paravirt.h |4 ++-- arch/x86/include/asm/paravirt_types.h |2 +- arch/x86/kernel/kvm.c |7 ++- kernel/sched/core.c | 10 +- kernel/sched/cputime.c|2 +- 5 files changed, 15 insertions(+), 10 deletions(-) diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index 5edd174..9b753ea 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -196,9 +196,9 @@ struct static_key; extern struct static_key paravirt_steal_enabled; extern struct static_key paravirt_steal_rq_enabled; -static inline u64 paravirt_steal_clock(int cpu) +static inline void paravirt_steal_clock(int cpu, u64 *steal) { - return PVOP_CALL1(u64, pv_time_ops.steal_clock, cpu); + PVOP_VCALL2(pv_time_ops.steal_clock, cpu, steal); } static inline unsigned long long paravirt_read_pmc(int counter) diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index 142236e..5d4fc8b 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -95,7 +95,7 @@ struct pv_lazy_ops { struct pv_time_ops { unsigned long long (*sched_clock)(void); - unsigned long long (*steal_clock)(int cpu); + void (*steal_clock)(int cpu, unsigned long long *steal); unsigned long (*get_tsc_khz)(void); }; diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index fe75a28..89e5468 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -386,9 +386,8 @@ static struct notifier_block kvm_pv_reboot_nb = { .notifier_call = kvm_pv_reboot_notify, }; -static u64 kvm_steal_clock(int cpu) +static void kvm_steal_clock(int cpu, u64 *steal) { - u64 steal; struct kvm_steal_time *src; int version; @@ -396,11 +395,9 @@ static u64 kvm_steal_clock(int cpu) do { version = src->version; rmb(); - steal = src->steal; + *steal = src->steal; rmb(); } while ((version & 1) || (version != src->version)); - - return steal; } void kvm_disable_steal_time(void) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 26058d0..efc2652 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -757,6 +757,7 @@ static void update_rq_clock_task(struct rq *rq, s64 delta) */ #if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING) s64 steal = 0, irq_delta = 0; + u64 consigned = 0; #endif #ifdef CONFIG_IRQ_TIME_ACCOUNTING irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time; @@ -785,8 +786,15 @@ static void update_rq_clock_task(struct rq *rq, s64 delta) #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING if (static_key_false((¶virt_steal_rq_enabled))) { u64 st; + u64 cs; - steal = paravirt_steal_clock(cpu_of(rq)); + paravirt_steal_clock(cpu_of(rq), &steal, &consigned); + /* +* since we are not assigning the steal time to cpustats +* here, just combine the steal and consigned times to +* do the rest of the calculations. +*/ + steal += consigned; steal -= rq->prev_steal_time_rq; if (unlikely(steal > delta)) diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index 825a956..0b4f1ec 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -275,7 +275,7 @@ static __always_inline bool steal_account_process_tick(void) if (static_key_false(¶virt_steal_enabled)) { u64 steal, st = 0; - steal = paravirt_steal_clock(smp_processor_id()); + paravirt_steal_clock(smp_processor_id(), &steal); steal -= this_rq()->prev_steal_time; st = steal_ticks(steal); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/4] Add the code to send the consigned time from the host to the guest
Change the paravirt calls that retrieve the steal-time information from the host. Add to it getting the consigned value as well as the steal time. Signed-off-by: Michael Wolf --- arch/x86/include/asm/kvm_host.h |1 + arch/x86/include/asm/paravirt.h |4 ++-- arch/x86/include/uapi/asm/kvm_para.h |3 ++- arch/x86/kernel/kvm.c|3 ++- arch/x86/kernel/paravirt.c |4 ++-- arch/x86/kvm/x86.c |2 ++ include/linux/kernel_stat.h |1 + kernel/sched/cputime.c | 21 +++-- kernel/sched/sched.h |2 ++ 9 files changed, 33 insertions(+), 8 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index dc87b65..fe5a37b 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -428,6 +428,7 @@ struct kvm_vcpu_arch { u64 msr_val; u64 last_steal; u64 accum_steal; + u64 accum_consigned; struct gfn_to_hva_cache stime; struct kvm_steal_time steal; } st; diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index 9b753ea..77f05e7 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -196,9 +196,9 @@ struct static_key; extern struct static_key paravirt_steal_enabled; extern struct static_key paravirt_steal_rq_enabled; -static inline void paravirt_steal_clock(int cpu, u64 *steal) +static inline void paravirt_steal_clock(int cpu, u64 *steal, u64 *consigned) { - PVOP_VCALL2(pv_time_ops.steal_clock, cpu, steal); + PVOP_VCALL3(pv_time_ops.steal_clock, cpu, steal, consigned); } static inline unsigned long long paravirt_read_pmc(int counter) diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h index 06fdbd9..55d617f 100644 --- a/arch/x86/include/uapi/asm/kvm_para.h +++ b/arch/x86/include/uapi/asm/kvm_para.h @@ -42,9 +42,10 @@ struct kvm_steal_time { __u64 steal; + __u64 consigned; __u32 version; __u32 flags; - __u32 pad[12]; + __u32 pad[10]; }; #define KVM_STEAL_ALIGNMENT_BITS 5 diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 89e5468..fb52f8a 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -386,7 +386,7 @@ static struct notifier_block kvm_pv_reboot_nb = { .notifier_call = kvm_pv_reboot_notify, }; -static void kvm_steal_clock(int cpu, u64 *steal) +static void kvm_steal_clock(int cpu, u64 *steal, u64 *consigned) { struct kvm_steal_time *src; int version; @@ -396,6 +396,7 @@ static void kvm_steal_clock(int cpu, u64 *steal) version = src->version; rmb(); *steal = src->steal; + *consigned = src->consigned; rmb(); } while ((version & 1) || (version != src->version)); } diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c index 17fff18..3797683 100644 --- a/arch/x86/kernel/paravirt.c +++ b/arch/x86/kernel/paravirt.c @@ -207,9 +207,9 @@ static void native_flush_tlb_single(unsigned long addr) struct static_key paravirt_steal_enabled; struct static_key paravirt_steal_rq_enabled; -static u64 native_steal_clock(int cpu) +static void native_steal_clock(int cpu, u64 *steal, u64 *consigned) { - return 0; + *steal = *consigned = 0; } /* These are in entry.S */ diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index c243b81..51b63d1 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1867,8 +1867,10 @@ static void record_steal_time(struct kvm_vcpu *vcpu) return; vcpu->arch.st.steal.steal += vcpu->arch.st.accum_steal; + vcpu->arch.st.steal.consigned += vcpu->arch.st.accum_consigned; vcpu->arch.st.steal.version += 2; vcpu->arch.st.accum_steal = 0; + vcpu->arch.st.accum_consigned = 0; kvm_write_guest_cached(vcpu->kvm, &vcpu->arch.st.stime, &vcpu->arch.st.steal, sizeof(struct kvm_steal_time)); diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h index e352052..f58ed0f 100644 --- a/include/linux/kernel_stat.h +++ b/include/linux/kernel_stat.h @@ -126,6 +126,7 @@ extern unsigned long long task_delta_exec(struct task_struct *); extern void account_user_time(struct task_struct *, cputime_t, cputime_t); extern void account_system_time(struct task_struct *, int, cputime_t, cputime_t); extern void account_steal_time(cputime_t); +extern void account_consigned_time(cputime_t); extern void account_idle_time(cputime_t); #ifdef CONFIG_VIRT_CPU_ACCOUNTING diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index 0b4f1ec..2a2d4be 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -244,6 +244,18 @@ void account_
[PATCH 4/4] Add a timer to allow the separation of consigned from steal time.
Add a helper routine to scheduler/core.c to allow the kvm module to retrieve the cpu hardlimit settings. The values will be used to set up a timer that is used to separate the consigned from the steal time. Signed-off-by: Michael Wolf --- arch/x86/include/asm/kvm_host.h |9 ++ arch/x86/kvm/x86.c | 62 ++- kernel/sched/core.c | 20 + 3 files changed, 90 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index fe5a37b..9518613 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -355,6 +355,15 @@ struct kvm_vcpu_arch { bool tpr_access_reporting; /* +* timer used to determine if the time should be counted as +* steal time or consigned time. +*/ + struct hrtimer steal_timer; + u64 current_consigned; + s64 consigned_quota; + s64 consigned_period; + + /* * Paging state of the vcpu * * If the vcpu runs in guest mode with two level paging this still saves diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 51b63d1..79d144d 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1848,13 +1848,32 @@ static void kvmclock_reset(struct kvm_vcpu *vcpu) static void accumulate_steal_time(struct kvm_vcpu *vcpu) { u64 delta; + u64 steal_delta; + u64 consigned_delta; if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED)) return; delta = current->sched_info.run_delay - vcpu->arch.st.last_steal; vcpu->arch.st.last_steal = current->sched_info.run_delay; - vcpu->arch.st.accum_steal = delta; + + /* split the delta into steal and consigned */ + if (vcpu->arch.current_consigned < vcpu->arch.consigned_quota) { + vcpu->arch.current_consigned += delta; + if (vcpu->arch.current_consigned > vcpu->arch.consigned_quota) { + steal_delta = vcpu->arch.current_consigned + - vcpu->arch.consigned_quota; + consigned_delta = delta - steal_delta; + } else { + consigned_delta = delta; + steal_delta = 0; + } + } else { + consigned_delta = 0; + steal_delta = delta; + } + vcpu->arch.st.accum_steal = steal_delta; + vcpu->arch.st.accum_consigned = consigned_delta; } static void record_steal_time(struct kvm_vcpu *vcpu) @@ -2629,8 +2648,35 @@ static bool need_emulate_wbinvd(struct kvm_vcpu *vcpu) !(vcpu->kvm->arch.iommu_flags & KVM_IOMMU_CACHE_COHERENCY); } +extern int sched_use_hard_capping(int cpuid, int num_vcpus, s64 *quota, + s64 *period); +enum hrtimer_restart steal_timer_fn(struct hrtimer *data) +{ + struct kvm_vcpu *vcpu; + struct kvm *kvm; + int num_vcpus; + ktime_t now; + + vcpu = container_of(data, struct kvm_vcpu, arch.steal_timer); + kvm = vcpu->kvm; + num_vcpus = atomic_read(&kvm->online_vcpus); + sched_use_hard_capping(vcpu->cpu, num_vcpus, + &vcpu->arch.consigned_quota, + &vcpu->arch.consigned_period); + vcpu->arch.current_consigned = 0; + now = ktime_get(); + hrtimer_forward(&vcpu->arch.steal_timer, now, + ktime_set(0, vcpu->arch.consigned_period)); + + return HRTIMER_RESTART; +} + void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) { + struct kvm *kvm; + int num_vcpus; + ktime_t ktime; + /* Address WBINVD may be executed by guest */ if (need_emulate_wbinvd(vcpu)) { if (kvm_x86_ops->has_wbinvd_exit()) @@ -2670,6 +2716,18 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) kvm_migrate_timers(vcpu); vcpu->cpu = cpu; } + /* Initialize and start a timer to capture steal and consigned time */ + kvm = vcpu->kvm; + num_vcpus = atomic_read(&kvm->online_vcpus); + num_vcpus = (num_vcpus == 0) ? 1 : num_vcpus; + sched_use_hard_capping(vcpu->cpu, num_vcpus, + &vcpu->arch.consigned_quota, + &vcpu->arch.consigned_period); + hrtimer_init(&vcpu->arch.steal_timer, CLOCK_MONOTONIC, + HRTIMER_MODE_REL); + vcpu->arch.steal_timer.function = &steal_timer_fn; + ktime = ktime_set(0, vcpu->arch.consigned_period); + hrtimer_start(&vcpu->arch.steal_timer, ktime, HRTIMER_MODE_REL); accumulate_steal_time(vcpu); kvm_make_request(KV
[PATCH 3/6] kvm tools: Rework stdio/stdout handling to support redirection
Currently if you redirect the output from "lkvm run" to a file then term_init() will fail, because it can't call the terminal ioctls. So check if stdin and stdout are ttys, if either is not then skip the rest of the terminal setup. Redirecting one but not the other is a little odd, but does work. Note that we skip registering the cleanup routines, so we don't need to modify them. Signed-off-by: Michael Ellerman --- tools/kvm/term.c | 15 +-- 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/tools/kvm/term.c b/tools/kvm/term.c index 4413450..fa85e4a 100644 --- a/tools/kvm/term.c +++ b/tools/kvm/term.c @@ -140,6 +140,15 @@ int term_init(struct kvm *kvm) struct termios term; int i, r; + for (i = 0; i < 4; i++) + if (term_fds[i][TERM_FD_IN] == 0) { + term_fds[i][TERM_FD_IN] = STDIN_FILENO; + term_fds[i][TERM_FD_OUT] = STDOUT_FILENO; + } + + if (!isatty(STDIN_FILENO) || !isatty(STDOUT_FILENO)) + return 0; + r = tcgetattr(STDIN_FILENO, &orig_term); if (r < 0) { pr_warning("unable to save initial standard input settings"); @@ -151,12 +160,6 @@ int term_init(struct kvm *kvm) term.c_lflag &= ~(ICANON | ECHO | ISIG); tcsetattr(STDIN_FILENO, TCSANOW, &term); - for (i = 0; i < 4; i++) - if (term_fds[i][TERM_FD_IN] == 0) { - term_fds[i][TERM_FD_IN] = STDIN_FILENO; - term_fds[i][TERM_FD_OUT] = STDOUT_FILENO; - } - signal(SIGTERM, term_sig_cleanup); atexit(term_cleanup); -- 1.7.10.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 4/6] kvm tools: powerpc: Fix buglet in xics_init() handling of nrcpus
In xics_init() we set the maximum server to kvm->nrcpus, and then set the nr_servers using maximum server + 1. That is off by one, in the harmless direction. Simplify it to just set nr_servers = kvm->nrcpus. Signed-off-by: Michael Ellerman --- tools/kvm/powerpc/xics.c |5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/tools/kvm/powerpc/xics.c b/tools/kvm/powerpc/xics.c index d4b5caa..cf64a08 100644 --- a/tools/kvm/powerpc/xics.c +++ b/tools/kvm/powerpc/xics.c @@ -445,16 +445,13 @@ static void rtas_int_on(struct kvm_cpu *vcpu, uint32_t token, static int xics_init(struct kvm *kvm) { - int max_server_num; unsigned int i; struct icp_state *icp; struct ics_state *ics; int j; - max_server_num = kvm->nrcpus; - icp = malloc(sizeof(*icp)); - icp->nr_servers = max_server_num + 1; + icp->nr_servers = kvm->nrcpus; icp->ss = malloc(icp->nr_servers * sizeof(struct icp_server_state)); for (i = 0; i < icp->nr_servers; i++) { -- 1.7.10.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/6] kvm tools: More error handling in the ipc code
Add perror() calls to a couple of exit paths, to ease debugging. There are also two places where we print "Failed starting IPC thread", but one is really an epoll failure, so make that obvious. Signed-off-by: Michael Ellerman --- tools/kvm/kvm-ipc.c | 17 + 1 file changed, 13 insertions(+), 4 deletions(-) diff --git a/tools/kvm/kvm-ipc.c b/tools/kvm/kvm-ipc.c index bdcc0d1..7897519 100644 --- a/tools/kvm/kvm-ipc.c +++ b/tools/kvm/kvm-ipc.c @@ -49,18 +49,25 @@ static int kvm__create_socket(struct kvm *kvm) } s = socket(AF_UNIX, SOCK_STREAM, 0); - if (s < 0) + if (s < 0) { + perror("socket"); return s; + } + local.sun_family = AF_UNIX; strlcpy(local.sun_path, full_name, sizeof(local.sun_path)); len = strlen(local.sun_path) + sizeof(local.sun_family); r = bind(s, (struct sockaddr *)&local, len); - if (r < 0) + if (r < 0) { + perror("bind"); goto fail; + } r = listen(s, 5); - if (r < 0) + if (r < 0) { + perror("listen"); goto fail; + } return s; @@ -430,6 +437,7 @@ int kvm_ipc__init(struct kvm *kvm) epoll_fd = epoll_create(KVM_IPC_MAX_MSGS); if (epoll_fd < 0) { + perror("epoll_create"); ret = epoll_fd; goto err; } @@ -437,13 +445,14 @@ int kvm_ipc__init(struct kvm *kvm) ev.events = EPOLLIN | EPOLLET; ev.data.fd = sock; if (epoll_ctl(epoll_fd, EPOLL_CTL_ADD, sock, &ev) < 0) { - pr_err("Failed starting IPC thread"); + pr_err("Failed adding socket to epoll"); ret = -EFAULT; goto err_epoll; } stop_fd = eventfd(0, 0); if (stop_fd < 0) { + perror("eventfd"); ret = stop_fd; goto err_epoll; } -- 1.7.10.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/6] kvm tools: Return error status in lkvm list
Currently list always returns 0, even if there was an error. Instead have it accumulate any errors and return that. Signed-off-by: Michael Ellerman --- tools/kvm/builtin-list.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/tools/kvm/builtin-list.c b/tools/kvm/builtin-list.c index 9299f17..c35be93 100644 --- a/tools/kvm/builtin-list.c +++ b/tools/kvm/builtin-list.c @@ -123,7 +123,7 @@ static void parse_setup_options(int argc, const char **argv) int kvm_cmd_list(int argc, const char **argv, const char *prefix) { - int r; + int status, r; parse_setup_options(argc, argv); @@ -133,17 +133,23 @@ int kvm_cmd_list(int argc, const char **argv, const char *prefix) printf("%6s %-20s %s\n", "PID", "NAME", "STATE"); printf("\n"); + status = 0; + if (run) { r = kvm_list_running_instances(); if (r < 0) perror("Error listing instances"); + + status |= r; } if (rootfs) { r = kvm_list_rootfs(); if (r < 0) perror("Error listing rootfs"); + + status |= r; } - return 0; + return status; } -- 1.7.10.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 6/6] kvm tools: powerpc: Only emit TB freq if it's non-zero
The kernel can handle a missing timebase-frequency property much better than one that claims zero. Signed-off-by: Michael Ellerman --- tools/kvm/powerpc/kvm.c |4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/tools/kvm/powerpc/kvm.c b/tools/kvm/powerpc/kvm.c index dc9f89d..b4b9f82 100644 --- a/tools/kvm/powerpc/kvm.c +++ b/tools/kvm/powerpc/kvm.c @@ -389,7 +389,9 @@ static int setup_fdt(struct kvm *kvm) _FDT(fdt_property_cell(fdt, "dcache-block-size", cpu_info->d_bsize)); _FDT(fdt_property_cell(fdt, "icache-block-size", cpu_info->i_bsize)); - _FDT(fdt_property_cell(fdt, "timebase-frequency", cpu_info->tb_freq)); + if (cpu_info->tb_freq) + _FDT(fdt_property_cell(fdt, "timebase-frequency", cpu_info->tb_freq)); + /* Lies, but safeish lies! */ _FDT(fdt_property_cell(fdt, "clock-frequency", 0xddbab200)); -- 1.7.10.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 5/6] kvm tools: powerpc: Add cpu info entry for POWER8
We should hard-code less of this stuff, but for now this works. Signed-off-by: Michael Ellerman --- tools/kvm/powerpc/cpu_info.c | 15 +++ 1 file changed, 15 insertions(+) diff --git a/tools/kvm/powerpc/cpu_info.c b/tools/kvm/powerpc/cpu_info.c index 11ca14e..a9dfe39 100644 --- a/tools/kvm/powerpc/cpu_info.c +++ b/tools/kvm/powerpc/cpu_info.c @@ -35,6 +35,20 @@ static struct cpu_info cpu_power7_info = { }, }; +/* POWER8 */ + +static struct cpu_info cpu_power8_info = { + .name = "POWER8", + .tb_freq = 51200, + .d_bsize = 128, + .i_bsize = 128, + .flags = CPUINFO_FLAG_DFP | CPUINFO_FLAG_VSX | CPUINFO_FLAG_VMX, + .mmu_info = { + .flags = KVM_PPC_PAGE_SIZES_REAL | KVM_PPC_1T_SEGMENTS, + .slb_size = 32, + }, +}; + /* PPC970/G5 */ static struct cpu_info cpu_970_info = { @@ -52,6 +66,7 @@ static struct pvr_info host_pvr_info[] = { { 0x, 0x0f03, &cpu_power7_info }, { 0x, 0x003f, &cpu_power7_info }, { 0x, 0x004a, &cpu_power7_info }, + { 0x, 0x004b, &cpu_power8_info }, { 0x, 0x0039, &cpu_970_info }, { 0x, 0x003c, &cpu_970_info }, { 0x, 0x0044, &cpu_970_info }, -- 1.7.10.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] Add a timer to allow the separation of consigned from steal time.
On 02/06/2013 08:36 AM, Glauber Costa wrote: On 02/06/2013 01:49 AM, Michael Wolf wrote: Add a helper routine to scheduler/core.c to allow the kvm module to retrieve the cpu hardlimit settings. The values will be used to set up a timer that is used to separate the consigned from the steal time. Sorry: What is the business of a timer in here? Whenever we read steal time, we know how much time has passed and with that information we can know the entitlement for the period. This breaks if we suspend, but we know that we suspended, so this is not a problem. I may be missing something, but how do we know how much time has passed? That is why I had the timer in there. I will go look again at the code but I thought the data was collected as ticks and passed at random times. The ticks are also accumulating so we are looking at the difference in the count between reads. Everything bigger the entitlement is steal time. I agree provided I know the amount of total time that the steal time was accumulated. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/4] Expand the steal time msr to also contain the consigned time.
On 02/06/2013 03:14 PM, Rik van Riel wrote: On 02/05/2013 04:49 PM, Michael Wolf wrote: Expand the steal time msr to also contain the consigned time. Signed-off-by: Michael Wolf --- arch/x86/include/asm/paravirt.h |4 ++-- arch/x86/include/asm/paravirt_types.h |2 +- arch/x86/kernel/kvm.c |7 ++- kernel/sched/core.c | 10 +- kernel/sched/cputime.c|2 +- 5 files changed, 15 insertions(+), 10 deletions(-) diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index 5edd174..9b753ea 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -196,9 +196,9 @@ struct static_key; extern struct static_key paravirt_steal_enabled; extern struct static_key paravirt_steal_rq_enabled; -static inline u64 paravirt_steal_clock(int cpu) +static inline void paravirt_steal_clock(int cpu, u64 *steal) { -return PVOP_CALL1(u64, pv_time_ops.steal_clock, cpu); +PVOP_VCALL2(pv_time_ops.steal_clock, cpu, steal); } This may be a stupid question, but what happens if a KVM guest with this change, runs on a kernel that still has the old steal time interface? What happens if the host has the new steal time interface, but the guest uses the old interface? Will both cases continue to work as expected with your patch series? If so, could you document (in the source code) why things continue to work? I will test the scenarios you suggest and will report back the results. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/4] Add the code to send the consigned time from the host to the guest
On 02/06/2013 03:18 PM, Rik van Riel wrote: On 02/05/2013 04:49 PM, Michael Wolf wrote: Change the paravirt calls that retrieve the steal-time information from the host. Add to it getting the consigned value as well as the steal time. Signed-off-by: Michael Wolf diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h index 06fdbd9..55d617f 100644 --- a/arch/x86/include/uapi/asm/kvm_para.h +++ b/arch/x86/include/uapi/asm/kvm_para.h @@ -42,9 +42,10 @@ struct kvm_steal_time { __u64 steal; +__u64 consigned; __u32 version; __u32 flags; -__u32 pad[12]; +__u32 pad[10]; }; The function kvm_register_steal_time passes the address of such a structure to the host kernel, which then does something with it. Could running a guest with the above patch, on top of a host with the old code, result in the values for "version" and "flags" being written into "consigned"? yes, good point. Could that result in confusing the guest kernel to no end, and generally breaking things? Ok I will move the consigned field to be after the flags. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] Add a timer to allow the separation of consigned from steal time.
On 02/07/2013 02:46 AM, Glauber Costa wrote: On 02/06/2013 10:07 PM, Michael Wolf wrote: On 02/06/2013 08:36 AM, Glauber Costa wrote: On 02/06/2013 01:49 AM, Michael Wolf wrote: Add a helper routine to scheduler/core.c to allow the kvm module to retrieve the cpu hardlimit settings. The values will be used to set up a timer that is used to separate the consigned from the steal time. Sorry: What is the business of a timer in here? Whenever we read steal time, we know how much time has passed and with that information we can know the entitlement for the period. This breaks if we suspend, but we know that we suspended, so this is not a problem. I may be missing something, but how do we know how much time has passed? That is why I had the timer in there. I will go look again at the code but I thought the data was collected as ticks and passed at random times. The ticks are also accumulating so we are looking at the difference in the count between reads. They can be collected at random times, but you can of course record the time in which it happened. ok. Let me add a previous_read field and take out the timer. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Transparent Huge Pages
Hello. I'm trying to understand how to use transparent huge pages (currently in x86). Before I used "explicit" huge pages alot (mostly about hugetlbfs), but it looked like THP should be easier so I gave it a try. This tiny program: - cut - #include #include #include #include #include #include #include int main(int argc, char **argv) { void *ptr; size_t len = argv[1] ? atoi(argv[1]) : 1024*1024*1024; /* no error checking! */ posix_memalign(&ptr, 2048*1024, len); madvise(ptr, len, MADV_HUGEPAGE); memset(ptr, 0, len); usleep(500); /* let khugepagesd do its work */ system("grep ^AnonHugePages: /proc/meminfo"); return 0; } - cut - which just tries to allocate some amount of RAM (1Gb by default) aligned to 2Mb, uses madvise(HUGEPAGE) on it, and checks /proc/meminfo for AnonHugePages. The problem is: I've never seen any value for AnonHugePages larger than about 16Mb. Usually it is around 10Mb or 8Mb, no matter how large the requested memory size is, including the default 1Gb. The question, obviously, is: why so small? My system (which is a few years old now) has 6Gb of RAM, it uses AMD Athlon II X2 260 CPU, and is running 3.2 kernel. Original question comes from grounds of of QEMU, which is supposed to use THP for guest memory, but it also does not use more than these ~10Mb, when allocating 1Gb to the guest. Thanks! /mjt -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: WARNING: at drivers/tty/tty_buffer.c:476 (tty is NULL)
On Wed, Jan 30, 2013 at 01:33:57PM -0500, Peter Hurley wrote: > On Sat, 2013-01-19 at 22:00 +0100, Jiri Slaby wrote: > > On 01/18/2013 10:07 PM, Greg Kroah-Hartman wrote: > > > Jiri, was there anything on the mailing list that I missed that should > > > have resolved this issue? I thought it was being worked on, but I can't > > > seem to find any resolution at the moment. > > > > Somebody had a patchset and promised to repost IIRC. I forgot his name > > though. > > > > /me back from digging in the mail history. > > > > Peter Hurley is the name. > > > > Peter, what happened with your patches in the end, please? > > The tty subsystem is very resilient to fixing :) > > The thing I'm working on right now -- which hopefully is the last issue > with the line discipline logic -- occurs with parallel __tty_hangup() > and tty_release(). > > At the moment, I'm trying to narrow the conditions when this happens. If it helps I seem to be able to reproduce this in just a few seconds by running trinity and hitting Ctrl-C to quit the watchdog. cheers -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] irq_dispose_mapping after irq request failure
On Mon, Feb 11, 2013 at 07:31:00AM +0200, Baruch Siach wrote: > Hi lkml, Hi Baruch, > The drivers/edac/mpc85xx_edac.c driver contains the following (abbreviated) > code snippet it its .probe: You dropped an important detail which is the preceeding line: pdata->irq = irq_of_parse_and_map(op->dev.of_node, 0); > res = devm_request_irq(&op->dev, pdata->irq, > mpc85xx_pci_isr, IRQF_DISABLED, > "[EDAC] PCI err", pci); > if (res < 0) { > irq_dispose_mapping(pdata->irq); > goto err2; > } > > Now, since the requested irq is already in use, and IRQF_SHARED is not set, > devm_request_irq errors() out, which is OK. Less OK is the > irq_dispose_mapping() call, which gives me this: > > EDAC PCI1: Giving out device to module 'MPC85xx_edac' controller > 'mpc85xx_pci_err': DEV 'ffe0a000.pcie' (INTERRUPT) > genirq: Flags mismatch irq 16. 0020 ([EDAC] PCI err) vs. 0020 ([EDAC] > PCI err) The hint here is to notice which other irq you're clashing with ^^ ie. yourself. Which is odd, that is the root of the problem. The badness you're getting from irq_dispose_mapping() is caused because you're disposing of that mapping which is currently still in use, by the same interrupt. That is caused by a "feature" in the irq mapping code, where if you ask to map an already mapped hwirq, it will give you back the same virq. So in your case when you called irq_of_parse_and_map() it noticed that someone had already mapped that hwirq, and gave you back an existing (in use) virq. > mpc85xx_pci_err_probe: Unable to requiest irq 16 for MPC85xx PCI err ^ While you're there, can you fix the typo :) > So, is irq_dispose_mapping() the right thing to do when irq request fails? It's the right thing to do to undo the effect of irq_create_mapping(), or in your case irq_of_parse_and_map(). It just falls down in this case, because you're inadvertently disposing of something that's still in use. > A simple grep shows that irq_dispose_mapping() calls are mostly limited to > powerpc code. Is there a reason for that? That's because the irq domain code began life as powerpc specific code. It's now become generic and will start to appear in more places. cheers -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: WARNING: at drivers/tty/tty_buffer.c:476 (tty is NULL)
On Mon, Feb 11, 2013 at 09:42:30AM -0500, Peter Hurley wrote: > Hi Michael, > > On Mon, 2013-02-11 at 13:44 +1100, Michael Ellerman wrote: > > On Wed, Jan 30, 2013 at 01:33:57PM -0500, Peter Hurley wrote: > > > On Sat, 2013-01-19 at 22:00 +0100, Jiri Slaby wrote: > > > > On 01/18/2013 10:07 PM, Greg Kroah-Hartman wrote: > > > > > Jiri, was there anything on the mailing list that I missed that should > > > > > have resolved this issue? I thought it was being worked on, but I > > > > > can't > > > > > seem to find any resolution at the moment. > > > > > > > > Somebody had a patchset and promised to repost IIRC. I forgot his name > > > > though. > > > > > > > > /me back from digging in the mail history. > > > > > > > > Peter Hurley is the name. > > > > > > > > Peter, what happened with your patches in the end, please? > > > > > > The tty subsystem is very resilient to fixing :) > > > > > > The thing I'm working on right now -- which hopefully is the last issue > > > with the line discipline logic -- occurs with parallel __tty_hangup() > > > and tty_release(). > > > > > > At the moment, I'm trying to narrow the conditions when this happens. > > > > If it helps I seem to be able to reproduce this in just a few seconds by >^^ > you get this WARNING from a parallel __tty_hangup? > > Or do you mean simply that you get this WARNING? Sorry I just mean I'm seeing the warning, I have no idea how. > > running trinity and hitting Ctrl-C to quit the watchdog. > > Can you reproduce after using the following patch series? > [PATCH v3 00/23] ldisc fixes What are they against? I tried Linus' tree and linux-next but they didn't apply cleanly against either. cheers -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] irq_dispose_mapping after irq request failure
On Tue, Feb 12, 2013 at 11:51:13AM +1100, Benjamin Herrenschmidt wrote: > On Mon, 2013-02-11 at 20:52 +, Grant Likely wrote: > > Really the irq mappings should be using reference counting. The existing > > code is naive on this count and just releases the irq on the first call > > to irq_dispose_mapping(). I've not gotten around to fixing that. Anyone > > want to take that task on? > > Is this the best approach ? > > The original idea was that there was no point disposing of mappings in most > cases and keeping the mapping around would provide a bit of stability of > interrupt numbers which might come in handy for debugging etc... > > The few cases where disposing of a mapping might be useful is if the > underlying > physical interrupts completely disappear, as in a cascaded controller gets > removed or that sort of thing, which is a very rare case... And even then... That may have been the intent, but we forgot to tell driver writers, ourselves included. > Could you just make irq_dispose_mapping() check if the irq desc is > active and fail/WARN/BUG if it is ? I don't see the point of adding a > refcount, > that feels overkill. I don't think you can, "active" is not well defined. Other code may have done nothing other than create the mapping and remembered the virq, which will break if you destroy the mapping. Or? I agree refcounting is not fun. It'll end up with the same mess as of_node_get/put() where practically every 2nd piece of code leaks references. I guess we can't go the other way, and say that mapping the same hwirq twice is an error. cheers -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: WARNING: at drivers/tty/tty_buffer.c:476 (tty is NULL)
On Mon, Feb 11, 2013 at 09:53:58PM -0500, Peter Hurley wrote: > On Tue, 2013-02-12 at 13:00 +1100, Michael Ellerman wrote: > > > Can you reproduce after using the following patch series? > > > [PATCH v3 00/23] ldisc fixes > > > > What are they against? I tried Linus' tree and linux-next but they > > didn't apply cleanly against either. > > The series was generated against next-20130204. 13/23 doesn't apply > cleanly at next-20130211 because of changes since. Rebases fine > though :) Should have tried harder against next, git-am is very picky. So running next-20130211, 20 runs of trinity followed by ctrl-c, I saw the warning 12 times. Back trace is basically always: Call Trace: [c0027a1efb20] [c0467b74] .flush_to_ldisc+0x244/0x250 (unreliable) [c0027a1efbd0] [c009ea7c] .process_one_work+0x1bc/0x4f0 [c0027a1efc70] [c009f2e0] .worker_thread+0x180/0x4b0 [c0027a1efd30] [c00a6c3c] .kthread+0xec/0x100 [c0027a1efe30] [c0009f64] .ret_from_kernel_thread+0x64/0x80 Running next-20130211 + "ldisc fixes", 20 runs of trinity followed by ctrl-c, I saw the warning _0_ times. Insert standard comment about testing only proving the presence of bugs not their absence :) So it looks promising. cheers -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v3 0/8] sched: use runnable avg in load balance
On 04/02/2013 11:23 AM, Alex Shi wrote: [snip] > > [patch v3 1/8] Revert "sched: Introduce temporary FAIR_GROUP_SCHED > [patch v3 2/8] sched: set initial value of runnable avg for new > [patch v3 3/8] sched: only count runnable avg on cfs_rq's nr_running > [patch v3 4/8] sched: update cpu load after task_tick. > [patch v3 5/8] sched: compute runnable load avg in cpu_load and > [patch v3 6/8] sched: consider runnable load average in move_tasks > [patch v3 7/8] sched: consider runnable load average in > [patch v3 8/8] sched: use instant load for burst wake up I've tested the patch set on 12 cpu X86 box with 3.9.0-rc2, and pgbench show regression on high-end this time. | db_size | clients | tps | | tps | +-+-+---+ +---+ | 22 MB | 1 | 10662 | | 10446 | | 22 MB | 2 | 21483 | | 20887 | | 22 MB | 4 | 42046 | | 41266 | | 22 MB | 8 | 55807 | | 51987 | | 22 MB | 12 | 50768 | | 50974 | | 22 MB | 16 | 49880 | | 49510 | | 22 MB | 24 | 45904 | | 42398 | | 22 MB | 32 | 43420 | | 40995 | | 7484 MB | 1 | 7965 | | 7376 | | 7484 MB | 2 | 19354 | | 19149 | | 7484 MB | 4 | 37552 | | 37458 | | 7484 MB | 8 | 48655 | | 46618 | | 7484 MB | 12 | 45778 | | 45756 | | 7484 MB | 16 | 45659 | | 44911 | | 7484 MB | 24 | 42192 | | 37185 | -11.87% | 7484 MB | 32 | 36385 | | 34447 | | 15 GB | 1 | 7677 | | 7359 | | 15 GB | 2 | 19227 | | 19049 | | 15 GB | 4 | 37335 | | 36947 | | 15 GB | 8 | 48130 | | 46898 | | 15 GB | 12 | 45393 | | 43986 | | 15 GB | 16 | 45110 | | 45719 | | 15 GB | 24 | 41415 | | 36813 | -11.11% | 15 GB | 32 | 35988 | | 34025 | The reason may caused by wake_affine()'s higher overhead, and pgbench is really sensitive to this stuff... Regards, Michael Wang > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v3 0/8] sched: use runnable avg in load balance
On 04/02/2013 04:34 PM, Mike Galbraith wrote: [snip] >> The reason may caused by wake_affine()'s higher overhead, and pgbench is >> really sensitive to this stuff... > > For grins, you could try running the whole thing SCHED_BATCH. (/me sees > singing/dancing red herring whenever wake_affine() and pgbench appear in > the same sentence;) I saw the patch touched the wake_affine(), just interested on what will happen ;-) The patch changed the overhead of wake_affine(), and also influence it's result, I used to think the later one may do some help to the pgbench... Regards, Michael Wang > > -Mike > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v3 0/8] sched: use runnable avg in load balance
On 04/02/2013 04:35 PM, Alex Shi wrote: > On 04/02/2013 03:23 PM, Michael Wang wrote: [snip] >> >> The reason may caused by wake_affine()'s higher overhead, and pgbench is >> really sensitive to this stuff... > > Thanks for testing. Could you like to remove the last patch and test it > again? I want to know if the last patch has effect on pgbench. Amazing, without the last one, pgbench show very good improvement, higher than 10ms throttle, lower than 100ms throttle, I need confirm this with a night-through testing. I will look into those patches in detail later, I think it addressed part of the wake_affine() issue (make the decision more accurately), that's nice ;-) Regards, Michael Wang > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/Resend 2/2] arm: mach-omap2: prevent UART console idle on suspend while using "no_console_suspend"
fter the device is built from dt. > > arch/arm/mach-omap2/omap_device.c |7 ++- > arch/arm/mach-omap2/serial.c |4 +++- > 2 files changed, 9 insertions(+), 2 deletions(-) > > diff --git a/arch/arm/mach-omap2/omap_device.c > b/arch/arm/mach-omap2/omap_device.c > index e065daa..f4ebf9f 100644 > --- a/arch/arm/mach-omap2/omap_device.c > +++ b/arch/arm/mach-omap2/omap_device.c > @@ -96,6 +96,9 @@ > #define USE_WAKEUP_LAT0 > #define IGNORE_WAKEUP_LAT1 > > +extern u8 no_console_suspend; > +extern char console_uart_name[]; > + > static int omap_early_device_register(struct platform_device *pdev); > > static struct omap_device_pm_latency omap_default_latency[] = { > @@ -372,7 +375,9 @@ static int omap_device_build_from_dt(struct > platform_device *pdev) > r->name = dev_name(&pdev->dev); > } > > -if (of_get_property(node, "ti,no_idle_on_suspend", NULL)) > +if (no_console_suspend && !strcmp(oh->name, console_uart_name)) > +omap_device_disable_idle_on_suspend(pdev); > +else if (of_get_property(node, "ti,no_idle_on_suspend", NULL)) > omap_device_disable_idle_on_suspend(pdev); Why do not use some flags instead of external variable? > > pdev->dev.pm_domain = &omap_device_pm_domain; > diff --git a/arch/arm/mach-omap2/serial.c b/arch/arm/mach-omap2/serial.c > index 037e691..f841ab5 100644 > --- a/arch/arm/mach-omap2/serial.c > +++ b/arch/arm/mach-omap2/serial.c > @@ -63,8 +63,9 @@ struct omap_uart_state { > static LIST_HEAD(uart_list); > static u8 num_uarts; > static u8 console_uart_id = -1; > -static u8 no_console_suspend; > static u8 uart_debug; > +u8 no_console_suspend; > +char console_uart_name[MAX_UART_HWMOD_NAME_LEN]; > > #define DEFAULT_RXDMA_POLLRATE1/* RX DMA polling rate (us) */ > #define DEFAULT_RXDMA_BUFSIZE4096/* RX DMA buffer size */ > @@ -199,6 +200,7 @@ static int __init omap_serial_early_init(void) > "%s%d", OMAP_SERIAL_NAME, uart->num); > > if (cmdline_find_option(uart_name)) { > +strcpy(console_uart_name, oh_name); > console_uart_id = uart->num; > > if (console_loglevel >= 10) { Michael -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH/Resend 2/2] arm: mach-omap2: prevent UART console idle on suspend while using "no_console_suspend"
Hi On 02/04/13 12:39, Sourav Poddar wrote: > Hi, > On Tuesday 02 April 2013 03:36 PM, Michael Trimarchi wrote: >> Hi >> >> On 02/04/13 11:50, Sourav Poddar wrote: >>> Hi Kevin, >>> On Wednesday 20 March 2013 05:36 PM, Sourav Poddar wrote: >>>> Realised the list to whom the patch was send got dropped. Ccing them all.. >>>> On Wednesday 20 March 2013 05:18 PM, Sourav Poddar wrote: >>>>> Hi Kevin, >>>>> On Tuesday 19 March 2013 12:24 AM, Kevin Hilman wrote: >>>>>> Sourav Poddar writes: >>>>>> >>>>>>> With dt boot, uart wakeup after suspend is non functional on omap4/5 >>>>>>> while using >>>>>>> "no_console_suspend" in the bootargs. With "no_console_suspend" used, >>>>>>> od->flags >>>>>>> should be ORed with "OMAP_DEVICE_NO_IDLE_ON_SUSPEND", thereby not >>>>>>> allowing the console >>>>>>> to idle in the suspend path. For non-dt case, this was taken care by >>>>>>> platform data. >>>>>>> >>>>>>> Tested on omap5430evm, omap4430sdp. >>>>>>> >>>>>>> Cc: Santosh Shilimkar >>>>>>> Cc: Felipe Balbi >>>>>>> Cc: Rajendra nayak >>>>>>> Signed-off-by: Sourav Poddar >>>>>> This patch creates a dependency between omap_device (generic, >>>>>> device-independent code) and a specific driver (UART.) >>>>>> >>>>>> If you need to do something like this that's DT boot specific, then >>>>>> we probably need some late initcall in serial.c to handle this. It does >>>>>> not belong in omap_device. >>>>>> >>>>> The following function "omap_device_disable_idle_on_suspend(pdev)" should >>>>> only >>>>> be called once the omap device has been build, which in the case of >>>>> device tree is >>>>> done in omap_device.c file. Moreover, the above call should be executed >>>>> conditionally >>>>> and should depend on the following two parameter. >>>>> >>>>> [1] a. Whether "no_console_suspend" is set and >>>>> b. the device build is a console uart. >>>>> >>>>> When I look closely into the serial.c file, I realised that >>>>> "core_initcall(omap_serial_early_init)" gets called irrespective >>>>> of dt/non dt boot and will take care of most of the stuff(checking whether >>>>> "no_console_suspend" is used and which uart is used as a console uart) >>>>> which the >>>>> $subject patch is proposing. >>>>> >>>>> But the problem is that we need to exchange the parsed information >>>>> from serial.c to the omap_device file for the condtional execution of >>>>> "omap_device_disable_idle_on_suspend" >>>>> >>>>> In this case, >>>>> from "serial.c" we need >>>>> 1. no_console_suspend = true >>>>> 2. strcpy(console_name, oh_name), where oh_name corresponds to the >>>>> console uart. >>>>> >>>>> then in "omap_device.c" do >>>>> if (no_console_suspend&& !strcmp(oh->name, console_name)) >>>>> omap_device_disable_idle_on_suspend(pdev); >>>>> >>>>> Please correct if I am understanding it incorrectly. >>>>> >>>>> If the above understanding looks good to you, is there a way we can make >>>>> this >>>>> exchange of information happen between serial.c and omap_device.c file? >>> Any input on this? >>> As I explained earlier, that there is a need to parse information in >>> serial.c and use that in >>> omap_device.c only after the device is build. >>> >>> Below is the patch (inlined) which further explains my point. The patch is >>> "just for the >>> idea" I am trying to express. >>> I have used extern variables to exchange information between serial.c and >>> omap_device.c. >>> Is there is a better way, we can do this "information exchange" without >>> using extern variables? >>> >>> >>> - >>> From: Sourav Po
Re: [RFC 4/4] mm: Enhance per process reclaim
Minchan, On Mon, Mar 25, 2013 at 7:21 AM, Minchan Kim wrote: > > Some pages could be shared by several processes. (ex, libc) > In case of that, it's too bad to reclaim them from the beginnig. > > This patch causes VM to keep them on memory until last task > try to reclaim them so shared pages will be reclaimed only if > all of task has gone swapping out. > > This feature doesn't handle non-linear mapping on ramfs because > it's very time-consuming and doesn't make sure of reclaiming and > not common. Against what tree does this patch apply? I've tries various trees, including MMOTM of 26 March, and encounter this error: CC mm/ksm.o mm/ksm.c: In function ‘try_to_unmap_ksm’: mm/ksm.c:1970:32: error: ‘vma’ undeclared (first use in this function) mm/ksm.c:1970:32: note: each undeclared identifier is reported only once for each function it appears in make[1]: *** [mm/ksm.o] Error 1 make: *** [mm] Error 2 Cheers, Michael > Signed-off-by: Sangseok Lee > Signed-off-by: Minchan Kim > --- > fs/proc/task_mmu.c | 2 +- > include/linux/ksm.h | 6 -- > include/linux/rmap.h | 8 +--- > mm/ksm.c | 9 +++- > mm/memory-failure.c | 2 +- > mm/migrate.c | 6 -- > mm/rmap.c| 58 > +--- > mm/vmscan.c | 14 +++-- > 8 files changed, 77 insertions(+), 28 deletions(-) > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > index c3713a4..7f6aaf5 100644 > --- a/fs/proc/task_mmu.c > +++ b/fs/proc/task_mmu.c > @@ -1154,7 +1154,7 @@ cont: > break; > } > pte_unmap_unlock(pte - 1, ptl); > - reclaim_pages_from_list(&page_list); > + reclaim_pages_from_list(&page_list, vma); > if (addr != end) > goto cont; > > diff --git a/include/linux/ksm.h b/include/linux/ksm.h > index 45c9b6a..d8e556b 100644 > --- a/include/linux/ksm.h > +++ b/include/linux/ksm.h > @@ -75,7 +75,8 @@ struct page *ksm_might_need_to_copy(struct page *page, > > int page_referenced_ksm(struct page *page, > struct mem_cgroup *memcg, unsigned long *vm_flags); > -int try_to_unmap_ksm(struct page *page, enum ttu_flags flags); > +int try_to_unmap_ksm(struct page *page, > + enum ttu_flags flags, struct vm_area_struct *vma); > int rmap_walk_ksm(struct page *page, int (*rmap_one)(struct page *, > struct vm_area_struct *, unsigned long, void *), void *arg); > void ksm_migrate_page(struct page *newpage, struct page *oldpage); > @@ -115,7 +116,8 @@ static inline int page_referenced_ksm(struct page *page, > return 0; > } > > -static inline int try_to_unmap_ksm(struct page *page, enum ttu_flags flags) > +static inline int try_to_unmap_ksm(struct page *page, > + enum ttu_flags flags, struct vm_area_struct > *target_vma) > { > return 0; > } > diff --git a/include/linux/rmap.h b/include/linux/rmap.h > index a24e34e..6c7d030 100644 > --- a/include/linux/rmap.h > +++ b/include/linux/rmap.h > @@ -12,7 +12,8 @@ > > extern int isolate_lru_page(struct page *page); > extern void putback_lru_page(struct page *page); > -extern unsigned long reclaim_pages_from_list(struct list_head *page_list); > +extern unsigned long reclaim_pages_from_list(struct list_head *page_list, > +struct vm_area_struct *vma); > > /* > * The anon_vma heads a list of private "related" vmas, to scan if > @@ -192,7 +193,8 @@ int page_referenced_one(struct page *, struct > vm_area_struct *, > > #define TTU_ACTION(x) ((x) & TTU_ACTION_MASK) > > -int try_to_unmap(struct page *, enum ttu_flags flags); > +int try_to_unmap(struct page *, enum ttu_flags flags, > + struct vm_area_struct *vma); > int try_to_unmap_one(struct page *, struct vm_area_struct *, > unsigned long address, enum ttu_flags flags); > > @@ -259,7 +261,7 @@ static inline int page_referenced(struct page *page, int > is_locked, > return 0; > } > > -#define try_to_unmap(page, refs) SWAP_FAIL > +#define try_to_unmap(page, refs, vma) SWAP_FAIL > > static inline int page_mkclean(struct page *page) > { > diff --git a/mm/ksm.c b/mm/ksm.c > index 7f629e4..1a90d13 100644 > --- a/mm/ksm.c > +++ b/mm/ksm.c > @@ -1949,7 +1949,8 @@ out: > return referenced; > } > > -int try_to_unmap_ksm(struct page *page, enum ttu_flags flags) > +int try_to_unmap_ksm(struct page *page, enum ttu_flags flags, > + struct vm_area_struct *target_vma) > { >
Re: [PATCHv2] arm: mach-omap2: prevent UART console idle on suspend while using "no_console_suspend"
Hi On 02/04/13 15:28, Sourav Poddar wrote: > With dt boot, uart wakeup after suspend is non functional while using > "no_console_suspend" in the bootargs. With "no_console_suspend" used, > od->flags > should be ORed with "OMAP_DEVICE_NO_IDLE_ON_SUSPEND", thereby not allowing > the console > to idle in the suspend path. > > Tested on omap5430evm, omap4430sdp. > > Cc: Santosh Shilimkar > Cc: Felipe Balbi > Cc: Rajendra nayak > Signed-off-by: Sourav Poddar > --- > v1->v2 > These patches were sent before as a series[1], but realised > "core_initcall(omap_serial_early_init)" in serial.c get executed > irrespective of dt or non dt boot and it will do most of the stuff > for us. > As suggested by Kevin Hilman in the previous version, this patch will > also prevent creating dependency between omap_device > (generic device-independent code) and a specific driver (UART). > > [1]: http://lkml.org/lkml/2013/3/18/294 > > arch/arm/mach-omap2/omap_device.c |5 +++-- > arch/arm/mach-omap2/omap_hwmod.h |5 + > arch/arm/mach-omap2/serial.c |4 +++- > 3 files changed, 11 insertions(+), 3 deletions(-) > > diff --git a/arch/arm/mach-omap2/omap_device.c > b/arch/arm/mach-omap2/omap_device.c > index 381be7a..89be64d 100644 > --- a/arch/arm/mach-omap2/omap_device.c > +++ b/arch/arm/mach-omap2/omap_device.c > @@ -170,8 +170,9 @@ static int omap_device_build_from_dt(struct > platform_device *pdev) > r->name = dev_name(&pdev->dev); > } > > - if (of_get_property(node, "ti,no_idle_on_suspend", NULL)) > - omap_device_disable_idle_on_suspend(pdev); > + if (oh->flags & HWMOD_DISABLE_IDLE_ON_SUSPEND || > + of_get_property(node, "ti,no_idle_on_suspend", NULL)) > + omap_device_disable_idle_on_suspend(pdev); > > pdev->dev.pm_domain = &omap_device_pm_domain; > > diff --git a/arch/arm/mach-omap2/omap_hwmod.h > b/arch/arm/mach-omap2/omap_hwmod.h > index d43d9b6..50e6145 100644 > --- a/arch/arm/mach-omap2/omap_hwmod.h > +++ b/arch/arm/mach-omap2/omap_hwmod.h > @@ -459,6 +459,10 @@ struct omap_hwmod_omap4_prcm { > * correctly, or this is being abused to deal with some PM latency > * issues -- but we're currently suffering from a shortage of > * folks who are able to track these issues down properly. > + * HWMOD_DISABLE_IDLE_ON_SUSPEND: don't idle this module on suspend. This is > + * needed for uart controller, which requires its clock not to be cut > + * during suspend while using "no_console_suspend" in bootargs with > + * device tree boot. > */ > #define HWMOD_SWSUP_SIDLE(1 << 0) > #define HWMOD_SWSUP_MSTANDBY (1 << 1) > @@ -471,6 +475,7 @@ struct omap_hwmod_omap4_prcm { > #define HWMOD_16BIT_REG (1 << 8) > #define HWMOD_EXT_OPT_MAIN_CLK (1 << 9) > #define HWMOD_BLOCK_WFI (1 << 10) > +#define HWMOD_DISABLE_IDLE_ON_SUSPEND(1 << 12) Just a comment more. Why 12 and not 11? Michael > > /* > * omap_hwmod._int_flags definitions > diff --git a/arch/arm/mach-omap2/serial.c b/arch/arm/mach-omap2/serial.c > index 8396b5b..adbafbd 100644 > --- a/arch/arm/mach-omap2/serial.c > +++ b/arch/arm/mach-omap2/serial.c > @@ -236,8 +236,10 @@ static int __init omap_serial_early_init(void) > uart_name, uart->num); > } > > - if (cmdline_find_option("no_console_suspend")) > + if (cmdline_find_option("no_console_suspend")) { > no_console_suspend = true; > + oh->flags |= HWMOD_DISABLE_IDLE_ON_SUSPEND; > + } > > /* >* omap-uart can be used for earlyprintk logs > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v3 0/8] sched: use runnable avg in load balance
On 04/02/2013 04:35 PM, Alex Shi wrote: [snip] >> >> The reason may caused by wake_affine()'s higher overhead, and pgbench is >> really sensitive to this stuff... > > Thanks for testing. Could you like to remove the last patch and test it > again? I want to know if the last patch has effect on pgbench. Done, here the results of pgbench without the last patch on my box: | db_size | clients | tps | | tps | +-+-+---+ +---+ | 22 MB | 1 | 10662 | | 10679 | | 22 MB | 2 | 21483 | | 21471 | | 22 MB | 4 | 42046 | | 41957 | | 22 MB | 8 | 55807 | | 55684 | | 22 MB | 12 | 50768 | | 52074 | | 22 MB | 16 | 49880 | | 52879 | | 22 MB | 24 | 45904 | | 53406 | | 22 MB | 32 | 43420 | | 54088 | +24.57% | 7484 MB | 1 | 7965 | | 7725 | | 7484 MB | 2 | 19354 | | 19405 | | 7484 MB | 4 | 37552 | | 37246 | | 7484 MB | 8 | 48655 | | 50613 | | 7484 MB | 12 | 45778 | | 47639 | | 7484 MB | 16 | 45659 | | 48707 | | 7484 MB | 24 | 42192 | | 46469 | | 7484 MB | 32 | 36385 | | 46346 | +27.38% | 15 GB | 1 | 7677 | | 7727 | | 15 GB | 2 | 19227 | | 19199 | | 15 GB | 4 | 37335 | | 37372 | | 15 GB | 8 | 48130 | | 50333 | | 15 GB | 12 | 45393 | | 47590 | | 15 GB | 16 | 45110 | | 48091 | | 15 GB | 24 | 41415 | | 47415 | | 15 GB | 32 | 35988 | | 45749 | +27.12% Very nice improvement, I'd like to test it with the wake-affine throttle patch later, let's see what will happen ;-) Any idea on why the last one caused the regression? Regards, Michael Wang > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v3 0/8] sched: use runnable avg in load balance
On 04/03/2013 10:56 AM, Alex Shi wrote: > On 04/03/2013 10:46 AM, Michael Wang wrote: >> | 15 GB | 16 | 45110 | | 48091 | >> | 15 GB | 24 | 41415 | | 47415 | >> | 15 GB | 32 | 35988 | | 45749 |+27.12% >> >> Very nice improvement, I'd like to test it with the wake-affine throttle >> patch later, let's see what will happen ;-) >> >> Any idea on why the last one caused the regression? > > you can change the burst threshold: sysctl_sched_migration_cost, to see > what's happen with different value. create a similar knob and tune it. > + > + if (cpu_rq(this_cpu)->avg_idle < sysctl_sched_migration_cost) > + burst_this = 1; > + if (cpu_rq(prev_cpu)->avg_idle < sysctl_sched_migration_cost) > + burst_prev = 1; > + > > This changing the rate of adopt cpu_rq(cpu)->load.weight, correct? So if rq is busy, cpu_rq(cpu)->load.weight is capable enough to stand for the load status of rq? what's the really idea here? > BTW, what's the job thread behaviour of pgbench, guess it has lots of > wakeup. what's the work and sleep ratio of pgbench? I won't do the summary unless I reviewed it's code :) what I know is, it's a database benchmark, with several process operating database, see below one for details: pgbench is a simple program for running benchmark tests on PostgreSQL. It runs the same sequence of SQL commands over and over, possibly in multiple concurrent database sessions, and then calculates the average transaction rate (transactions per second). By default, pgbench tests a scenario that is loosely based on TPC-B, involving five SELECT, UPDATE, and INSERT commands per transaction. However, it is easy to test other cases by writing your own transaction script files. Regards, Michael Wang > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v3 0/8] sched: use runnable avg in load balance
On 04/03/2013 12:28 PM, Alex Shi wrote: [snip] > > but the patch may cause some unfairness if this/prev cpu are not burst at > same time. So could like try the following patch? I will try it later, some doubt below :) [snip] > + > + if (cpu_rq(this_cpu)->avg_idle < sysctl_sched_migration_cost || > + cpu_rq(prev_cpu)->avg_idle < sysctl_sched_migration_cost) > + burst= 1; > + > + /* use instant load for bursty waking up */ > + if (!burst) { > + load = source_load(prev_cpu, idx); > + this_load = target_load(this_cpu, idx); > + } else { > + load = cpu_rq(prev_cpu)->load.weight; > + this_load = cpu_rq(this_cpu)->load.weight; Ok, my understanding is, we want pull if 'prev_cpu' is burst, and don't want pull if 'this_cpu' is burst, correct? And we do this by guess the load higher or lower, is that right? And I think target_load() is capable enough to choose the higher load, if 'cpu_rq(cpu)->load.weight' is really higher, the results will be the same. So what about this: /* prefer higher load if burst */ load = burst_prev ? target_load(prev_cpu, idx) : source_load(prev_cpu, idx); this_load = target_load(this_cpu, idx); Regards, Michael Wang > + } > > /* >* If sync wakeup then subtract the (maximum possible) > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v3 0/8] sched: use runnable avg in load balance
On 04/03/2013 01:38 PM, Michael Wang wrote: > On 04/03/2013 12:28 PM, Alex Shi wrote: > [snip] >> >> but the patch may cause some unfairness if this/prev cpu are not burst at >> same time. So could like try the following patch? > > I will try it later, some doubt below :) > > [snip] >> + >> +if (cpu_rq(this_cpu)->avg_idle < sysctl_sched_migration_cost || >> +cpu_rq(prev_cpu)->avg_idle < sysctl_sched_migration_cost) >> +burst= 1; >> + >> +/* use instant load for bursty waking up */ >> +if (!burst) { >> +load = source_load(prev_cpu, idx); >> +this_load = target_load(this_cpu, idx); >> +} else { >> +load = cpu_rq(prev_cpu)->load.weight; >> +this_load = cpu_rq(this_cpu)->load.weight; > > Ok, my understanding is, we want pull if 'prev_cpu' is burst, and don't > want pull if 'this_cpu' is burst, correct? > > And we do this by guess the load higher or lower, is that right? > > And I think target_load() is capable enough to choose the higher load, > if 'cpu_rq(cpu)->load.weight' is really higher, the results will be the > same. > > So what about this: > > /* prefer higher load if burst */ > load = burst_prev ? And this check could also be: load = burst_prev && !burst_this ? if we don't prefer the pull when this_cpu also bursted. Regards, Michael Wang > target_load(prev_cpu, idx) : source_load(prev_cpu, idx); > > this_load = target_load(this_cpu, idx); > > Regards, > Michael Wang > >> +} >> >> /* >> * If sync wakeup then subtract the (maximum possible) >> > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/