Re: [PATCH] gpio: Do not accept gpio chip additions before gpiolib initialization

2016-03-30 Thread Alexandre Courbot
On Wed, Mar 30, 2016 at 6:16 PM, Guenter Roeck  wrote:
> On 03/30/2016 01:37 AM, Alexandre Courbot wrote:
>>
>> On Wed, Mar 30, 2016 at 3:20 AM, Guenter Roeck  wrote:
>>>
>>> Since commit ff2b13592299 ("gpio: make the gpiochip a real device"),
>>> attempts to add a gpio chip prior to gpiolib initialization cause the
>>> system to crash. Dump a warning to the console and return an error
>>> if the situation is encountered.
>>
>>
>> Mmm I see the problem but this could seriously delay the availability
>> of some GPIOs that are useful for early system boot.
>>
>> I have not followed the GPIO device patches as closely as I should
>> have, but shouldn't you be able to register a GPIO chip without
>> immediately presenting it to user-space, for internal kernel needs? If
>> gpiolib is not initialized, then device-related operations would be
>> skipped, and gpiolib_dev_init() could then parse the list of
>> registered chips and fix them up when it gets called.
>>
>> Again, I'm speaking without real knowledge here, but that pattern
>> seems more resilent to me.
>>
> You are absolutely right, but my knowledge of gpiolib is not good enough
> to make that change. See this as a band-gap; it is better than just
> crashing.

Actually, the following may be simpler:

Why not add a check in gpiochip_add_data() that will directly call
gpiolib_dev_init() if required? Then gpiolib_dev_init() could also
check whether it has already been called in that context and become a
no-op for when it is later called from core_initcall. Is there
anything that would prevents this from being a viable fix?


Re: [PATCH] net: mvneta: explicitly disable BM on 64bit platform

2016-03-30 Thread Jisheng Zhang
Hi Gregory,

On Wed, 30 Mar 2016 17:11:41 +0200 Gregory CLEMENT wrote:

> Hi Jisheng,
>  
>  On mer., mars 30 2016, Jisheng Zhang  wrote:
> 
> > The mvneta BM can't work on 64bit platform, as the BM hardware expects
> > buf virtual address to be placed in the first four bytes of mapped
> > buffer, but obviously the virtual address on 64bit platform can't be
> > stored in 4 bytes. So we have to explicitly disable BM on 64bit
> > platform.  
> 
> Actually mvneta is used on Armada 3700 which is a 64bits platform.
> Is it true that the driver needs some change to use BM in 64 bits, but
> we don't have to disable it.
> 
> Here is the 64 bits part of the patch we have currently on the hardware
> prototype. We have more things which are really related to the way the
> mvneta is connected to the Armada 3700 SoC. This code was not ready for

Thanks for the sharing.

I think we could commit easy parts firstly, for example: the cacheline size
hardcoding, either piece of your diff or my version:

http://lists.infradead.org/pipermail/linux-arm-kernel/2016-March/418513.html

> mainline but I prefer share it now instead of having the HWBM blindly

I have looked through the diff, it is for the driver itself on 64bit platforms,
and it doesn't touch BM. The BM itself need to be disabled for 64bit, I'm not
sure the BM could work on 64bit even with your diff. Per my understanding, the 
BM
can't work on 64 bit, let's have a look at some piece of the 
mvneta_bm_construct()

*(u32 *)buf = (u32)buf;

Am I misunderstanding?

Thanks,
Jisheng

> disable for 64 bits platform:
> 
> --- a/drivers/net/ethernet/marvell/Kconfig
> +++ b/drivers/net/ethernet/marvell/Kconfig
> @@ -55,7 +55,7 @@ config MVNETA_BM_ENABLE
>  
>  config MVNETA
>   tristate "Marvell Armada 370/38x/XP network interface support"
> - depends on PLAT_ORION
> + depends on ARCH_MVEBU || COMPILE_TEST
>   select MVMDIO
>   select FIXED_PHY
>   ---help---
> diff --git a/drivers/net/ethernet/marvell/mvneta.c 
> b/drivers/net/ethernet/marvell/mvneta.c
> index 577f7ca7deba..6929ad112b64 100644
> --- a/drivers/net/ethernet/marvell/mvneta.c
> +++ b/drivers/net/ethernet/marvell/mvneta.c
> @@ -260,7 +260,7 @@
>  
>  #define MVNETA_VLAN_TAG_LEN 4
>  
> -#define MVNETA_CPU_D_CACHE_LINE_SIZE32
> +#define MVNETA_CPU_D_CACHE_LINE_SIZEcache_line_size()
>  #define MVNETA_TX_CSUM_DEF_SIZE  1600
>  #define MVNETA_TX_CSUM_MAX_SIZE  9800
>  #define MVNETA_ACC_MODE_EXT1 1
> @@ -297,6 +297,12 @@
>  /* descriptor aligned size */
>  #define MVNETA_DESC_ALIGNED_SIZE 32
>  
> +/* Number of bytes to be taken into account by HW when putting incoming data
> + * to the buffers. It is needed in case NET_SKB_PAD exceeds maximum packet
> + * offset supported in MVNETA_RXQ_CONFIG_REG(q) registers.
> + */
> +#define MVNETA_RX_PKT_OFFSET_CORRECTION  64
> +
>  #define MVNETA_RX_PKT_SIZE(mtu) \
>   ALIGN((mtu) + MVNETA_MH_SIZE + MVNETA_VLAN_TAG_LEN + \
> ETH_HLEN + ETH_FCS_LEN,\
> @@ -417,6 +423,10 @@ struct mvneta_port {
>   u64 ethtool_stats[ARRAY_SIZE(mvneta_statistics)];
>  
>   u32 indir[MVNETA_RSS_LU_TABLE_SIZE];
> +#ifdef CONFIG_64BIT
> + u64 data_high;
> +#endif
> + u16 rx_offset_correction;
>  };
>  
>  /* The mvneta_tx_desc and mvneta_rx_desc structures describe the
> @@ -961,7 +971,9 @@ static int mvneta_bm_port_init(struct platform_device 
> *pdev,
>  struct mvneta_port *pp)
>  {
>   struct device_node *dn = pdev->dev.of_node;
> - u32 long_pool_id, short_pool_id, wsize;
> + u32 long_pool_id, short_pool_id;
> +#ifndef CONFIG_64BIT
> + u32 wsize;
>   u8 target, attr;
>   int err;
>  
> @@ -985,7 +997,7 @@ static int mvneta_bm_port_init(struct platform_device 
> *pdev,
>   netdev_info(pp->dev, "missing long pool id\n");
>   return -EINVAL;
>   }
> -
> +#endif
>   /* Create port's long pool depending on mtu */
>   pp->pool_long = mvneta_bm_pool_use(pp->bm_priv, long_pool_id,
>  MVNETA_BM_LONG, pp->id,
> @@ -1790,6 +1802,10 @@ static int mvneta_rx_refill(struct mvneta_port *pp,
>   if (!data)
>   return -ENOMEM;
>  
> +#ifdef CONFIG_64BIT
> + if (unlikely(pp->data_high != ((u64)data & 0x)))
> + return -ENOMEM;
> +#endif
>   phys_addr = dma_map_single(pp->dev->dev.parent, data,
>  MVNETA_RX_BUF_SIZE(pp->pkt_size),
>  DMA_FROM_DEVICE);
> @@ -1798,7 +1814,8 @@ static int mvneta_rx_refill(struct mvneta_port *pp,
>   return -ENOMEM;
>   }
>  
> - mvneta_rx_desc_fill(rx_desc, phys_addr, (u32)data);
> + phys_addr += pp->rx_offset_correction;
> + mvneta_rx_desc_fill(rx_desc, phys_addr, (uintptr_t)data);
>   return 0;
>  }
>  
> @@ -1860,8 +1877,16 @@ static void 

Re: [PATCH] gpio: Do not accept gpio chip additions before gpiolib initialization

2016-03-30 Thread Alexandre Courbot
On Wed, Mar 30, 2016 at 6:16 PM, Guenter Roeck  wrote:
> On 03/30/2016 01:37 AM, Alexandre Courbot wrote:
>>
>> On Wed, Mar 30, 2016 at 3:20 AM, Guenter Roeck  wrote:
>>>
>>> Since commit ff2b13592299 ("gpio: make the gpiochip a real device"),
>>> attempts to add a gpio chip prior to gpiolib initialization cause the
>>> system to crash. Dump a warning to the console and return an error
>>> if the situation is encountered.
>>
>>
>> Mmm I see the problem but this could seriously delay the availability
>> of some GPIOs that are useful for early system boot.
>>
>> I have not followed the GPIO device patches as closely as I should
>> have, but shouldn't you be able to register a GPIO chip without
>> immediately presenting it to user-space, for internal kernel needs? If
>> gpiolib is not initialized, then device-related operations would be
>> skipped, and gpiolib_dev_init() could then parse the list of
>> registered chips and fix them up when it gets called.
>>
>> Again, I'm speaking without real knowledge here, but that pattern
>> seems more resilent to me.
>>
> You are absolutely right, but my knowledge of gpiolib is not good enough
> to make that change. See this as a band-gap; it is better than just
> crashing.

Actually, the following may be simpler:

Why not add a check in gpiochip_add_data() that will directly call
gpiolib_dev_init() if required? Then gpiolib_dev_init() could also
check whether it has already been called in that context and become a
no-op for when it is later called from core_initcall. Is there
anything that would prevents this from being a viable fix?


Re: [PATCH] net: mvneta: explicitly disable BM on 64bit platform

2016-03-30 Thread Jisheng Zhang
Hi Gregory,

On Wed, 30 Mar 2016 17:11:41 +0200 Gregory CLEMENT wrote:

> Hi Jisheng,
>  
>  On mer., mars 30 2016, Jisheng Zhang  wrote:
> 
> > The mvneta BM can't work on 64bit platform, as the BM hardware expects
> > buf virtual address to be placed in the first four bytes of mapped
> > buffer, but obviously the virtual address on 64bit platform can't be
> > stored in 4 bytes. So we have to explicitly disable BM on 64bit
> > platform.  
> 
> Actually mvneta is used on Armada 3700 which is a 64bits platform.
> Is it true that the driver needs some change to use BM in 64 bits, but
> we don't have to disable it.
> 
> Here is the 64 bits part of the patch we have currently on the hardware
> prototype. We have more things which are really related to the way the
> mvneta is connected to the Armada 3700 SoC. This code was not ready for

Thanks for the sharing.

I think we could commit easy parts firstly, for example: the cacheline size
hardcoding, either piece of your diff or my version:

http://lists.infradead.org/pipermail/linux-arm-kernel/2016-March/418513.html

> mainline but I prefer share it now instead of having the HWBM blindly

I have looked through the diff, it is for the driver itself on 64bit platforms,
and it doesn't touch BM. The BM itself need to be disabled for 64bit, I'm not
sure the BM could work on 64bit even with your diff. Per my understanding, the 
BM
can't work on 64 bit, let's have a look at some piece of the 
mvneta_bm_construct()

*(u32 *)buf = (u32)buf;

Am I misunderstanding?

Thanks,
Jisheng

> disable for 64 bits platform:
> 
> --- a/drivers/net/ethernet/marvell/Kconfig
> +++ b/drivers/net/ethernet/marvell/Kconfig
> @@ -55,7 +55,7 @@ config MVNETA_BM_ENABLE
>  
>  config MVNETA
>   tristate "Marvell Armada 370/38x/XP network interface support"
> - depends on PLAT_ORION
> + depends on ARCH_MVEBU || COMPILE_TEST
>   select MVMDIO
>   select FIXED_PHY
>   ---help---
> diff --git a/drivers/net/ethernet/marvell/mvneta.c 
> b/drivers/net/ethernet/marvell/mvneta.c
> index 577f7ca7deba..6929ad112b64 100644
> --- a/drivers/net/ethernet/marvell/mvneta.c
> +++ b/drivers/net/ethernet/marvell/mvneta.c
> @@ -260,7 +260,7 @@
>  
>  #define MVNETA_VLAN_TAG_LEN 4
>  
> -#define MVNETA_CPU_D_CACHE_LINE_SIZE32
> +#define MVNETA_CPU_D_CACHE_LINE_SIZEcache_line_size()
>  #define MVNETA_TX_CSUM_DEF_SIZE  1600
>  #define MVNETA_TX_CSUM_MAX_SIZE  9800
>  #define MVNETA_ACC_MODE_EXT1 1
> @@ -297,6 +297,12 @@
>  /* descriptor aligned size */
>  #define MVNETA_DESC_ALIGNED_SIZE 32
>  
> +/* Number of bytes to be taken into account by HW when putting incoming data
> + * to the buffers. It is needed in case NET_SKB_PAD exceeds maximum packet
> + * offset supported in MVNETA_RXQ_CONFIG_REG(q) registers.
> + */
> +#define MVNETA_RX_PKT_OFFSET_CORRECTION  64
> +
>  #define MVNETA_RX_PKT_SIZE(mtu) \
>   ALIGN((mtu) + MVNETA_MH_SIZE + MVNETA_VLAN_TAG_LEN + \
> ETH_HLEN + ETH_FCS_LEN,\
> @@ -417,6 +423,10 @@ struct mvneta_port {
>   u64 ethtool_stats[ARRAY_SIZE(mvneta_statistics)];
>  
>   u32 indir[MVNETA_RSS_LU_TABLE_SIZE];
> +#ifdef CONFIG_64BIT
> + u64 data_high;
> +#endif
> + u16 rx_offset_correction;
>  };
>  
>  /* The mvneta_tx_desc and mvneta_rx_desc structures describe the
> @@ -961,7 +971,9 @@ static int mvneta_bm_port_init(struct platform_device 
> *pdev,
>  struct mvneta_port *pp)
>  {
>   struct device_node *dn = pdev->dev.of_node;
> - u32 long_pool_id, short_pool_id, wsize;
> + u32 long_pool_id, short_pool_id;
> +#ifndef CONFIG_64BIT
> + u32 wsize;
>   u8 target, attr;
>   int err;
>  
> @@ -985,7 +997,7 @@ static int mvneta_bm_port_init(struct platform_device 
> *pdev,
>   netdev_info(pp->dev, "missing long pool id\n");
>   return -EINVAL;
>   }
> -
> +#endif
>   /* Create port's long pool depending on mtu */
>   pp->pool_long = mvneta_bm_pool_use(pp->bm_priv, long_pool_id,
>  MVNETA_BM_LONG, pp->id,
> @@ -1790,6 +1802,10 @@ static int mvneta_rx_refill(struct mvneta_port *pp,
>   if (!data)
>   return -ENOMEM;
>  
> +#ifdef CONFIG_64BIT
> + if (unlikely(pp->data_high != ((u64)data & 0x)))
> + return -ENOMEM;
> +#endif
>   phys_addr = dma_map_single(pp->dev->dev.parent, data,
>  MVNETA_RX_BUF_SIZE(pp->pkt_size),
>  DMA_FROM_DEVICE);
> @@ -1798,7 +1814,8 @@ static int mvneta_rx_refill(struct mvneta_port *pp,
>   return -ENOMEM;
>   }
>  
> - mvneta_rx_desc_fill(rx_desc, phys_addr, (u32)data);
> + phys_addr += pp->rx_offset_correction;
> + mvneta_rx_desc_fill(rx_desc, phys_addr, (uintptr_t)data);
>   return 0;
>  }
>  
> @@ -1860,8 +1877,16 @@ static void mvneta_rxq_drop_pkts(struct 

Re: zram: per-cpu compression streams

2016-03-30 Thread Minchan Kim
Hello Sergey,

On Thu, Mar 31, 2016 at 10:26:26AM +0900, Sergey Senozhatsky wrote:
> Hello,
> 
> On (03/31/16 07:12), Minchan Kim wrote:
> [..]
> > > I used a bit different script. no `buffer_compress_percentage' option,
> > > because it provide "a mix of random data and zeroes"
> > 
> > Normally, zram's compression ratio is 3 or 2 so I used it.
> > Hmm, isn't it more real practice usecase?
> 
> this option guarantees that the supplied to zram data will have
> a requested compression ratio? hm, but we never do that in real
> life, zram sees random data.

I agree it's hard to create such random read data with benchmark.
One option is that we share swap dump data of real product, for exmaple,
android or webOS and feed it to the benchmark. But as you know, it
cannot cover all of workload, either. So, to just easy test, I wanted
to make represntative compression ratio data and fio provides option
for it via buffer_compress_percentage.
It would be better rather than feeding random data which could make
lots of noise for each test cycle.

> 
> > If we don't use buffer_compress_percentage, what's the content in the 
> > buffer?
> 
> that's a good question. I quickly looked into the fio source code,
> we need to use "buffer_pattern=str" option, I think. so the buffers
> will be filled with the same data.
> 
> I don't mind to have buffer_compress_percentage as a separate test (set
> as a local test option), but I think that using common buffer pattern
> adds more confidence when we compare test results.

If we both uses same "buffer_compress_percentage=something", it's
good to compare. The benefit of buffer_compress_percentage is we can
change compression ratio easily in zram testing and see various
test to see what compression ratio or speed affects the system.

>  
> [..]
> > > hm, but I guess it's not enough; fio probably will have different
> > > data (well, only if we didn't ask it to zero-fill the buffers) for
> > > different tests, causing different zram->zsmalloc behaviour. need
> > > to check it.
> [..]
> > > #jobs4   
> > > READ:  8720.4MB/s  7301.7MB/s  7896.2MB/s
> > > READ:  7510.3MB/s  6690.1MB/s  6456.2MB/s
> > > WRITE: 2211.6MB/s  1930.8MB/s  2713.9MB/s
> > > WRITE: 2002.2MB/s  1629.8MB/s  2227.7MB/s
> > 
> > Your case is 40% win. It's huge, Nice!
> > I tested with your guide line(i.e., no buffer_compress_percentage,
> > scramble_buffers=0) but still 10% enhance in my machine.
> > Hmm,,,
> > 
> > How about if you test my fio job.file in your machine?
> > Still, it's 40% win?
> 
> I'll retest with new config.
> 
> > Also, I want to test again in your exactly same configuration.
> > Could you tell me zram environment(ie, disksize, compression
> > algorithm) and share me your job.file of fio?
> 
> sure.

I tested with you suggested parameter.
In my side, win is better compared to my previous test but it seems
your test is so fast. IOW, filesize is small and loops is just 1.
Please test filesize=500m loops=10 or 20.
It can make your test more stable and enhance is 10~20% in my side.
Let's discuss further once test result between us is consistent.

Thanks.

> 
> 3G, lzo
> 
> 
> --- my fio-template is
> 
> [global]
> bs=4k
> ioengine=sync
> direct=1
> size=__SIZE__
> numjobs=__JOBS__
> group_reporting
> filename=/dev/zram0
> loops=1
> buffer_pattern=0xbadc0ffee
> scramble_buffers=0
> 
> [seq-read]
> rw=read
> stonewall
> 
> [rand-read]
> rw=randread
> stonewall
> 
> [seq-write]
> rw=write
> stonewall
> 
> [rand-write]
> rw=randwrite
> stonewall
> 
> [mixed-seq]
> rw=rw
> stonewall
> 
> [mixed-rand]
> rw=randrw
> stonewall
> 
> 
> #separate test with
> #buffer_compress_percentage=50
> 
> 
> 
> --- my create-zram script is as follows.
> 
> 
> #!/bin/sh
> 
> rmmod zram
> modprobe zram
> 
> if [ -e /sys/block/zram0/initstate ]; then
> initdone=`cat /sys/block/zram0/initstate`
> if [ $initdone = 1 ]; then
> echo "init done"
> exit 1
> fi
> fi
> 
> echo 8 > /sys/block/zram0/max_comp_streams
> 
> echo lzo > /sys/block/zram0/comp_algorithm
> cat /sys/block/zram0/comp_algorithm
> 
> cat /sys/block/zram0/max_comp_streams
> echo $1 > /sys/block/zram0/disksize
> 
> 
> 
> 
> 
> --- and I use it as
> 
> 
> #!/bin/sh
> 
> DEVICE_SZ=$((3 * 1024 * 1024 * 1024))
> FREE_SPACE=$(($DEVICE_SZ / 10))
> LOG=/tmp/fio-zram-test
> LOG_SUFFIX=$1
> 
> function reset_zram
> {
> rmmod zram
> }
> 
> function create_zram
> {
> ./create-zram $DEVICE_SZ
> }
> 
> function main
> {
> local j
> local i
> 
> if [ "z$LOG_SUFFIX" = "z" ]; then
> LOG_SUFFIX="UNSET"
> fi
> 
> LOG=$LOG-$LOG_SUFFIX
> 
> for i in {1..10}; do
> reset_zram
> create_zram
> 
> cat fio-test-template | sed s/__JOBS__/$i/ | sed 
> s/__SIZE__/$((($DEVICE_SZ/$i - $FREE_SPACE)/(1024*1024)))M/ > fio-test
> 

Re: zram: per-cpu compression streams

2016-03-30 Thread Minchan Kim
Hello Sergey,

On Thu, Mar 31, 2016 at 10:26:26AM +0900, Sergey Senozhatsky wrote:
> Hello,
> 
> On (03/31/16 07:12), Minchan Kim wrote:
> [..]
> > > I used a bit different script. no `buffer_compress_percentage' option,
> > > because it provide "a mix of random data and zeroes"
> > 
> > Normally, zram's compression ratio is 3 or 2 so I used it.
> > Hmm, isn't it more real practice usecase?
> 
> this option guarantees that the supplied to zram data will have
> a requested compression ratio? hm, but we never do that in real
> life, zram sees random data.

I agree it's hard to create such random read data with benchmark.
One option is that we share swap dump data of real product, for exmaple,
android or webOS and feed it to the benchmark. But as you know, it
cannot cover all of workload, either. So, to just easy test, I wanted
to make represntative compression ratio data and fio provides option
for it via buffer_compress_percentage.
It would be better rather than feeding random data which could make
lots of noise for each test cycle.

> 
> > If we don't use buffer_compress_percentage, what's the content in the 
> > buffer?
> 
> that's a good question. I quickly looked into the fio source code,
> we need to use "buffer_pattern=str" option, I think. so the buffers
> will be filled with the same data.
> 
> I don't mind to have buffer_compress_percentage as a separate test (set
> as a local test option), but I think that using common buffer pattern
> adds more confidence when we compare test results.

If we both uses same "buffer_compress_percentage=something", it's
good to compare. The benefit of buffer_compress_percentage is we can
change compression ratio easily in zram testing and see various
test to see what compression ratio or speed affects the system.

>  
> [..]
> > > hm, but I guess it's not enough; fio probably will have different
> > > data (well, only if we didn't ask it to zero-fill the buffers) for
> > > different tests, causing different zram->zsmalloc behaviour. need
> > > to check it.
> [..]
> > > #jobs4   
> > > READ:  8720.4MB/s  7301.7MB/s  7896.2MB/s
> > > READ:  7510.3MB/s  6690.1MB/s  6456.2MB/s
> > > WRITE: 2211.6MB/s  1930.8MB/s  2713.9MB/s
> > > WRITE: 2002.2MB/s  1629.8MB/s  2227.7MB/s
> > 
> > Your case is 40% win. It's huge, Nice!
> > I tested with your guide line(i.e., no buffer_compress_percentage,
> > scramble_buffers=0) but still 10% enhance in my machine.
> > Hmm,,,
> > 
> > How about if you test my fio job.file in your machine?
> > Still, it's 40% win?
> 
> I'll retest with new config.
> 
> > Also, I want to test again in your exactly same configuration.
> > Could you tell me zram environment(ie, disksize, compression
> > algorithm) and share me your job.file of fio?
> 
> sure.

I tested with you suggested parameter.
In my side, win is better compared to my previous test but it seems
your test is so fast. IOW, filesize is small and loops is just 1.
Please test filesize=500m loops=10 or 20.
It can make your test more stable and enhance is 10~20% in my side.
Let's discuss further once test result between us is consistent.

Thanks.

> 
> 3G, lzo
> 
> 
> --- my fio-template is
> 
> [global]
> bs=4k
> ioengine=sync
> direct=1
> size=__SIZE__
> numjobs=__JOBS__
> group_reporting
> filename=/dev/zram0
> loops=1
> buffer_pattern=0xbadc0ffee
> scramble_buffers=0
> 
> [seq-read]
> rw=read
> stonewall
> 
> [rand-read]
> rw=randread
> stonewall
> 
> [seq-write]
> rw=write
> stonewall
> 
> [rand-write]
> rw=randwrite
> stonewall
> 
> [mixed-seq]
> rw=rw
> stonewall
> 
> [mixed-rand]
> rw=randrw
> stonewall
> 
> 
> #separate test with
> #buffer_compress_percentage=50
> 
> 
> 
> --- my create-zram script is as follows.
> 
> 
> #!/bin/sh
> 
> rmmod zram
> modprobe zram
> 
> if [ -e /sys/block/zram0/initstate ]; then
> initdone=`cat /sys/block/zram0/initstate`
> if [ $initdone = 1 ]; then
> echo "init done"
> exit 1
> fi
> fi
> 
> echo 8 > /sys/block/zram0/max_comp_streams
> 
> echo lzo > /sys/block/zram0/comp_algorithm
> cat /sys/block/zram0/comp_algorithm
> 
> cat /sys/block/zram0/max_comp_streams
> echo $1 > /sys/block/zram0/disksize
> 
> 
> 
> 
> 
> --- and I use it as
> 
> 
> #!/bin/sh
> 
> DEVICE_SZ=$((3 * 1024 * 1024 * 1024))
> FREE_SPACE=$(($DEVICE_SZ / 10))
> LOG=/tmp/fio-zram-test
> LOG_SUFFIX=$1
> 
> function reset_zram
> {
> rmmod zram
> }
> 
> function create_zram
> {
> ./create-zram $DEVICE_SZ
> }
> 
> function main
> {
> local j
> local i
> 
> if [ "z$LOG_SUFFIX" = "z" ]; then
> LOG_SUFFIX="UNSET"
> fi
> 
> LOG=$LOG-$LOG_SUFFIX
> 
> for i in {1..10}; do
> reset_zram
> create_zram
> 
> cat fio-test-template | sed s/__JOBS__/$i/ | sed 
> s/__SIZE__/$((($DEVICE_SZ/$i - $FREE_SPACE)/(1024*1024)))M/ > fio-test
> 

[PATCH net-next 3/6] macvtap: socket rx busy polling support

2016-03-30 Thread Jason Wang
Signed-off-by: Jason Wang 
---
 drivers/net/macvtap.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index 95394ed..1891aff 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 /*
@@ -369,6 +370,7 @@ static rx_handler_result_t macvtap_handle_frame(struct 
sk_buff **pskb)
goto drop;
 
if (!segs) {
+   sk_mark_napi_id(>sk, skb);
skb_queue_tail(>sk.sk_receive_queue, skb);
goto wake_up;
}
@@ -378,6 +380,7 @@ static rx_handler_result_t macvtap_handle_frame(struct 
sk_buff **pskb)
struct sk_buff *nskb = segs->next;
 
segs->next = NULL;
+   sk_mark_napi_id(>sk, segs);
skb_queue_tail(>sk.sk_receive_queue, segs);
segs = nskb;
}
@@ -391,6 +394,7 @@ static rx_handler_result_t macvtap_handle_frame(struct 
sk_buff **pskb)
!(features & NETIF_F_CSUM_MASK) &&
skb_checksum_help(skb))
goto drop;
+   sk_mark_napi_id(>sk, skb);
skb_queue_tail(>sk.sk_receive_queue, skb);
}
 
-- 
2.5.0



[PATCH net-next 3/6] macvtap: socket rx busy polling support

2016-03-30 Thread Jason Wang
Signed-off-by: Jason Wang 
---
 drivers/net/macvtap.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index 95394ed..1891aff 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 /*
@@ -369,6 +370,7 @@ static rx_handler_result_t macvtap_handle_frame(struct 
sk_buff **pskb)
goto drop;
 
if (!segs) {
+   sk_mark_napi_id(>sk, skb);
skb_queue_tail(>sk.sk_receive_queue, skb);
goto wake_up;
}
@@ -378,6 +380,7 @@ static rx_handler_result_t macvtap_handle_frame(struct 
sk_buff **pskb)
struct sk_buff *nskb = segs->next;
 
segs->next = NULL;
+   sk_mark_napi_id(>sk, segs);
skb_queue_tail(>sk.sk_receive_queue, segs);
segs = nskb;
}
@@ -391,6 +394,7 @@ static rx_handler_result_t macvtap_handle_frame(struct 
sk_buff **pskb)
!(features & NETIF_F_CSUM_MASK) &&
skb_checksum_help(skb))
goto drop;
+   sk_mark_napi_id(>sk, skb);
skb_queue_tail(>sk.sk_receive_queue, skb);
}
 
-- 
2.5.0



[PATCH net-next 6/6] vhost_net: net device rx busy polling support

2016-03-30 Thread Jason Wang
This patch let vhost_net try rx busy polling of underlying net device
when busy polling is enabled. Test shows some improvement on TCP_RR:

smp=1 queue=1
size/session/+thu%/+normalize%/+tpkts%/+rpkts%/+ioexits%/
1/ 1/   +4%/   +3%/   +3%/   +3%/  +22%
1/50/   +2%/   +2%/   +2%/   +2%/0%
1/   100/   +1%/0%/   +1%/   +1%/   -1%
1/   200/   +2%/   +1%/   +2%/   +2%/0%
   64/ 1/   +1%/   +3%/   +1%/   +1%/   +1%
   64/50/0%/0%/0%/0%/   -1%
   64/   100/   +1%/0%/   +1%/   +1%/0%
   64/   200/0%/0%/   +2%/   +2%/0%
  256/ 1/   +2%/   +2%/   +2%/   +2%/   +2%
  256/50/   +3%/   +3%/   +3%/   +3%/0%
  256/   100/   +1%/   +1%/   +2%/   +2%/0%
  256/   200/0%/0%/   +1%/   +1%/   +1%
 1024/ 1/   +2%/   +2%/   +2%/   +2%/   +2%
 1024/50/   -1%/   -1%/   -1%/   -1%/   -2%
 1024/   100/   +1%/   +1%/0%/0%/   -1%
 1024/   200/   +2%/   +1%/   +2%/   +2%/0%

smp=8 queue=1
size/session/+thu%/+normalize%/+tpkts%/+rpkts%/+ioexits%/
1/ 1/   +1%/   -5%/   +1%/   +1%/0%
1/50/   +1%/0%/   +1%/   +1%/   -1%
1/   100/   -1%/   -1%/   -2%/   -2%/   -4%
1/   200/0%/0%/0%/0%/   -1%
   64/ 1/   -2%/  -10%/   -2%/   -2%/   -2%
   64/50/   -1%/   -1%/   -1%/   -1%/   -2%
   64/   100/   -1%/0%/0%/0%/   -1%
   64/   200/   -1%/   -1%/0%/0%/0%
  256/ 1/   +7%/  +25%/   +7%/   +7%/   +7%
  256/50/   +2%/   +2%/   +2%/   +2%/   -1%
  256/   100/   -1%/   -1%/   -1%/   -1%/   -3%
  256/   200/   +1%/0%/0%/0%/0%
 1024/ 1/   +5%/  +15%/   +5%/   +5%/   +4%
 1024/50/0%/0%/   -1%/   -1%/   -1%
 1024/   100/   -1%/   -1%/   -1%/   -1%/   -2%
 1024/   200/   -1%/0%/   -1%/   -1%/   -1%

smp=8 queue=8
size/session/+thu%/+normalize%/+tpkts%/+rpkts%/+ioexits%/
1/ 1/   +5%/   +2%/   +5%/   +5%/0%
1/50/   +2%/   +2%/   +3%/   +3%/  -20%
1/   100/   +5%/   +5%/   +5%/   +5%/  -13%
1/   200/   +8%/   +8%/   +6%/   +6%/  -12%
   64/ 1/0%/   +4%/0%/0%/  +18%
   64/50/   +6%/   +5%/   +5%/   +5%/   -7%
   64/   100/   +4%/   +4%/   +5%/   +5%/  -12%
   64/   200/   +5%/   +5%/   +5%/   +5%/  -12%
  256/ 1/0%/   -3%/0%/0%/   +1%
  256/50/   +3%/   +3%/   +3%/   +3%/   -2%
  256/   100/   +6%/   +5%/   +5%/   +5%/  -11%
  256/   200/   +4%/   +4%/   +4%/   +4%/  -13%
 1024/ 1/0%/   -3%/0%/0%/   -6%
 1024/50/   +1%/   +1%/   +1%/   +1%/  -10%
 1024/   100/   +4%/   +4%/   +5%/   +5%/  -11%
 1024/   200/   +4%/   +5%/   +4%/   +4%/  -12%

Signed-off-by: Jason Wang 
---
 drivers/vhost/net.c | 22 --
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index f744eeb..7350f6c 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -27,6 +27,7 @@
 #include 
 
 #include 
+#include 
 
 #include "vhost.h"
 
@@ -307,15 +308,24 @@ static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
unsigned int *out_num, unsigned int *in_num)
 {
unsigned long uninitialized_var(endtime);
+   struct socket *sock = vq->private_data;
+   struct sock *sk = sock->sk;
+   struct napi_struct *napi;
int r = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
out_num, in_num, NULL, NULL);
 
if (r == vq->num && vq->busyloop_timeout) {
preempt_disable();
+   rcu_read_lock();
+   napi = napi_by_id(sk->sk_napi_id);
endtime = busy_clock() + vq->busyloop_timeout;
while (vhost_can_busy_poll(vq->dev, endtime) &&
-  vhost_vq_avail_empty(vq->dev, vq))
+  vhost_vq_avail_empty(vq->dev, vq)) {
+   if (napi)
+   sk_busy_loop_once(sk, napi);
cpu_relax_lowlatency();
+   }
+   rcu_read_unlock();
preempt_enable();
r = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
out_num, in_num, NULL, NULL);
@@ -476,6 +486,7 @@ static int vhost_net_rx_peek_head_len(struct vhost_net 
*net, struct sock *sk)
struct vhost_net_virtqueue *nvq = >vqs[VHOST_NET_VQ_TX];
struct vhost_virtqueue *vq = >vq;
unsigned long uninitialized_var(endtime);
+   struct napi_struct *napi;
int len = peek_head_len(sk);
 
if (!len && vq->busyloop_timeout) {
@@ -484,13 +495,20 @@ static int vhost_net_rx_peek_head_len(struct vhost_net 
*net, struct sock *sk)
vhost_disable_notify(>dev, vq);
 
preempt_disable();
+   rcu_read_lock();
+
+   napi = napi_by_id(sk->sk_napi_id);
endtime = busy_clock() + vq->busyloop_timeout;
 
   

[PATCH net-next 6/6] vhost_net: net device rx busy polling support

2016-03-30 Thread Jason Wang
This patch let vhost_net try rx busy polling of underlying net device
when busy polling is enabled. Test shows some improvement on TCP_RR:

smp=1 queue=1
size/session/+thu%/+normalize%/+tpkts%/+rpkts%/+ioexits%/
1/ 1/   +4%/   +3%/   +3%/   +3%/  +22%
1/50/   +2%/   +2%/   +2%/   +2%/0%
1/   100/   +1%/0%/   +1%/   +1%/   -1%
1/   200/   +2%/   +1%/   +2%/   +2%/0%
   64/ 1/   +1%/   +3%/   +1%/   +1%/   +1%
   64/50/0%/0%/0%/0%/   -1%
   64/   100/   +1%/0%/   +1%/   +1%/0%
   64/   200/0%/0%/   +2%/   +2%/0%
  256/ 1/   +2%/   +2%/   +2%/   +2%/   +2%
  256/50/   +3%/   +3%/   +3%/   +3%/0%
  256/   100/   +1%/   +1%/   +2%/   +2%/0%
  256/   200/0%/0%/   +1%/   +1%/   +1%
 1024/ 1/   +2%/   +2%/   +2%/   +2%/   +2%
 1024/50/   -1%/   -1%/   -1%/   -1%/   -2%
 1024/   100/   +1%/   +1%/0%/0%/   -1%
 1024/   200/   +2%/   +1%/   +2%/   +2%/0%

smp=8 queue=1
size/session/+thu%/+normalize%/+tpkts%/+rpkts%/+ioexits%/
1/ 1/   +1%/   -5%/   +1%/   +1%/0%
1/50/   +1%/0%/   +1%/   +1%/   -1%
1/   100/   -1%/   -1%/   -2%/   -2%/   -4%
1/   200/0%/0%/0%/0%/   -1%
   64/ 1/   -2%/  -10%/   -2%/   -2%/   -2%
   64/50/   -1%/   -1%/   -1%/   -1%/   -2%
   64/   100/   -1%/0%/0%/0%/   -1%
   64/   200/   -1%/   -1%/0%/0%/0%
  256/ 1/   +7%/  +25%/   +7%/   +7%/   +7%
  256/50/   +2%/   +2%/   +2%/   +2%/   -1%
  256/   100/   -1%/   -1%/   -1%/   -1%/   -3%
  256/   200/   +1%/0%/0%/0%/0%
 1024/ 1/   +5%/  +15%/   +5%/   +5%/   +4%
 1024/50/0%/0%/   -1%/   -1%/   -1%
 1024/   100/   -1%/   -1%/   -1%/   -1%/   -2%
 1024/   200/   -1%/0%/   -1%/   -1%/   -1%

smp=8 queue=8
size/session/+thu%/+normalize%/+tpkts%/+rpkts%/+ioexits%/
1/ 1/   +5%/   +2%/   +5%/   +5%/0%
1/50/   +2%/   +2%/   +3%/   +3%/  -20%
1/   100/   +5%/   +5%/   +5%/   +5%/  -13%
1/   200/   +8%/   +8%/   +6%/   +6%/  -12%
   64/ 1/0%/   +4%/0%/0%/  +18%
   64/50/   +6%/   +5%/   +5%/   +5%/   -7%
   64/   100/   +4%/   +4%/   +5%/   +5%/  -12%
   64/   200/   +5%/   +5%/   +5%/   +5%/  -12%
  256/ 1/0%/   -3%/0%/0%/   +1%
  256/50/   +3%/   +3%/   +3%/   +3%/   -2%
  256/   100/   +6%/   +5%/   +5%/   +5%/  -11%
  256/   200/   +4%/   +4%/   +4%/   +4%/  -13%
 1024/ 1/0%/   -3%/0%/0%/   -6%
 1024/50/   +1%/   +1%/   +1%/   +1%/  -10%
 1024/   100/   +4%/   +4%/   +5%/   +5%/  -11%
 1024/   200/   +4%/   +5%/   +4%/   +4%/  -12%

Signed-off-by: Jason Wang 
---
 drivers/vhost/net.c | 22 --
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index f744eeb..7350f6c 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -27,6 +27,7 @@
 #include 
 
 #include 
+#include 
 
 #include "vhost.h"
 
@@ -307,15 +308,24 @@ static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
unsigned int *out_num, unsigned int *in_num)
 {
unsigned long uninitialized_var(endtime);
+   struct socket *sock = vq->private_data;
+   struct sock *sk = sock->sk;
+   struct napi_struct *napi;
int r = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
out_num, in_num, NULL, NULL);
 
if (r == vq->num && vq->busyloop_timeout) {
preempt_disable();
+   rcu_read_lock();
+   napi = napi_by_id(sk->sk_napi_id);
endtime = busy_clock() + vq->busyloop_timeout;
while (vhost_can_busy_poll(vq->dev, endtime) &&
-  vhost_vq_avail_empty(vq->dev, vq))
+  vhost_vq_avail_empty(vq->dev, vq)) {
+   if (napi)
+   sk_busy_loop_once(sk, napi);
cpu_relax_lowlatency();
+   }
+   rcu_read_unlock();
preempt_enable();
r = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
out_num, in_num, NULL, NULL);
@@ -476,6 +486,7 @@ static int vhost_net_rx_peek_head_len(struct vhost_net 
*net, struct sock *sk)
struct vhost_net_virtqueue *nvq = >vqs[VHOST_NET_VQ_TX];
struct vhost_virtqueue *vq = >vq;
unsigned long uninitialized_var(endtime);
+   struct napi_struct *napi;
int len = peek_head_len(sk);
 
if (!len && vq->busyloop_timeout) {
@@ -484,13 +495,20 @@ static int vhost_net_rx_peek_head_len(struct vhost_net 
*net, struct sock *sk)
vhost_disable_notify(>dev, vq);
 
preempt_disable();
+   rcu_read_lock();
+
+   napi = napi_by_id(sk->sk_napi_id);
endtime = busy_clock() + vq->busyloop_timeout;
 
while 

平时最多也就联系了三千家,全球还有十几万客户在哪里?

2016-03-30 Thread Andy-Search*Mailer*Inquiry*Order
您好:
您还在用ali平台开发外贸客户?
   还在使用展会宣传企业和产品?
 你out了!!!
 当前外贸客户开发难,您是否也在寻找展会,B2B之外好的渠道? 
 行业全球十几万客户,平时最多也就联系了三千家,您是否想把剩下的也开发到?
 加QQ1286754208给您演示下主动开发客户的方法,先用先受益,已经有近万家企业领先您使用!!。
 广东省商业联合会推荐,主动开发客户第一品牌,近万家企业正在获益。您可以没有使用,但是不能没有了解。


[PATCH net-next 5/6] net: export napi_by_id()

2016-03-30 Thread Jason Wang
This patch exports napi_by_id() which will be used by vhost_net socket
busy polling.

Signed-off-by: Jason Wang 
---
 include/net/busy_poll.h | 1 +
 net/core/dev.c  | 3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
index e765e23..dc9c76d 100644
--- a/include/net/busy_poll.h
+++ b/include/net/busy_poll.h
@@ -74,6 +74,7 @@ static inline bool busy_loop_timeout(unsigned long end_time)
 
 bool sk_busy_loop(struct sock *sk, int nonblock);
 int sk_busy_loop_once(struct sock *sk, struct napi_struct *napi);
+struct napi_struct *napi_by_id(unsigned int napi_id);
 
 /* used in the NIC receive handler to mark the skb */
 static inline void skb_mark_napi_id(struct sk_buff *skb,
diff --git a/net/core/dev.c b/net/core/dev.c
index a2f0c46..b98d210 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4888,7 +4888,7 @@ void napi_complete_done(struct napi_struct *n, int 
work_done)
 EXPORT_SYMBOL(napi_complete_done);
 
 /* must be called under rcu_read_lock(), as we dont take a reference */
-static struct napi_struct *napi_by_id(unsigned int napi_id)
+struct napi_struct *napi_by_id(unsigned int napi_id)
 {
unsigned int hash = napi_id % HASH_SIZE(napi_hash);
struct napi_struct *napi;
@@ -4899,6 +4899,7 @@ static struct napi_struct *napi_by_id(unsigned int 
napi_id)
 
return NULL;
 }
+EXPORT_SYMBOL(napi_by_id);
 
 #if defined(CONFIG_NET_RX_BUSY_POLL)
 #define BUSY_POLL_BUDGET 8
-- 
2.5.0



平时最多也就联系了三千家,全球还有十几万客户在哪里?

2016-03-30 Thread Andy-Search*Mailer*Inquiry*Order
您好:
您还在用ali平台开发外贸客户?
   还在使用展会宣传企业和产品?
 你out了!!!
 当前外贸客户开发难,您是否也在寻找展会,B2B之外好的渠道? 
 行业全球十几万客户,平时最多也就联系了三千家,您是否想把剩下的也开发到?
 加QQ1286754208给您演示下主动开发客户的方法,先用先受益,已经有近万家企业领先您使用!!。
 广东省商业联合会推荐,主动开发客户第一品牌,近万家企业正在获益。您可以没有使用,但是不能没有了解。


[PATCH net-next 5/6] net: export napi_by_id()

2016-03-30 Thread Jason Wang
This patch exports napi_by_id() which will be used by vhost_net socket
busy polling.

Signed-off-by: Jason Wang 
---
 include/net/busy_poll.h | 1 +
 net/core/dev.c  | 3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
index e765e23..dc9c76d 100644
--- a/include/net/busy_poll.h
+++ b/include/net/busy_poll.h
@@ -74,6 +74,7 @@ static inline bool busy_loop_timeout(unsigned long end_time)
 
 bool sk_busy_loop(struct sock *sk, int nonblock);
 int sk_busy_loop_once(struct sock *sk, struct napi_struct *napi);
+struct napi_struct *napi_by_id(unsigned int napi_id);
 
 /* used in the NIC receive handler to mark the skb */
 static inline void skb_mark_napi_id(struct sk_buff *skb,
diff --git a/net/core/dev.c b/net/core/dev.c
index a2f0c46..b98d210 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4888,7 +4888,7 @@ void napi_complete_done(struct napi_struct *n, int 
work_done)
 EXPORT_SYMBOL(napi_complete_done);
 
 /* must be called under rcu_read_lock(), as we dont take a reference */
-static struct napi_struct *napi_by_id(unsigned int napi_id)
+struct napi_struct *napi_by_id(unsigned int napi_id)
 {
unsigned int hash = napi_id % HASH_SIZE(napi_hash);
struct napi_struct *napi;
@@ -4899,6 +4899,7 @@ static struct napi_struct *napi_by_id(unsigned int 
napi_id)
 
return NULL;
 }
+EXPORT_SYMBOL(napi_by_id);
 
 #if defined(CONFIG_NET_RX_BUSY_POLL)
 #define BUSY_POLL_BUDGET 8
-- 
2.5.0



[PATCH net-next 0/6] net device rx busy polling support in vhost_net

2016-03-30 Thread Jason Wang
Hi all:

This series try to add net device rx busy polling support in
vhost_net. This is done through:

- adding socket rx busy polling support for tun/macvtap by marking
  napi_id.
- vhost_net will try to find the net device through napi_id and do
  busy polling if possible.

TCP_RR tests on two mlx4s show some improvements:

smp=1 queue=1
size/session/+thu%/+normalize%/+tpkts%/+rpkts%/+ioexits%/
1/ 1/   +4%/   +3%/   +3%/   +3%/  +22%
1/50/   +2%/   +2%/   +2%/   +2%/0%
1/   100/   +1%/0%/   +1%/   +1%/   -1%
1/   200/   +2%/   +1%/   +2%/   +2%/0%
   64/ 1/   +1%/   +3%/   +1%/   +1%/   +1%
   64/50/0%/0%/0%/0%/   -1%
   64/   100/   +1%/0%/   +1%/   +1%/0%
   64/   200/0%/0%/   +2%/   +2%/0%
  256/ 1/   +2%/   +2%/   +2%/   +2%/   +2%
  256/50/   +3%/   +3%/   +3%/   +3%/0%
  256/   100/   +1%/   +1%/   +2%/   +2%/0%
  256/   200/0%/0%/   +1%/   +1%/   +1%
 1024/ 1/   +2%/   +2%/   +2%/   +2%/   +2%
 1024/50/   -1%/   -1%/   -1%/   -1%/   -2%
 1024/   100/   +1%/   +1%/0%/0%/   -1%
 1024/   200/   +2%/   +1%/   +2%/   +2%/0%

smp=8 queue=1
size/session/+thu%/+normalize%/+tpkts%/+rpkts%/+ioexits%/
1/ 1/   +1%/   -5%/   +1%/   +1%/0%
1/50/   +1%/0%/   +1%/   +1%/   -1%
1/   100/   -1%/   -1%/   -2%/   -2%/   -4%
1/   200/0%/0%/0%/0%/   -1%
   64/ 1/   -2%/  -10%/   -2%/   -2%/   -2%
   64/50/   -1%/   -1%/   -1%/   -1%/   -2%
   64/   100/   -1%/0%/0%/0%/   -1%
   64/   200/   -1%/   -1%/0%/0%/0%
  256/ 1/   +7%/  +25%/   +7%/   +7%/   +7%
  256/50/   +2%/   +2%/   +2%/   +2%/   -1%
  256/   100/   -1%/   -1%/   -1%/   -1%/   -3%
  256/   200/   +1%/0%/0%/0%/0%
 1024/ 1/   +5%/  +15%/   +5%/   +5%/   +4%
 1024/50/0%/0%/   -1%/   -1%/   -1%
 1024/   100/   -1%/   -1%/   -1%/   -1%/   -2%
 1024/   200/   -1%/0%/   -1%/   -1%/   -1%

smp=8 queue=8
size/session/+thu%/+normalize%/+tpkts%/+rpkts%/+ioexits%/
1/ 1/   +5%/   +2%/   +5%/   +5%/0%
1/50/   +2%/   +2%/   +3%/   +3%/  -20%
1/   100/   +5%/   +5%/   +5%/   +5%/  -13%
1/   200/   +8%/   +8%/   +6%/   +6%/  -12%
   64/ 1/0%/   +4%/0%/0%/  +18%
   64/50/   +6%/   +5%/   +5%/   +5%/   -7%
   64/   100/   +4%/   +4%/   +5%/   +5%/  -12%
   64/   200/   +5%/   +5%/   +5%/   +5%/  -12%
  256/ 1/0%/   -3%/0%/0%/   +1%
  256/50/   +3%/   +3%/   +3%/   +3%/   -2%
  256/   100/   +6%/   +5%/   +5%/   +5%/  -11%
  256/   200/   +4%/   +4%/   +4%/   +4%/  -13%
 1024/ 1/0%/   -3%/0%/0%/   -6%
 1024/50/   +1%/   +1%/   +1%/   +1%/  -10%
 1024/   100/   +4%/   +4%/   +5%/   +5%/  -11%
 1024/   200/   +4%/   +5%/   +4%/   +4%/  -12%

Thanks

Jason Wang (6):
  net: skbuff: don't use union for napi_id and sender_cpu
  tuntap: socket rx busy polling support
  macvtap: socket rx busy polling support
  net: core: factor out core busy polling logic to sk_busy_loop_once()
  net: export napi_by_id()
  vhost_net: net device rx busy polling support

 drivers/net/macvtap.c   |  4 
 drivers/net/tun.c   |  3 ++-
 drivers/vhost/net.c | 22 --
 include/linux/skbuff.h  | 10 
 include/net/busy_poll.h |  8 +++
 net/core/dev.c  | 62 -
 6 files changed, 75 insertions(+), 34 deletions(-)

-- 
2.5.0



[PATCH net-next 4/6] net: core: factor out core busy polling logic to sk_busy_loop_once()

2016-03-30 Thread Jason Wang
This patch factors out core logic of busy polling to
sk_busy_loop_once() in order to be reused by other modules.

Signed-off-by: Jason Wang 
---
 include/net/busy_poll.h |  7 ++
 net/core/dev.c  | 59 -
 2 files changed, 41 insertions(+), 25 deletions(-)

diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
index 2fbeb13..e765e23 100644
--- a/include/net/busy_poll.h
+++ b/include/net/busy_poll.h
@@ -73,6 +73,7 @@ static inline bool busy_loop_timeout(unsigned long end_time)
 }
 
 bool sk_busy_loop(struct sock *sk, int nonblock);
+int sk_busy_loop_once(struct sock *sk, struct napi_struct *napi);
 
 /* used in the NIC receive handler to mark the skb */
 static inline void skb_mark_napi_id(struct sk_buff *skb,
@@ -117,6 +118,12 @@ static inline bool busy_loop_timeout(unsigned long 
end_time)
return true;
 }
 
+static inline int sk_busy_loop_once(struct napi_struct *napi,
+   int (*busy_poll)(struct napi_struct *dev))
+{
+   return 0;
+}
+
 static inline bool sk_busy_loop(struct sock *sk, int nonblock)
 {
return false;
diff --git a/net/core/dev.c b/net/core/dev.c
index b9bcbe7..a2f0c46 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4902,10 +4902,42 @@ static struct napi_struct *napi_by_id(unsigned int 
napi_id)
 
 #if defined(CONFIG_NET_RX_BUSY_POLL)
 #define BUSY_POLL_BUDGET 8
+int sk_busy_loop_once(struct sock *sk, struct napi_struct *napi)
+{
+   int (*busy_poll)(struct napi_struct *dev);
+   int rc = 0;
+
+   /* Note: ndo_busy_poll method is optional in linux-4.5 */
+   busy_poll = napi->dev->netdev_ops->ndo_busy_poll;
+
+   local_bh_disable();
+   if (busy_poll) {
+   rc = busy_poll(napi);
+   } else if (napi_schedule_prep(napi)) {
+   void *have = netpoll_poll_lock(napi);
+
+   if (test_bit(NAPI_STATE_SCHED, >state)) {
+   rc = napi->poll(napi, BUSY_POLL_BUDGET);
+   trace_napi_poll(napi);
+   if (rc == BUSY_POLL_BUDGET) {
+   napi_complete_done(napi, rc);
+   napi_schedule(napi);
+   }
+   }
+   netpoll_poll_unlock(have);
+   }
+   if (rc > 0)
+   NET_ADD_STATS_BH(sock_net(sk),
+LINUX_MIB_BUSYPOLLRXPACKETS, rc);
+   local_bh_enable();
+
+   return rc;
+}
+EXPORT_SYMBOL(sk_busy_loop_once);
+
 bool sk_busy_loop(struct sock *sk, int nonblock)
 {
unsigned long end_time = !nonblock ? sk_busy_loop_end_time(sk) : 0;
-   int (*busy_poll)(struct napi_struct *dev);
struct napi_struct *napi;
int rc = false;
 
@@ -4915,31 +4947,8 @@ bool sk_busy_loop(struct sock *sk, int nonblock)
if (!napi)
goto out;
 
-   /* Note: ndo_busy_poll method is optional in linux-4.5 */
-   busy_poll = napi->dev->netdev_ops->ndo_busy_poll;
-
do {
-   rc = 0;
-   local_bh_disable();
-   if (busy_poll) {
-   rc = busy_poll(napi);
-   } else if (napi_schedule_prep(napi)) {
-   void *have = netpoll_poll_lock(napi);
-
-   if (test_bit(NAPI_STATE_SCHED, >state)) {
-   rc = napi->poll(napi, BUSY_POLL_BUDGET);
-   trace_napi_poll(napi);
-   if (rc == BUSY_POLL_BUDGET) {
-   napi_complete_done(napi, rc);
-   napi_schedule(napi);
-   }
-   }
-   netpoll_poll_unlock(have);
-   }
-   if (rc > 0)
-   NET_ADD_STATS_BH(sock_net(sk),
-LINUX_MIB_BUSYPOLLRXPACKETS, rc);
-   local_bh_enable();
+   rc = sk_busy_loop_once(sk, napi);
 
if (rc == LL_FLUSH_FAILED)
break; /* permanent failure */
-- 
2.5.0



[PATCH net-next 1/6] net: skbuff: don't use union for napi_id and sender_cpu

2016-03-30 Thread Jason Wang
We use a union for napi_id and send_cpu, this is ok for most of the
cases except when we want to support busy polling for tun which needs
napi_id to be stored and passed to socket during tun_net_xmit(). In
this case, napi_id was overridden with sender_cpu before tun_net_xmit()
was called if XPS was enabled. Fixing by not using union for napi_id
and sender_cpu.

Signed-off-by: Jason Wang 
---
 include/linux/skbuff.h | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 15d0df9..8aee891 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -743,11 +743,11 @@ struct sk_buff {
__u32   hash;
__be16  vlan_proto;
__u16   vlan_tci;
-#if defined(CONFIG_NET_RX_BUSY_POLL) || defined(CONFIG_XPS)
-   union {
-   unsigned intnapi_id;
-   unsigned intsender_cpu;
-   };
+#if defined(CONFIG_NET_RX_BUSY_POLL)
+   unsigned intnapi_id;
+#endif
+#if defined(CONFIG_XPS)
+   unsigned intsender_cpu;
 #endif
union {
 #ifdef CONFIG_NETWORK_SECMARK
-- 
2.5.0



[PATCH net-next 2/6] tuntap: socket rx busy polling support

2016-03-30 Thread Jason Wang
Signed-off-by: Jason Wang 
---
 drivers/net/tun.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index afdf950..950faf5 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -69,6 +69,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -871,6 +872,7 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct 
net_device *dev)
 
nf_reset(skb);
 
+   sk_mark_napi_id(tfile->socket.sk, skb);
/* Enqueue packet */
skb_queue_tail(>socket.sk->sk_receive_queue, skb);
 
@@ -878,7 +880,6 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct 
net_device *dev)
if (tfile->flags & TUN_FASYNC)
kill_fasync(>fasync, SIGIO, POLL_IN);
tfile->socket.sk->sk_data_ready(tfile->socket.sk);
-
rcu_read_unlock();
return NETDEV_TX_OK;
 
-- 
2.5.0



[PATCH net-next 1/6] net: skbuff: don't use union for napi_id and sender_cpu

2016-03-30 Thread Jason Wang
We use a union for napi_id and send_cpu, this is ok for most of the
cases except when we want to support busy polling for tun which needs
napi_id to be stored and passed to socket during tun_net_xmit(). In
this case, napi_id was overridden with sender_cpu before tun_net_xmit()
was called if XPS was enabled. Fixing by not using union for napi_id
and sender_cpu.

Signed-off-by: Jason Wang 
---
 include/linux/skbuff.h | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 15d0df9..8aee891 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -743,11 +743,11 @@ struct sk_buff {
__u32   hash;
__be16  vlan_proto;
__u16   vlan_tci;
-#if defined(CONFIG_NET_RX_BUSY_POLL) || defined(CONFIG_XPS)
-   union {
-   unsigned intnapi_id;
-   unsigned intsender_cpu;
-   };
+#if defined(CONFIG_NET_RX_BUSY_POLL)
+   unsigned intnapi_id;
+#endif
+#if defined(CONFIG_XPS)
+   unsigned intsender_cpu;
 #endif
union {
 #ifdef CONFIG_NETWORK_SECMARK
-- 
2.5.0



[PATCH net-next 2/6] tuntap: socket rx busy polling support

2016-03-30 Thread Jason Wang
Signed-off-by: Jason Wang 
---
 drivers/net/tun.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index afdf950..950faf5 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -69,6 +69,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -871,6 +872,7 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct 
net_device *dev)
 
nf_reset(skb);
 
+   sk_mark_napi_id(tfile->socket.sk, skb);
/* Enqueue packet */
skb_queue_tail(>socket.sk->sk_receive_queue, skb);
 
@@ -878,7 +880,6 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct 
net_device *dev)
if (tfile->flags & TUN_FASYNC)
kill_fasync(>fasync, SIGIO, POLL_IN);
tfile->socket.sk->sk_data_ready(tfile->socket.sk);
-
rcu_read_unlock();
return NETDEV_TX_OK;
 
-- 
2.5.0



[PATCH net-next 0/6] net device rx busy polling support in vhost_net

2016-03-30 Thread Jason Wang
Hi all:

This series try to add net device rx busy polling support in
vhost_net. This is done through:

- adding socket rx busy polling support for tun/macvtap by marking
  napi_id.
- vhost_net will try to find the net device through napi_id and do
  busy polling if possible.

TCP_RR tests on two mlx4s show some improvements:

smp=1 queue=1
size/session/+thu%/+normalize%/+tpkts%/+rpkts%/+ioexits%/
1/ 1/   +4%/   +3%/   +3%/   +3%/  +22%
1/50/   +2%/   +2%/   +2%/   +2%/0%
1/   100/   +1%/0%/   +1%/   +1%/   -1%
1/   200/   +2%/   +1%/   +2%/   +2%/0%
   64/ 1/   +1%/   +3%/   +1%/   +1%/   +1%
   64/50/0%/0%/0%/0%/   -1%
   64/   100/   +1%/0%/   +1%/   +1%/0%
   64/   200/0%/0%/   +2%/   +2%/0%
  256/ 1/   +2%/   +2%/   +2%/   +2%/   +2%
  256/50/   +3%/   +3%/   +3%/   +3%/0%
  256/   100/   +1%/   +1%/   +2%/   +2%/0%
  256/   200/0%/0%/   +1%/   +1%/   +1%
 1024/ 1/   +2%/   +2%/   +2%/   +2%/   +2%
 1024/50/   -1%/   -1%/   -1%/   -1%/   -2%
 1024/   100/   +1%/   +1%/0%/0%/   -1%
 1024/   200/   +2%/   +1%/   +2%/   +2%/0%

smp=8 queue=1
size/session/+thu%/+normalize%/+tpkts%/+rpkts%/+ioexits%/
1/ 1/   +1%/   -5%/   +1%/   +1%/0%
1/50/   +1%/0%/   +1%/   +1%/   -1%
1/   100/   -1%/   -1%/   -2%/   -2%/   -4%
1/   200/0%/0%/0%/0%/   -1%
   64/ 1/   -2%/  -10%/   -2%/   -2%/   -2%
   64/50/   -1%/   -1%/   -1%/   -1%/   -2%
   64/   100/   -1%/0%/0%/0%/   -1%
   64/   200/   -1%/   -1%/0%/0%/0%
  256/ 1/   +7%/  +25%/   +7%/   +7%/   +7%
  256/50/   +2%/   +2%/   +2%/   +2%/   -1%
  256/   100/   -1%/   -1%/   -1%/   -1%/   -3%
  256/   200/   +1%/0%/0%/0%/0%
 1024/ 1/   +5%/  +15%/   +5%/   +5%/   +4%
 1024/50/0%/0%/   -1%/   -1%/   -1%
 1024/   100/   -1%/   -1%/   -1%/   -1%/   -2%
 1024/   200/   -1%/0%/   -1%/   -1%/   -1%

smp=8 queue=8
size/session/+thu%/+normalize%/+tpkts%/+rpkts%/+ioexits%/
1/ 1/   +5%/   +2%/   +5%/   +5%/0%
1/50/   +2%/   +2%/   +3%/   +3%/  -20%
1/   100/   +5%/   +5%/   +5%/   +5%/  -13%
1/   200/   +8%/   +8%/   +6%/   +6%/  -12%
   64/ 1/0%/   +4%/0%/0%/  +18%
   64/50/   +6%/   +5%/   +5%/   +5%/   -7%
   64/   100/   +4%/   +4%/   +5%/   +5%/  -12%
   64/   200/   +5%/   +5%/   +5%/   +5%/  -12%
  256/ 1/0%/   -3%/0%/0%/   +1%
  256/50/   +3%/   +3%/   +3%/   +3%/   -2%
  256/   100/   +6%/   +5%/   +5%/   +5%/  -11%
  256/   200/   +4%/   +4%/   +4%/   +4%/  -13%
 1024/ 1/0%/   -3%/0%/0%/   -6%
 1024/50/   +1%/   +1%/   +1%/   +1%/  -10%
 1024/   100/   +4%/   +4%/   +5%/   +5%/  -11%
 1024/   200/   +4%/   +5%/   +4%/   +4%/  -12%

Thanks

Jason Wang (6):
  net: skbuff: don't use union for napi_id and sender_cpu
  tuntap: socket rx busy polling support
  macvtap: socket rx busy polling support
  net: core: factor out core busy polling logic to sk_busy_loop_once()
  net: export napi_by_id()
  vhost_net: net device rx busy polling support

 drivers/net/macvtap.c   |  4 
 drivers/net/tun.c   |  3 ++-
 drivers/vhost/net.c | 22 --
 include/linux/skbuff.h  | 10 
 include/net/busy_poll.h |  8 +++
 net/core/dev.c  | 62 -
 6 files changed, 75 insertions(+), 34 deletions(-)

-- 
2.5.0



[PATCH net-next 4/6] net: core: factor out core busy polling logic to sk_busy_loop_once()

2016-03-30 Thread Jason Wang
This patch factors out core logic of busy polling to
sk_busy_loop_once() in order to be reused by other modules.

Signed-off-by: Jason Wang 
---
 include/net/busy_poll.h |  7 ++
 net/core/dev.c  | 59 -
 2 files changed, 41 insertions(+), 25 deletions(-)

diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
index 2fbeb13..e765e23 100644
--- a/include/net/busy_poll.h
+++ b/include/net/busy_poll.h
@@ -73,6 +73,7 @@ static inline bool busy_loop_timeout(unsigned long end_time)
 }
 
 bool sk_busy_loop(struct sock *sk, int nonblock);
+int sk_busy_loop_once(struct sock *sk, struct napi_struct *napi);
 
 /* used in the NIC receive handler to mark the skb */
 static inline void skb_mark_napi_id(struct sk_buff *skb,
@@ -117,6 +118,12 @@ static inline bool busy_loop_timeout(unsigned long 
end_time)
return true;
 }
 
+static inline int sk_busy_loop_once(struct napi_struct *napi,
+   int (*busy_poll)(struct napi_struct *dev))
+{
+   return 0;
+}
+
 static inline bool sk_busy_loop(struct sock *sk, int nonblock)
 {
return false;
diff --git a/net/core/dev.c b/net/core/dev.c
index b9bcbe7..a2f0c46 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4902,10 +4902,42 @@ static struct napi_struct *napi_by_id(unsigned int 
napi_id)
 
 #if defined(CONFIG_NET_RX_BUSY_POLL)
 #define BUSY_POLL_BUDGET 8
+int sk_busy_loop_once(struct sock *sk, struct napi_struct *napi)
+{
+   int (*busy_poll)(struct napi_struct *dev);
+   int rc = 0;
+
+   /* Note: ndo_busy_poll method is optional in linux-4.5 */
+   busy_poll = napi->dev->netdev_ops->ndo_busy_poll;
+
+   local_bh_disable();
+   if (busy_poll) {
+   rc = busy_poll(napi);
+   } else if (napi_schedule_prep(napi)) {
+   void *have = netpoll_poll_lock(napi);
+
+   if (test_bit(NAPI_STATE_SCHED, >state)) {
+   rc = napi->poll(napi, BUSY_POLL_BUDGET);
+   trace_napi_poll(napi);
+   if (rc == BUSY_POLL_BUDGET) {
+   napi_complete_done(napi, rc);
+   napi_schedule(napi);
+   }
+   }
+   netpoll_poll_unlock(have);
+   }
+   if (rc > 0)
+   NET_ADD_STATS_BH(sock_net(sk),
+LINUX_MIB_BUSYPOLLRXPACKETS, rc);
+   local_bh_enable();
+
+   return rc;
+}
+EXPORT_SYMBOL(sk_busy_loop_once);
+
 bool sk_busy_loop(struct sock *sk, int nonblock)
 {
unsigned long end_time = !nonblock ? sk_busy_loop_end_time(sk) : 0;
-   int (*busy_poll)(struct napi_struct *dev);
struct napi_struct *napi;
int rc = false;
 
@@ -4915,31 +4947,8 @@ bool sk_busy_loop(struct sock *sk, int nonblock)
if (!napi)
goto out;
 
-   /* Note: ndo_busy_poll method is optional in linux-4.5 */
-   busy_poll = napi->dev->netdev_ops->ndo_busy_poll;
-
do {
-   rc = 0;
-   local_bh_disable();
-   if (busy_poll) {
-   rc = busy_poll(napi);
-   } else if (napi_schedule_prep(napi)) {
-   void *have = netpoll_poll_lock(napi);
-
-   if (test_bit(NAPI_STATE_SCHED, >state)) {
-   rc = napi->poll(napi, BUSY_POLL_BUDGET);
-   trace_napi_poll(napi);
-   if (rc == BUSY_POLL_BUDGET) {
-   napi_complete_done(napi, rc);
-   napi_schedule(napi);
-   }
-   }
-   netpoll_poll_unlock(have);
-   }
-   if (rc > 0)
-   NET_ADD_STATS_BH(sock_net(sk),
-LINUX_MIB_BUSYPOLLRXPACKETS, rc);
-   local_bh_enable();
+   rc = sk_busy_loop_once(sk, napi);
 
if (rc == LL_FLUSH_FAILED)
break; /* permanent failure */
-- 
2.5.0



Re: [PATCH] tpm: remove redundant code from self-test functions

2016-03-30 Thread Jason Gunthorpe
On Wed, Mar 30, 2016 at 04:20:45PM +0300, Jarkko Sakkinen wrote:
  
> - rc = be32_to_cpu(cmd.header.out.return_code);
>   if (rc == TPM_ERR_DISABLED || rc == TPM_ERR_DEACTIVATED) {

This line is the entire reason it is open coded, I see it being
removed, but I don't see how the functionality is maintained?

Jason


Re: [PATCH] tpm: remove redundant code from self-test functions

2016-03-30 Thread Jason Gunthorpe
On Wed, Mar 30, 2016 at 04:20:45PM +0300, Jarkko Sakkinen wrote:
  
> - rc = be32_to_cpu(cmd.header.out.return_code);
>   if (rc == TPM_ERR_DISABLED || rc == TPM_ERR_DEACTIVATED) {

This line is the entire reason it is open coded, I see it being
removed, but I don't see how the functionality is maintained?

Jason


[PATCH] .mailmap: Add Christophe Ricard

2016-03-30 Thread Christophe Ricard
Different computers had different settings in the mail client. Some
contributions appear as Christophe Ricard, others as Christophe RICARD.

Signed-off-by: Christophe Ricard 
---
 .mailmap | 1 +
 1 file changed, 1 insertion(+)

diff --git a/.mailmap b/.mailmap
index 7e6c533..90c0aef 100644
--- a/.mailmap
+++ b/.mailmap
@@ -33,6 +33,7 @@ Björn Steinbrink 
 Brian Avery 
 Brian King 
 Christoph Hellwig 
+Christophe Ricard 
 Corey Minyard 
 Damian Hobson-Garcia 
 David Brownell 
-- 
2.5.0



[PATCH] .mailmap: Add Christophe Ricard

2016-03-30 Thread Christophe Ricard
Different computers had different settings in the mail client. Some
contributions appear as Christophe Ricard, others as Christophe RICARD.

Signed-off-by: Christophe Ricard 
---
 .mailmap | 1 +
 1 file changed, 1 insertion(+)

diff --git a/.mailmap b/.mailmap
index 7e6c533..90c0aef 100644
--- a/.mailmap
+++ b/.mailmap
@@ -33,6 +33,7 @@ Björn Steinbrink 
 Brian Avery 
 Brian King 
 Christoph Hellwig 
+Christophe Ricard 
 Corey Minyard 
 Damian Hobson-Garcia 
 David Brownell 
-- 
2.5.0



Re: [PATCH v8 0/4] Introduce usb charger framework to deal with the usb gadget power negotation

2016-03-30 Thread Baolin Wang
On 30 March 2016 at 19:24, Felipe Balbi  wrote:
>>> >> >> >
>>> >> >> > Seems you don't want to guarantee charger type detection is done
>>> >> >> > before gadget connection(pullup DP), right?
>>> >> >> > I see you call usb_charger_detect_type() in each gadget usb
>>> >> >> > state
>>> >> >> changes.
>>> >> >>
>>> >> >> I am not sure I get your point correctly, please correct me if I
>>> >> >> misunderstand you.
>>> >> >> We need to check the charger type at every event comes from the
>>> >> >> usb gadget state changes or the extcon device state changes, which
>>> >> >> means a new charger plugin or pullup.
>>> >> >>
>>> >> >
>>> >> > According to usb charger spec, my understanding is you can't do
>>> >> > real charger detection procedure *after* gadget _connection_(pullup
>>> >> > DP), also I don't
>>> >>
>>> >> Why can not? Charger detection is usually from PMIC.
>>> >
>>> > Charger detection process will impact DP/DM line state, see usb
>>> > charger spec v1.2 for detail detection process, section 4.6.3 says:
>>> >
>>> > "A PD is allowed to *disconnect* and repeat the charger detection
>>> > process multiple times while attached. The PD is required to wait for
>>> > a time of at least TCP_VDM_EN max between disconnecting and restarting
>>> > the charger detection process."
>>> >
>>> > As Peter mentioned, the charger detection should happen between VBUS
>>> > detection and gadget pull up DP for first plug in case. So when
>>> > gadget connect (pullup DP), you should already know the charger type.
>>>
>>> Make sense. In our company's solution, charger detection can be done by
>>> hardware from PMIC at first, then it will not affect the DP/DM line when
>>> gadget starts to enumeration.
>>
>> I see, charger type detection is done automatically by PMIC when VBUS
>> is detected in your case, you just assume the process is complete
>
> assuming this finishes before gadget starts is a bad idea. It would've
> been much more robust to delay usb_gadget_connect() until we KNOW
> charger detection has completed.

It is hardware action to detect the charger type quickly. It actually
*gets* the charger type and does not means *detect* charger type in
'usb_charger_detect_type()' function. Maybe I need to change the
function name as 'usb_charger_get_type()'.

If some udc drivers want to detect charger type in
'gadget->ops->get_charger_type()' callback, they should avoid
impacting DP/DM line state at the right gadget state. Thanks.

>
> --
> balbi



-- 
Baolin.wang
Best Regards


Re: [PATCH v8 0/4] Introduce usb charger framework to deal with the usb gadget power negotation

2016-03-30 Thread Baolin Wang
On 30 March 2016 at 19:24, Felipe Balbi  wrote:
>>> >> >> >
>>> >> >> > Seems you don't want to guarantee charger type detection is done
>>> >> >> > before gadget connection(pullup DP), right?
>>> >> >> > I see you call usb_charger_detect_type() in each gadget usb
>>> >> >> > state
>>> >> >> changes.
>>> >> >>
>>> >> >> I am not sure I get your point correctly, please correct me if I
>>> >> >> misunderstand you.
>>> >> >> We need to check the charger type at every event comes from the
>>> >> >> usb gadget state changes or the extcon device state changes, which
>>> >> >> means a new charger plugin or pullup.
>>> >> >>
>>> >> >
>>> >> > According to usb charger spec, my understanding is you can't do
>>> >> > real charger detection procedure *after* gadget _connection_(pullup
>>> >> > DP), also I don't
>>> >>
>>> >> Why can not? Charger detection is usually from PMIC.
>>> >
>>> > Charger detection process will impact DP/DM line state, see usb
>>> > charger spec v1.2 for detail detection process, section 4.6.3 says:
>>> >
>>> > "A PD is allowed to *disconnect* and repeat the charger detection
>>> > process multiple times while attached. The PD is required to wait for
>>> > a time of at least TCP_VDM_EN max between disconnecting and restarting
>>> > the charger detection process."
>>> >
>>> > As Peter mentioned, the charger detection should happen between VBUS
>>> > detection and gadget pull up DP for first plug in case. So when
>>> > gadget connect (pullup DP), you should already know the charger type.
>>>
>>> Make sense. In our company's solution, charger detection can be done by
>>> hardware from PMIC at first, then it will not affect the DP/DM line when
>>> gadget starts to enumeration.
>>
>> I see, charger type detection is done automatically by PMIC when VBUS
>> is detected in your case, you just assume the process is complete
>
> assuming this finishes before gadget starts is a bad idea. It would've
> been much more robust to delay usb_gadget_connect() until we KNOW
> charger detection has completed.

It is hardware action to detect the charger type quickly. It actually
*gets* the charger type and does not means *detect* charger type in
'usb_charger_detect_type()' function. Maybe I need to change the
function name as 'usb_charger_get_type()'.

If some udc drivers want to detect charger type in
'gadget->ops->get_charger_type()' callback, they should avoid
impacting DP/DM line state at the right gadget state. Thanks.

>
> --
> balbi



-- 
Baolin.wang
Best Regards


Re: [PATCH v2] tty/serial/8250: fix RS485 half-duplex RX

2016-03-30 Thread Yegor Yefremov
On Thu, Mar 24, 2016 at 9:03 AM,   wrote:
> From: Yegor Yefremov 
>
> When in half-duplex mode RX will be disabled before TX, but not
> enabled after deactivating transmitter. This patch enables
> UART_IER_RLSI and UART_IER_RDI interrupts after TX is over.
>
> Cc: Matwey V. Kornilov 
> Signed-off-by: Yegor Yefremov 
> Fixes: e490c9144cfa ("tty: Add software emulated RS485 support for 8250")

Transferring from Acked-by from v1:

Acked-by: Matwey V. Kornilov 

> ---
> Changes:
> v2: change subject and add 'Fixes' tag
>
>  drivers/tty/serial/8250/8250_port.c | 11 ++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/tty/serial/8250/8250_port.c 
> b/drivers/tty/serial/8250/8250_port.c
> index e213da0..00ad263 100644
> --- a/drivers/tty/serial/8250/8250_port.c
> +++ b/drivers/tty/serial/8250/8250_port.c
> @@ -1403,9 +1403,18 @@ static void __do_stop_tx_rs485(struct uart_8250_port 
> *p)
> /*
>  * Empty the RX FIFO, we are not interested in anything
>  * received during the half-duplex transmission.
> +* Enable previously disabled RX interrupts.
>  */
> -   if (!(p->port.rs485.flags & SER_RS485_RX_DURING_TX))
> +   if (!(p->port.rs485.flags & SER_RS485_RX_DURING_TX)) {
> serial8250_clear_fifos(p);
> +
> +   serial8250_rpm_get(p);
> +
> +   p->ier |= UART_IER_RLSI | UART_IER_RDI;
> +   serial_port_out(>port, UART_IER, p->ier);
> +
> +   serial8250_rpm_put(p);
> +   }
>  }
>
>  static void serial8250_em485_handle_stop_tx(unsigned long arg)
> --
> 2.1.4
>


Re: [PATCH v2] tty/serial/8250: fix RS485 half-duplex RX

2016-03-30 Thread Yegor Yefremov
On Thu, Mar 24, 2016 at 9:03 AM,   wrote:
> From: Yegor Yefremov 
>
> When in half-duplex mode RX will be disabled before TX, but not
> enabled after deactivating transmitter. This patch enables
> UART_IER_RLSI and UART_IER_RDI interrupts after TX is over.
>
> Cc: Matwey V. Kornilov 
> Signed-off-by: Yegor Yefremov 
> Fixes: e490c9144cfa ("tty: Add software emulated RS485 support for 8250")

Transferring from Acked-by from v1:

Acked-by: Matwey V. Kornilov 

> ---
> Changes:
> v2: change subject and add 'Fixes' tag
>
>  drivers/tty/serial/8250/8250_port.c | 11 ++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/tty/serial/8250/8250_port.c 
> b/drivers/tty/serial/8250/8250_port.c
> index e213da0..00ad263 100644
> --- a/drivers/tty/serial/8250/8250_port.c
> +++ b/drivers/tty/serial/8250/8250_port.c
> @@ -1403,9 +1403,18 @@ static void __do_stop_tx_rs485(struct uart_8250_port 
> *p)
> /*
>  * Empty the RX FIFO, we are not interested in anything
>  * received during the half-duplex transmission.
> +* Enable previously disabled RX interrupts.
>  */
> -   if (!(p->port.rs485.flags & SER_RS485_RX_DURING_TX))
> +   if (!(p->port.rs485.flags & SER_RS485_RX_DURING_TX)) {
> serial8250_clear_fifos(p);
> +
> +   serial8250_rpm_get(p);
> +
> +   p->ier |= UART_IER_RLSI | UART_IER_RDI;
> +   serial_port_out(>port, UART_IER, p->ier);
> +
> +   serial8250_rpm_put(p);
> +   }
>  }
>
>  static void serial8250_em485_handle_stop_tx(unsigned long arg)
> --
> 2.1.4
>


Re: [PATCH v8 0/4] Introduce usb charger framework to deal with the usb gadget power negotation

2016-03-30 Thread Baolin Wang
On 30 March 2016 at 18:58, Jun Li  wrote:
>> >> >> > Seems you don't want to guarantee charger type detection is done
>> >> >> > before gadget connection(pullup DP), right?
>> >> >> > I see you call usb_charger_detect_type() in each gadget usb
>> >> >> > state
>> >> >> changes.
>> >> >>
>> >> >> I am not sure I get your point correctly, please correct me if I
>> >> >> misunderstand you.
>> >> >> We need to check the charger type at every event comes from the
>> >> >> usb gadget state changes or the extcon device state changes, which
>> >> >> means a new charger plugin or pullup.
>> >> >>
>> >> >
>> >> > According to usb charger spec, my understanding is you can't do
>> >> > real charger detection procedure *after* gadget _connection_(pullup
>> >> > DP), also I don't
>> >>
>> >> Why can not? Charger detection is usually from PMIC.
>> >
>> > Charger detection process will impact DP/DM line state, see usb
>> > charger spec v1.2 for detail detection process, section 4.6.3 says:
>> >
>> > "A PD is allowed to *disconnect* and repeat the charger detection
>> > process multiple times while attached. The PD is required to wait for
>> > a time of at least TCP_VDM_EN max between disconnecting and restarting
>> > the charger detection process."
>> >
>> > As Peter mentioned, the charger detection should happen between VBUS
>> > detection and gadget pull up DP for first plug in case. So when
>> > gadget connect (pullup DP), you should already know the charger type.
>>
>> Make sense. In our company's solution, charger detection can be done by
>> hardware from PMIC at first, then it will not affect the DP/DM line when
>> gadget starts to enumeration.
>
> I see, charger type detection is done automatically by PMIC when VBUS is
> detected in your case, you just assume the process is complete before SW
> do gadget connect. To make the framework common, you may do one time charger 
> type check when vbus is on, and save it to avoid repeat charger type check.

OK. I'll add one judgement to check if the charger type is set in
'usb_charger_detect_type()' function.

-- 
Baolin.wang
Best Regards


Re: [PATCH v8 0/4] Introduce usb charger framework to deal with the usb gadget power negotation

2016-03-30 Thread Baolin Wang
On 30 March 2016 at 18:58, Jun Li  wrote:
>> >> >> > Seems you don't want to guarantee charger type detection is done
>> >> >> > before gadget connection(pullup DP), right?
>> >> >> > I see you call usb_charger_detect_type() in each gadget usb
>> >> >> > state
>> >> >> changes.
>> >> >>
>> >> >> I am not sure I get your point correctly, please correct me if I
>> >> >> misunderstand you.
>> >> >> We need to check the charger type at every event comes from the
>> >> >> usb gadget state changes or the extcon device state changes, which
>> >> >> means a new charger plugin or pullup.
>> >> >>
>> >> >
>> >> > According to usb charger spec, my understanding is you can't do
>> >> > real charger detection procedure *after* gadget _connection_(pullup
>> >> > DP), also I don't
>> >>
>> >> Why can not? Charger detection is usually from PMIC.
>> >
>> > Charger detection process will impact DP/DM line state, see usb
>> > charger spec v1.2 for detail detection process, section 4.6.3 says:
>> >
>> > "A PD is allowed to *disconnect* and repeat the charger detection
>> > process multiple times while attached. The PD is required to wait for
>> > a time of at least TCP_VDM_EN max between disconnecting and restarting
>> > the charger detection process."
>> >
>> > As Peter mentioned, the charger detection should happen between VBUS
>> > detection and gadget pull up DP for first plug in case. So when
>> > gadget connect (pullup DP), you should already know the charger type.
>>
>> Make sense. In our company's solution, charger detection can be done by
>> hardware from PMIC at first, then it will not affect the DP/DM line when
>> gadget starts to enumeration.
>
> I see, charger type detection is done automatically by PMIC when VBUS is
> detected in your case, you just assume the process is complete before SW
> do gadget connect. To make the framework common, you may do one time charger 
> type check when vbus is on, and save it to avoid repeat charger type check.

OK. I'll add one judgement to check if the charger type is set in
'usb_charger_detect_type()' function.

-- 
Baolin.wang
Best Regards


Re: [PATCH V2 3/3] sched/deadline: Tracepoints for deadline scheduler

2016-03-30 Thread Juri Lelli
Hi,

On 29/03/16 15:25, Steven Rostedt wrote:
> On Tue, 29 Mar 2016 16:12:38 -0300
> Daniel Bristot de Oliveira  wrote:
> 
> > On 03/29/2016 02:13 PM, Steven Rostedt wrote:
> > >>   -0 [007] d..3 78377.688969: sched_switch: 
> > >> prev_comm=swapper/7 prev_pid=0 prev_prio=120 prev_state=R ==> 
> > >> next_comm=b next_pid=18973 next_prio=-1  
> > >> >b-18973 [007] d..3 78377.688979: sched_deadline_block: 
> > >> > now=78377.688976271 deadline=78377.718945137 remaining_runtime=9968866
> > >> >b-18973 [007] d..3 78377.688981: sched_switch: 
> > >> > prev_comm=b prev_pid=18973 prev_prio=-1 prev_state=S ==> 
> > >> > next_comm=swapper/7 next_pid=0 next_prio=120  
> > > Why did it go to sleep? The above is still not very useful. What do you
> > > mean "blocking on a system call"?  
> > 
> > A task can go can to sleep in a blocking system call, like waiting
> > a network packet, or any other external event.
> 
> Note, waiting for a network packet or some other external event is a
> userspace call. A schedule out in 'S' state means exactly that. But
> I hate the term "blocked" because that is more like waiting for
> something else to finish (like blocked on a lock). In which case, if
> that did happen, the state would be "D" not "S".
> 
> "S" is basically "sleeping" and it gets woken up by some other event. A
> slight difference to the term "blocked".
> 
> > 
> > The "block state" is a possible state of a task running in the deadline
> > scheduler. It means that a task voluntarily left the processor, not
> > by calling sched_yield(), but by blocking (or sleeping) waiting another
> > event.
> > 
> > This state is described in the Figure 2 of the article "Deadline
> > scheduling in the Linux kernel", available at:
> > http://onlinelibrary.wiley.com/doi/10.1002/spe.2335/abstract
> 
> Bah, confusing terminology.
> 

Mmm, a bit of overloading yes. Should be consistent with RT literature
terminology, though (I hope :-/).

> > 
> > The block state affects the replenishment of the task, and that is why
> > it is different of both yeild and throttle. If the task blocks and is
> > awakened prior to the deadline, the task's runtime will not be
> > replenished. On the other hand it will. For example:
> > 

Not entirely true. We can have a replenishment even if the task wakes up
before the deadline, if it happens to wake up after the 0-lag point
(after which its runtime can't be recycled if we don't want to affect
others' guarantees). Anyway, this doesn't affect the discussion, I only
wanted to point out that the fact that a replenishment happened might be
useful information to get.

> > Blocking, and then waking up after the deadline:
> >b-5152  [007] d..3  3983.376428: sched_deadline_replenish: 
> > comm=b pid=5152 now=3983.376425148 deadline=3983.406425148 runtime=1000
> >b-5152  [007] d..3  3983.376515: sched_deadline_block: 
> > now=3983.376511101 deadline=3983.406425148 remaining_runtime=9909566
> >b-5152  [007] d..3  3983.376529: sched_switch: prev_comm=b 
> > prev_pid=5152 prev_prio=-1 prev_state=S ==> next_comm=swapper/7 next_pid=0 
> > next_prio=120
> > 
> >   -0 [007] d.h4  3983.476592: sched_deadline_replenish: 
> > comm=b pid=5152 now=3983.476589573 deadline=3983.506589573 runtime=1000
> >   -0 [007] dNh4  3983.476596: sched_wakeup: comm=b 
> > pid=5152 prio=-1 target_cpu=007
> >   -0 [007] d..3  3983.476648: sched_switch: 
> > prev_comm=swapper/7 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=b 
> > next_pid=5152 next_prio=-1
> >b-5152  [007] d..3  3983.476660: sched_deadline_block: 
> > now=3983.476656613 deadline=3983.506589573 remaining_runtime=9932960
> >b-5152  [007] d..3  3983.476663: sched_switch: prev_comm=b 
> > prev_pid=5152 prev_prio=-1 prev_state=S ==> next_comm=swapper/7 next_pid=0 
> > next_prio=120
> > 
> > 
> > Blocking, and then waking up before the deadline:
> >b-5139  [007] d..3  3964.148290: sched_deadline_replenish: 
> > comm=b pid=5139 now=3964.148285227 deadline=3964.178285227 runtime=1000
> >b-5139  [007] d..3  3964.148396: sched_deadline_block: 
> > now=3964.148385308 deadline=3964.178285227 remaining_runtime=9895243
> >b-5139  [007] d..3  3964.148400: sched_switch: prev_comm=b 
> > prev_pid=5139 prev_prio=-1 prev_state=S ==> next_comm=swapper/7 next_pid=0 
> > next_prio=120
> > 
> >   -0 [007] dNh5  3964.148411: sched_wakeup: comm=b 
> > pid=5139 prio=-1 target_cpu=007
> >   -0 [007] d..3  3964.148419: sched_switch: 
> > prev_comm=swapper/7 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=b 
> > next_pid=5139 next_prio=-1
> >b-5139  [007] d..3  3964.148427: sched_deadline_block: 
> > now=3964.148426022 deadline=3964.178285227 remaining_runtime=9878164
> >b-5139  [007] d..3  3964.148429: 

Re: [PATCH V2 3/3] sched/deadline: Tracepoints for deadline scheduler

2016-03-30 Thread Juri Lelli
Hi,

On 29/03/16 15:25, Steven Rostedt wrote:
> On Tue, 29 Mar 2016 16:12:38 -0300
> Daniel Bristot de Oliveira  wrote:
> 
> > On 03/29/2016 02:13 PM, Steven Rostedt wrote:
> > >>   -0 [007] d..3 78377.688969: sched_switch: 
> > >> prev_comm=swapper/7 prev_pid=0 prev_prio=120 prev_state=R ==> 
> > >> next_comm=b next_pid=18973 next_prio=-1  
> > >> >b-18973 [007] d..3 78377.688979: sched_deadline_block: 
> > >> > now=78377.688976271 deadline=78377.718945137 remaining_runtime=9968866
> > >> >b-18973 [007] d..3 78377.688981: sched_switch: 
> > >> > prev_comm=b prev_pid=18973 prev_prio=-1 prev_state=S ==> 
> > >> > next_comm=swapper/7 next_pid=0 next_prio=120  
> > > Why did it go to sleep? The above is still not very useful. What do you
> > > mean "blocking on a system call"?  
> > 
> > A task can go can to sleep in a blocking system call, like waiting
> > a network packet, or any other external event.
> 
> Note, waiting for a network packet or some other external event is a
> userspace call. A schedule out in 'S' state means exactly that. But
> I hate the term "blocked" because that is more like waiting for
> something else to finish (like blocked on a lock). In which case, if
> that did happen, the state would be "D" not "S".
> 
> "S" is basically "sleeping" and it gets woken up by some other event. A
> slight difference to the term "blocked".
> 
> > 
> > The "block state" is a possible state of a task running in the deadline
> > scheduler. It means that a task voluntarily left the processor, not
> > by calling sched_yield(), but by blocking (or sleeping) waiting another
> > event.
> > 
> > This state is described in the Figure 2 of the article "Deadline
> > scheduling in the Linux kernel", available at:
> > http://onlinelibrary.wiley.com/doi/10.1002/spe.2335/abstract
> 
> Bah, confusing terminology.
> 

Mmm, a bit of overloading yes. Should be consistent with RT literature
terminology, though (I hope :-/).

> > 
> > The block state affects the replenishment of the task, and that is why
> > it is different of both yeild and throttle. If the task blocks and is
> > awakened prior to the deadline, the task's runtime will not be
> > replenished. On the other hand it will. For example:
> > 

Not entirely true. We can have a replenishment even if the task wakes up
before the deadline, if it happens to wake up after the 0-lag point
(after which its runtime can't be recycled if we don't want to affect
others' guarantees). Anyway, this doesn't affect the discussion, I only
wanted to point out that the fact that a replenishment happened might be
useful information to get.

> > Blocking, and then waking up after the deadline:
> >b-5152  [007] d..3  3983.376428: sched_deadline_replenish: 
> > comm=b pid=5152 now=3983.376425148 deadline=3983.406425148 runtime=1000
> >b-5152  [007] d..3  3983.376515: sched_deadline_block: 
> > now=3983.376511101 deadline=3983.406425148 remaining_runtime=9909566
> >b-5152  [007] d..3  3983.376529: sched_switch: prev_comm=b 
> > prev_pid=5152 prev_prio=-1 prev_state=S ==> next_comm=swapper/7 next_pid=0 
> > next_prio=120
> > 
> >   -0 [007] d.h4  3983.476592: sched_deadline_replenish: 
> > comm=b pid=5152 now=3983.476589573 deadline=3983.506589573 runtime=1000
> >   -0 [007] dNh4  3983.476596: sched_wakeup: comm=b 
> > pid=5152 prio=-1 target_cpu=007
> >   -0 [007] d..3  3983.476648: sched_switch: 
> > prev_comm=swapper/7 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=b 
> > next_pid=5152 next_prio=-1
> >b-5152  [007] d..3  3983.476660: sched_deadline_block: 
> > now=3983.476656613 deadline=3983.506589573 remaining_runtime=9932960
> >b-5152  [007] d..3  3983.476663: sched_switch: prev_comm=b 
> > prev_pid=5152 prev_prio=-1 prev_state=S ==> next_comm=swapper/7 next_pid=0 
> > next_prio=120
> > 
> > 
> > Blocking, and then waking up before the deadline:
> >b-5139  [007] d..3  3964.148290: sched_deadline_replenish: 
> > comm=b pid=5139 now=3964.148285227 deadline=3964.178285227 runtime=1000
> >b-5139  [007] d..3  3964.148396: sched_deadline_block: 
> > now=3964.148385308 deadline=3964.178285227 remaining_runtime=9895243
> >b-5139  [007] d..3  3964.148400: sched_switch: prev_comm=b 
> > prev_pid=5139 prev_prio=-1 prev_state=S ==> next_comm=swapper/7 next_pid=0 
> > next_prio=120
> > 
> >   -0 [007] dNh5  3964.148411: sched_wakeup: comm=b 
> > pid=5139 prio=-1 target_cpu=007
> >   -0 [007] d..3  3964.148419: sched_switch: 
> > prev_comm=swapper/7 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=b 
> > next_pid=5139 next_prio=-1
> >b-5139  [007] d..3  3964.148427: sched_deadline_block: 
> > now=3964.148426022 deadline=3964.178285227 remaining_runtime=9878164
> >b-5139  [007] d..3  3964.148429: sched_switch: 

Re: [PATCH v16 22/23] tracing: Add hist trigger 'log2' modifier

2016-03-30 Thread Daniel Wagner
Hi Namhyung,

On 03/29/2016 05:17 PM, Namhyung Kim wrote:
> On Tue, Mar 29, 2016 at 12:01:40PM +0200, Daniel Wagner wrote:
>> cat /sys/kernel/debug/tracing/events/test/latency_complete/hist
>> # event histogram
>> #
>> # trigger info: 
>> hist:keys=latency.log2:vals=hitcount:sort=latency.log2:size=2048 [active]
>> #
>> #
> 
> Maybe we want to skip printing those flags for sort keys..
> What about this?

Yes, something like this makes sense to me. As I said, it is only a
minor detail and maybe just not worth the extra code.

Anyway, the hist patches do really work nicely. I hope they get merged soon.

cheers,
daniel


Re: [PATCH v16 22/23] tracing: Add hist trigger 'log2' modifier

2016-03-30 Thread Daniel Wagner
Hi Namhyung,

On 03/29/2016 05:17 PM, Namhyung Kim wrote:
> On Tue, Mar 29, 2016 at 12:01:40PM +0200, Daniel Wagner wrote:
>> cat /sys/kernel/debug/tracing/events/test/latency_complete/hist
>> # event histogram
>> #
>> # trigger info: 
>> hist:keys=latency.log2:vals=hitcount:sort=latency.log2:size=2048 [active]
>> #
>> #
> 
> Maybe we want to skip printing those flags for sort keys..
> What about this?

Yes, something like this makes sense to me. As I said, it is only a
minor detail and maybe just not worth the extra code.

Anyway, the hist patches do really work nicely. I hope they get merged soon.

cheers,
daniel


Re: [PATCH 0/3] idle, Honor Hardware Disabled States

2016-03-30 Thread Len Brown
> Len,
>
> Your patch does
>
> +   skl_cstates[5].disabled = 1;/* C8-SKL */
> +   skl_cstates[6].disabled = 1;/* C9-SKL */
>
> and I don't think that is correct for SKY-H.

For https://bugzilla.kernel.org/show_bug.cgi?id=109081
it is correct.

> Your patch does not take into account that the states are explicitly disabled
> in MSR_NHM_SNB_PKG_CST_CFG_CTL.  That is the problem here and what you've done
> is simply hammered a disable into those states.

ENOPARSE.
Are we talking about the failure in
https://bugzilla.kernel.org/show_bug.cgi?id=109081
or a different problem?

>
> Additionally, your patch does not show the user the correct state information:
>
> [root@dhcp40-125 ~]# egrep ^ 
> /sys/devices/system/cpu/cpu0/cpuidle/state?/disable
> /sys/devices/system/cpu/cpu0/cpuidle/state0/disable:1:0
> /sys/devices/system/cpu/cpu0/cpuidle/state1/disable:1:0
> /sys/devices/system/cpu/cpu0/cpuidle/state2/disable:1:0
> /sys/devices/system/cpu/cpu0/cpuidle/state3/disable:1:0
> /sys/devices/system/cpu/cpu0/cpuidle/state4/disable:1:0
> /sys/devices/system/cpu/cpu0/cpuidle/state5/disable:1:0
> /sys/devices/system/cpu/cpu0/cpuidle/state6/disable:1:0
> /sys/devices/system/cpu/cpu0/cpuidle/state7/disable:1:0 << should be 1
> /sys/devices/system/cpu/cpu0/cpuidle/state8/disable:1:0 << should be 1

the 'disabled' attribute you see in sysfs is not
struct cpuidle_state.disabled
it is
struct cpuidle_state_usage.disabled

> The fix is to honour the settings in MSR_NHM_SNB_PKG_CST_CFG_CTL.  I cannot 
> say
> for certain that ALL SKY-H are impacted (you are admittedly in better position
> to say so or not).  I can say that on the 2 systems tested here the
> MSR_NHM_SNB_PKG_CST_CFG_CTL do have the appropriate disable value set.
>
> /me could be missing some important info  -- again, perhaps there are some
> SKY-H's out there that do not have states disabled in
> MSR_NHM_SNB_PKG_CST_CFG_CTL, and that's why I've proposed rebasing on top of
> your change.

Do you see this debug message when you run current upstream on this hardware?

/* if state marked as disabled, skip it */
if (cpuidle_state_table[cstate].disabled != 0) {
pr_debug(PREFIX "state %s is disabled",
cpuidle_state_table[cstate].name);
continue;
}


If no, then my patch is not disabling C8/C9 on your system.

Also, if it were, the code above causes the states to not appear
at all in sysfs, because they are not registered.

Re: MSR_NHM_SNB_PKG_CST_CFG_CTL

if PC10 is disabled there, then functionally, it doesn't matter what we do,
which is why my patch does nothing when PC10 is disabled.

In such a scenario, pc10 presence in sysfs (and cpufreq)
is cosmetic.  The hardware knows what to do.

Do you think that cosmetic issue is worth dealing with?
Note that the decoding of that MSR changes with every CPU,
so to get it right (like turbostat does), we'd need a table.
Also, it would be useful only for states which are  PC states only.
ie. we can't disable CC7 just because PC7 is disabled. etc.
So you could remove PC8, PC9, PC10 from sysfs on SKL
when they are disabled, but that is all.

thanks,
Len Brown, Intel Open Source Technology Center


Re: [PATCH 0/3] idle, Honor Hardware Disabled States

2016-03-30 Thread Len Brown
> Len,
>
> Your patch does
>
> +   skl_cstates[5].disabled = 1;/* C8-SKL */
> +   skl_cstates[6].disabled = 1;/* C9-SKL */
>
> and I don't think that is correct for SKY-H.

For https://bugzilla.kernel.org/show_bug.cgi?id=109081
it is correct.

> Your patch does not take into account that the states are explicitly disabled
> in MSR_NHM_SNB_PKG_CST_CFG_CTL.  That is the problem here and what you've done
> is simply hammered a disable into those states.

ENOPARSE.
Are we talking about the failure in
https://bugzilla.kernel.org/show_bug.cgi?id=109081
or a different problem?

>
> Additionally, your patch does not show the user the correct state information:
>
> [root@dhcp40-125 ~]# egrep ^ 
> /sys/devices/system/cpu/cpu0/cpuidle/state?/disable
> /sys/devices/system/cpu/cpu0/cpuidle/state0/disable:1:0
> /sys/devices/system/cpu/cpu0/cpuidle/state1/disable:1:0
> /sys/devices/system/cpu/cpu0/cpuidle/state2/disable:1:0
> /sys/devices/system/cpu/cpu0/cpuidle/state3/disable:1:0
> /sys/devices/system/cpu/cpu0/cpuidle/state4/disable:1:0
> /sys/devices/system/cpu/cpu0/cpuidle/state5/disable:1:0
> /sys/devices/system/cpu/cpu0/cpuidle/state6/disable:1:0
> /sys/devices/system/cpu/cpu0/cpuidle/state7/disable:1:0 << should be 1
> /sys/devices/system/cpu/cpu0/cpuidle/state8/disable:1:0 << should be 1

the 'disabled' attribute you see in sysfs is not
struct cpuidle_state.disabled
it is
struct cpuidle_state_usage.disabled

> The fix is to honour the settings in MSR_NHM_SNB_PKG_CST_CFG_CTL.  I cannot 
> say
> for certain that ALL SKY-H are impacted (you are admittedly in better position
> to say so or not).  I can say that on the 2 systems tested here the
> MSR_NHM_SNB_PKG_CST_CFG_CTL do have the appropriate disable value set.
>
> /me could be missing some important info  -- again, perhaps there are some
> SKY-H's out there that do not have states disabled in
> MSR_NHM_SNB_PKG_CST_CFG_CTL, and that's why I've proposed rebasing on top of
> your change.

Do you see this debug message when you run current upstream on this hardware?

/* if state marked as disabled, skip it */
if (cpuidle_state_table[cstate].disabled != 0) {
pr_debug(PREFIX "state %s is disabled",
cpuidle_state_table[cstate].name);
continue;
}


If no, then my patch is not disabling C8/C9 on your system.

Also, if it were, the code above causes the states to not appear
at all in sysfs, because they are not registered.

Re: MSR_NHM_SNB_PKG_CST_CFG_CTL

if PC10 is disabled there, then functionally, it doesn't matter what we do,
which is why my patch does nothing when PC10 is disabled.

In such a scenario, pc10 presence in sysfs (and cpufreq)
is cosmetic.  The hardware knows what to do.

Do you think that cosmetic issue is worth dealing with?
Note that the decoding of that MSR changes with every CPU,
so to get it right (like turbostat does), we'd need a table.
Also, it would be useful only for states which are  PC states only.
ie. we can't disable CC7 just because PC7 is disabled. etc.
So you could remove PC8, PC9, PC10 from sysfs on SKL
when they are disabled, but that is all.

thanks,
Len Brown, Intel Open Source Technology Center


Re: [PATCH v4 6/9] ARM: dts: Add initial gpio setting of MMC2 device for exynos3250-monk

2016-03-30 Thread Krzysztof Kozlowski
On 31.03.2016 11:48, Chanwoo Choi wrote:
> This patch adds initial pin configuration of MMC2 device on exynos3250-monk
> board because the MMC2 gpio pin (gpk2[0-6]) are NC (not connected) state.
> 
> Suggested-by: Krzysztof Kozlowski 
> Signed-off-by: Chanwoo Choi 
> Reviewed-by: Krzysztof Kozlowski 
> ---
>  arch/arm/boot/dts/exynos3250-monk.dts | 12 +++-
>  1 file changed, 11 insertions(+), 1 deletion(-)

Thanks, applied this one for v4.7 with adjusted title.

I'll wait with other patches for Sylwester's branch/tag.

Best regards,
Krzysztof




Re: [PATCH v4 6/9] ARM: dts: Add initial gpio setting of MMC2 device for exynos3250-monk

2016-03-30 Thread Krzysztof Kozlowski
On 31.03.2016 11:48, Chanwoo Choi wrote:
> This patch adds initial pin configuration of MMC2 device on exynos3250-monk
> board because the MMC2 gpio pin (gpk2[0-6]) are NC (not connected) state.
> 
> Suggested-by: Krzysztof Kozlowski 
> Signed-off-by: Chanwoo Choi 
> Reviewed-by: Krzysztof Kozlowski 
> ---
>  arch/arm/boot/dts/exynos3250-monk.dts | 12 +++-
>  1 file changed, 11 insertions(+), 1 deletion(-)

Thanks, applied this one for v4.7 with adjusted title.

I'll wait with other patches for Sylwester's branch/tag.

Best regards,
Krzysztof




Re: [PATCH v4 1/9] ARM: dts: Add initial pin configuration for exynos3250-rinato

2016-03-30 Thread Krzysztof Kozlowski
On 31.03.2016 11:47, Chanwoo Choi wrote:
> This patch adds initial pin configuration using pinctrl subsystem
> to reduce leakage power-consumption of gpio pins in normal state.
> All pins included in this patch are NC (not connected) pin.
> 
> Cc: Kukjin Kim 
> Cc: Krzysztof Kozlowski 
> Signed-off-by: Chanwoo Choi 
> Reviewed-by: Krzysztof Kozlowski 
> ---
>  arch/arm/boot/dts/exynos3250-pinctrl.dtsi | 38 +
>  arch/arm/boot/dts/exynos3250-rinato.dts   | 71 
> ++-
>  2 files changed, 107 insertions(+), 2 deletions(-)

Thanks, applied this one for v4.7 with adjusted title.

Best regards,
Krzysztof



Re: [PATCH v4 1/9] ARM: dts: Add initial pin configuration for exynos3250-rinato

2016-03-30 Thread Krzysztof Kozlowski
On 31.03.2016 11:47, Chanwoo Choi wrote:
> This patch adds initial pin configuration using pinctrl subsystem
> to reduce leakage power-consumption of gpio pins in normal state.
> All pins included in this patch are NC (not connected) pin.
> 
> Cc: Kukjin Kim 
> Cc: Krzysztof Kozlowski 
> Signed-off-by: Chanwoo Choi 
> Reviewed-by: Krzysztof Kozlowski 
> ---
>  arch/arm/boot/dts/exynos3250-pinctrl.dtsi | 38 +
>  arch/arm/boot/dts/exynos3250-rinato.dts   | 71 
> ++-
>  2 files changed, 107 insertions(+), 2 deletions(-)

Thanks, applied this one for v4.7 with adjusted title.

Best regards,
Krzysztof



[git pull] vfs fix

2016-03-30 Thread Al Viro
The following changes since commit f55532a0c0b8bb6148f4e07853b876ef73bc69ca:

  Linux 4.6-rc1 (2016-03-26 16:03:24 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git for-linus

for you to fetch changes up to 7500c38ac3258815f86f41744a538850c3221b23:

  fix the braino in "namei: massage lookup_slow() to be usable by 
lookup_one_len_unlocked()" (2016-03-31 00:23:05 -0400)


Al Viro (1):
  fix the braino in "namei: massage lookup_slow() to be usable by 
lookup_one_len_unlocked()"

 fs/namei.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)


Re: [PATCH] ACPICA: Remove unnecessary "\n" from an ACPI_INFO boot message

2016-03-30 Thread Joe Perches
On Wed, 2016-03-30 at 22:11 -0300, Daniel Bristot de Oliveira wrote:
> On 03/29/2016 04:09 PM, Moore, Robert wrote:
> > Actually, I did in fact put that there to break up the output after the 
> > tables are loaded. Is this a problem?
> Well, I do not believe that there is a real problem on it.
> 
> On the other hand, it does not seem to be common to have blank lines in
> the kernel log, and as there is no info about from where the black line
> comes from, it does not even seems to be connected to the previous
> message. So although my patch is about "cosmetics", IMHO it is worth as
> pattern or best practices.

FWIW: I agree with Daniel.



[git pull] vfs fix

2016-03-30 Thread Al Viro
The following changes since commit f55532a0c0b8bb6148f4e07853b876ef73bc69ca:

  Linux 4.6-rc1 (2016-03-26 16:03:24 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git for-linus

for you to fetch changes up to 7500c38ac3258815f86f41744a538850c3221b23:

  fix the braino in "namei: massage lookup_slow() to be usable by 
lookup_one_len_unlocked()" (2016-03-31 00:23:05 -0400)


Al Viro (1):
  fix the braino in "namei: massage lookup_slow() to be usable by 
lookup_one_len_unlocked()"

 fs/namei.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)


Re: [PATCH] ACPICA: Remove unnecessary "\n" from an ACPI_INFO boot message

2016-03-30 Thread Joe Perches
On Wed, 2016-03-30 at 22:11 -0300, Daniel Bristot de Oliveira wrote:
> On 03/29/2016 04:09 PM, Moore, Robert wrote:
> > Actually, I did in fact put that there to break up the output after the 
> > tables are loaded. Is this a problem?
> Well, I do not believe that there is a real problem on it.
> 
> On the other hand, it does not seem to be common to have blank lines in
> the kernel log, and as there is no info about from where the black line
> comes from, it does not even seems to be connected to the previous
> message. So although my patch is about "cosmetics", IMHO it is worth as
> pattern or best practices.

FWIW: I agree with Daniel.



[PATCH 3/9] x86 tsc_msr: Update comments, expand definitions

2016-03-30 Thread Len Brown
From: Len Brown 

Syntax only, no functional change.

Signed-off-by: Len Brown 
---
 arch/x86/kernel/tsc_msr.c | 36 ++--
 1 file changed, 10 insertions(+), 26 deletions(-)

diff --git a/arch/x86/kernel/tsc_msr.c b/arch/x86/kernel/tsc_msr.c
index d460ef1..3a866bc 100644
--- a/arch/x86/kernel/tsc_msr.c
+++ b/arch/x86/kernel/tsc_msr.c
@@ -1,14 +1,5 @@
 /*
- * tsc_msr.c - MSR based TSC calibration on Intel Atom SoC platforms.
- *
- * TSC in Intel Atom SoC runs at a constant rate which can be figured
- * by this formula:
- *  * 
- * See Intel 64 and IA-32 System Programming Guid section 16.12 and 30.11.5
- * for details.
- * Especially some Intel Atom SoCs don't have PIT(i8254) or HPET, so MSR
- * based calibration is the only option.
- *
+ * tsc_msr.c - TSC frequency enumeration via MSR
  *
  * Copyright (C) 2013 Intel Corporation
  * Author: Bin Gao 
@@ -22,17 +13,10 @@
 #include 
 #include 
 
-/* CPU reference clock frequency: in KHz */
-#define FREQ_8383200
-#define FREQ_100   99840
-#define FREQ_133   133200
-#define FREQ_166   166400
-
 #define MAX_NUM_FREQS  8
 
 /*
- * According to Intel 64 and IA-32 System Programming Guide,
- * if MSR_PERF_STAT[31] is set, the maximum resolved bus ratio can be
+ * If MSR_PERF_STAT[31] is set, the maximum resolved bus ratio can be
  * read in MSR_PLATFORM_ID[12:8], otherwise in MSR_PERF_STAT[44:40].
  * Unfortunately some Intel Atom SoCs aren't quite compliant to this,
  * so we need manually differentiate SoC families. This is what the
@@ -47,15 +31,15 @@ struct freq_desc {
 
 static struct freq_desc freq_desc_tables[] = {
/* PNW */
-   { 6, 0x27, 0, { 0, 0, 0, 0, 0, FREQ_100, 0, FREQ_83 } },
+   { 6, 0x27, 0, { 0, 0, 0, 0, 0, 99840, 0, 83200 } },
/* CLV+ */
-   { 6, 0x35, 0, { 0, FREQ_133, 0, 0, 0, FREQ_100, 0, FREQ_83 } },
-   /* TNG */
-   { 6, 0x4a, 1, { 0, FREQ_100, FREQ_133, 0, 0, 0, 0, 0 } },
-   /* VLV2 */
-   { 6, 0x37, 1, { FREQ_83, FREQ_100, FREQ_133, FREQ_166, 0, 0, 0, 0 } },
-   /* ANN */
-   { 6, 0x5a, 1, { FREQ_83, FREQ_100, FREQ_133, FREQ_100, 0, 0, 0, 0 } },
+   { 6, 0x35, 0, { 0, 133200, 0, 0, 0, 99840, 0, 83200 } },
+   /* TNG - Intel Atom processor Z3400 series */
+   { 6, 0x4a, 1, { 0, 99840, 133200, 0, 0, 0, 0, 0 } },
+   /* VLV2 - Intel Atom processor E3000, Z3600, Z3700 series */
+   { 6, 0x37, 1, { 83200, 99840, 133200, 166400, 0, 0, 0, 0 } },
+   /* ANN - Intel Atom processor Z3500 series */
+   { 6, 0x5a, 1, { 83200, 99840, 133200, 99840, 0, 0, 0, 0 } },
 };
 
 static int match_cpu(u8 family, u8 model)
-- 
2.8.0.rc4.16.g56331f8



[PATCH 3/9] x86 tsc_msr: Update comments, expand definitions

2016-03-30 Thread Len Brown
From: Len Brown 

Syntax only, no functional change.

Signed-off-by: Len Brown 
---
 arch/x86/kernel/tsc_msr.c | 36 ++--
 1 file changed, 10 insertions(+), 26 deletions(-)

diff --git a/arch/x86/kernel/tsc_msr.c b/arch/x86/kernel/tsc_msr.c
index d460ef1..3a866bc 100644
--- a/arch/x86/kernel/tsc_msr.c
+++ b/arch/x86/kernel/tsc_msr.c
@@ -1,14 +1,5 @@
 /*
- * tsc_msr.c - MSR based TSC calibration on Intel Atom SoC platforms.
- *
- * TSC in Intel Atom SoC runs at a constant rate which can be figured
- * by this formula:
- *  * 
- * See Intel 64 and IA-32 System Programming Guid section 16.12 and 30.11.5
- * for details.
- * Especially some Intel Atom SoCs don't have PIT(i8254) or HPET, so MSR
- * based calibration is the only option.
- *
+ * tsc_msr.c - TSC frequency enumeration via MSR
  *
  * Copyright (C) 2013 Intel Corporation
  * Author: Bin Gao 
@@ -22,17 +13,10 @@
 #include 
 #include 
 
-/* CPU reference clock frequency: in KHz */
-#define FREQ_8383200
-#define FREQ_100   99840
-#define FREQ_133   133200
-#define FREQ_166   166400
-
 #define MAX_NUM_FREQS  8
 
 /*
- * According to Intel 64 and IA-32 System Programming Guide,
- * if MSR_PERF_STAT[31] is set, the maximum resolved bus ratio can be
+ * If MSR_PERF_STAT[31] is set, the maximum resolved bus ratio can be
  * read in MSR_PLATFORM_ID[12:8], otherwise in MSR_PERF_STAT[44:40].
  * Unfortunately some Intel Atom SoCs aren't quite compliant to this,
  * so we need manually differentiate SoC families. This is what the
@@ -47,15 +31,15 @@ struct freq_desc {
 
 static struct freq_desc freq_desc_tables[] = {
/* PNW */
-   { 6, 0x27, 0, { 0, 0, 0, 0, 0, FREQ_100, 0, FREQ_83 } },
+   { 6, 0x27, 0, { 0, 0, 0, 0, 0, 99840, 0, 83200 } },
/* CLV+ */
-   { 6, 0x35, 0, { 0, FREQ_133, 0, 0, 0, FREQ_100, 0, FREQ_83 } },
-   /* TNG */
-   { 6, 0x4a, 1, { 0, FREQ_100, FREQ_133, 0, 0, 0, 0, 0 } },
-   /* VLV2 */
-   { 6, 0x37, 1, { FREQ_83, FREQ_100, FREQ_133, FREQ_166, 0, 0, 0, 0 } },
-   /* ANN */
-   { 6, 0x5a, 1, { FREQ_83, FREQ_100, FREQ_133, FREQ_100, 0, 0, 0, 0 } },
+   { 6, 0x35, 0, { 0, 133200, 0, 0, 0, 99840, 0, 83200 } },
+   /* TNG - Intel Atom processor Z3400 series */
+   { 6, 0x4a, 1, { 0, 99840, 133200, 0, 0, 0, 0, 0 } },
+   /* VLV2 - Intel Atom processor E3000, Z3600, Z3700 series */
+   { 6, 0x37, 1, { 83200, 99840, 133200, 166400, 0, 0, 0, 0 } },
+   /* ANN - Intel Atom processor Z3500 series */
+   { 6, 0x5a, 1, { 83200, 99840, 133200, 99840, 0, 0, 0, 0 } },
 };
 
 static int match_cpu(u8 family, u8 model)
-- 
2.8.0.rc4.16.g56331f8



Re: [PART1 RFC v3 10/12] svm: Do not expose x2APIC when enable AVIC

2016-03-30 Thread Suravee Suthikulpanit

Hi Radim,

On 03/19/2016 03:59 AM, Radim Krčmář wrote:

2016-03-18 01:09-0500, Suravee Suthikulpanit:

From: Suravee Suthikulpanit 

Since AVIC only virtualizes xAPIC hardware for the guest, we need to:
 * Intercept APIC BAR msr accesses to disable x2APIC
 * Intercept CPUID access to not advertise x2APIC support
 * Hide x2APIC support when checking via KVM ioctl

Signed-off-by: Suravee Suthikulpanit 
---
  arch/x86/kvm/svm.c | 49 ++---
  1 file changed, 42 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 6303147..ba84d57 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -209,6 +209,7 @@ static const struct svm_direct_access_msrs {
{ .index = MSR_IA32_LASTBRANCHTOIP, .always = false },
{ .index = MSR_IA32_LASTINTFROMIP,  .always = false },
{ .index = MSR_IA32_LASTINTTOIP,.always = false },
+   { .index = MSR_IA32_APICBASE,   .always = false },
{ .index = MSR_INVALID, .always = false },
  };

@@ -853,6 +854,9 @@ static void svm_vcpu_init_msrpm(u32 *msrpm)

set_msr_interception(msrpm, direct_access_msrs[i].index, 1, 1);
}
+
+   if (svm_vcpu_avic_enabled(svm))
+   set_msr_interception(msrpm, MSR_IA32_APICBASE, 1, 1);


AVIC really won't exit on writes to MSR_IA32_APICBASE otherwise?


Actually, I got confused about this part. This should not be needed.




@@ -3308,6 +3312,18 @@ static int svm_get_msr(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
msr_info->data = 0x1E;
}
break;
+   case MSR_IA32_APICBASE:
+   if (svm_vcpu_avic_enabled(svm)) {
+   /* Note:
+* For AVIC, we need to disable X2APIC
+* and enable XAPIC
+*/
+   kvm_get_msr_common(vcpu, msr_info);
+   msr_info->data &= ~X2APIC_ENABLE;
+   msr_info->data |= XAPIC_ENABLE;
+   break;


No.  This won't make the guest switch to xAPIC.
x2APIC can only be enabled if CPUID has that flag and it's impossible to
toggle that CPUID flag it during runtime.


This is also not needed since we already disable the x2APIC in the CPUID 
below.



+   }
+   /* Follow through if not AVIC */
default:
return kvm_get_msr_common(vcpu, msr_info);
}
@@ -3436,6 +3452,10 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, struct 
msr_data *msr)
case MSR_VM_IGNNE:
vcpu_unimpl(vcpu, "unimplemented wrmsr: 0x%x data 0x%llx\n", 
ecx, data);
break;
+   case MSR_IA32_APICBASE:
+   if (svm_vcpu_avic_enabled(svm))
+   avic_update_vapic_bar(to_svm(vcpu), data);


There is no connection to x2APIC, please do it in a different patch.


Right. I'll move this.




+   /* Follow through */
default:
return kvm_set_msr_common(vcpu, msr);
}
@@ -4554,11 +4574,26 @@ static void svm_cpuid_update(struct kvm_vcpu *vcpu)

/* Update nrips enabled cache */
svm->nrips_enabled = !!guest_cpuid_has_nrips(>vcpu);
+
+   /* Do not support X2APIC when enable AVIC */
+   if (svm_vcpu_avic_enabled(svm)) {
+   int i;
+
+   for (i = 0 ; i < vcpu->arch.cpuid_nent ; i++) {
+   if (vcpu->arch.cpuid_entries[i].function == 1)


Please use kvm_find_cpuid_entry for the search.


+   vcpu->arch.cpuid_entries[i].ecx &= ~(1 << 21);


and X86_FEATURE_X2APIC (or something with X2APIC in name) for the bit.

The code will become so obvious that the comment can be removed. :)


Good point. I can only find example of using (X86_FEATURE_X2APIC % 32) 
== 21.



+   }
+   }
  }

  static void svm_set_supported_cpuid(u32 func, struct kvm_cpuid_entry2 *entry)
  {
switch (func) {
+   case 0x0001:
+   /* Do not support X2APIC when enable AVIC */
+   if (avic)
+   entry->ecx &= ~(1 << 21);


I think this might be the right place for the code you have in
svm_cpuid_update.


Right. I'll also make change to use (X86_FEATURE_X2APIC % 32)


Btw. how does x2APIC behave under AVIC?
We definitely shouldn't recommend/expose x2APIC with AVIC as AVIC
doesn't accelerate x2APIC guest-facing interface,


Access to offset 0x400+ would generate #VMEXIT no accel fault 
read/write.  So, we will need to handle and emulate this in the host.



but the MSR interface is going to exit and host-side interrupt

> delivery will probably still work, so I don't see
> a huge problem with it.

Agree that it will still work. However, in such case, the guest code 
would 

Re: [PART1 RFC v3 10/12] svm: Do not expose x2APIC when enable AVIC

2016-03-30 Thread Suravee Suthikulpanit

Hi Radim,

On 03/19/2016 03:59 AM, Radim Krčmář wrote:

2016-03-18 01:09-0500, Suravee Suthikulpanit:

From: Suravee Suthikulpanit 

Since AVIC only virtualizes xAPIC hardware for the guest, we need to:
 * Intercept APIC BAR msr accesses to disable x2APIC
 * Intercept CPUID access to not advertise x2APIC support
 * Hide x2APIC support when checking via KVM ioctl

Signed-off-by: Suravee Suthikulpanit 
---
  arch/x86/kvm/svm.c | 49 ++---
  1 file changed, 42 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 6303147..ba84d57 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -209,6 +209,7 @@ static const struct svm_direct_access_msrs {
{ .index = MSR_IA32_LASTBRANCHTOIP, .always = false },
{ .index = MSR_IA32_LASTINTFROMIP,  .always = false },
{ .index = MSR_IA32_LASTINTTOIP,.always = false },
+   { .index = MSR_IA32_APICBASE,   .always = false },
{ .index = MSR_INVALID, .always = false },
  };

@@ -853,6 +854,9 @@ static void svm_vcpu_init_msrpm(u32 *msrpm)

set_msr_interception(msrpm, direct_access_msrs[i].index, 1, 1);
}
+
+   if (svm_vcpu_avic_enabled(svm))
+   set_msr_interception(msrpm, MSR_IA32_APICBASE, 1, 1);


AVIC really won't exit on writes to MSR_IA32_APICBASE otherwise?


Actually, I got confused about this part. This should not be needed.




@@ -3308,6 +3312,18 @@ static int svm_get_msr(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
msr_info->data = 0x1E;
}
break;
+   case MSR_IA32_APICBASE:
+   if (svm_vcpu_avic_enabled(svm)) {
+   /* Note:
+* For AVIC, we need to disable X2APIC
+* and enable XAPIC
+*/
+   kvm_get_msr_common(vcpu, msr_info);
+   msr_info->data &= ~X2APIC_ENABLE;
+   msr_info->data |= XAPIC_ENABLE;
+   break;


No.  This won't make the guest switch to xAPIC.
x2APIC can only be enabled if CPUID has that flag and it's impossible to
toggle that CPUID flag it during runtime.


This is also not needed since we already disable the x2APIC in the CPUID 
below.



+   }
+   /* Follow through if not AVIC */
default:
return kvm_get_msr_common(vcpu, msr_info);
}
@@ -3436,6 +3452,10 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, struct 
msr_data *msr)
case MSR_VM_IGNNE:
vcpu_unimpl(vcpu, "unimplemented wrmsr: 0x%x data 0x%llx\n", 
ecx, data);
break;
+   case MSR_IA32_APICBASE:
+   if (svm_vcpu_avic_enabled(svm))
+   avic_update_vapic_bar(to_svm(vcpu), data);


There is no connection to x2APIC, please do it in a different patch.


Right. I'll move this.




+   /* Follow through */
default:
return kvm_set_msr_common(vcpu, msr);
}
@@ -4554,11 +4574,26 @@ static void svm_cpuid_update(struct kvm_vcpu *vcpu)

/* Update nrips enabled cache */
svm->nrips_enabled = !!guest_cpuid_has_nrips(>vcpu);
+
+   /* Do not support X2APIC when enable AVIC */
+   if (svm_vcpu_avic_enabled(svm)) {
+   int i;
+
+   for (i = 0 ; i < vcpu->arch.cpuid_nent ; i++) {
+   if (vcpu->arch.cpuid_entries[i].function == 1)


Please use kvm_find_cpuid_entry for the search.


+   vcpu->arch.cpuid_entries[i].ecx &= ~(1 << 21);


and X86_FEATURE_X2APIC (or something with X2APIC in name) for the bit.

The code will become so obvious that the comment can be removed. :)


Good point. I can only find example of using (X86_FEATURE_X2APIC % 32) 
== 21.



+   }
+   }
  }

  static void svm_set_supported_cpuid(u32 func, struct kvm_cpuid_entry2 *entry)
  {
switch (func) {
+   case 0x0001:
+   /* Do not support X2APIC when enable AVIC */
+   if (avic)
+   entry->ecx &= ~(1 << 21);


I think this might be the right place for the code you have in
svm_cpuid_update.


Right. I'll also make change to use (X86_FEATURE_X2APIC % 32)


Btw. how does x2APIC behave under AVIC?
We definitely shouldn't recommend/expose x2APIC with AVIC as AVIC
doesn't accelerate x2APIC guest-facing interface,


Access to offset 0x400+ would generate #VMEXIT no accel fault 
read/write.  So, we will need to handle and emulate this in the host.



but the MSR interface is going to exit and host-side interrupt

> delivery will probably still work, so I don't see
> a huge problem with it.

Agree that it will still work. However, in such case, the guest code 
would likely default to using x2APIC interface, which will not be 

[PATCH 1/9] x86 tsc_msr: Identify Intel-specific code

2016-03-30 Thread Len Brown
From: Len Brown 

try_msr_calibrate_tsc() is currently Intel-specific,
and should not execute on any other vendor's parts.

Signed-off-by: Len Brown 
---
 arch/x86/kernel/tsc_msr.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/kernel/tsc_msr.c b/arch/x86/kernel/tsc_msr.c
index 92ae6ac..c16e35b 100644
--- a/arch/x86/kernel/tsc_msr.c
+++ b/arch/x86/kernel/tsc_msr.c
@@ -86,6 +86,9 @@ unsigned long try_msr_calibrate_tsc(void)
unsigned long res;
int cpu_index;
 
+   if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
+   return 0;
+
cpu_index = match_cpu(boot_cpu_data.x86, boot_cpu_data.x86_model);
if (cpu_index < 0)
return 0;
-- 
2.8.0.rc4.16.g56331f8



[PATCH 5/9] x86 tsc_msr: Add Airmont reference clock values

2016-03-30 Thread Len Brown
From: Len Brown 

per the Intel 64 and IA-32 Architecture Software Developer's Manual...

Add the reference clock for Intel Atom Processors
Based on the Airmont Microarchitecture.

Reported-by: Stephane Gasparini 
Signed-off-by: Len Brown 
---
 arch/x86/kernel/tsc_msr.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/tsc_msr.c b/arch/x86/kernel/tsc_msr.c
index 19f2a9a..59c371e 100644
--- a/arch/x86/kernel/tsc_msr.c
+++ b/arch/x86/kernel/tsc_msr.c
@@ -13,7 +13,7 @@
 #include 
 #include 
 
-#define MAX_NUM_FREQS  8
+#define MAX_NUM_FREQS  9
 
 /*
  * If MSR_PERF_STAT[31] is set, the maximum resolved bus ratio can be
@@ -40,6 +40,9 @@ static struct freq_desc freq_desc_tables[] = {
{ 6, 0x37, 1, { 83300, 10, 133300, 116700, 8, 0, 0, 0 } },
/* ANN - Intel Atom processor Z3500 series */
{ 6, 0x5a, 1, { 83300, 10, 133300, 10, 0, 0, 0, 0 } },
+   /* AMT - Intel Atom processor X7-Z8000 and X5-Z8000 series */
+   { 6, 0x4c, 1, { 83300, 10, 133300, 116700,
+   8, 93300, 9, 88900, 87500 } },
 };
 
 static int match_cpu(u8 family, u8 model)
-- 
2.8.0.rc4.16.g56331f8



[PATCH 2/9] x86 tsc_msr: Remove debugging messages

2016-03-30 Thread Len Brown
From: Len Brown 

Debugging messages are not necessary after all of the
possible hardware failures that never occur.
Instead, this code can simply return 0.

This code also doesn't need to print in the success case.
tsc_init() already prints the TSC frequency,
and apic=debug is available if anybody really is
interested in printing the LAPIC frequency.

Signed-off-by: Len Brown 
---
 arch/x86/kernel/tsc_msr.c | 19 +++
 1 file changed, 3 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kernel/tsc_msr.c b/arch/x86/kernel/tsc_msr.c
index c16e35b..d460ef1 100644
--- a/arch/x86/kernel/tsc_msr.c
+++ b/arch/x86/kernel/tsc_msr.c
@@ -76,9 +76,10 @@ static int match_cpu(u8 family, u8 model)
(freq_desc_tables[cpu_index].freqs[freq_id])
 
 /*
- * Do MSR calibration only for known/supported CPUs.
+ * MSR-based CPU/TSC frequency discovery for certain CPUs.
  *
- * Returns the calibration value or 0 if MSR calibration failed.
+ * Set global "lapic_timer_frequency" to bus_clock_cycles/jiffy
+ * Return processor base frequency in KHz, or 0 on failure.
  */
 unsigned long try_msr_calibrate_tsc(void)
 {
@@ -100,31 +101,17 @@ unsigned long try_msr_calibrate_tsc(void)
rdmsr(MSR_IA32_PERF_STATUS, lo, hi);
ratio = (hi >> 8) & 0x1f;
}
-   pr_info("Maximum core-clock to bus-clock ratio: 0x%x\n", ratio);
-
-   if (!ratio)
-   goto fail;
 
/* Get FSB FREQ ID */
rdmsr(MSR_FSB_FREQ, lo, hi);
freq_id = lo & 0x7;
freq = id_to_freq(cpu_index, freq_id);
-   pr_info("Resolved frequency ID: %u, frequency: %u KHz\n",
-   freq_id, freq);
-   if (!freq)
-   goto fail;
 
/* TSC frequency = maximum resolved freq * maximum resolved bus ratio */
res = freq * ratio;
-   pr_info("TSC runs at %lu KHz\n", res);
 
 #ifdef CONFIG_X86_LOCAL_APIC
lapic_timer_frequency = (freq * 1000) / HZ;
-   pr_info("lapic_timer_frequency = %d\n", lapic_timer_frequency);
 #endif
return res;
-
-fail:
-   pr_warn("Fast TSC calibration using MSR failed\n");
-   return 0;
 }
-- 
2.8.0.rc4.16.g56331f8



[PATCH 1/9] x86 tsc_msr: Identify Intel-specific code

2016-03-30 Thread Len Brown
From: Len Brown 

try_msr_calibrate_tsc() is currently Intel-specific,
and should not execute on any other vendor's parts.

Signed-off-by: Len Brown 
---
 arch/x86/kernel/tsc_msr.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/kernel/tsc_msr.c b/arch/x86/kernel/tsc_msr.c
index 92ae6ac..c16e35b 100644
--- a/arch/x86/kernel/tsc_msr.c
+++ b/arch/x86/kernel/tsc_msr.c
@@ -86,6 +86,9 @@ unsigned long try_msr_calibrate_tsc(void)
unsigned long res;
int cpu_index;
 
+   if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
+   return 0;
+
cpu_index = match_cpu(boot_cpu_data.x86, boot_cpu_data.x86_model);
if (cpu_index < 0)
return 0;
-- 
2.8.0.rc4.16.g56331f8



[PATCH 5/9] x86 tsc_msr: Add Airmont reference clock values

2016-03-30 Thread Len Brown
From: Len Brown 

per the Intel 64 and IA-32 Architecture Software Developer's Manual...

Add the reference clock for Intel Atom Processors
Based on the Airmont Microarchitecture.

Reported-by: Stephane Gasparini 
Signed-off-by: Len Brown 
---
 arch/x86/kernel/tsc_msr.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/tsc_msr.c b/arch/x86/kernel/tsc_msr.c
index 19f2a9a..59c371e 100644
--- a/arch/x86/kernel/tsc_msr.c
+++ b/arch/x86/kernel/tsc_msr.c
@@ -13,7 +13,7 @@
 #include 
 #include 
 
-#define MAX_NUM_FREQS  8
+#define MAX_NUM_FREQS  9
 
 /*
  * If MSR_PERF_STAT[31] is set, the maximum resolved bus ratio can be
@@ -40,6 +40,9 @@ static struct freq_desc freq_desc_tables[] = {
{ 6, 0x37, 1, { 83300, 10, 133300, 116700, 8, 0, 0, 0 } },
/* ANN - Intel Atom processor Z3500 series */
{ 6, 0x5a, 1, { 83300, 10, 133300, 10, 0, 0, 0, 0 } },
+   /* AMT - Intel Atom processor X7-Z8000 and X5-Z8000 series */
+   { 6, 0x4c, 1, { 83300, 10, 133300, 116700,
+   8, 93300, 9, 88900, 87500 } },
 };
 
 static int match_cpu(u8 family, u8 model)
-- 
2.8.0.rc4.16.g56331f8



[PATCH 2/9] x86 tsc_msr: Remove debugging messages

2016-03-30 Thread Len Brown
From: Len Brown 

Debugging messages are not necessary after all of the
possible hardware failures that never occur.
Instead, this code can simply return 0.

This code also doesn't need to print in the success case.
tsc_init() already prints the TSC frequency,
and apic=debug is available if anybody really is
interested in printing the LAPIC frequency.

Signed-off-by: Len Brown 
---
 arch/x86/kernel/tsc_msr.c | 19 +++
 1 file changed, 3 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kernel/tsc_msr.c b/arch/x86/kernel/tsc_msr.c
index c16e35b..d460ef1 100644
--- a/arch/x86/kernel/tsc_msr.c
+++ b/arch/x86/kernel/tsc_msr.c
@@ -76,9 +76,10 @@ static int match_cpu(u8 family, u8 model)
(freq_desc_tables[cpu_index].freqs[freq_id])
 
 /*
- * Do MSR calibration only for known/supported CPUs.
+ * MSR-based CPU/TSC frequency discovery for certain CPUs.
  *
- * Returns the calibration value or 0 if MSR calibration failed.
+ * Set global "lapic_timer_frequency" to bus_clock_cycles/jiffy
+ * Return processor base frequency in KHz, or 0 on failure.
  */
 unsigned long try_msr_calibrate_tsc(void)
 {
@@ -100,31 +101,17 @@ unsigned long try_msr_calibrate_tsc(void)
rdmsr(MSR_IA32_PERF_STATUS, lo, hi);
ratio = (hi >> 8) & 0x1f;
}
-   pr_info("Maximum core-clock to bus-clock ratio: 0x%x\n", ratio);
-
-   if (!ratio)
-   goto fail;
 
/* Get FSB FREQ ID */
rdmsr(MSR_FSB_FREQ, lo, hi);
freq_id = lo & 0x7;
freq = id_to_freq(cpu_index, freq_id);
-   pr_info("Resolved frequency ID: %u, frequency: %u KHz\n",
-   freq_id, freq);
-   if (!freq)
-   goto fail;
 
/* TSC frequency = maximum resolved freq * maximum resolved bus ratio */
res = freq * ratio;
-   pr_info("TSC runs at %lu KHz\n", res);
 
 #ifdef CONFIG_X86_LOCAL_APIC
lapic_timer_frequency = (freq * 1000) / HZ;
-   pr_info("lapic_timer_frequency = %d\n", lapic_timer_frequency);
 #endif
return res;
-
-fail:
-   pr_warn("Fast TSC calibration using MSR failed\n");
-   return 0;
 }
-- 
2.8.0.rc4.16.g56331f8



[PATCH 7/9] x86 tsc_msr: Remove irqoff around MSR-based TSC enumeration

2016-03-30 Thread Len Brown
From: Len Brown 

Remove the irqoff/irqon around MSR-based TSC enumeration,
as it is not necessary.

Also rename: try_msr_calibrate_tsc() to cpu_khz_from_msr(),
as that better describes what the routine does.

Signed-off-by: Len Brown 
---
 arch/x86/include/asm/tsc.h | 3 +--
 arch/x86/kernel/tsc.c  | 5 +
 arch/x86/kernel/tsc_msr.c  | 2 +-
 3 files changed, 3 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h
index 174c421..d634f2a 100644
--- a/arch/x86/include/asm/tsc.h
+++ b/arch/x86/include/asm/tsc.h
@@ -52,7 +52,6 @@ extern int notsc_setup(char *);
 extern void tsc_save_sched_clock_state(void);
 extern void tsc_restore_sched_clock_state(void);
 
-/* MSR based TSC calibration for Intel Atom SoC platforms */
-unsigned long try_msr_calibrate_tsc(void);
+unsigned long cpu_khz_from_msr(void);
 
 #endif /* _ASM_X86_TSC_H */
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index c9c4c7c..0ffb57f 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -674,10 +674,7 @@ unsigned long native_calibrate_tsc(void)
unsigned long flags, latch, ms, fast_calibrate;
int hpet = is_hpet_enabled(), i, loopmin;
 
-   /* Calibrate TSC using MSR for Intel Atom SoCs */
-   local_irq_save(flags);
-   fast_calibrate = try_msr_calibrate_tsc();
-   local_irq_restore(flags);
+   fast_calibrate = cpu_khz_from_msr();
if (fast_calibrate)
return fast_calibrate;
 
diff --git a/arch/x86/kernel/tsc_msr.c b/arch/x86/kernel/tsc_msr.c
index c8ea977..b0839c5 100644
--- a/arch/x86/kernel/tsc_msr.c
+++ b/arch/x86/kernel/tsc_msr.c
@@ -68,7 +68,7 @@ static int match_cpu(u8 family, u8 model)
  * Set global "lapic_timer_frequency" to bus_clock_cycles/jiffy
  * Return processor base frequency in KHz, or 0 on failure.
  */
-unsigned long try_msr_calibrate_tsc(void)
+unsigned long cpu_khz_from_msr(void)
 {
u32 lo, hi, ratio, freq_id, freq;
unsigned long res;
-- 
2.8.0.rc4.16.g56331f8



[PATCH 6/9] x86 tsc_msr: Extend to include Intel Core Architecture

2016-03-30 Thread Len Brown
From: Len Brown 

tsc_msr is used to quickly and reliably
enumerate the CPU/TSC frequencies at boot time
For the Intel Atom Architecture.

Extend tsc_msr to include recent Intel Core Architecture.

As this code discovers BCLK, it also sets lapic_timer_frequency,
which allows LAPIC timer calibration to be skipped,
though it is already skipped on systems with a TSC deadline timer.

Signed-off-by: Len Brown 
---
 arch/x86/kernel/tsc_msr.c | 49 +++
 1 file changed, 41 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/tsc_msr.c b/arch/x86/kernel/tsc_msr.c
index 59c371e..c8ea977 100644
--- a/arch/x86/kernel/tsc_msr.c
+++ b/arch/x86/kernel/tsc_msr.c
@@ -77,23 +77,56 @@ unsigned long try_msr_calibrate_tsc(void)
if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
return 0;
 
+   /*
+* 100 MHz BCLK Core Architecture -- before SKL.
+* De-rate 100Mhz by about 0.25% to account
+* for the average effect of spread-spectrum clocking.
+*/
+   switch (boot_cpu_data.x86_model) {
+
+   case 0x2A:  /* SNB */
+   case 0x3A:  /* IVB */
+   freq = 99773;
+   goto get_ratio;
+   case 0x2D:  /* SNB Xeon */
+   case 0x3E:  /* IVB Xeon */
+   freq = 99760;
+   goto get_ratio;
+   case 0x3C:  /* HSW */
+   case 0x3F:  /* HSW */
+   case 0x45:  /* HSW */
+   case 0x46:  /* HSW */
+   case 0x3D:  /* BDW */
+   case 0x47:  /* BDW */
+   case 0x4F:  /* BDX */
+   case 0x56:  /* BDX-DE */
+   freq = 99769;
+   goto get_ratio;
+   }
+
+   /*
+* Atom Architecture
+*/
cpu_index = match_cpu(boot_cpu_data.x86, boot_cpu_data.x86_model);
if (cpu_index < 0)
return 0;
 
-   if (freq_desc_tables[cpu_index].msr_plat) {
-   rdmsr(MSR_PLATFORM_INFO, lo, hi);
-   ratio = (lo >> 8) & 0x1f;
-   } else {
-   rdmsr(MSR_IA32_PERF_STATUS, lo, hi);
-   ratio = (hi >> 8) & 0x1f;
-   }
-
/* Get FSB FREQ ID */
rdmsr(MSR_FSB_FREQ, lo, hi);
freq_id = lo & 0x7;
freq = id_to_freq(cpu_index, freq_id);
 
+   if (!freq_desc_tables[cpu_index].msr_plat) {
+   rdmsr(MSR_IA32_PERF_STATUS, lo, hi);
+   ratio = (hi >> 8) & 0x1f;
+   goto done;
+   }
+
+get_ratio:
+   rdmsr(MSR_PLATFORM_INFO, lo, hi);
+   ratio = (lo >> 8) & 0x1f;
+
+done:
/* TSC frequency = maximum resolved freq * maximum resolved bus ratio */
res = freq * ratio;
 
-- 
2.8.0.rc4.16.g56331f8



[PATCH 7/9] x86 tsc_msr: Remove irqoff around MSR-based TSC enumeration

2016-03-30 Thread Len Brown
From: Len Brown 

Remove the irqoff/irqon around MSR-based TSC enumeration,
as it is not necessary.

Also rename: try_msr_calibrate_tsc() to cpu_khz_from_msr(),
as that better describes what the routine does.

Signed-off-by: Len Brown 
---
 arch/x86/include/asm/tsc.h | 3 +--
 arch/x86/kernel/tsc.c  | 5 +
 arch/x86/kernel/tsc_msr.c  | 2 +-
 3 files changed, 3 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h
index 174c421..d634f2a 100644
--- a/arch/x86/include/asm/tsc.h
+++ b/arch/x86/include/asm/tsc.h
@@ -52,7 +52,6 @@ extern int notsc_setup(char *);
 extern void tsc_save_sched_clock_state(void);
 extern void tsc_restore_sched_clock_state(void);
 
-/* MSR based TSC calibration for Intel Atom SoC platforms */
-unsigned long try_msr_calibrate_tsc(void);
+unsigned long cpu_khz_from_msr(void);
 
 #endif /* _ASM_X86_TSC_H */
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index c9c4c7c..0ffb57f 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -674,10 +674,7 @@ unsigned long native_calibrate_tsc(void)
unsigned long flags, latch, ms, fast_calibrate;
int hpet = is_hpet_enabled(), i, loopmin;
 
-   /* Calibrate TSC using MSR for Intel Atom SoCs */
-   local_irq_save(flags);
-   fast_calibrate = try_msr_calibrate_tsc();
-   local_irq_restore(flags);
+   fast_calibrate = cpu_khz_from_msr();
if (fast_calibrate)
return fast_calibrate;
 
diff --git a/arch/x86/kernel/tsc_msr.c b/arch/x86/kernel/tsc_msr.c
index c8ea977..b0839c5 100644
--- a/arch/x86/kernel/tsc_msr.c
+++ b/arch/x86/kernel/tsc_msr.c
@@ -68,7 +68,7 @@ static int match_cpu(u8 family, u8 model)
  * Set global "lapic_timer_frequency" to bus_clock_cycles/jiffy
  * Return processor base frequency in KHz, or 0 on failure.
  */
-unsigned long try_msr_calibrate_tsc(void)
+unsigned long cpu_khz_from_msr(void)
 {
u32 lo, hi, ratio, freq_id, freq;
unsigned long res;
-- 
2.8.0.rc4.16.g56331f8



[PATCH 6/9] x86 tsc_msr: Extend to include Intel Core Architecture

2016-03-30 Thread Len Brown
From: Len Brown 

tsc_msr is used to quickly and reliably
enumerate the CPU/TSC frequencies at boot time
For the Intel Atom Architecture.

Extend tsc_msr to include recent Intel Core Architecture.

As this code discovers BCLK, it also sets lapic_timer_frequency,
which allows LAPIC timer calibration to be skipped,
though it is already skipped on systems with a TSC deadline timer.

Signed-off-by: Len Brown 
---
 arch/x86/kernel/tsc_msr.c | 49 +++
 1 file changed, 41 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/tsc_msr.c b/arch/x86/kernel/tsc_msr.c
index 59c371e..c8ea977 100644
--- a/arch/x86/kernel/tsc_msr.c
+++ b/arch/x86/kernel/tsc_msr.c
@@ -77,23 +77,56 @@ unsigned long try_msr_calibrate_tsc(void)
if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
return 0;
 
+   /*
+* 100 MHz BCLK Core Architecture -- before SKL.
+* De-rate 100Mhz by about 0.25% to account
+* for the average effect of spread-spectrum clocking.
+*/
+   switch (boot_cpu_data.x86_model) {
+
+   case 0x2A:  /* SNB */
+   case 0x3A:  /* IVB */
+   freq = 99773;
+   goto get_ratio;
+   case 0x2D:  /* SNB Xeon */
+   case 0x3E:  /* IVB Xeon */
+   freq = 99760;
+   goto get_ratio;
+   case 0x3C:  /* HSW */
+   case 0x3F:  /* HSW */
+   case 0x45:  /* HSW */
+   case 0x46:  /* HSW */
+   case 0x3D:  /* BDW */
+   case 0x47:  /* BDW */
+   case 0x4F:  /* BDX */
+   case 0x56:  /* BDX-DE */
+   freq = 99769;
+   goto get_ratio;
+   }
+
+   /*
+* Atom Architecture
+*/
cpu_index = match_cpu(boot_cpu_data.x86, boot_cpu_data.x86_model);
if (cpu_index < 0)
return 0;
 
-   if (freq_desc_tables[cpu_index].msr_plat) {
-   rdmsr(MSR_PLATFORM_INFO, lo, hi);
-   ratio = (lo >> 8) & 0x1f;
-   } else {
-   rdmsr(MSR_IA32_PERF_STATUS, lo, hi);
-   ratio = (hi >> 8) & 0x1f;
-   }
-
/* Get FSB FREQ ID */
rdmsr(MSR_FSB_FREQ, lo, hi);
freq_id = lo & 0x7;
freq = id_to_freq(cpu_index, freq_id);
 
+   if (!freq_desc_tables[cpu_index].msr_plat) {
+   rdmsr(MSR_IA32_PERF_STATUS, lo, hi);
+   ratio = (hi >> 8) & 0x1f;
+   goto done;
+   }
+
+get_ratio:
+   rdmsr(MSR_PLATFORM_INFO, lo, hi);
+   ratio = (lo >> 8) & 0x1f;
+
+done:
/* TSC frequency = maximum resolved freq * maximum resolved bus ratio */
res = freq * ratio;
 
-- 
2.8.0.rc4.16.g56331f8



[PATCH 4/9] x86 tsc_msr: Correct Silvermont reference clock values

2016-03-30 Thread Len Brown
From: Len Brown 

Atom processors use a 19.2 MHz crystal oscillator.

Early processors generate 100 MHz via 19.2 MHz * 26 / 5 = 99.84 MHz.

Later preocessor generate 100 MHz via 19.2 MHz * 125 / 24 = 100 MHz.

Update the Silvermont-based tables accordingly,
matching the Software Developers Manual.

Also, correct a 166 MHz entry that should have been 116 MHz,
and add a missing 80 MHz entry.

Reported-by: Stephane Gasparini 
Signed-off-by: Len Brown 
---
 arch/x86/kernel/tsc_msr.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/tsc_msr.c b/arch/x86/kernel/tsc_msr.c
index 3a866bc..19f2a9a 100644
--- a/arch/x86/kernel/tsc_msr.c
+++ b/arch/x86/kernel/tsc_msr.c
@@ -35,11 +35,11 @@ static struct freq_desc freq_desc_tables[] = {
/* CLV+ */
{ 6, 0x35, 0, { 0, 133200, 0, 0, 0, 99840, 0, 83200 } },
/* TNG - Intel Atom processor Z3400 series */
-   { 6, 0x4a, 1, { 0, 99840, 133200, 0, 0, 0, 0, 0 } },
+   { 6, 0x4a, 1, { 0, 10, 133300, 0, 0, 0, 0, 0 } },
/* VLV2 - Intel Atom processor E3000, Z3600, Z3700 series */
-   { 6, 0x37, 1, { 83200, 99840, 133200, 166400, 0, 0, 0, 0 } },
+   { 6, 0x37, 1, { 83300, 10, 133300, 116700, 8, 0, 0, 0 } },
/* ANN - Intel Atom processor Z3500 series */
-   { 6, 0x5a, 1, { 83200, 99840, 133200, 99840, 0, 0, 0, 0 } },
+   { 6, 0x5a, 1, { 83300, 10, 133300, 10, 0, 0, 0, 0 } },
 };
 
 static int match_cpu(u8 family, u8 model)
-- 
2.8.0.rc4.16.g56331f8



[PATCH 4/9] x86 tsc_msr: Correct Silvermont reference clock values

2016-03-30 Thread Len Brown
From: Len Brown 

Atom processors use a 19.2 MHz crystal oscillator.

Early processors generate 100 MHz via 19.2 MHz * 26 / 5 = 99.84 MHz.

Later preocessor generate 100 MHz via 19.2 MHz * 125 / 24 = 100 MHz.

Update the Silvermont-based tables accordingly,
matching the Software Developers Manual.

Also, correct a 166 MHz entry that should have been 116 MHz,
and add a missing 80 MHz entry.

Reported-by: Stephane Gasparini 
Signed-off-by: Len Brown 
---
 arch/x86/kernel/tsc_msr.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/tsc_msr.c b/arch/x86/kernel/tsc_msr.c
index 3a866bc..19f2a9a 100644
--- a/arch/x86/kernel/tsc_msr.c
+++ b/arch/x86/kernel/tsc_msr.c
@@ -35,11 +35,11 @@ static struct freq_desc freq_desc_tables[] = {
/* CLV+ */
{ 6, 0x35, 0, { 0, 133200, 0, 0, 0, 99840, 0, 83200 } },
/* TNG - Intel Atom processor Z3400 series */
-   { 6, 0x4a, 1, { 0, 99840, 133200, 0, 0, 0, 0, 0 } },
+   { 6, 0x4a, 1, { 0, 10, 133300, 0, 0, 0, 0, 0 } },
/* VLV2 - Intel Atom processor E3000, Z3600, Z3700 series */
-   { 6, 0x37, 1, { 83200, 99840, 133200, 166400, 0, 0, 0, 0 } },
+   { 6, 0x37, 1, { 83300, 10, 133300, 116700, 8, 0, 0, 0 } },
/* ANN - Intel Atom processor Z3500 series */
-   { 6, 0x5a, 1, { 83200, 99840, 133200, 99840, 0, 0, 0, 0 } },
+   { 6, 0x5a, 1, { 83300, 10, 133300, 10, 0, 0, 0, 0 } },
 };
 
 static int match_cpu(u8 family, u8 model)
-- 
2.8.0.rc4.16.g56331f8



[PATCH 9/9] x86 tsc: enumerate BXT tsc_khz via CPUID

2016-03-30 Thread Len Brown
From: Bin Gao 

Hard code the BXT crystal clock (aka ART - Always Running Timer)
to 19.200 MHz, and use CPUID leaf 0x15 to determine the BXT TSC frequency.

Use tsc_khz to sanity check BXT cpu_khz,
which can be erroneous in some configurations.

Signed-off-by: Bin Gao 
[lenb: simplified]
Signed-off-by: Len Brown 
---
 arch/x86/kernel/tsc.c | 18 +-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index ca41c30..64dc998 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -693,7 +693,11 @@ unsigned long native_calibrate_tsc(void)
switch (boot_cpu_data.x86_model) {
case 0x4E:  /* SKL */
case 0x5E:  /* SKL */
-   crystal_khz = 24000;/* 24 MHz */
+   crystal_khz = 24000;/* 24.0 MHz */
+   break;
+   case 0x5C:  /* BXT */
+   crystal_khz = 19200;/* 19.2 MHz */
+   break;
}
}
 
@@ -891,8 +895,12 @@ int recalibrate_cpu_khz(void)
if (cpu_has_tsc) {
cpu_khz = x86_platform.calibrate_cpu();
tsc_khz = x86_platform.calibrate_tsc();
+
if (tsc_khz == 0)
tsc_khz = cpu_khz;
+   else if (abs(cpu_khz - tsc_khz) * 10 > tsc_khz)
+   cpu_khz = tsc_khz;
+
cpu_data(0).loops_per_jiffy =
cpufreq_scale(cpu_data(0).loops_per_jiffy,
cpu_khz_old, cpu_khz);
@@ -1305,8 +1313,16 @@ void __init tsc_init(void)
 
cpu_khz = x86_platform.calibrate_cpu();
tsc_khz = x86_platform.calibrate_tsc();
+
+   /*
+* Trust non-zero tsc_khz as authorative,
+* and use it to sanity check cpu_khz,
+* which will be off if system timer is off.
+*/
if (tsc_khz == 0)
tsc_khz = cpu_khz;
+   else if (abs(cpu_khz - tsc_khz) * 10 > tsc_khz)
+   cpu_khz = tsc_khz;
 
if (!tsc_khz) {
mark_tsc_unstable("could not calculate TSC khz");
-- 
2.8.0.rc4.16.g56331f8



[PATCH 8/9] x86 tsc: enumerate SKL cpu_khz and tsc_khz via CPUID

2016-03-30 Thread Len Brown
From: Len Brown 

Skylake CPU base-frequency and TSC frequency may differ
by up to 2%.

Enumerate CPU and TSC frequencies separately, allowing
cpu_khz and tsc_khz to differ.

The existing CPU frequency calibration mechanism is unchanged.
However, CPUID extensions are preferred, when available.

CPUID.0x16 is preferred over MSR and timer calibration
for CPU frequency discovery.

CPUID.0x15 takes precedence over CPU-frequency
for TSC frequency discovery.

Signed-off-by: Len Brown 
---
 arch/x86/include/asm/tsc.h  |  1 +
 arch/x86/include/asm/x86_init.h |  4 ++-
 arch/x86/kernel/tsc.c   | 75 +
 arch/x86/kernel/x86_init.c  |  1 +
 4 files changed, 73 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h
index d634f2a..c279502 100644
--- a/arch/x86/include/asm/tsc.h
+++ b/arch/x86/include/asm/tsc.h
@@ -36,6 +36,7 @@ extern void mark_tsc_unstable(char *reason);
 extern int unsynchronized_tsc(void);
 extern int check_tsc_unstable(void);
 extern int check_tsc_disabled(void);
+extern unsigned long native_calibrate_cpu(void);
 extern unsigned long native_calibrate_tsc(void);
 extern unsigned long long native_sched_clock_from_tsc(u64 tsc);
 
diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h
index 1ae89a2..2e5c84d 100644
--- a/arch/x86/include/asm/x86_init.h
+++ b/arch/x86/include/asm/x86_init.h
@@ -143,7 +143,8 @@ struct timespec;
 
 /**
  * struct x86_platform_ops - platform specific runtime functions
- * @calibrate_tsc: calibrate TSC
+ * @calibrate_cpu: calibrate CPU
+ * @calibrate_tsc: calibrate TSC, if different from CPU
  * @get_wallclock: get time from HW clock like RTC etc.
  * @set_wallclock: set time back to HW clock
  * @is_untracked_pat_range exclude from PAT logic
@@ -154,6 +155,7 @@ struct timespec;
  * @apic_post_init:adjust apic if neeeded
  */
 struct x86_platform_ops {
+   unsigned long (*calibrate_cpu)(void);
unsigned long (*calibrate_tsc)(void);
void (*get_wallclock)(struct timespec *ts);
int (*set_wallclock)(const struct timespec *ts);
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 0ffb57f..ca41c30 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -239,7 +239,7 @@ static inline unsigned long long cycles_2_ns(unsigned long 
long cyc)
return ns;
 }
 
-static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu)
+static void set_cyc2ns_scale(unsigned long khz, int cpu)
 {
unsigned long long tsc_now, ns_now;
struct cyc2ns_data *data;
@@ -248,7 +248,7 @@ static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu)
local_irq_save(flags);
sched_clock_idle_sleep_event();
 
-   if (!cpu_khz)
+   if (!khz)
goto done;
 
data = cyc2ns_write_begin(cpu);
@@ -261,7 +261,7 @@ static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu)
 * time function is continuous; see the comment near struct
 * cyc2ns_data.
 */
-   clocks_calc_mult_shift(>cyc2ns_mul, >cyc2ns_shift, cpu_khz,
+   clocks_calc_mult_shift(>cyc2ns_mul, >cyc2ns_shift, khz,
   NSEC_PER_MSEC, 0);
 
/*
@@ -665,15 +665,72 @@ success:
 }
 
 /**
- * native_calibrate_tsc - calibrate the tsc on boot
+ * native_calibrate_tsc
+ * Determine TSC frequency via CPUID, else return 0.
  */
 unsigned long native_calibrate_tsc(void)
 {
+   unsigned int eax_denominator, ebx_numerator, ecx_hz, edx;
+   unsigned int crystal_khz;
+
+   if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
+   return 0;
+
+   if (boot_cpu_data.cpuid_level < 0x15)
+   return 0;
+
+   eax_denominator = ebx_numerator = ecx_hz = edx = 0;
+
+   /* CPUID 15H TSC/Crystal ratio, plus optionally Crystal Hz */
+   cpuid(0x15, _denominator, _numerator, _hz, );
+
+   if (ebx_numerator == 0 || eax_denominator == 0)
+   return 0;
+
+   crystal_khz = ecx_hz / 1000;
+
+   if (crystal_khz == 0) {
+   switch (boot_cpu_data.x86_model) {
+   case 0x4E:  /* SKL */
+   case 0x5E:  /* SKL */
+   crystal_khz = 24000;/* 24 MHz */
+   }
+   }
+
+   return crystal_khz * ebx_numerator / eax_denominator;
+}
+
+static unsigned long cpu_khz_from_cpuid(void)
+{
+   unsigned int eax_base_mhz, ebx_max_mhz, ecx_bus_mhz, edx;
+
+   if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
+   return 0;
+
+   if (boot_cpu_data.cpuid_level < 0x16)
+   return 0;
+
+   eax_base_mhz = ebx_max_mhz = ecx_bus_mhz = edx = 0;
+
+   cpuid(0x16, _base_mhz, _max_mhz, _bus_mhz, );
+
+   return eax_base_mhz * 1000;
+}
+
+/**
+ * native_calibrate_cpu - calibrate the cpu on boot
+ */
+unsigned long 

[PATCH 0/9] x86: TSC calibration update

2016-03-30 Thread Len Brown
cpu_khz and tsc_khz initialization can be unreliable and expensive.
They are initialized in tsc_init()/native_calibrate_tsc(), which prints:

pr_info("Detected %lu.%03lu MHz processor\n", cpu_khz...)

native_calibrate_cpu() first tries quick_pit_calibrate(),
which can take over 50.0M cycles to succeed,
or as few as 2.4M cycles to fail.

On failure, pit_calibrate_tsc() is attempted, which can succeed
in as few as 20M cycles, but may consume over 240M cycles
before it fails.

By comparison, on many processors, tsc frequency can be discovered by
table and MSR or CPUID in under 0.002M cycles.

Subsequently tsc_refine_calibration_work() checks our work,
but it takes under 0.004M cycles.

pr_info("Refined TSC clocksource calibration: %lu.%03lu MHz\n", tsc_khz...)

Finally, CPU and TSC frequency are not guaranteed to be identical,
and this series allows cpu_khz and tsc_khz to differ
within a few percent.

cheers,
Len Brown, Intel Open Source Technology Center

this patch set is also available via git:

git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux.git x86


Bin Gao (1):
  x86 tsc: enumerate BXT tsc_khz via CPUID

Len Brown (8):
  x86 tsc_msr: Identify Intel-specific code
  x86 tsc_msr: Remove debugging messages
  x86 tsc_msr: Update comments, expand definitions
  x86 tsc_msr: Correct Silvermont reference clock values
  x86 tsc_msr: Add Airmont reference clock values
  x86 tsc_msr: Extend to include Intel Core Architecture
  x86 tsc_msr: Remove irqoff around MSR-based TSC enumeration
  x86 tsc: enumerate SKL cpu_khz and tsc_khz via CPUID

 arch/x86/include/asm/tsc.h  |   4 +-
 arch/x86/include/asm/x86_init.h |   4 +-
 arch/x86/kernel/tsc.c   |  96 ++
 arch/x86/kernel/tsc_msr.c   | 112 ++--
 arch/x86/kernel/x86_init.c  |   1 +
 5 files changed, 152 insertions(+), 65 deletions(-)



[PATCH 9/9] x86 tsc: enumerate BXT tsc_khz via CPUID

2016-03-30 Thread Len Brown
From: Bin Gao 

Hard code the BXT crystal clock (aka ART - Always Running Timer)
to 19.200 MHz, and use CPUID leaf 0x15 to determine the BXT TSC frequency.

Use tsc_khz to sanity check BXT cpu_khz,
which can be erroneous in some configurations.

Signed-off-by: Bin Gao 
[lenb: simplified]
Signed-off-by: Len Brown 
---
 arch/x86/kernel/tsc.c | 18 +-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index ca41c30..64dc998 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -693,7 +693,11 @@ unsigned long native_calibrate_tsc(void)
switch (boot_cpu_data.x86_model) {
case 0x4E:  /* SKL */
case 0x5E:  /* SKL */
-   crystal_khz = 24000;/* 24 MHz */
+   crystal_khz = 24000;/* 24.0 MHz */
+   break;
+   case 0x5C:  /* BXT */
+   crystal_khz = 19200;/* 19.2 MHz */
+   break;
}
}
 
@@ -891,8 +895,12 @@ int recalibrate_cpu_khz(void)
if (cpu_has_tsc) {
cpu_khz = x86_platform.calibrate_cpu();
tsc_khz = x86_platform.calibrate_tsc();
+
if (tsc_khz == 0)
tsc_khz = cpu_khz;
+   else if (abs(cpu_khz - tsc_khz) * 10 > tsc_khz)
+   cpu_khz = tsc_khz;
+
cpu_data(0).loops_per_jiffy =
cpufreq_scale(cpu_data(0).loops_per_jiffy,
cpu_khz_old, cpu_khz);
@@ -1305,8 +1313,16 @@ void __init tsc_init(void)
 
cpu_khz = x86_platform.calibrate_cpu();
tsc_khz = x86_platform.calibrate_tsc();
+
+   /*
+* Trust non-zero tsc_khz as authorative,
+* and use it to sanity check cpu_khz,
+* which will be off if system timer is off.
+*/
if (tsc_khz == 0)
tsc_khz = cpu_khz;
+   else if (abs(cpu_khz - tsc_khz) * 10 > tsc_khz)
+   cpu_khz = tsc_khz;
 
if (!tsc_khz) {
mark_tsc_unstable("could not calculate TSC khz");
-- 
2.8.0.rc4.16.g56331f8



[PATCH 8/9] x86 tsc: enumerate SKL cpu_khz and tsc_khz via CPUID

2016-03-30 Thread Len Brown
From: Len Brown 

Skylake CPU base-frequency and TSC frequency may differ
by up to 2%.

Enumerate CPU and TSC frequencies separately, allowing
cpu_khz and tsc_khz to differ.

The existing CPU frequency calibration mechanism is unchanged.
However, CPUID extensions are preferred, when available.

CPUID.0x16 is preferred over MSR and timer calibration
for CPU frequency discovery.

CPUID.0x15 takes precedence over CPU-frequency
for TSC frequency discovery.

Signed-off-by: Len Brown 
---
 arch/x86/include/asm/tsc.h  |  1 +
 arch/x86/include/asm/x86_init.h |  4 ++-
 arch/x86/kernel/tsc.c   | 75 +
 arch/x86/kernel/x86_init.c  |  1 +
 4 files changed, 73 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h
index d634f2a..c279502 100644
--- a/arch/x86/include/asm/tsc.h
+++ b/arch/x86/include/asm/tsc.h
@@ -36,6 +36,7 @@ extern void mark_tsc_unstable(char *reason);
 extern int unsynchronized_tsc(void);
 extern int check_tsc_unstable(void);
 extern int check_tsc_disabled(void);
+extern unsigned long native_calibrate_cpu(void);
 extern unsigned long native_calibrate_tsc(void);
 extern unsigned long long native_sched_clock_from_tsc(u64 tsc);
 
diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h
index 1ae89a2..2e5c84d 100644
--- a/arch/x86/include/asm/x86_init.h
+++ b/arch/x86/include/asm/x86_init.h
@@ -143,7 +143,8 @@ struct timespec;
 
 /**
  * struct x86_platform_ops - platform specific runtime functions
- * @calibrate_tsc: calibrate TSC
+ * @calibrate_cpu: calibrate CPU
+ * @calibrate_tsc: calibrate TSC, if different from CPU
  * @get_wallclock: get time from HW clock like RTC etc.
  * @set_wallclock: set time back to HW clock
  * @is_untracked_pat_range exclude from PAT logic
@@ -154,6 +155,7 @@ struct timespec;
  * @apic_post_init:adjust apic if neeeded
  */
 struct x86_platform_ops {
+   unsigned long (*calibrate_cpu)(void);
unsigned long (*calibrate_tsc)(void);
void (*get_wallclock)(struct timespec *ts);
int (*set_wallclock)(const struct timespec *ts);
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 0ffb57f..ca41c30 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -239,7 +239,7 @@ static inline unsigned long long cycles_2_ns(unsigned long 
long cyc)
return ns;
 }
 
-static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu)
+static void set_cyc2ns_scale(unsigned long khz, int cpu)
 {
unsigned long long tsc_now, ns_now;
struct cyc2ns_data *data;
@@ -248,7 +248,7 @@ static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu)
local_irq_save(flags);
sched_clock_idle_sleep_event();
 
-   if (!cpu_khz)
+   if (!khz)
goto done;
 
data = cyc2ns_write_begin(cpu);
@@ -261,7 +261,7 @@ static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu)
 * time function is continuous; see the comment near struct
 * cyc2ns_data.
 */
-   clocks_calc_mult_shift(>cyc2ns_mul, >cyc2ns_shift, cpu_khz,
+   clocks_calc_mult_shift(>cyc2ns_mul, >cyc2ns_shift, khz,
   NSEC_PER_MSEC, 0);
 
/*
@@ -665,15 +665,72 @@ success:
 }
 
 /**
- * native_calibrate_tsc - calibrate the tsc on boot
+ * native_calibrate_tsc
+ * Determine TSC frequency via CPUID, else return 0.
  */
 unsigned long native_calibrate_tsc(void)
 {
+   unsigned int eax_denominator, ebx_numerator, ecx_hz, edx;
+   unsigned int crystal_khz;
+
+   if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
+   return 0;
+
+   if (boot_cpu_data.cpuid_level < 0x15)
+   return 0;
+
+   eax_denominator = ebx_numerator = ecx_hz = edx = 0;
+
+   /* CPUID 15H TSC/Crystal ratio, plus optionally Crystal Hz */
+   cpuid(0x15, _denominator, _numerator, _hz, );
+
+   if (ebx_numerator == 0 || eax_denominator == 0)
+   return 0;
+
+   crystal_khz = ecx_hz / 1000;
+
+   if (crystal_khz == 0) {
+   switch (boot_cpu_data.x86_model) {
+   case 0x4E:  /* SKL */
+   case 0x5E:  /* SKL */
+   crystal_khz = 24000;/* 24 MHz */
+   }
+   }
+
+   return crystal_khz * ebx_numerator / eax_denominator;
+}
+
+static unsigned long cpu_khz_from_cpuid(void)
+{
+   unsigned int eax_base_mhz, ebx_max_mhz, ecx_bus_mhz, edx;
+
+   if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
+   return 0;
+
+   if (boot_cpu_data.cpuid_level < 0x16)
+   return 0;
+
+   eax_base_mhz = ebx_max_mhz = ecx_bus_mhz = edx = 0;
+
+   cpuid(0x16, _base_mhz, _max_mhz, _bus_mhz, );
+
+   return eax_base_mhz * 1000;
+}
+
+/**
+ * native_calibrate_cpu - calibrate the cpu on boot
+ */
+unsigned long native_calibrate_cpu(void)
+{
u64 

[PATCH 0/9] x86: TSC calibration update

2016-03-30 Thread Len Brown
cpu_khz and tsc_khz initialization can be unreliable and expensive.
They are initialized in tsc_init()/native_calibrate_tsc(), which prints:

pr_info("Detected %lu.%03lu MHz processor\n", cpu_khz...)

native_calibrate_cpu() first tries quick_pit_calibrate(),
which can take over 50.0M cycles to succeed,
or as few as 2.4M cycles to fail.

On failure, pit_calibrate_tsc() is attempted, which can succeed
in as few as 20M cycles, but may consume over 240M cycles
before it fails.

By comparison, on many processors, tsc frequency can be discovered by
table and MSR or CPUID in under 0.002M cycles.

Subsequently tsc_refine_calibration_work() checks our work,
but it takes under 0.004M cycles.

pr_info("Refined TSC clocksource calibration: %lu.%03lu MHz\n", tsc_khz...)

Finally, CPU and TSC frequency are not guaranteed to be identical,
and this series allows cpu_khz and tsc_khz to differ
within a few percent.

cheers,
Len Brown, Intel Open Source Technology Center

this patch set is also available via git:

git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux.git x86


Bin Gao (1):
  x86 tsc: enumerate BXT tsc_khz via CPUID

Len Brown (8):
  x86 tsc_msr: Identify Intel-specific code
  x86 tsc_msr: Remove debugging messages
  x86 tsc_msr: Update comments, expand definitions
  x86 tsc_msr: Correct Silvermont reference clock values
  x86 tsc_msr: Add Airmont reference clock values
  x86 tsc_msr: Extend to include Intel Core Architecture
  x86 tsc_msr: Remove irqoff around MSR-based TSC enumeration
  x86 tsc: enumerate SKL cpu_khz and tsc_khz via CPUID

 arch/x86/include/asm/tsc.h  |   4 +-
 arch/x86/include/asm/x86_init.h |   4 +-
 arch/x86/kernel/tsc.c   |  96 ++
 arch/x86/kernel/tsc_msr.c   | 112 ++--
 arch/x86/kernel/x86_init.c  |   1 +
 5 files changed, 152 insertions(+), 65 deletions(-)



RE: [PATCH] usb: xhci: Fix incomplete PM resume operation due to XHCI commmand timeout

2016-03-30 Thread Rajesh Bhagat


> -Original Message-
> From: Mathias Nyman [mailto:mathias.ny...@linux.intel.com]
> Sent: Tuesday, March 29, 2016 10:51 PM
> To: Rajesh Bhagat 
> Cc: gre...@linuxfoundation.org; linux-...@vger.kernel.org; linux-
> ker...@vger.kernel.org; Sriram Dash 
> Subject: Re: [PATCH] usb: xhci: Fix incomplete PM resume operation due to XHCI
> commmand timeout
> 
> On 28.03.2016 09:13, Rajesh Bhagat wrote:
> >
> >
> >> -Original Message-
> >> From: Mathias Nyman [mailto:mathias.ny...@linux.intel.com]
> >> Sent: Wednesday, March 23, 2016 7:52 PM
> >> To: Rajesh Bhagat 
> >> Cc: gre...@linuxfoundation.org; linux-...@vger.kernel.org; linux-
> >> ker...@vger.kernel.org; Sriram Dash 
> >> Subject: Re: [PATCH] usb: xhci: Fix incomplete PM resume operation
> >> due to XHCI commmand timeout
> >>
> >> On 23.03.2016 05:53, Rajesh Bhagat wrote:
> >>
> > IMO, The assumption that "xhci_abort_cmd_ring would always
> > generate an event and handle_cmd_completion would be called" will
> > not be always be true if HW
>  is in bad state.
> >
> > Please share your opinion.
> >
> 
>  writing the CA (command abort) bit in CRCR (command ring control
>  register)  will stop the command ring, and CRR (command ring
>  running) will be set
> >> to 0 by xHC.
>  xhci_abort_cmd_ring() polls this bit up to 5 seconds.
>  If it's not 0 then the driver considers the command abort as failed.
> 
>  The scenario you're thinking of is that xHC would still react to CA
>  bit set, it would stop the command ring and set CRR 0, but not send
>  a command
> >> completion event.
> 
>  Have you tried adding some debug to handle_cmd_completion() and see
>  if you receive any event after command abortion?
> 
> >>>
> >>> Yes. We have added debug prints at first line of
> >>> handle_cmd_completion, and we are not getting those prints. The last
> >>> print messages that we get are as below from xhci_alloc_dev while
> >>> resume
> >>> operation:
> >>>
> >>> xhci-hcd xhci-hcd.0.auto: Command timeout xhci-hcd xhci-hcd.0.auto:
> >>> Abort command ring
> >>>
> >>> May be somehow, USB controller is in bad state and not responding to
> >>> the
> >> commands.
> >>>
> >>> Please suggest how XHCI driver can handle such situations.
> >>>
> >>
> >> Restart the command timeout timer when writing the command abort bit.
> >> If we get theIf we get the abort event the timer is deleted.
> >>
> >> Otherwise if the timout triggers a second time we end up calling
> >> xhci_handle_command_timeout() with a stopped ring, This will call
> >> xhci_handle_stopped_cmd_ring(), turn the aborted command to no-op,
> >> restart the command ring, and finally when the no-op completes it
> >> should call the missing completion.
> >>
> >> If command ring doesn't start then additional code could be added to
> >> xhci_handle_command_timeout() that clears the command ring if it is
> >> called a second time (=current command is already in abort state and
> >> command ring is stopped when entering xhci_handle_command_timeout)
> >>
> >> There might be some details missing, I'm not able to test any of
> >> this, but try something like this:
> >>
> >> diff --git a/drivers/usb/host/xhci-ring.c
> >> b/drivers/usb/host/xhci-ring.c index 3e1d24c..576819e 100644
> >> --- a/drivers/usb/host/xhci-ring.c
> >> +++ b/drivers/usb/host/xhci-ring.c
> >> @@ -319,7 +319,10 @@ static int xhci_abort_cmd_ring(struct xhci_hcd *xhci)
> >>   xhci_halt(xhci);
> >>   return -ESHUTDOWN;
> >>   }
> >> -
> >> +   /* writing the CMD_RING_ABORT bit should create a command 
> >> completion
> >> +* event, add a command completion timeout for it as well
> >> +*/
> >> +   mod_timer(>cmd_timer, jiffies +
> >> + XHCI_CMD_DEFAULT_TIMEOUT);
> >>   return 0;
> >>}
> >
> > Hello Mathias,
> >
> > Thanks for the patch.
> >
> > After application of above patch, I'm getting following prints constantly:
> >
> > xhci-hcd xhci-hcd.0.auto: Command timeout xhci-hcd xhci-hcd.0.auto:
> > Abort command ring xhci-hcd xhci-hcd.0.auto: Command timeout on
> > stopped ring xhci-hcd xhci-hcd.0.auto: Turn aborted command be56e000
> > to no-op xhci-hcd xhci-hcd.0.auto: // Ding dong!
> > ...
> > xhci-hcd xhci-hcd.0.auto: Command timeout xhci-hcd xhci-hcd.0.auto:
> > Abort command ring xhci-hcd xhci-hcd.0.auto: Command timeout on
> > stopped ring xhci-hcd xhci-hcd.0.auto: Turn aborted command be56e000
> > to no-op xhci-hcd xhci-hcd.0.auto: // Ding dong!
> >
> > As expected, xhci_handle_command_timeout is called again and next time
> > ring state is __not__ CMD_RING_STATE_RUNNING, Hence
> > xhci_handle_stopped_cmd_ring is called which turn all the aborted
> > commands to no-ops and again makes the ring state as
> CMD_RING_STATE_RUNNING, and rings the door bell.
> >
> > But again in this case, no response 

RE: [PATCH] usb: xhci: Fix incomplete PM resume operation due to XHCI commmand timeout

2016-03-30 Thread Rajesh Bhagat


> -Original Message-
> From: Mathias Nyman [mailto:mathias.ny...@linux.intel.com]
> Sent: Tuesday, March 29, 2016 10:51 PM
> To: Rajesh Bhagat 
> Cc: gre...@linuxfoundation.org; linux-...@vger.kernel.org; linux-
> ker...@vger.kernel.org; Sriram Dash 
> Subject: Re: [PATCH] usb: xhci: Fix incomplete PM resume operation due to XHCI
> commmand timeout
> 
> On 28.03.2016 09:13, Rajesh Bhagat wrote:
> >
> >
> >> -Original Message-
> >> From: Mathias Nyman [mailto:mathias.ny...@linux.intel.com]
> >> Sent: Wednesday, March 23, 2016 7:52 PM
> >> To: Rajesh Bhagat 
> >> Cc: gre...@linuxfoundation.org; linux-...@vger.kernel.org; linux-
> >> ker...@vger.kernel.org; Sriram Dash 
> >> Subject: Re: [PATCH] usb: xhci: Fix incomplete PM resume operation
> >> due to XHCI commmand timeout
> >>
> >> On 23.03.2016 05:53, Rajesh Bhagat wrote:
> >>
> > IMO, The assumption that "xhci_abort_cmd_ring would always
> > generate an event and handle_cmd_completion would be called" will
> > not be always be true if HW
>  is in bad state.
> >
> > Please share your opinion.
> >
> 
>  writing the CA (command abort) bit in CRCR (command ring control
>  register)  will stop the command ring, and CRR (command ring
>  running) will be set
> >> to 0 by xHC.
>  xhci_abort_cmd_ring() polls this bit up to 5 seconds.
>  If it's not 0 then the driver considers the command abort as failed.
> 
>  The scenario you're thinking of is that xHC would still react to CA
>  bit set, it would stop the command ring and set CRR 0, but not send
>  a command
> >> completion event.
> 
>  Have you tried adding some debug to handle_cmd_completion() and see
>  if you receive any event after command abortion?
> 
> >>>
> >>> Yes. We have added debug prints at first line of
> >>> handle_cmd_completion, and we are not getting those prints. The last
> >>> print messages that we get are as below from xhci_alloc_dev while
> >>> resume
> >>> operation:
> >>>
> >>> xhci-hcd xhci-hcd.0.auto: Command timeout xhci-hcd xhci-hcd.0.auto:
> >>> Abort command ring
> >>>
> >>> May be somehow, USB controller is in bad state and not responding to
> >>> the
> >> commands.
> >>>
> >>> Please suggest how XHCI driver can handle such situations.
> >>>
> >>
> >> Restart the command timeout timer when writing the command abort bit.
> >> If we get theIf we get the abort event the timer is deleted.
> >>
> >> Otherwise if the timout triggers a second time we end up calling
> >> xhci_handle_command_timeout() with a stopped ring, This will call
> >> xhci_handle_stopped_cmd_ring(), turn the aborted command to no-op,
> >> restart the command ring, and finally when the no-op completes it
> >> should call the missing completion.
> >>
> >> If command ring doesn't start then additional code could be added to
> >> xhci_handle_command_timeout() that clears the command ring if it is
> >> called a second time (=current command is already in abort state and
> >> command ring is stopped when entering xhci_handle_command_timeout)
> >>
> >> There might be some details missing, I'm not able to test any of
> >> this, but try something like this:
> >>
> >> diff --git a/drivers/usb/host/xhci-ring.c
> >> b/drivers/usb/host/xhci-ring.c index 3e1d24c..576819e 100644
> >> --- a/drivers/usb/host/xhci-ring.c
> >> +++ b/drivers/usb/host/xhci-ring.c
> >> @@ -319,7 +319,10 @@ static int xhci_abort_cmd_ring(struct xhci_hcd *xhci)
> >>   xhci_halt(xhci);
> >>   return -ESHUTDOWN;
> >>   }
> >> -
> >> +   /* writing the CMD_RING_ABORT bit should create a command 
> >> completion
> >> +* event, add a command completion timeout for it as well
> >> +*/
> >> +   mod_timer(>cmd_timer, jiffies +
> >> + XHCI_CMD_DEFAULT_TIMEOUT);
> >>   return 0;
> >>}
> >
> > Hello Mathias,
> >
> > Thanks for the patch.
> >
> > After application of above patch, I'm getting following prints constantly:
> >
> > xhci-hcd xhci-hcd.0.auto: Command timeout xhci-hcd xhci-hcd.0.auto:
> > Abort command ring xhci-hcd xhci-hcd.0.auto: Command timeout on
> > stopped ring xhci-hcd xhci-hcd.0.auto: Turn aborted command be56e000
> > to no-op xhci-hcd xhci-hcd.0.auto: // Ding dong!
> > ...
> > xhci-hcd xhci-hcd.0.auto: Command timeout xhci-hcd xhci-hcd.0.auto:
> > Abort command ring xhci-hcd xhci-hcd.0.auto: Command timeout on
> > stopped ring xhci-hcd xhci-hcd.0.auto: Turn aborted command be56e000
> > to no-op xhci-hcd xhci-hcd.0.auto: // Ding dong!
> >
> > As expected, xhci_handle_command_timeout is called again and next time
> > ring state is __not__ CMD_RING_STATE_RUNNING, Hence
> > xhci_handle_stopped_cmd_ring is called which turn all the aborted
> > commands to no-ops and again makes the ring state as
> CMD_RING_STATE_RUNNING, and rings the door bell.
> >
> > But again in this case, no response from USB controller and
> > xhci_alloc_dev is still waiting for wait_for_completion.
> >
> 

Re: [RFCv7 PATCH 00/10] sched: scheduler-driven CPU frequency selection

2016-03-30 Thread Yuyang Du
Hi Steve,

On Wed, Mar 30, 2016 at 06:35:23PM -0700, Steve Muckle wrote:
> This series was dropped in favor of Rafael's schedutil. But on the
> chance that you're still curious about the test setup used to quantify
> the series I'll explain below.
 
I will catch up and learn both.

> These results are meant to show how the governors perform across varying
> workload intensities and periodicities. Higher overhead (OH) numbers
> indicate that the completion times of each period of the workload were
> closer to what they would be when run at fmin (100% overhead would be as
> slow as fmin, 0% overhead would be as fast as fmax). And as described
> above, overruns (OR) indicate that the governor was not responsive
> enough to finish the work in each period of the workload.
> 
> These are just performance metrics so they only tell half the story.
> Power is not factored in at all.
> 
> This provides a quick sanity check that the governor under test (in this
> case, the now defunct schedfreq, or sched for short) performs similarly
> to two of the most commonly used governors, ondemand and interactive, in
> steady state periodic workloads. In the data above sched looks good for
> the most part with the second test case being the biggest exception.
 
Yes, it is indeed a quick sanity check.

Thanks,
Yuyang


Re: [RFCv7 PATCH 00/10] sched: scheduler-driven CPU frequency selection

2016-03-30 Thread Yuyang Du
Hi Steve,

On Wed, Mar 30, 2016 at 06:35:23PM -0700, Steve Muckle wrote:
> This series was dropped in favor of Rafael's schedutil. But on the
> chance that you're still curious about the test setup used to quantify
> the series I'll explain below.
 
I will catch up and learn both.

> These results are meant to show how the governors perform across varying
> workload intensities and periodicities. Higher overhead (OH) numbers
> indicate that the completion times of each period of the workload were
> closer to what they would be when run at fmin (100% overhead would be as
> slow as fmin, 0% overhead would be as fast as fmax). And as described
> above, overruns (OR) indicate that the governor was not responsive
> enough to finish the work in each period of the workload.
> 
> These are just performance metrics so they only tell half the story.
> Power is not factored in at all.
> 
> This provides a quick sanity check that the governor under test (in this
> case, the now defunct schedfreq, or sched for short) performs similarly
> to two of the most commonly used governors, ondemand and interactive, in
> steady state periodic workloads. In the data above sched looks good for
> the most part with the second test case being the biggest exception.
 
Yes, it is indeed a quick sanity check.

Thanks,
Yuyang


[PATCH RESEND v2 1/6] sched/fair: Generalize the load/util averages resolution definition

2016-03-30 Thread Yuyang Du
Integer metric needs fixed point arithmetic. In sched/fair, a few
metrics, e.g., weight, load, load_avg, util_avg, freq, and capacity,
may have different fixed point ranges, which makes their update and
usage error-prone.

In order to avoid the errors relating to the fixed point range, we
definie a basic fixed point range, and then formalize all metrics to
base on the basic range.

The basic range is 1024 or (1 << 10). Further, one can recursively
apply the basic range to have larger range.

Pointed out by Ben Segall, weight (visible to user, e.g., NICE-0 has
1024) and load (e.g., NICE_0_LOAD) have independent ranges, but they
must be well calibrated.

Signed-off-by: Yuyang Du 
---
 include/linux/sched.h | 16 +---
 kernel/sched/fair.c   |  4 
 kernel/sched/sched.h  | 15 ++-
 3 files changed, 23 insertions(+), 12 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index c617ea1..54784d0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -934,9 +934,19 @@ enum cpu_idle_type {
 };
 
 /*
+ * Integer metrics need fixed point arithmetic, e.g., sched/fair
+ * has a few: load, load_avg, util_avg, freq, and capacity.
+ *
+ * We define a basic fixed point arithmetic range, and then formalize
+ * all these metrics based on that basic range.
+ */
+# define SCHED_FIXEDPOINT_SHIFT10
+# define SCHED_FIXEDPOINT_SCALE(1L << SCHED_FIXEDPOINT_SHIFT)
+
+/*
  * Increase resolution of cpu_capacity calculations
  */
-#define SCHED_CAPACITY_SHIFT   10
+#define SCHED_CAPACITY_SHIFT   SCHED_FIXEDPOINT_SHIFT
 #define SCHED_CAPACITY_SCALE   (1L << SCHED_CAPACITY_SHIFT)
 
 /*
@@ -1202,8 +1212,8 @@ struct load_weight {
  * 1) load_avg factors frequency scaling into the amount of time that a
  * sched_entity is runnable on a rq into its weight. For cfs_rq, it is the
  * aggregated such weights of all runnable and blocked sched_entities.
- * 2) util_avg factors frequency and cpu scaling into the amount of time
- * that a sched_entity is running on a CPU, in the range [0..SCHED_LOAD_SCALE].
+ * 2) util_avg factors frequency and cpu capacity scaling into the amount of 
time
+ * that a sched_entity is running on a CPU, in the range 
[0..SCHED_CAPACITY_SCALE].
  * For cfs_rq, it is the aggregated such times of all runnable and
  * blocked sched_entities.
  * The 64 bit load_sum can:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 303d639..1d3fc01 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2609,10 +2609,6 @@ static u32 __compute_runnable_contrib(u64 n)
return contrib + runnable_avg_yN_sum[n];
 }
 
-#if (SCHED_LOAD_SHIFT - SCHED_LOAD_RESOLUTION) != 10 || SCHED_CAPACITY_SHIFT 
!= 10
-#error "load tracking assumes 2^10 as unit"
-#endif
-
 #define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e6d4a3f..15a89ee 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -54,18 +54,23 @@ static inline void update_cpu_load_active(struct rq 
*this_rq) { }
  * increased costs.
  */
 #if 0 /* BITS_PER_LONG > 32 -- currently broken: it increases power usage 
under light load  */
-# define SCHED_LOAD_RESOLUTION 10
-# define scale_load(w) ((w) << SCHED_LOAD_RESOLUTION)
-# define scale_load_down(w)((w) >> SCHED_LOAD_RESOLUTION)
+# define SCHED_LOAD_SHIFT  (SCHED_FIXEDPOINT_SHIFT + 
SCHED_FIXEDPOINT_SHIFT)
+# define scale_load(w) ((w) << SCHED_FIXEDPOINT_SHIFT)
+# define scale_load_down(w)((w) >> SCHED_FIXEDPOINT_SHIFT)
 #else
-# define SCHED_LOAD_RESOLUTION 0
+# define SCHED_LOAD_SHIFT  (SCHED_FIXEDPOINT_SHIFT)
 # define scale_load(w) (w)
 # define scale_load_down(w)(w)
 #endif
 
-#define SCHED_LOAD_SHIFT   (10 + SCHED_LOAD_RESOLUTION)
 #define SCHED_LOAD_SCALE   (1L << SCHED_LOAD_SHIFT)
 
+/*
+ * NICE_0's weight (visible to user) and its load (invisible to user) have
+ * independent ranges, but they should be well calibrated. We use scale_load()
+ * and scale_load_down(w) to convert between them, the following must be true:
+ * scale_load(sched_prio_to_weight[20]) == NICE_0_LOAD
+ */
 #define NICE_0_LOADSCHED_LOAD_SCALE
 #define NICE_0_SHIFT   SCHED_LOAD_SHIFT
 
-- 
2.1.4



[PATCH RESEND v2 4/6] sched/fair: Remove scale_load_down() for load_avg

2016-03-30 Thread Yuyang Du
Currently, load_avg = scale_load_down(load) * runnable%. The extra scaling
down of load does not make much sense, because load_avg is primarily THE
load and on top of that, we take runnable time into account.

We therefore remove scale_load_down() for load_avg. But we need to
carefully consider the overflow risk if load has higher range
(2*SCHED_FIXEDPOINT_SHIFT). The only case an overflow may occur due
to us is on 64bit kernel with increased load range. In that case,
the 64bit load_sum can afford 4251057 (=2^64/47742/88761/1024)
entities with the highest load (=88761*1024) always runnable on one
single cfs_rq, which may be an issue, but should be fine. Even if this
occurs at the end of day, on the condition where it occurs, the
load average will not be useful anyway.

Signed-off-by: Yuyang Du 
[update calculate_imbalance]
Signed-off-by: Vincent Guittot 
---
 include/linux/sched.h | 19 ++-
 kernel/sched/fair.c   | 19 +--
 2 files changed, 23 insertions(+), 15 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index db3c6e1..8df6d69 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1213,7 +1213,7 @@ struct load_weight {
  *
  * [load_avg definition]
  *
- * load_avg = runnable% * scale_load_down(load)
+ * load_avg = runnable% * load
  *
  * where runnable% is the time ratio that a sched_entity is runnable.
  * For cfs_rq, it is the aggregated such load_avg of all runnable and
@@ -1221,7 +1221,7 @@ struct load_weight {
  *
  * load_avg may also take frequency scaling into account:
  *
- * load_avg = runnable% * scale_load_down(load) * freq%
+ * load_avg = runnable% * load * freq%
  *
  * where freq% is the CPU frequency normalize to the highest frequency
  *
@@ -1247,9 +1247,18 @@ struct load_weight {
  *
  * [Overflow issue]
  *
- * The 64bit load_sum can have 4353082796 (=2^64/47742/88761) entities
- * with the highest load (=88761) always runnable on a single cfs_rq, we
- * should not overflow as the number already hits PID_MAX_LIMIT.
+ * On 64bit kernel:
+ *
+ * When load has small fixed point range (SCHED_FIXEDPOINT_SHIFT), the
+ * 64bit load_sum can have 4353082796 (=2^64/47742/88761) tasks with
+ * the highest load (=88761) always runnable on a cfs_rq, we should
+ * not overflow as the number already hits PID_MAX_LIMIT.
+ *
+ * When load has large fixed point range (2*SCHED_FIXEDPOINT_SHIFT),
+ * the 64bit load_sum can have 4251057 (=2^64/47742/88761/1024) tasks
+ * with the highest load (=88761*1024) always runnable on ONE cfs_rq,
+ * we should be fine. Even if the overflow occurs at the end of day,
+ * at the time the load_avg won't be useful anyway in that situation.
  *
  * For all other cases (including 32bit kernel), struct load_weight's
  * weight will overflow first before we do, because:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bf835b5..da6642f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -680,7 +680,7 @@ void init_entity_runnable_average(struct sched_entity *se)
 * will definitely be update (after enqueue).
 */
sa->period_contrib = 1023;
-   sa->load_avg = scale_load_down(se->load.weight);
+   sa->load_avg = se->load.weight;
sa->load_sum = sa->load_avg * LOAD_AVG_MAX;
sa->util_avg = SCHED_CAPACITY_SCALE;
sa->util_sum = sa->util_avg * LOAD_AVG_MAX;
@@ -2837,7 +2837,7 @@ static inline int update_cfs_rq_load_avg(u64 now, struct 
cfs_rq *cfs_rq)
}
 
decayed = __update_load_avg(now, cpu_of(rq_of(cfs_rq)), sa,
-   scale_load_down(cfs_rq->load.weight), cfs_rq->curr != NULL, 
cfs_rq);
+   cfs_rq->load.weight, cfs_rq->curr != NULL, cfs_rq);
 
 #ifndef CONFIG_64BIT
smp_wmb();
@@ -2858,8 +2858,7 @@ static inline void update_load_avg(struct sched_entity 
*se, int update_tg)
 * Track task load average for carrying it to new CPU after migrated, 
and
 * track group sched_entity load average for task_h_load calc in 
migration
 */
-   __update_load_avg(now, cpu, >avg,
- se->on_rq * scale_load_down(se->load.weight),
+   __update_load_avg(now, cpu, >avg, se->on_rq * se->load.weight,
  cfs_rq->curr == se, NULL);
 
if (update_cfs_rq_load_avg(now, cfs_rq) && update_tg)
@@ -2896,7 +2895,7 @@ skip_aging:
 static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity 
*se)
 {
__update_load_avg(cfs_rq->avg.last_update_time, cpu_of(rq_of(cfs_rq)),
- >avg, se->on_rq * 
scale_load_down(se->load.weight),
+ >avg, se->on_rq * se->load.weight,
  cfs_rq->curr == se, NULL);
 
cfs_rq->avg.load_avg = max_t(long, cfs_rq->avg.load_avg - 
se->avg.load_avg, 0);
@@ -2916,7 +2915,7 @@ enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct 
sched_entity *se)
migrated = 

[PATCH RESEND v2 1/6] sched/fair: Generalize the load/util averages resolution definition

2016-03-30 Thread Yuyang Du
Integer metric needs fixed point arithmetic. In sched/fair, a few
metrics, e.g., weight, load, load_avg, util_avg, freq, and capacity,
may have different fixed point ranges, which makes their update and
usage error-prone.

In order to avoid the errors relating to the fixed point range, we
definie a basic fixed point range, and then formalize all metrics to
base on the basic range.

The basic range is 1024 or (1 << 10). Further, one can recursively
apply the basic range to have larger range.

Pointed out by Ben Segall, weight (visible to user, e.g., NICE-0 has
1024) and load (e.g., NICE_0_LOAD) have independent ranges, but they
must be well calibrated.

Signed-off-by: Yuyang Du 
---
 include/linux/sched.h | 16 +---
 kernel/sched/fair.c   |  4 
 kernel/sched/sched.h  | 15 ++-
 3 files changed, 23 insertions(+), 12 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index c617ea1..54784d0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -934,9 +934,19 @@ enum cpu_idle_type {
 };
 
 /*
+ * Integer metrics need fixed point arithmetic, e.g., sched/fair
+ * has a few: load, load_avg, util_avg, freq, and capacity.
+ *
+ * We define a basic fixed point arithmetic range, and then formalize
+ * all these metrics based on that basic range.
+ */
+# define SCHED_FIXEDPOINT_SHIFT10
+# define SCHED_FIXEDPOINT_SCALE(1L << SCHED_FIXEDPOINT_SHIFT)
+
+/*
  * Increase resolution of cpu_capacity calculations
  */
-#define SCHED_CAPACITY_SHIFT   10
+#define SCHED_CAPACITY_SHIFT   SCHED_FIXEDPOINT_SHIFT
 #define SCHED_CAPACITY_SCALE   (1L << SCHED_CAPACITY_SHIFT)
 
 /*
@@ -1202,8 +1212,8 @@ struct load_weight {
  * 1) load_avg factors frequency scaling into the amount of time that a
  * sched_entity is runnable on a rq into its weight. For cfs_rq, it is the
  * aggregated such weights of all runnable and blocked sched_entities.
- * 2) util_avg factors frequency and cpu scaling into the amount of time
- * that a sched_entity is running on a CPU, in the range [0..SCHED_LOAD_SCALE].
+ * 2) util_avg factors frequency and cpu capacity scaling into the amount of 
time
+ * that a sched_entity is running on a CPU, in the range 
[0..SCHED_CAPACITY_SCALE].
  * For cfs_rq, it is the aggregated such times of all runnable and
  * blocked sched_entities.
  * The 64 bit load_sum can:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 303d639..1d3fc01 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2609,10 +2609,6 @@ static u32 __compute_runnable_contrib(u64 n)
return contrib + runnable_avg_yN_sum[n];
 }
 
-#if (SCHED_LOAD_SHIFT - SCHED_LOAD_RESOLUTION) != 10 || SCHED_CAPACITY_SHIFT 
!= 10
-#error "load tracking assumes 2^10 as unit"
-#endif
-
 #define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e6d4a3f..15a89ee 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -54,18 +54,23 @@ static inline void update_cpu_load_active(struct rq 
*this_rq) { }
  * increased costs.
  */
 #if 0 /* BITS_PER_LONG > 32 -- currently broken: it increases power usage 
under light load  */
-# define SCHED_LOAD_RESOLUTION 10
-# define scale_load(w) ((w) << SCHED_LOAD_RESOLUTION)
-# define scale_load_down(w)((w) >> SCHED_LOAD_RESOLUTION)
+# define SCHED_LOAD_SHIFT  (SCHED_FIXEDPOINT_SHIFT + 
SCHED_FIXEDPOINT_SHIFT)
+# define scale_load(w) ((w) << SCHED_FIXEDPOINT_SHIFT)
+# define scale_load_down(w)((w) >> SCHED_FIXEDPOINT_SHIFT)
 #else
-# define SCHED_LOAD_RESOLUTION 0
+# define SCHED_LOAD_SHIFT  (SCHED_FIXEDPOINT_SHIFT)
 # define scale_load(w) (w)
 # define scale_load_down(w)(w)
 #endif
 
-#define SCHED_LOAD_SHIFT   (10 + SCHED_LOAD_RESOLUTION)
 #define SCHED_LOAD_SCALE   (1L << SCHED_LOAD_SHIFT)
 
+/*
+ * NICE_0's weight (visible to user) and its load (invisible to user) have
+ * independent ranges, but they should be well calibrated. We use scale_load()
+ * and scale_load_down(w) to convert between them, the following must be true:
+ * scale_load(sched_prio_to_weight[20]) == NICE_0_LOAD
+ */
 #define NICE_0_LOADSCHED_LOAD_SCALE
 #define NICE_0_SHIFT   SCHED_LOAD_SHIFT
 
-- 
2.1.4



[PATCH RESEND v2 4/6] sched/fair: Remove scale_load_down() for load_avg

2016-03-30 Thread Yuyang Du
Currently, load_avg = scale_load_down(load) * runnable%. The extra scaling
down of load does not make much sense, because load_avg is primarily THE
load and on top of that, we take runnable time into account.

We therefore remove scale_load_down() for load_avg. But we need to
carefully consider the overflow risk if load has higher range
(2*SCHED_FIXEDPOINT_SHIFT). The only case an overflow may occur due
to us is on 64bit kernel with increased load range. In that case,
the 64bit load_sum can afford 4251057 (=2^64/47742/88761/1024)
entities with the highest load (=88761*1024) always runnable on one
single cfs_rq, which may be an issue, but should be fine. Even if this
occurs at the end of day, on the condition where it occurs, the
load average will not be useful anyway.

Signed-off-by: Yuyang Du 
[update calculate_imbalance]
Signed-off-by: Vincent Guittot 
---
 include/linux/sched.h | 19 ++-
 kernel/sched/fair.c   | 19 +--
 2 files changed, 23 insertions(+), 15 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index db3c6e1..8df6d69 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1213,7 +1213,7 @@ struct load_weight {
  *
  * [load_avg definition]
  *
- * load_avg = runnable% * scale_load_down(load)
+ * load_avg = runnable% * load
  *
  * where runnable% is the time ratio that a sched_entity is runnable.
  * For cfs_rq, it is the aggregated such load_avg of all runnable and
@@ -1221,7 +1221,7 @@ struct load_weight {
  *
  * load_avg may also take frequency scaling into account:
  *
- * load_avg = runnable% * scale_load_down(load) * freq%
+ * load_avg = runnable% * load * freq%
  *
  * where freq% is the CPU frequency normalize to the highest frequency
  *
@@ -1247,9 +1247,18 @@ struct load_weight {
  *
  * [Overflow issue]
  *
- * The 64bit load_sum can have 4353082796 (=2^64/47742/88761) entities
- * with the highest load (=88761) always runnable on a single cfs_rq, we
- * should not overflow as the number already hits PID_MAX_LIMIT.
+ * On 64bit kernel:
+ *
+ * When load has small fixed point range (SCHED_FIXEDPOINT_SHIFT), the
+ * 64bit load_sum can have 4353082796 (=2^64/47742/88761) tasks with
+ * the highest load (=88761) always runnable on a cfs_rq, we should
+ * not overflow as the number already hits PID_MAX_LIMIT.
+ *
+ * When load has large fixed point range (2*SCHED_FIXEDPOINT_SHIFT),
+ * the 64bit load_sum can have 4251057 (=2^64/47742/88761/1024) tasks
+ * with the highest load (=88761*1024) always runnable on ONE cfs_rq,
+ * we should be fine. Even if the overflow occurs at the end of day,
+ * at the time the load_avg won't be useful anyway in that situation.
  *
  * For all other cases (including 32bit kernel), struct load_weight's
  * weight will overflow first before we do, because:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bf835b5..da6642f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -680,7 +680,7 @@ void init_entity_runnable_average(struct sched_entity *se)
 * will definitely be update (after enqueue).
 */
sa->period_contrib = 1023;
-   sa->load_avg = scale_load_down(se->load.weight);
+   sa->load_avg = se->load.weight;
sa->load_sum = sa->load_avg * LOAD_AVG_MAX;
sa->util_avg = SCHED_CAPACITY_SCALE;
sa->util_sum = sa->util_avg * LOAD_AVG_MAX;
@@ -2837,7 +2837,7 @@ static inline int update_cfs_rq_load_avg(u64 now, struct 
cfs_rq *cfs_rq)
}
 
decayed = __update_load_avg(now, cpu_of(rq_of(cfs_rq)), sa,
-   scale_load_down(cfs_rq->load.weight), cfs_rq->curr != NULL, 
cfs_rq);
+   cfs_rq->load.weight, cfs_rq->curr != NULL, cfs_rq);
 
 #ifndef CONFIG_64BIT
smp_wmb();
@@ -2858,8 +2858,7 @@ static inline void update_load_avg(struct sched_entity 
*se, int update_tg)
 * Track task load average for carrying it to new CPU after migrated, 
and
 * track group sched_entity load average for task_h_load calc in 
migration
 */
-   __update_load_avg(now, cpu, >avg,
- se->on_rq * scale_load_down(se->load.weight),
+   __update_load_avg(now, cpu, >avg, se->on_rq * se->load.weight,
  cfs_rq->curr == se, NULL);
 
if (update_cfs_rq_load_avg(now, cfs_rq) && update_tg)
@@ -2896,7 +2895,7 @@ skip_aging:
 static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity 
*se)
 {
__update_load_avg(cfs_rq->avg.last_update_time, cpu_of(rq_of(cfs_rq)),
- >avg, se->on_rq * 
scale_load_down(se->load.weight),
+ >avg, se->on_rq * se->load.weight,
  cfs_rq->curr == se, NULL);
 
cfs_rq->avg.load_avg = max_t(long, cfs_rq->avg.load_avg - 
se->avg.load_avg, 0);
@@ -2916,7 +2915,7 @@ enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct 
sched_entity *se)
migrated = !sa->last_update_time;
if (!migrated) {
   

[PATCH RESEND v2 6/6] sched/fair: Remove unconditionally inactive code

2016-03-30 Thread Yuyang Du
The increased load resolution (fixed point arithmetic range) is
unconditionally deactivated with #if 0, so it is effectively broken.

But the increased load range is still used somewhere (e.g., in Google),
so we keep this feature. The reconciliation is we define
CONFIG_CFS_INCREASE_LOAD_RANGE and it depends on FAIR_GROUP_SCHED and
64BIT and BROKEN.

Suggested-by: Ingo Molnar 
Signed-off-by: Yuyang Du 
---
 init/Kconfig | 16 +++
 kernel/sched/sched.h | 55 +---
 2 files changed, 38 insertions(+), 33 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index 2232080..d072c09 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1025,6 +1025,22 @@ config CFS_BANDWIDTH
  restriction.
  See tip/Documentation/scheduler/sched-bwc.txt for more information.
 
+config CFS_INCREASE_LOAD_RANGE
+   bool "Increase kernel load range"
+   depends on 64BIT && BROKEN
+   default n
+   help
+ Increase resolution of nice-level calculations for 64-bit 
architectures.
+ The extra resolution improves shares distribution and load balancing 
of
+ low-weight task groups (eg. nice +19 on an autogroup), deeper 
taskgroup
+ hierarchies, especially on larger systems. This is not a user-visible 
change
+ and does not change the user-interface for setting shares/weights.
+ We increase resolution only if we have enough bits to allow this 
increased
+ resolution (i.e. BITS_PER_LONG > 32). The costs for increasing 
resolution
+ when BITS_PER_LONG <= 32 are pretty high and the returns do not 
justify the
+ increased costs.
+ Currently broken: it increases power usage under light load.
+
 config RT_GROUP_SCHED
bool "Group scheduling for SCHED_RR/FIFO"
depends on CGROUP_SCHED
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ebe16e3..1bb0d69 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -42,39 +42,6 @@ static inline void update_cpu_load_active(struct rq 
*this_rq) { }
 #define NS_TO_JIFFIES(TIME)((unsigned long)(TIME) / (NSEC_PER_SEC / HZ))
 
 /*
- * Increase resolution of nice-level calculations for 64-bit architectures.
- * The extra resolution improves shares distribution and load balancing of
- * low-weight task groups (eg. nice +19 on an autogroup), deeper taskgroup
- * hierarchies, especially on larger systems. This is not a user-visible change
- * and does not change the user-interface for setting shares/weights.
- *
- * We increase resolution only if we have enough bits to allow this increased
- * resolution (i.e. BITS_PER_LONG > 32). The costs for increasing resolution
- * when BITS_PER_LONG <= 32 are pretty high and the returns do not justify the
- * increased costs.
- */
-#if 0 /* BITS_PER_LONG > 32 -- currently broken: it increases power usage 
under light load  */
-# define NICE_0_LOAD_SHIFT (SCHED_FIXEDPOINT_SHIFT + 
SCHED_FIXEDPOINT_SHIFT)
-# define user_to_kernel_load(w)((w) << SCHED_FIXEDPOINT_SHIFT)
-# define kernel_to_user_load(w)((w) >> SCHED_FIXEDPOINT_SHIFT)
-#else
-# define NICE_0_LOAD_SHIFT (SCHED_FIXEDPOINT_SHIFT)
-# define user_to_kernel_load(w)(w)
-# define kernel_to_user_load(w)(w)
-#endif
-
-/*
- * Task weight (visible to user) and its load (invisible to user) have
- * independent resolution, but they should be well calibrated. We use
- * user_to_kernel_load() and kernel_to_user_load(w) to convert between
- * them. The following must be true:
- *
- * user_to_kernel_load(sched_prio_to_weight[USER_PRIO(NICE_TO_PRIO(0))]) == 
NICE_0_LOAD
- * kernel_to_user_load(NICE_0_LOAD) == 
sched_prio_to_weight[USER_PRIO(NICE_TO_PRIO(0))]
- */
-#define NICE_0_LOAD(1L << NICE_0_LOAD_SHIFT)
-
-/*
  * Single value that decides SCHED_DEADLINE internal math precision.
  * 10 -> just above 1us
  * 9  -> just above 0.5us
@@ -1150,6 +1117,28 @@ extern const int sched_prio_to_weight[40];
 extern const u32 sched_prio_to_wmult[40];
 
 /*
+ * Task weight (visible to user) and its load (invisible to user) have
+ * independent ranges, but they should be well calibrated. We use
+ * user_to_kernel_load() and kernel_to_user_load(w) to convert between
+ * them.
+ *
+ * The following must also be true:
+ * user_to_kernel_load(sched_prio_to_weight[USER_PRIO(NICE_TO_PRIO(0))]) == 
NICE_0_LOAD
+ * kernel_to_user_load(NICE_0_LOAD) == 
sched_prio_to_weight[USER_PRIO(NICE_TO_PRIO(0))]
+ */
+#ifdef CONFIG_CFS_INCREASE_LOAD_RANGE
+#define NICE_0_LOAD_SHIFT  (SCHED_FIXEDPOINT_SHIFT + 
SCHED_FIXEDPOINT_SHIFT)
+#define user_to_kernel_load(w) (w << SCHED_FIXEDPOINT_SHIFT)
+#define kernel_to_user_load(w) (w >> SCHED_FIXEDPOINT_SHIFT)
+#else
+#define NICE_0_LOAD_SHIFT  (SCHED_FIXEDPOINT_SHIFT)
+#define user_to_kernel_load(w) (w)
+#define kernel_to_user_load(w) (w)
+#endif
+
+#define NICE_0_LOAD(1UL << NICE_0_LOAD_SHIFT)
+
+/*
  * 

[PATCH RESEND v2 5/6] sched/fair: Rename scale_load() and scale_load_down()

2016-03-30 Thread Yuyang Du
Rename scale_load() and scale_load_down() to user_to_kernel_load()
and kernel_to_user_load() respectively, to allow the names to bear
what they are really about.

Signed-off-by: Yuyang Du 
[update calculate_imbalance]
Signed-off-by: Vincent Guittot 

Signed-off-by: Yuyang Du 
---
 kernel/sched/core.c  |  8 
 kernel/sched/fair.c  | 14 --
 kernel/sched/sched.h | 16 
 3 files changed, 20 insertions(+), 18 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0b21e7a..81c876e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -677,12 +677,12 @@ static void set_load_weight(struct task_struct *p)
 * SCHED_IDLE tasks get minimal weight:
 */
if (idle_policy(p->policy)) {
-   load->weight = scale_load(WEIGHT_IDLEPRIO);
+   load->weight = user_to_kernel_load(WEIGHT_IDLEPRIO);
load->inv_weight = WMULT_IDLEPRIO;
return;
}
 
-   load->weight = scale_load(sched_prio_to_weight[prio]);
+   load->weight = user_to_kernel_load(sched_prio_to_weight[prio]);
load->inv_weight = sched_prio_to_wmult[prio];
 }
 
@@ -8110,7 +8110,7 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
 static int cpu_shares_write_u64(struct cgroup_subsys_state *css,
struct cftype *cftype, u64 shareval)
 {
-   return sched_group_set_shares(css_tg(css), scale_load(shareval));
+   return sched_group_set_shares(css_tg(css), 
user_to_kernel_load(shareval));
 }
 
 static u64 cpu_shares_read_u64(struct cgroup_subsys_state *css,
@@ -8118,7 +8118,7 @@ static u64 cpu_shares_read_u64(struct cgroup_subsys_state 
*css,
 {
struct task_group *tg = css_tg(css);
 
-   return (u64) scale_load_down(tg->shares);
+   return (u64) kernel_to_user_load(tg->shares);
 }
 
 #ifdef CONFIG_CFS_BANDWIDTH
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index da6642f..bcf1027 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -189,7 +189,7 @@ static void __update_inv_weight(struct load_weight *lw)
if (likely(lw->inv_weight))
return;
 
-   w = scale_load_down(lw->weight);
+   w = kernel_to_user_load(lw->weight);
 
if (BITS_PER_LONG > 32 && unlikely(w >= WMULT_CONST))
lw->inv_weight = 1;
@@ -213,7 +213,7 @@ static void __update_inv_weight(struct load_weight *lw)
  */
 static u64 __calc_delta(u64 delta_exec, unsigned long weight, struct 
load_weight *lw)
 {
-   u64 fact = scale_load_down(weight);
+   u64 fact = kernel_to_user_load(weight);
int shift = WMULT_SHIFT;
 
__update_inv_weight(lw);
@@ -6875,10 +6875,11 @@ static inline void calculate_imbalance(struct lb_env 
*env, struct sd_lb_stats *s
 */
if (busiest->group_type == group_overloaded &&
local->group_type   == group_overloaded) {
+   unsigned long min_cpu_load =
+   kernel_to_user_load(NICE_0_LOAD) * 
busiest->group_capacity;
load_above_capacity = busiest->sum_nr_running * NICE_0_LOAD;
-   if (load_above_capacity > scale_load(busiest->group_capacity))
-   load_above_capacity -=
-   scale_load(busiest->group_capacity);
+   if (load_above_capacity > min_cpu_load)
+   load_above_capacity -= min_cpu_load;
else
load_above_capacity = ~0UL;
}
@@ -8432,7 +8433,8 @@ int sched_group_set_shares(struct task_group *tg, 
unsigned long shares)
if (!tg->se[0])
return -EINVAL;
 
-   shares = clamp(shares, scale_load(MIN_SHARES), scale_load(MAX_SHARES));
+   shares = clamp(shares, user_to_kernel_load(MIN_SHARES),
+  user_to_kernel_load(MAX_SHARES));
 
mutex_lock(_mutex);
if (tg->shares == shares)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 94ba652..ebe16e3 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -55,22 +55,22 @@ static inline void update_cpu_load_active(struct rq 
*this_rq) { }
  */
 #if 0 /* BITS_PER_LONG > 32 -- currently broken: it increases power usage 
under light load  */
 # define NICE_0_LOAD_SHIFT (SCHED_FIXEDPOINT_SHIFT + 
SCHED_FIXEDPOINT_SHIFT)
-# define scale_load(w) ((w) << SCHED_FIXEDPOINT_SHIFT)
-# define scale_load_down(w)((w) >> SCHED_FIXEDPOINT_SHIFT)
+# define user_to_kernel_load(w)((w) << SCHED_FIXEDPOINT_SHIFT)
+# define kernel_to_user_load(w)((w) >> SCHED_FIXEDPOINT_SHIFT)
 #else
 # define NICE_0_LOAD_SHIFT (SCHED_FIXEDPOINT_SHIFT)
-# define scale_load(w) (w)
-# define scale_load_down(w)(w)
+# define user_to_kernel_load(w)(w)
+# define kernel_to_user_load(w)(w)
 #endif
 
 /*
  * Task weight (visible to user) and its load (invisible to user) 

[PATCH RESEND v2 6/6] sched/fair: Remove unconditionally inactive code

2016-03-30 Thread Yuyang Du
The increased load resolution (fixed point arithmetic range) is
unconditionally deactivated with #if 0, so it is effectively broken.

But the increased load range is still used somewhere (e.g., in Google),
so we keep this feature. The reconciliation is we define
CONFIG_CFS_INCREASE_LOAD_RANGE and it depends on FAIR_GROUP_SCHED and
64BIT and BROKEN.

Suggested-by: Ingo Molnar 
Signed-off-by: Yuyang Du 
---
 init/Kconfig | 16 +++
 kernel/sched/sched.h | 55 +---
 2 files changed, 38 insertions(+), 33 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index 2232080..d072c09 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1025,6 +1025,22 @@ config CFS_BANDWIDTH
  restriction.
  See tip/Documentation/scheduler/sched-bwc.txt for more information.
 
+config CFS_INCREASE_LOAD_RANGE
+   bool "Increase kernel load range"
+   depends on 64BIT && BROKEN
+   default n
+   help
+ Increase resolution of nice-level calculations for 64-bit 
architectures.
+ The extra resolution improves shares distribution and load balancing 
of
+ low-weight task groups (eg. nice +19 on an autogroup), deeper 
taskgroup
+ hierarchies, especially on larger systems. This is not a user-visible 
change
+ and does not change the user-interface for setting shares/weights.
+ We increase resolution only if we have enough bits to allow this 
increased
+ resolution (i.e. BITS_PER_LONG > 32). The costs for increasing 
resolution
+ when BITS_PER_LONG <= 32 are pretty high and the returns do not 
justify the
+ increased costs.
+ Currently broken: it increases power usage under light load.
+
 config RT_GROUP_SCHED
bool "Group scheduling for SCHED_RR/FIFO"
depends on CGROUP_SCHED
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ebe16e3..1bb0d69 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -42,39 +42,6 @@ static inline void update_cpu_load_active(struct rq 
*this_rq) { }
 #define NS_TO_JIFFIES(TIME)((unsigned long)(TIME) / (NSEC_PER_SEC / HZ))
 
 /*
- * Increase resolution of nice-level calculations for 64-bit architectures.
- * The extra resolution improves shares distribution and load balancing of
- * low-weight task groups (eg. nice +19 on an autogroup), deeper taskgroup
- * hierarchies, especially on larger systems. This is not a user-visible change
- * and does not change the user-interface for setting shares/weights.
- *
- * We increase resolution only if we have enough bits to allow this increased
- * resolution (i.e. BITS_PER_LONG > 32). The costs for increasing resolution
- * when BITS_PER_LONG <= 32 are pretty high and the returns do not justify the
- * increased costs.
- */
-#if 0 /* BITS_PER_LONG > 32 -- currently broken: it increases power usage 
under light load  */
-# define NICE_0_LOAD_SHIFT (SCHED_FIXEDPOINT_SHIFT + 
SCHED_FIXEDPOINT_SHIFT)
-# define user_to_kernel_load(w)((w) << SCHED_FIXEDPOINT_SHIFT)
-# define kernel_to_user_load(w)((w) >> SCHED_FIXEDPOINT_SHIFT)
-#else
-# define NICE_0_LOAD_SHIFT (SCHED_FIXEDPOINT_SHIFT)
-# define user_to_kernel_load(w)(w)
-# define kernel_to_user_load(w)(w)
-#endif
-
-/*
- * Task weight (visible to user) and its load (invisible to user) have
- * independent resolution, but they should be well calibrated. We use
- * user_to_kernel_load() and kernel_to_user_load(w) to convert between
- * them. The following must be true:
- *
- * user_to_kernel_load(sched_prio_to_weight[USER_PRIO(NICE_TO_PRIO(0))]) == 
NICE_0_LOAD
- * kernel_to_user_load(NICE_0_LOAD) == 
sched_prio_to_weight[USER_PRIO(NICE_TO_PRIO(0))]
- */
-#define NICE_0_LOAD(1L << NICE_0_LOAD_SHIFT)
-
-/*
  * Single value that decides SCHED_DEADLINE internal math precision.
  * 10 -> just above 1us
  * 9  -> just above 0.5us
@@ -1150,6 +1117,28 @@ extern const int sched_prio_to_weight[40];
 extern const u32 sched_prio_to_wmult[40];
 
 /*
+ * Task weight (visible to user) and its load (invisible to user) have
+ * independent ranges, but they should be well calibrated. We use
+ * user_to_kernel_load() and kernel_to_user_load(w) to convert between
+ * them.
+ *
+ * The following must also be true:
+ * user_to_kernel_load(sched_prio_to_weight[USER_PRIO(NICE_TO_PRIO(0))]) == 
NICE_0_LOAD
+ * kernel_to_user_load(NICE_0_LOAD) == 
sched_prio_to_weight[USER_PRIO(NICE_TO_PRIO(0))]
+ */
+#ifdef CONFIG_CFS_INCREASE_LOAD_RANGE
+#define NICE_0_LOAD_SHIFT  (SCHED_FIXEDPOINT_SHIFT + 
SCHED_FIXEDPOINT_SHIFT)
+#define user_to_kernel_load(w) (w << SCHED_FIXEDPOINT_SHIFT)
+#define kernel_to_user_load(w) (w >> SCHED_FIXEDPOINT_SHIFT)
+#else
+#define NICE_0_LOAD_SHIFT  (SCHED_FIXEDPOINT_SHIFT)
+#define user_to_kernel_load(w) (w)
+#define kernel_to_user_load(w) (w)
+#endif
+
+#define NICE_0_LOAD(1UL << NICE_0_LOAD_SHIFT)
+
+/*
  * {de,en}queue flags:
  *
  * 

[PATCH RESEND v2 5/6] sched/fair: Rename scale_load() and scale_load_down()

2016-03-30 Thread Yuyang Du
Rename scale_load() and scale_load_down() to user_to_kernel_load()
and kernel_to_user_load() respectively, to allow the names to bear
what they are really about.

Signed-off-by: Yuyang Du 
[update calculate_imbalance]
Signed-off-by: Vincent Guittot 

Signed-off-by: Yuyang Du 
---
 kernel/sched/core.c  |  8 
 kernel/sched/fair.c  | 14 --
 kernel/sched/sched.h | 16 
 3 files changed, 20 insertions(+), 18 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0b21e7a..81c876e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -677,12 +677,12 @@ static void set_load_weight(struct task_struct *p)
 * SCHED_IDLE tasks get minimal weight:
 */
if (idle_policy(p->policy)) {
-   load->weight = scale_load(WEIGHT_IDLEPRIO);
+   load->weight = user_to_kernel_load(WEIGHT_IDLEPRIO);
load->inv_weight = WMULT_IDLEPRIO;
return;
}
 
-   load->weight = scale_load(sched_prio_to_weight[prio]);
+   load->weight = user_to_kernel_load(sched_prio_to_weight[prio]);
load->inv_weight = sched_prio_to_wmult[prio];
 }
 
@@ -8110,7 +8110,7 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset)
 static int cpu_shares_write_u64(struct cgroup_subsys_state *css,
struct cftype *cftype, u64 shareval)
 {
-   return sched_group_set_shares(css_tg(css), scale_load(shareval));
+   return sched_group_set_shares(css_tg(css), 
user_to_kernel_load(shareval));
 }
 
 static u64 cpu_shares_read_u64(struct cgroup_subsys_state *css,
@@ -8118,7 +8118,7 @@ static u64 cpu_shares_read_u64(struct cgroup_subsys_state 
*css,
 {
struct task_group *tg = css_tg(css);
 
-   return (u64) scale_load_down(tg->shares);
+   return (u64) kernel_to_user_load(tg->shares);
 }
 
 #ifdef CONFIG_CFS_BANDWIDTH
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index da6642f..bcf1027 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -189,7 +189,7 @@ static void __update_inv_weight(struct load_weight *lw)
if (likely(lw->inv_weight))
return;
 
-   w = scale_load_down(lw->weight);
+   w = kernel_to_user_load(lw->weight);
 
if (BITS_PER_LONG > 32 && unlikely(w >= WMULT_CONST))
lw->inv_weight = 1;
@@ -213,7 +213,7 @@ static void __update_inv_weight(struct load_weight *lw)
  */
 static u64 __calc_delta(u64 delta_exec, unsigned long weight, struct 
load_weight *lw)
 {
-   u64 fact = scale_load_down(weight);
+   u64 fact = kernel_to_user_load(weight);
int shift = WMULT_SHIFT;
 
__update_inv_weight(lw);
@@ -6875,10 +6875,11 @@ static inline void calculate_imbalance(struct lb_env 
*env, struct sd_lb_stats *s
 */
if (busiest->group_type == group_overloaded &&
local->group_type   == group_overloaded) {
+   unsigned long min_cpu_load =
+   kernel_to_user_load(NICE_0_LOAD) * 
busiest->group_capacity;
load_above_capacity = busiest->sum_nr_running * NICE_0_LOAD;
-   if (load_above_capacity > scale_load(busiest->group_capacity))
-   load_above_capacity -=
-   scale_load(busiest->group_capacity);
+   if (load_above_capacity > min_cpu_load)
+   load_above_capacity -= min_cpu_load;
else
load_above_capacity = ~0UL;
}
@@ -8432,7 +8433,8 @@ int sched_group_set_shares(struct task_group *tg, 
unsigned long shares)
if (!tg->se[0])
return -EINVAL;
 
-   shares = clamp(shares, scale_load(MIN_SHARES), scale_load(MAX_SHARES));
+   shares = clamp(shares, user_to_kernel_load(MIN_SHARES),
+  user_to_kernel_load(MAX_SHARES));
 
mutex_lock(_mutex);
if (tg->shares == shares)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 94ba652..ebe16e3 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -55,22 +55,22 @@ static inline void update_cpu_load_active(struct rq 
*this_rq) { }
  */
 #if 0 /* BITS_PER_LONG > 32 -- currently broken: it increases power usage 
under light load  */
 # define NICE_0_LOAD_SHIFT (SCHED_FIXEDPOINT_SHIFT + 
SCHED_FIXEDPOINT_SHIFT)
-# define scale_load(w) ((w) << SCHED_FIXEDPOINT_SHIFT)
-# define scale_load_down(w)((w) >> SCHED_FIXEDPOINT_SHIFT)
+# define user_to_kernel_load(w)((w) << SCHED_FIXEDPOINT_SHIFT)
+# define kernel_to_user_load(w)((w) >> SCHED_FIXEDPOINT_SHIFT)
 #else
 # define NICE_0_LOAD_SHIFT (SCHED_FIXEDPOINT_SHIFT)
-# define scale_load(w) (w)
-# define scale_load_down(w)(w)
+# define user_to_kernel_load(w)(w)
+# define kernel_to_user_load(w)(w)
 #endif
 
 /*
  * Task weight (visible to user) and its load (invisible to user) have
  * independent resolution, but they should be well calibrated. We 

[PATCH RESEND v2 3/6] sched/fair: Add introduction to the sched load avg metrics

2016-03-30 Thread Yuyang Du
These sched metrics have become complex enough. We introduce them
at their definition.

Signed-off-by: Yuyang Du 
---
 include/linux/sched.h | 60 +--
 1 file changed, 49 insertions(+), 11 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 54784d0..db3c6e1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1208,18 +1208,56 @@ struct load_weight {
 };
 
 /*
- * The load_avg/util_avg accumulates an infinite geometric series.
- * 1) load_avg factors frequency scaling into the amount of time that a
- * sched_entity is runnable on a rq into its weight. For cfs_rq, it is the
- * aggregated such weights of all runnable and blocked sched_entities.
- * 2) util_avg factors frequency and cpu capacity scaling into the amount of 
time
- * that a sched_entity is running on a CPU, in the range 
[0..SCHED_CAPACITY_SCALE].
- * For cfs_rq, it is the aggregated such times of all runnable and
+ * The load_avg/util_avg accumulates an infinite geometric series
+ * (see __update_load_avg() in kernel/sched/fair.c).
+ *
+ * [load_avg definition]
+ *
+ * load_avg = runnable% * scale_load_down(load)
+ *
+ * where runnable% is the time ratio that a sched_entity is runnable.
+ * For cfs_rq, it is the aggregated such load_avg of all runnable and
  * blocked sched_entities.
- * The 64 bit load_sum can:
- * 1) for cfs_rq, afford 4353082796 (=2^64/47742/88761) entities with
- * the highest weight (=88761) always runnable, we should not overflow
- * 2) for entity, support any load.weight always runnable
+ *
+ * load_avg may also take frequency scaling into account:
+ *
+ * load_avg = runnable% * scale_load_down(load) * freq%
+ *
+ * where freq% is the CPU frequency normalize to the highest frequency
+ *
+ * [util_avg definition]
+ *
+ * util_avg = running% * SCHED_CAPACITY_SCALE
+ *
+ * where running% is the time ratio that a sched_entity is running on
+ * a CPU. For cfs_rq, it is the aggregated such util_avg of all runnable
+ * and blocked sched_entities.
+ *
+ * util_avg may also factor frequency scaling and CPU capacity scaling:
+ *
+ * util_avg = running% * SCHED_CAPACITY_SCALE * freq% * capacity%
+ *
+ * where freq% is the same as above, and capacity% is the CPU capacity
+ * normalized to the greatest capacity (due to uarch differences, etc).
+ *
+ * N.B., the above ratios (runnable%, running%, freq%, and capacity%)
+ * themselves are in the range of [0, 1]. To do fixed point arithmetic,
+ * we therefore scale them to as large range as necessary. This is for
+ * example reflected by util_avg's SCHED_CAPACITY_SCALE.
+ *
+ * [Overflow issue]
+ *
+ * The 64bit load_sum can have 4353082796 (=2^64/47742/88761) entities
+ * with the highest load (=88761) always runnable on a single cfs_rq, we
+ * should not overflow as the number already hits PID_MAX_LIMIT.
+ *
+ * For all other cases (including 32bit kernel), struct load_weight's
+ * weight will overflow first before we do, because:
+ *
+ *Max(load_avg) <= Max(load.weight)
+ *
+ * Then, it is the load_weight's responsibility to consider overflow
+ * issues.
  */
 struct sched_avg {
u64 last_update_time, load_sum;
-- 
2.1.4



[PATCH RESEND v2 3/6] sched/fair: Add introduction to the sched load avg metrics

2016-03-30 Thread Yuyang Du
These sched metrics have become complex enough. We introduce them
at their definition.

Signed-off-by: Yuyang Du 
---
 include/linux/sched.h | 60 +--
 1 file changed, 49 insertions(+), 11 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 54784d0..db3c6e1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1208,18 +1208,56 @@ struct load_weight {
 };
 
 /*
- * The load_avg/util_avg accumulates an infinite geometric series.
- * 1) load_avg factors frequency scaling into the amount of time that a
- * sched_entity is runnable on a rq into its weight. For cfs_rq, it is the
- * aggregated such weights of all runnable and blocked sched_entities.
- * 2) util_avg factors frequency and cpu capacity scaling into the amount of 
time
- * that a sched_entity is running on a CPU, in the range 
[0..SCHED_CAPACITY_SCALE].
- * For cfs_rq, it is the aggregated such times of all runnable and
+ * The load_avg/util_avg accumulates an infinite geometric series
+ * (see __update_load_avg() in kernel/sched/fair.c).
+ *
+ * [load_avg definition]
+ *
+ * load_avg = runnable% * scale_load_down(load)
+ *
+ * where runnable% is the time ratio that a sched_entity is runnable.
+ * For cfs_rq, it is the aggregated such load_avg of all runnable and
  * blocked sched_entities.
- * The 64 bit load_sum can:
- * 1) for cfs_rq, afford 4353082796 (=2^64/47742/88761) entities with
- * the highest weight (=88761) always runnable, we should not overflow
- * 2) for entity, support any load.weight always runnable
+ *
+ * load_avg may also take frequency scaling into account:
+ *
+ * load_avg = runnable% * scale_load_down(load) * freq%
+ *
+ * where freq% is the CPU frequency normalize to the highest frequency
+ *
+ * [util_avg definition]
+ *
+ * util_avg = running% * SCHED_CAPACITY_SCALE
+ *
+ * where running% is the time ratio that a sched_entity is running on
+ * a CPU. For cfs_rq, it is the aggregated such util_avg of all runnable
+ * and blocked sched_entities.
+ *
+ * util_avg may also factor frequency scaling and CPU capacity scaling:
+ *
+ * util_avg = running% * SCHED_CAPACITY_SCALE * freq% * capacity%
+ *
+ * where freq% is the same as above, and capacity% is the CPU capacity
+ * normalized to the greatest capacity (due to uarch differences, etc).
+ *
+ * N.B., the above ratios (runnable%, running%, freq%, and capacity%)
+ * themselves are in the range of [0, 1]. To do fixed point arithmetic,
+ * we therefore scale them to as large range as necessary. This is for
+ * example reflected by util_avg's SCHED_CAPACITY_SCALE.
+ *
+ * [Overflow issue]
+ *
+ * The 64bit load_sum can have 4353082796 (=2^64/47742/88761) entities
+ * with the highest load (=88761) always runnable on a single cfs_rq, we
+ * should not overflow as the number already hits PID_MAX_LIMIT.
+ *
+ * For all other cases (including 32bit kernel), struct load_weight's
+ * weight will overflow first before we do, because:
+ *
+ *Max(load_avg) <= Max(load.weight)
+ *
+ * Then, it is the load_weight's responsibility to consider overflow
+ * issues.
  */
 struct sched_avg {
u64 last_update_time, load_sum;
-- 
2.1.4



[PATCH RESEND v2 0/6] sched/fair: Clean up sched metric definitions

2016-03-30 Thread Yuyang Du
Hi Peter,

This patch searies is left in last year, and thus I resend it. Would you
please give it a look?

The previous version is at http://thread.gmane.org/gmane.linux.kernel/2068513

This series cleans up the sched metrics, changes include:
(1) Define SCHED_FIXEDPOINT_SHIFT for all fixed point arithmetic scaling.
(2) Get rid of confusing scaling factors: SCHED_LOAD_SHIFT and SCHED_LOAD_SCALE,
and thus only levae NICE_0_LOAD (for load) and SCHED_CAPACITY_SCALE (for 
util).
(3) Consistently use SCHED_CAPACITY_SCALE for util related.
(4) Add more detailed introduction to the sched metrics.
(5) Get rid of unnecessary extra scaling up and down for load.
(6) Rename the mappings between priority (user) and load (kernel).
(7) Remove/replace inactive code.

So, except for (5), we did not change any logic. Per request by Ingo, I checked
the disassembly of kernel/sched/built-in.o before vs. after the patches. But
since the very first patch to the end, there are a bunch of "offset" changes,
all like the pattern:

 60e3:  eb 21   jmp6106 
-60e5:  be db 02 00 00  mov$0x2db,%esi
+60e5:  be e0 02 00 00  mov$0x2e0,%esi

I have no idea what is changed, but venture a guess, code layout changed a bit?

Anyway, thanks a lot to Ben, Morten, Dietmar, Vincent, and others who provided
valuable comments.

v2 changes:
- Rename SCHED_RESOLUTION_SHIFT to SCHED_FIXEDPOINT_SHIFT, thanks to Peter
- Fix bugs in calculate_imbalance(), thanks to Vincent
- Fix "#if 0" for increased kernel load, suggested by Ingo

Thanks,
Yuyang

Yuyang Du (6):
  sched/fair: Generalize the load/util averages resolution definition
  sched/fair: Remove SCHED_LOAD_SHIFT and SCHED_LOAD_SCALE
  sched/fair: Add introduction to the sched load avg metrics
  sched/fair: Remove scale_load_down() for load_avg
  sched/fair: Rename scale_load() and scale_load_down()
  sched/fair: Remove unconditionally inactive code

 include/linux/sched.h | 81 +++
 init/Kconfig  | 16 ++
 kernel/sched/core.c   |  8 ++---
 kernel/sched/fair.c   | 33 ++---
 kernel/sched/sched.h  | 52 +++--
 5 files changed, 127 insertions(+), 63 deletions(-)

-- 
2.1.4



[PATCH RESEND v2 2/6] sched/fair: Remove SCHED_LOAD_SHIFT and SCHED_LOAD_SCALE

2016-03-30 Thread Yuyang Du
After cleaning up the sched metrics, these two definitions that cause
ambiguity are not needed any more. Use NICE_0_LOAD_SHIFT and NICE_0_LOAD
instead (the names suggest clearly who they are).

Suggested-by: Ben Segall 
Signed-off-by: Yuyang Du 
---
 kernel/sched/fair.c  |  4 ++--
 kernel/sched/sched.h | 22 +++---
 2 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1d3fc01..bf835b5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -682,7 +682,7 @@ void init_entity_runnable_average(struct sched_entity *se)
sa->period_contrib = 1023;
sa->load_avg = scale_load_down(se->load.weight);
sa->load_sum = sa->load_avg * LOAD_AVG_MAX;
-   sa->util_avg = scale_load_down(SCHED_LOAD_SCALE);
+   sa->util_avg = SCHED_CAPACITY_SCALE;
sa->util_sum = sa->util_avg * LOAD_AVG_MAX;
/* when this task enqueue'ed, it will contribute to its cfs_rq's 
load_avg */
 }
@@ -6877,7 +6877,7 @@ static inline void calculate_imbalance(struct lb_env 
*env, struct sd_lb_stats *s
if (busiest->group_type == group_overloaded &&
local->group_type   == group_overloaded) {
load_above_capacity = busiest->sum_nr_running *
-   SCHED_LOAD_SCALE;
+ scale_load_down(NICE_0_LOAD);
if (load_above_capacity > busiest->group_capacity)
load_above_capacity -= busiest->group_capacity;
else
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 15a89ee..94ba652 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -54,25 +54,25 @@ static inline void update_cpu_load_active(struct rq 
*this_rq) { }
  * increased costs.
  */
 #if 0 /* BITS_PER_LONG > 32 -- currently broken: it increases power usage 
under light load  */
-# define SCHED_LOAD_SHIFT  (SCHED_FIXEDPOINT_SHIFT + 
SCHED_FIXEDPOINT_SHIFT)
+# define NICE_0_LOAD_SHIFT (SCHED_FIXEDPOINT_SHIFT + 
SCHED_FIXEDPOINT_SHIFT)
 # define scale_load(w) ((w) << SCHED_FIXEDPOINT_SHIFT)
 # define scale_load_down(w)((w) >> SCHED_FIXEDPOINT_SHIFT)
 #else
-# define SCHED_LOAD_SHIFT  (SCHED_FIXEDPOINT_SHIFT)
+# define NICE_0_LOAD_SHIFT (SCHED_FIXEDPOINT_SHIFT)
 # define scale_load(w) (w)
 # define scale_load_down(w)(w)
 #endif
 
-#define SCHED_LOAD_SCALE   (1L << SCHED_LOAD_SHIFT)
-
 /*
- * NICE_0's weight (visible to user) and its load (invisible to user) have
- * independent ranges, but they should be well calibrated. We use scale_load()
- * and scale_load_down(w) to convert between them, the following must be true:
- * scale_load(sched_prio_to_weight[20]) == NICE_0_LOAD
+ * Task weight (visible to user) and its load (invisible to user) have
+ * independent resolution, but they should be well calibrated. We use
+ * scale_load() and scale_load_down(w) to convert between them. The
+ * following must be true:
+ *
+ *  scale_load(sched_prio_to_weight[USER_PRIO(NICE_TO_PRIO(0))]) == NICE_0_LOAD
+ *
  */
-#define NICE_0_LOADSCHED_LOAD_SCALE
-#define NICE_0_SHIFT   SCHED_LOAD_SHIFT
+#define NICE_0_LOAD(1L << NICE_0_LOAD_SHIFT)
 
 /*
  * Single value that decides SCHED_DEADLINE internal math precision.
@@ -859,7 +859,7 @@ DECLARE_PER_CPU(struct sched_domain *, sd_asym);
 struct sched_group_capacity {
atomic_t ref;
/*
-* CPU capacity of this group, SCHED_LOAD_SCALE being max capacity
+* CPU capacity of this group, SCHED_CAPACITY_SCALE being max capacity
 * for a single CPU.
 */
unsigned int capacity;
-- 
2.1.4



[PATCH RESEND v2 0/6] sched/fair: Clean up sched metric definitions

2016-03-30 Thread Yuyang Du
Hi Peter,

This patch searies is left in last year, and thus I resend it. Would you
please give it a look?

The previous version is at http://thread.gmane.org/gmane.linux.kernel/2068513

This series cleans up the sched metrics, changes include:
(1) Define SCHED_FIXEDPOINT_SHIFT for all fixed point arithmetic scaling.
(2) Get rid of confusing scaling factors: SCHED_LOAD_SHIFT and SCHED_LOAD_SCALE,
and thus only levae NICE_0_LOAD (for load) and SCHED_CAPACITY_SCALE (for 
util).
(3) Consistently use SCHED_CAPACITY_SCALE for util related.
(4) Add more detailed introduction to the sched metrics.
(5) Get rid of unnecessary extra scaling up and down for load.
(6) Rename the mappings between priority (user) and load (kernel).
(7) Remove/replace inactive code.

So, except for (5), we did not change any logic. Per request by Ingo, I checked
the disassembly of kernel/sched/built-in.o before vs. after the patches. But
since the very first patch to the end, there are a bunch of "offset" changes,
all like the pattern:

 60e3:  eb 21   jmp6106 
-60e5:  be db 02 00 00  mov$0x2db,%esi
+60e5:  be e0 02 00 00  mov$0x2e0,%esi

I have no idea what is changed, but venture a guess, code layout changed a bit?

Anyway, thanks a lot to Ben, Morten, Dietmar, Vincent, and others who provided
valuable comments.

v2 changes:
- Rename SCHED_RESOLUTION_SHIFT to SCHED_FIXEDPOINT_SHIFT, thanks to Peter
- Fix bugs in calculate_imbalance(), thanks to Vincent
- Fix "#if 0" for increased kernel load, suggested by Ingo

Thanks,
Yuyang

Yuyang Du (6):
  sched/fair: Generalize the load/util averages resolution definition
  sched/fair: Remove SCHED_LOAD_SHIFT and SCHED_LOAD_SCALE
  sched/fair: Add introduction to the sched load avg metrics
  sched/fair: Remove scale_load_down() for load_avg
  sched/fair: Rename scale_load() and scale_load_down()
  sched/fair: Remove unconditionally inactive code

 include/linux/sched.h | 81 +++
 init/Kconfig  | 16 ++
 kernel/sched/core.c   |  8 ++---
 kernel/sched/fair.c   | 33 ++---
 kernel/sched/sched.h  | 52 +++--
 5 files changed, 127 insertions(+), 63 deletions(-)

-- 
2.1.4



[PATCH RESEND v2 2/6] sched/fair: Remove SCHED_LOAD_SHIFT and SCHED_LOAD_SCALE

2016-03-30 Thread Yuyang Du
After cleaning up the sched metrics, these two definitions that cause
ambiguity are not needed any more. Use NICE_0_LOAD_SHIFT and NICE_0_LOAD
instead (the names suggest clearly who they are).

Suggested-by: Ben Segall 
Signed-off-by: Yuyang Du 
---
 kernel/sched/fair.c  |  4 ++--
 kernel/sched/sched.h | 22 +++---
 2 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1d3fc01..bf835b5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -682,7 +682,7 @@ void init_entity_runnable_average(struct sched_entity *se)
sa->period_contrib = 1023;
sa->load_avg = scale_load_down(se->load.weight);
sa->load_sum = sa->load_avg * LOAD_AVG_MAX;
-   sa->util_avg = scale_load_down(SCHED_LOAD_SCALE);
+   sa->util_avg = SCHED_CAPACITY_SCALE;
sa->util_sum = sa->util_avg * LOAD_AVG_MAX;
/* when this task enqueue'ed, it will contribute to its cfs_rq's 
load_avg */
 }
@@ -6877,7 +6877,7 @@ static inline void calculate_imbalance(struct lb_env 
*env, struct sd_lb_stats *s
if (busiest->group_type == group_overloaded &&
local->group_type   == group_overloaded) {
load_above_capacity = busiest->sum_nr_running *
-   SCHED_LOAD_SCALE;
+ scale_load_down(NICE_0_LOAD);
if (load_above_capacity > busiest->group_capacity)
load_above_capacity -= busiest->group_capacity;
else
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 15a89ee..94ba652 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -54,25 +54,25 @@ static inline void update_cpu_load_active(struct rq 
*this_rq) { }
  * increased costs.
  */
 #if 0 /* BITS_PER_LONG > 32 -- currently broken: it increases power usage 
under light load  */
-# define SCHED_LOAD_SHIFT  (SCHED_FIXEDPOINT_SHIFT + 
SCHED_FIXEDPOINT_SHIFT)
+# define NICE_0_LOAD_SHIFT (SCHED_FIXEDPOINT_SHIFT + 
SCHED_FIXEDPOINT_SHIFT)
 # define scale_load(w) ((w) << SCHED_FIXEDPOINT_SHIFT)
 # define scale_load_down(w)((w) >> SCHED_FIXEDPOINT_SHIFT)
 #else
-# define SCHED_LOAD_SHIFT  (SCHED_FIXEDPOINT_SHIFT)
+# define NICE_0_LOAD_SHIFT (SCHED_FIXEDPOINT_SHIFT)
 # define scale_load(w) (w)
 # define scale_load_down(w)(w)
 #endif
 
-#define SCHED_LOAD_SCALE   (1L << SCHED_LOAD_SHIFT)
-
 /*
- * NICE_0's weight (visible to user) and its load (invisible to user) have
- * independent ranges, but they should be well calibrated. We use scale_load()
- * and scale_load_down(w) to convert between them, the following must be true:
- * scale_load(sched_prio_to_weight[20]) == NICE_0_LOAD
+ * Task weight (visible to user) and its load (invisible to user) have
+ * independent resolution, but they should be well calibrated. We use
+ * scale_load() and scale_load_down(w) to convert between them. The
+ * following must be true:
+ *
+ *  scale_load(sched_prio_to_weight[USER_PRIO(NICE_TO_PRIO(0))]) == NICE_0_LOAD
+ *
  */
-#define NICE_0_LOADSCHED_LOAD_SCALE
-#define NICE_0_SHIFT   SCHED_LOAD_SHIFT
+#define NICE_0_LOAD(1L << NICE_0_LOAD_SHIFT)
 
 /*
  * Single value that decides SCHED_DEADLINE internal math precision.
@@ -859,7 +859,7 @@ DECLARE_PER_CPU(struct sched_domain *, sd_asym);
 struct sched_group_capacity {
atomic_t ref;
/*
-* CPU capacity of this group, SCHED_LOAD_SCALE being max capacity
+* CPU capacity of this group, SCHED_CAPACITY_SCALE being max capacity
 * for a single CPU.
 */
unsigned int capacity;
-- 
2.1.4



Re: [PATCH] s390/kexec: Consolidate crash_map/unmap_reserved_pages() and arch_kexec_protect(unprotect)_crashkres()

2016-03-30 Thread Xunlei Pang
Hi Bao,

On 2016/03/31 at 10:52, Baoquan He wrote:
> On 03/31/16 at 10:43am, Minfei Huang wrote:
>> On 03/30/16 at 08:30pm, Baoquan He wrote:
>>> Hi Xunlei,
>>>
>>> I have two questions.
>>>
>>> One is do we still need Minfei's patch if this patch is applied since
>>> you have completely delete crash_map/unmap_reserved_pages in
>>> kernel/kexec.c ?
>> I think it is necessary to apply my bug-fixing patch firstly before
>> apply this, since other maintainers can backport my bug-fixing patch to
>> fix issue for stable linux kernel.
> This is why previously I said you two need get together to discuss how
> to fix this issue and post. Two questions: 1st is Xunlei is doing a
> cleanup but leave the map/unmap there thought they are doing the same
> thing in different way; 2nd is your bug fix patch with his clean up. It
> looks totally mess, to reviewers and maintainers. So now I will leave
> these to other people interested to review because I personally don't
> like it, but I don't object it strongly since I don't like always aruging
> by type writing.
>

Thanks for your comments, and I'm fine with your concern.

There is a "historical" reason, we didn't expect these patches back then,
they were coming out gradually due to some discussion in the mailinglist.

It would be clear if these patches were reordered as follows:
Minfei's patchset:
[Patch01]   kexec: make a pair of map/unmap reserved pages in error path
[Patch02]   kexec: do a cleanup for function kexec_load

Then my patchset:
[Patch01]   kexec: introduce a protection mechanism for the crashkernel 
reserved memory
[Patch02]   s390/kexec: Consolidate crash_map/unmap_reserved_pages() and 
arch_kexec_protect(unprotect)_crashkres()
[Patch03(x86_64)]  kexec: provide arch_kexec_protect(unprotect)_crashkres()

I don't know if it is possible to reorder that since they are already in 
"linux-next", ask Andrew for help :-)

Regards,
Xunlei

>> Thanks
>> Minfei
>>
>>> On 03/30/16 at 07:47pm, Xunlei Pang wrote:
 Commit 3f625002581b ("kexec: introduce a protection mechanism
 for the crashkernel reserved memory") is a similar mechanism
 for protecting the crash kernel reserved memory to previous
 crash_map/unmap_reserved_pages() implementation, the new one
 is more generic in name and cleaner in code (besides, some
 arch may not be allowed to unmap the pgtable).

 Therefore, this patch consolidates them, and uses the new
 arch_kexec_protect(unprotect)_crashkres() to replace former
 crash_map/unmap_reserved_pages() which by now has been only
 used by S390.

 The consolidation work needs the crash memory to be mapped
 initially, so get rid of S390 crash kernel memblock removal
 in reserve_crashkernel(). Once kdump kernel is loaded, the
 new arch_kexec_protect_crashkres() implemented for S390 will
 actually unmap the pgtable like before.

 The patch also fixed a S390 crash_shrink_memory() bad page warning
 in passing due to not using memblock_reserve():
   BUG: Bad page state in process bash  pfn:7e400
   page:03d101f9 count:0 mapcount:1 mapping: (null) index:0x0
   flags: 0x0()
   page dumped because: nonzero mapcount
   Modules linked in: ghash_s390 prng aes_s390 des_s390 des_generic
   CPU: 0 PID: 1558 Comm: bash Not tainted 4.6.0-rc1-next-20160327 #1
73007a58 73007ae8 0002 
73007b88 73007b00 73007b00 0022cf4e
00a579b8 007b0dd6 00791a8c
000b
73007b48 73007ae8  
070003d10001 00112f20 73007ae8 73007b48
   Call Trace:
   ([<00112e0c>] show_trace+0x5c/0x78)
   ([<00112ed4>] show_stack+0x6c/0xe8)
   ([<003f28dc>] dump_stack+0x84/0xb8)
   ([<00235454>] bad_page+0xec/0x158)
   ([<002357a4>] free_pages_prepare+0x2e4/0x308)
   ([<002383a2>] free_hot_cold_page+0x42/0x198)
   ([<001c45e0>] crash_free_reserved_phys_range+0x60/0x88)
   ([<001c49b0>] crash_shrink_memory+0xb8/0x1a0)
   ([<0015bcae>] kexec_crash_size_store+0x46/0x60)
   ([<0033d326>] kernfs_fop_write+0x136/0x180)
   ([<002b253c>] __vfs_write+0x3c/0x100)
   ([<002b35ce>] vfs_write+0x8e/0x190)
   ([<002b4ca0>] SyS_write+0x60/0xd0)
   ([<0063067c>] system_call+0x244/0x264)

 Cc: Michael Holzheu 
 Signed-off-by: Xunlei Pang 
 ---
 Tested kexec/kdump on S390x

  arch/s390/kernel/machine_kexec.c | 86 
 ++--
  arch/s390/kernel/setup.c |  7 ++--
  include/linux/kexec.h|  2 -
  kernel/kexec.c   | 12 --
  kernel/kexec_core.c  

Re: [PATCH] s390/kexec: Consolidate crash_map/unmap_reserved_pages() and arch_kexec_protect(unprotect)_crashkres()

2016-03-30 Thread Xunlei Pang
Hi Bao,

On 2016/03/31 at 10:52, Baoquan He wrote:
> On 03/31/16 at 10:43am, Minfei Huang wrote:
>> On 03/30/16 at 08:30pm, Baoquan He wrote:
>>> Hi Xunlei,
>>>
>>> I have two questions.
>>>
>>> One is do we still need Minfei's patch if this patch is applied since
>>> you have completely delete crash_map/unmap_reserved_pages in
>>> kernel/kexec.c ?
>> I think it is necessary to apply my bug-fixing patch firstly before
>> apply this, since other maintainers can backport my bug-fixing patch to
>> fix issue for stable linux kernel.
> This is why previously I said you two need get together to discuss how
> to fix this issue and post. Two questions: 1st is Xunlei is doing a
> cleanup but leave the map/unmap there thought they are doing the same
> thing in different way; 2nd is your bug fix patch with his clean up. It
> looks totally mess, to reviewers and maintainers. So now I will leave
> these to other people interested to review because I personally don't
> like it, but I don't object it strongly since I don't like always aruging
> by type writing.
>

Thanks for your comments, and I'm fine with your concern.

There is a "historical" reason, we didn't expect these patches back then,
they were coming out gradually due to some discussion in the mailinglist.

It would be clear if these patches were reordered as follows:
Minfei's patchset:
[Patch01]   kexec: make a pair of map/unmap reserved pages in error path
[Patch02]   kexec: do a cleanup for function kexec_load

Then my patchset:
[Patch01]   kexec: introduce a protection mechanism for the crashkernel 
reserved memory
[Patch02]   s390/kexec: Consolidate crash_map/unmap_reserved_pages() and 
arch_kexec_protect(unprotect)_crashkres()
[Patch03(x86_64)]  kexec: provide arch_kexec_protect(unprotect)_crashkres()

I don't know if it is possible to reorder that since they are already in 
"linux-next", ask Andrew for help :-)

Regards,
Xunlei

>> Thanks
>> Minfei
>>
>>> On 03/30/16 at 07:47pm, Xunlei Pang wrote:
 Commit 3f625002581b ("kexec: introduce a protection mechanism
 for the crashkernel reserved memory") is a similar mechanism
 for protecting the crash kernel reserved memory to previous
 crash_map/unmap_reserved_pages() implementation, the new one
 is more generic in name and cleaner in code (besides, some
 arch may not be allowed to unmap the pgtable).

 Therefore, this patch consolidates them, and uses the new
 arch_kexec_protect(unprotect)_crashkres() to replace former
 crash_map/unmap_reserved_pages() which by now has been only
 used by S390.

 The consolidation work needs the crash memory to be mapped
 initially, so get rid of S390 crash kernel memblock removal
 in reserve_crashkernel(). Once kdump kernel is loaded, the
 new arch_kexec_protect_crashkres() implemented for S390 will
 actually unmap the pgtable like before.

 The patch also fixed a S390 crash_shrink_memory() bad page warning
 in passing due to not using memblock_reserve():
   BUG: Bad page state in process bash  pfn:7e400
   page:03d101f9 count:0 mapcount:1 mapping: (null) index:0x0
   flags: 0x0()
   page dumped because: nonzero mapcount
   Modules linked in: ghash_s390 prng aes_s390 des_s390 des_generic
   CPU: 0 PID: 1558 Comm: bash Not tainted 4.6.0-rc1-next-20160327 #1
73007a58 73007ae8 0002 
73007b88 73007b00 73007b00 0022cf4e
00a579b8 007b0dd6 00791a8c
000b
73007b48 73007ae8  
070003d10001 00112f20 73007ae8 73007b48
   Call Trace:
   ([<00112e0c>] show_trace+0x5c/0x78)
   ([<00112ed4>] show_stack+0x6c/0xe8)
   ([<003f28dc>] dump_stack+0x84/0xb8)
   ([<00235454>] bad_page+0xec/0x158)
   ([<002357a4>] free_pages_prepare+0x2e4/0x308)
   ([<002383a2>] free_hot_cold_page+0x42/0x198)
   ([<001c45e0>] crash_free_reserved_phys_range+0x60/0x88)
   ([<001c49b0>] crash_shrink_memory+0xb8/0x1a0)
   ([<0015bcae>] kexec_crash_size_store+0x46/0x60)
   ([<0033d326>] kernfs_fop_write+0x136/0x180)
   ([<002b253c>] __vfs_write+0x3c/0x100)
   ([<002b35ce>] vfs_write+0x8e/0x190)
   ([<002b4ca0>] SyS_write+0x60/0xd0)
   ([<0063067c>] system_call+0x244/0x264)

 Cc: Michael Holzheu 
 Signed-off-by: Xunlei Pang 
 ---
 Tested kexec/kdump on S390x

  arch/s390/kernel/machine_kexec.c | 86 
 ++--
  arch/s390/kernel/setup.c |  7 ++--
  include/linux/kexec.h|  2 -
  kernel/kexec.c   | 12 --
  kernel/kexec_core.c  | 11 +
  5 files changed, 

Re: [PATCH v2] gpio: pca953x: Use correct u16 value for register word write

2016-03-30 Thread Phil Reid

On 30/03/2016 2:49 PM, Yong Li wrote:

The current implementation only uses the first byte in val,
the second byte is always 0. Change it to use cpu_to_le16
to write the two bytes into the register

Signed-off-by: Yong Li 


Reviewed-by: Phil Reid 

---
  drivers/gpio/gpio-pca953x.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpio/gpio-pca953x.c b/drivers/gpio/gpio-pca953x.c
index d0d3065..e66084c 100644
--- a/drivers/gpio/gpio-pca953x.c
+++ b/drivers/gpio/gpio-pca953x.c
@@ -18,6 +18,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 

@@ -159,7 +160,7 @@ static int pca953x_write_regs(struct pca953x_chip *chip, 
int reg, u8 *val)
switch (chip->chip_type) {
case PCA953X_TYPE:
ret = i2c_smbus_write_word_data(chip->client,
-   reg << 1, (u16) *val);
+   reg << 1, cpu_to_le16(get_unaligned((u16 *)val)));
break;
case PCA957X_TYPE:
ret = i2c_smbus_write_byte_data(chip->client, reg << 1,




--
Regards
Phil Reid



Re: [PATCH v2] gpio: pca953x: Use correct u16 value for register word write

2016-03-30 Thread Phil Reid

On 30/03/2016 2:49 PM, Yong Li wrote:

The current implementation only uses the first byte in val,
the second byte is always 0. Change it to use cpu_to_le16
to write the two bytes into the register

Signed-off-by: Yong Li 


Reviewed-by: Phil Reid 

---
  drivers/gpio/gpio-pca953x.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpio/gpio-pca953x.c b/drivers/gpio/gpio-pca953x.c
index d0d3065..e66084c 100644
--- a/drivers/gpio/gpio-pca953x.c
+++ b/drivers/gpio/gpio-pca953x.c
@@ -18,6 +18,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 

@@ -159,7 +160,7 @@ static int pca953x_write_regs(struct pca953x_chip *chip, 
int reg, u8 *val)
switch (chip->chip_type) {
case PCA953X_TYPE:
ret = i2c_smbus_write_word_data(chip->client,
-   reg << 1, (u16) *val);
+   reg << 1, cpu_to_le16(get_unaligned((u16 *)val)));
break;
case PCA957X_TYPE:
ret = i2c_smbus_write_byte_data(chip->client, reg << 1,




--
Regards
Phil Reid



Re:����Proces denture tools

2016-03-30 Thread Frank
Dear Sir

Hello!

Our company is the Guangzhou Honda and China FAW suppliers. We have been 
providing quality products and services for them.

Our company's main products:

Graphite machining end mill
The denture machining tool
Zirconia processing tool
PCD tools
CNC inserts
Milling toolholder
Boron carbide nozzles
Mould parts
Guide bush and collet

Dental with titanium screws, this product quality has certification ISO13485 of 
the German T��1V 

We will frequently introduce new products, please pay attention 
www.jch-tools.com.

We have been providing our customers with high quality products and quality 
service, we performed continuous processing technology improvement, continuing 
to reduce the production cost, to provide customers maximize profits.

Our products are exported around the world, has a good customer evaluation.

Our quality service, has a good customer evaluation.

If you are interested in, you can test some samples for confirming our products 
quality, I believe we will have pleasant cooperation in the near future.


Best regards!

Frank

SHENZHEN KALEAD TOOLS CO.,LTD | Shenzhen Surmount Tools Co., Ltd 
 TEL: 86-0755-27261985 
 FAX: 86-0755-27261895
 Email: s...@jch-tools.com 
 Web: www.jch-tools.com 


Re:����Proces denture tools

2016-03-30 Thread Frank
Dear Sir

Hello!

Our company is the Guangzhou Honda and China FAW suppliers. We have been 
providing quality products and services for them.

Our company's main products:

Graphite machining end mill
The denture machining tool
Zirconia processing tool
PCD tools
CNC inserts
Milling toolholder
Boron carbide nozzles
Mould parts
Guide bush and collet

Dental with titanium screws, this product quality has certification ISO13485 of 
the German T��1V 

We will frequently introduce new products, please pay attention 
www.jch-tools.com.

We have been providing our customers with high quality products and quality 
service, we performed continuous processing technology improvement, continuing 
to reduce the production cost, to provide customers maximize profits.

Our products are exported around the world, has a good customer evaluation.

Our quality service, has a good customer evaluation.

If you are interested in, you can test some samples for confirming our products 
quality, I believe we will have pleasant cooperation in the near future.


Best regards!

Frank

SHENZHEN KALEAD TOOLS CO.,LTD | Shenzhen Surmount Tools Co., Ltd 
 TEL: 86-0755-27261985 
 FAX: 86-0755-27261895
 Email: s...@jch-tools.com 
 Web: www.jch-tools.com 


Re:����Proces denture tools

2016-03-30 Thread Frank
Dear Sir

Hello!

Our company is the Guangzhou Honda and China FAW suppliers. We have been 
providing quality products and services for them.

Our company's main products:

Graphite machining end mill
The denture machining tool
Zirconia processing tool
PCD tools
CNC inserts
Milling toolholder
Boron carbide nozzles
Mould parts
Guide bush and collet

Dental with titanium screws, this product quality has certification ISO13485 of 
the German T��1V 

We will frequently introduce new products, please pay attention 
www.jch-tools.com.

We have been providing our customers with high quality products and quality 
service, we performed continuous processing technology improvement, continuing 
to reduce the production cost, to provide customers maximize profits.

Our products are exported around the world, has a good customer evaluation.

Our quality service, has a good customer evaluation.

If you are interested in, you can test some samples for confirming our products 
quality, I believe we will have pleasant cooperation in the near future.


Best regards!

Frank

SHENZHEN KALEAD TOOLS CO.,LTD | Shenzhen Surmount Tools Co., Ltd 
 TEL: 86-0755-27261985 
 FAX: 86-0755-27261895
 Email: s...@jch-tools.com 
 Web: www.jch-tools.com 


Re:����Proces denture tools

2016-03-30 Thread Frank
Dear Sir

Hello!

Our company is the Guangzhou Honda and China FAW suppliers. We have been 
providing quality products and services for them.

Our company's main products:

Graphite machining end mill
The denture machining tool
Zirconia processing tool
PCD tools
CNC inserts
Milling toolholder
Boron carbide nozzles
Mould parts
Guide bush and collet

Dental with titanium screws, this product quality has certification ISO13485 of 
the German T��1V 

We will frequently introduce new products, please pay attention 
www.jch-tools.com.

We have been providing our customers with high quality products and quality 
service, we performed continuous processing technology improvement, continuing 
to reduce the production cost, to provide customers maximize profits.

Our products are exported around the world, has a good customer evaluation.

Our quality service, has a good customer evaluation.

If you are interested in, you can test some samples for confirming our products 
quality, I believe we will have pleasant cooperation in the near future.


Best regards!

Frank

SHENZHEN KALEAD TOOLS CO.,LTD | Shenzhen Surmount Tools Co., Ltd 
 TEL: 86-0755-27261985 
 FAX: 86-0755-27261895
 Email: s...@jch-tools.com 
 Web: www.jch-tools.com 


[PATCH v4] mmc: Provide tracepoints for request processing

2016-03-30 Thread Baolin Wang
This patch provides some tracepoints for the lifecycle of a mmc request from
starting to completion to help with performance analysis of MMC subsystem.

Changes since v3:
 - Add "retries" and "re-tune state" in the trace print.
 - Move trace_mmc_request_start() to __mmc_start_request() function to avoid
 missing valuable information about which command/request is being sent.

Signed-off-by: Baolin Wang 
---
 drivers/mmc/core/core.c|7 ++
 include/trace/events/mmc.h |  182 
 2 files changed, 189 insertions(+)
 create mode 100644 include/trace/events/mmc.h

diff --git a/drivers/mmc/core/core.c b/drivers/mmc/core/core.c
index f95d41f..98ff0f9 100644
--- a/drivers/mmc/core/core.c
+++ b/drivers/mmc/core/core.c
@@ -36,6 +36,9 @@
 #include 
 #include 
 
+#define CREATE_TRACE_POINTS
+#include 
+
 #include "core.h"
 #include "bus.h"
 #include "host.h"
@@ -140,6 +143,8 @@ void mmc_request_done(struct mmc_host *host, struct 
mmc_request *mrq)
cmd->retries = 0;
}
 
+   trace_mmc_request_done(host, mrq);
+
if (err && cmd->retries && !mmc_card_removed(host->card)) {
/*
 * Request starter must handle retries - see
@@ -215,6 +220,8 @@ static void __mmc_start_request(struct mmc_host *host, 
struct mmc_request *mrq)
}
}
 
+   trace_mmc_request_start(host, mrq);
+
host->ops->request(host, mrq);
 }
 
diff --git a/include/trace/events/mmc.h b/include/trace/events/mmc.h
new file mode 100644
index 000..a72f9b9
--- /dev/null
+++ b/include/trace/events/mmc.h
@@ -0,0 +1,182 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM mmc
+
+#if !defined(_TRACE_MMC_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_MMC_H
+
+#include 
+#include 
+#include 
+#include 
+
+TRACE_EVENT(mmc_request_start,
+
+   TP_PROTO(struct mmc_host *host, struct mmc_request *mrq),
+
+   TP_ARGS(host, mrq),
+
+   TP_STRUCT__entry(
+   __field(u32,cmd_opcode)
+   __field(u32,cmd_arg)
+   __field(unsigned int,   cmd_flags)
+   __field(unsigned int,   cmd_retries)
+   __field(u32,stop_opcode)
+   __field(u32,stop_arg)
+   __field(unsigned int,   stop_flags)
+   __field(unsigned int,   stop_retries)
+   __field(u32,sbc_opcode)
+   __field(u32,sbc_arg)
+   __field(unsigned int,   sbc_flags)
+   __field(unsigned int,   sbc_retries)
+   __field(unsigned int,   blocks)
+   __field(unsigned int,   blksz)
+   __field(unsigned int,   data_flags)
+   __field(unsigned int,   can_retune)
+   __field(unsigned int,   doing_retune)
+   __field(unsigned int,   retune_now)
+   __field(int,need_retune)
+   __field(int,hold_retune)
+   __field(unsigned int,   retune_period)
+   __field(struct mmc_request *,   mrq)
+   __string(name,  mmc_hostname(host))
+   ),
+
+   TP_fast_assign(
+   __entry->cmd_opcode = mrq->cmd->opcode;
+   __entry->cmd_arg = mrq->cmd->arg;
+   __entry->cmd_flags = mrq->cmd->flags;
+   __entry->cmd_retries = mrq->cmd->retries;
+   __entry->stop_opcode = mrq->stop ? mrq->stop->opcode : 0;
+   __entry->stop_arg = mrq->stop ? mrq->stop->arg : 0;
+   __entry->stop_flags = mrq->stop ? mrq->stop->flags : 0;
+   __entry->stop_retries = mrq->stop ? mrq->stop->retries : 0;
+   __entry->sbc_opcode = mrq->sbc ? mrq->sbc->opcode : 0;
+   __entry->sbc_arg = mrq->sbc ? mrq->sbc->arg : 0;
+   __entry->sbc_flags = mrq->sbc ? mrq->sbc->flags : 0;
+   __entry->sbc_retries = mrq->sbc ? mrq->sbc->retries : 0;
+   __entry->blksz = mrq->data ? mrq->data->blksz : 0;
+   __entry->blocks = mrq->data ? mrq->data->blocks : 0;
+   __entry->data_flags = mrq->data ? mrq->data->flags : 0;
+   __entry->can_retune = host->can_retune;
+   __entry->doing_retune = host->doing_retune;
+   __entry->retune_now = host->retune_now;
+   __entry->need_retune = host->need_retune;
+   __entry->hold_retune = host->hold_retune;
+   __entry->retune_period = host->retune_period;
+   __assign_str(name, mmc_hostname(host));
+   __entry->mrq = mrq;
+   ),
+
+   TP_printk("%s: start struct mmc_request[%p]: "
+ "cmd_opcode=%u cmd_arg=0x%x cmd_flags=0x%x cmd_retries=%u 

[PATCH v4] mmc: Provide tracepoints for request processing

2016-03-30 Thread Baolin Wang
This patch provides some tracepoints for the lifecycle of a mmc request from
starting to completion to help with performance analysis of MMC subsystem.

Changes since v3:
 - Add "retries" and "re-tune state" in the trace print.
 - Move trace_mmc_request_start() to __mmc_start_request() function to avoid
 missing valuable information about which command/request is being sent.

Signed-off-by: Baolin Wang 
---
 drivers/mmc/core/core.c|7 ++
 include/trace/events/mmc.h |  182 
 2 files changed, 189 insertions(+)
 create mode 100644 include/trace/events/mmc.h

diff --git a/drivers/mmc/core/core.c b/drivers/mmc/core/core.c
index f95d41f..98ff0f9 100644
--- a/drivers/mmc/core/core.c
+++ b/drivers/mmc/core/core.c
@@ -36,6 +36,9 @@
 #include 
 #include 
 
+#define CREATE_TRACE_POINTS
+#include 
+
 #include "core.h"
 #include "bus.h"
 #include "host.h"
@@ -140,6 +143,8 @@ void mmc_request_done(struct mmc_host *host, struct 
mmc_request *mrq)
cmd->retries = 0;
}
 
+   trace_mmc_request_done(host, mrq);
+
if (err && cmd->retries && !mmc_card_removed(host->card)) {
/*
 * Request starter must handle retries - see
@@ -215,6 +220,8 @@ static void __mmc_start_request(struct mmc_host *host, 
struct mmc_request *mrq)
}
}
 
+   trace_mmc_request_start(host, mrq);
+
host->ops->request(host, mrq);
 }
 
diff --git a/include/trace/events/mmc.h b/include/trace/events/mmc.h
new file mode 100644
index 000..a72f9b9
--- /dev/null
+++ b/include/trace/events/mmc.h
@@ -0,0 +1,182 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM mmc
+
+#if !defined(_TRACE_MMC_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_MMC_H
+
+#include 
+#include 
+#include 
+#include 
+
+TRACE_EVENT(mmc_request_start,
+
+   TP_PROTO(struct mmc_host *host, struct mmc_request *mrq),
+
+   TP_ARGS(host, mrq),
+
+   TP_STRUCT__entry(
+   __field(u32,cmd_opcode)
+   __field(u32,cmd_arg)
+   __field(unsigned int,   cmd_flags)
+   __field(unsigned int,   cmd_retries)
+   __field(u32,stop_opcode)
+   __field(u32,stop_arg)
+   __field(unsigned int,   stop_flags)
+   __field(unsigned int,   stop_retries)
+   __field(u32,sbc_opcode)
+   __field(u32,sbc_arg)
+   __field(unsigned int,   sbc_flags)
+   __field(unsigned int,   sbc_retries)
+   __field(unsigned int,   blocks)
+   __field(unsigned int,   blksz)
+   __field(unsigned int,   data_flags)
+   __field(unsigned int,   can_retune)
+   __field(unsigned int,   doing_retune)
+   __field(unsigned int,   retune_now)
+   __field(int,need_retune)
+   __field(int,hold_retune)
+   __field(unsigned int,   retune_period)
+   __field(struct mmc_request *,   mrq)
+   __string(name,  mmc_hostname(host))
+   ),
+
+   TP_fast_assign(
+   __entry->cmd_opcode = mrq->cmd->opcode;
+   __entry->cmd_arg = mrq->cmd->arg;
+   __entry->cmd_flags = mrq->cmd->flags;
+   __entry->cmd_retries = mrq->cmd->retries;
+   __entry->stop_opcode = mrq->stop ? mrq->stop->opcode : 0;
+   __entry->stop_arg = mrq->stop ? mrq->stop->arg : 0;
+   __entry->stop_flags = mrq->stop ? mrq->stop->flags : 0;
+   __entry->stop_retries = mrq->stop ? mrq->stop->retries : 0;
+   __entry->sbc_opcode = mrq->sbc ? mrq->sbc->opcode : 0;
+   __entry->sbc_arg = mrq->sbc ? mrq->sbc->arg : 0;
+   __entry->sbc_flags = mrq->sbc ? mrq->sbc->flags : 0;
+   __entry->sbc_retries = mrq->sbc ? mrq->sbc->retries : 0;
+   __entry->blksz = mrq->data ? mrq->data->blksz : 0;
+   __entry->blocks = mrq->data ? mrq->data->blocks : 0;
+   __entry->data_flags = mrq->data ? mrq->data->flags : 0;
+   __entry->can_retune = host->can_retune;
+   __entry->doing_retune = host->doing_retune;
+   __entry->retune_now = host->retune_now;
+   __entry->need_retune = host->need_retune;
+   __entry->hold_retune = host->hold_retune;
+   __entry->retune_period = host->retune_period;
+   __assign_str(name, mmc_hostname(host));
+   __entry->mrq = mrq;
+   ),
+
+   TP_printk("%s: start struct mmc_request[%p]: "
+ "cmd_opcode=%u cmd_arg=0x%x cmd_flags=0x%x cmd_retries=%u "
+ 

[PATCH v2] spi: rockchip: fix warning of static check

2016-03-30 Thread Shawn Lin
Use dma_request_chan instead of dma_request_slave_channel,
in this case we can check EPROBE_DEFER without static
warning.

Reported-by: Dan Carpenter 
Cc: Doug Anderson 
Cc: Dan Carpenter 
Signed-off-by: Shawn Lin 

---

Changes in v2:
- use dma_request_chan and replace IS_ERR_OR_NULL()
  with IS_ERR
- do the same for rx

 drivers/spi/spi-rockchip.c | 14 +-
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/drivers/spi/spi-rockchip.c b/drivers/spi/spi-rockchip.c
index b71c1ae..6c6c001 100644
--- a/drivers/spi/spi-rockchip.c
+++ b/drivers/spi/spi-rockchip.c
@@ -730,23 +730,27 @@ static int rockchip_spi_probe(struct platform_device 
*pdev)
master->transfer_one = rockchip_spi_transfer_one;
master->handle_err = rockchip_spi_handle_err;
 
-   rs->dma_tx.ch = dma_request_slave_channel(rs->dev, "tx");
-   if (IS_ERR_OR_NULL(rs->dma_tx.ch)) {
+   rs->dma_tx.ch = dma_request_chan(rs->dev, "tx");
+   if (IS_ERR(rs->dma_tx.ch)) {
/* Check tx to see if we need defer probing driver */
if (PTR_ERR(rs->dma_tx.ch) == -EPROBE_DEFER) {
ret = -EPROBE_DEFER;
goto err_get_fifo_len;
}
dev_warn(rs->dev, "Failed to request TX DMA channel\n");
+   rs->dma_tx.ch = NULL;
}
 
-   rs->dma_rx.ch = dma_request_slave_channel(rs->dev, "rx");
-   if (!rs->dma_rx.ch) {
-   if (rs->dma_tx.ch) {
+   rs->dma_rx.ch = dma_request_chan(rs->dev, "rx");
+   if (IS_ERR(rs->dma_rx.ch)) {
+   if (PTR_ERR(rs->dma_rx.ch) == -EPROBE_DEFER) {
dma_release_channel(rs->dma_tx.ch);
rs->dma_tx.ch = NULL;
+   ret = -EPROBE_DEFER;
+   goto err_get_fifo_len;
}
dev_warn(rs->dev, "Failed to request RX DMA channel\n");
+   rs->dma_rx.ch = NULL;
}
 
if (rs->dma_tx.ch && rs->dma_rx.ch) {
-- 
2.3.7




[PATCH v2] spi: rockchip: fix warning of static check

2016-03-30 Thread Shawn Lin
Use dma_request_chan instead of dma_request_slave_channel,
in this case we can check EPROBE_DEFER without static
warning.

Reported-by: Dan Carpenter 
Cc: Doug Anderson 
Cc: Dan Carpenter 
Signed-off-by: Shawn Lin 

---

Changes in v2:
- use dma_request_chan and replace IS_ERR_OR_NULL()
  with IS_ERR
- do the same for rx

 drivers/spi/spi-rockchip.c | 14 +-
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/drivers/spi/spi-rockchip.c b/drivers/spi/spi-rockchip.c
index b71c1ae..6c6c001 100644
--- a/drivers/spi/spi-rockchip.c
+++ b/drivers/spi/spi-rockchip.c
@@ -730,23 +730,27 @@ static int rockchip_spi_probe(struct platform_device 
*pdev)
master->transfer_one = rockchip_spi_transfer_one;
master->handle_err = rockchip_spi_handle_err;
 
-   rs->dma_tx.ch = dma_request_slave_channel(rs->dev, "tx");
-   if (IS_ERR_OR_NULL(rs->dma_tx.ch)) {
+   rs->dma_tx.ch = dma_request_chan(rs->dev, "tx");
+   if (IS_ERR(rs->dma_tx.ch)) {
/* Check tx to see if we need defer probing driver */
if (PTR_ERR(rs->dma_tx.ch) == -EPROBE_DEFER) {
ret = -EPROBE_DEFER;
goto err_get_fifo_len;
}
dev_warn(rs->dev, "Failed to request TX DMA channel\n");
+   rs->dma_tx.ch = NULL;
}
 
-   rs->dma_rx.ch = dma_request_slave_channel(rs->dev, "rx");
-   if (!rs->dma_rx.ch) {
-   if (rs->dma_tx.ch) {
+   rs->dma_rx.ch = dma_request_chan(rs->dev, "rx");
+   if (IS_ERR(rs->dma_rx.ch)) {
+   if (PTR_ERR(rs->dma_rx.ch) == -EPROBE_DEFER) {
dma_release_channel(rs->dma_tx.ch);
rs->dma_tx.ch = NULL;
+   ret = -EPROBE_DEFER;
+   goto err_get_fifo_len;
}
dev_warn(rs->dev, "Failed to request RX DMA channel\n");
+   rs->dma_rx.ch = NULL;
}
 
if (rs->dma_tx.ch && rs->dma_rx.ch) {
-- 
2.3.7




Re: [PATCH v7 1/5] staging/android: add num_fences field to struct sync_file_info

2016-03-30 Thread Greg Kroah-Hartman
On Wed, Mar 30, 2016 at 11:53:38PM -0300, Gustavo Padovan wrote:
> Hi Greg,
> 
> 2016-03-30 Greg Kroah-Hartman :
> 
> > On Thu, Mar 03, 2016 at 04:40:42PM -0300, Gustavo Padovan wrote:
> > > From: Gustavo Padovan 
> > 
> > 
> > 
> > Gustavo, can you resend both series of your android patches so I know I
> > have the latest ones to work with?  Please also collect the acks that
> > people have provided so far.
> 
> I have resent it already. The lastest patches on this series is v10,
> it contain the acks
> 
> https://lkml.org/lkml/2016/3/18/298

Ok, I'll review those, thanks.

greg k-h


Re: [PATCH v7 1/5] staging/android: add num_fences field to struct sync_file_info

2016-03-30 Thread Greg Kroah-Hartman
On Wed, Mar 30, 2016 at 11:53:38PM -0300, Gustavo Padovan wrote:
> Hi Greg,
> 
> 2016-03-30 Greg Kroah-Hartman :
> 
> > On Thu, Mar 03, 2016 at 04:40:42PM -0300, Gustavo Padovan wrote:
> > > From: Gustavo Padovan 
> > 
> > 
> > 
> > Gustavo, can you resend both series of your android patches so I know I
> > have the latest ones to work with?  Please also collect the acks that
> > people have provided so far.
> 
> I have resent it already. The lastest patches on this series is v10,
> it contain the acks
> 
> https://lkml.org/lkml/2016/3/18/298

Ok, I'll review those, thanks.

greg k-h


  1   2   3   4   5   6   7   8   9   10   >