date:20160226

Re: [PATCH v7 RESEND 0/4] mailbox: hisilicon: add Hi6220 mailbox driver

2016-02-26 Thread Wei Xu

Hi Leo and Jassi,

On 26/02/2016 19:40, Jassi Brar wrote:
> On Mon, Feb 15, 2016 at 7:20 PM, Leo Yan  wrote:
>> Hi6220 mailbox supports up to 32 channels. Each channel is unidirectional
>> with a maximum message size of 8 words. I/O is performed using register
>> access (there is no DMA) and the cell raises an interrupt when messages
>> are received.
>>
>> This patch series is to implement Hi6220 mailbox driver. It registers
>> two channels into framework for communication with MCU, one is tx channel
>> and another is rx channel. Now mailbox driver is used to send message to
>> MCU to control dynamic voltage and frequency scaling for CPU, GPU and DDR.
>>
>> Changes from v6:
>> * Fix to use lowercase for hexadecimal value in DT binding document
>>
>> Changes from v5:
>> * Refine to use mailbox three specifiers for client driver, so add xlate
>>   callback function in mailbox driver to support these specifiers
>> * Refine document for property "hi6220,mbox-tx-noirq"
>>
>> Changes from v4:
>> * According to Jassi's suggestion, using DT binding to register channels
>> * Change to use operating-points-v2 to register operating points
>>
>> Changes from v3:
>> * The patch series for enabling idle state for Hi6220 has reserved memory
>>   regions, so this series will not include it anymore
>> * Refined mailbox driver according to Jassi's suggestion;
>>   Removed kfifo from mailbox driver;
>>   Removed spinlock for ipc registers accessing, due every channel has its
>>   own dedicated bit in ipc register and readl/writel will introduce memory
>>   barrier, so don't need spinlock to protect ipc registers accessing
>> * After mailbox driver is ready, can use patch 4 to enable CPU's OPPs and
>>   stub clock driver; finally can enable CPUFreq driver for CPU frequency
>>   scaling
>>
>> Changes from v2:
>> * Get rid of unused memory regions from memory node in DT, and don't use
>>   reserved-memory node according to Mark and Leif's suggestion; Haojian also
>>   has updated UEFI for efi memory info
>>
>> Changes from v1:
>> * Correct lock usage for SMP scenario
>>
>> Changes from RFC:
>> * According to Jassi's review, totally remove the abstract common driver
>>   layer and only commit driver dedicated for Hi6220
>> * According to Paul Bolle's review, fix typo issue for Kconfig and remove
>>   unnecessary dependency with OF and fix minor for mailbox driver
>> * Refine a little for dts nodes
>>
>>
>> Leo Yan (4):
>>   dt-bindings: mailbox: Document Hi6220 mailbox driver
>>   mailbox: Hi6220: add mailbox driver
>>   arm64: dts: add mailbox node for Hi6220
>>   arm64: dts: add Hi6220's stub clock node
>>
>>  .../bindings/mailbox/hisilicon,hi6220-mailbox.txt  |  74 
>>  arch/arm64/boot/dts/hisilicon/hi6220.dtsi  |  64 
>>  drivers/mailbox/Kconfig|   8 +
>>  drivers/mailbox/Makefile   |   2 +
>>  drivers/mailbox/hi6220-mailbox.c   | 395 
>> +
>>
> Applied 1 & 2 to mailbox-for-next.  Patch-3&4 should go via asoc tree.

Applied 3 and 4 to the hisilicon soc tree.
Thanks!

Best Regards,
Wei

> 
> Thanks
> 
> .
>

Re: [PATCH v7 RESEND 0/4] mailbox: hisilicon: add Hi6220 mailbox driver

2016-02-26 Thread Wei Xu

Hi Leo and Jassi,

On 26/02/2016 19:40, Jassi Brar wrote:
> On Mon, Feb 15, 2016 at 7:20 PM, Leo Yan  wrote:
>> Hi6220 mailbox supports up to 32 channels. Each channel is unidirectional
>> with a maximum message size of 8 words. I/O is performed using register
>> access (there is no DMA) and the cell raises an interrupt when messages
>> are received.
>>
>> This patch series is to implement Hi6220 mailbox driver. It registers
>> two channels into framework for communication with MCU, one is tx channel
>> and another is rx channel. Now mailbox driver is used to send message to
>> MCU to control dynamic voltage and frequency scaling for CPU, GPU and DDR.
>>
>> Changes from v6:
>> * Fix to use lowercase for hexadecimal value in DT binding document
>>
>> Changes from v5:
>> * Refine to use mailbox three specifiers for client driver, so add xlate
>>   callback function in mailbox driver to support these specifiers
>> * Refine document for property "hi6220,mbox-tx-noirq"
>>
>> Changes from v4:
>> * According to Jassi's suggestion, using DT binding to register channels
>> * Change to use operating-points-v2 to register operating points
>>
>> Changes from v3:
>> * The patch series for enabling idle state for Hi6220 has reserved memory
>>   regions, so this series will not include it anymore
>> * Refined mailbox driver according to Jassi's suggestion;
>>   Removed kfifo from mailbox driver;
>>   Removed spinlock for ipc registers accessing, due every channel has its
>>   own dedicated bit in ipc register and readl/writel will introduce memory
>>   barrier, so don't need spinlock to protect ipc registers accessing
>> * After mailbox driver is ready, can use patch 4 to enable CPU's OPPs and
>>   stub clock driver; finally can enable CPUFreq driver for CPU frequency
>>   scaling
>>
>> Changes from v2:
>> * Get rid of unused memory regions from memory node in DT, and don't use
>>   reserved-memory node according to Mark and Leif's suggestion; Haojian also
>>   has updated UEFI for efi memory info
>>
>> Changes from v1:
>> * Correct lock usage for SMP scenario
>>
>> Changes from RFC:
>> * According to Jassi's review, totally remove the abstract common driver
>>   layer and only commit driver dedicated for Hi6220
>> * According to Paul Bolle's review, fix typo issue for Kconfig and remove
>>   unnecessary dependency with OF and fix minor for mailbox driver
>> * Refine a little for dts nodes
>>
>>
>> Leo Yan (4):
>>   dt-bindings: mailbox: Document Hi6220 mailbox driver
>>   mailbox: Hi6220: add mailbox driver
>>   arm64: dts: add mailbox node for Hi6220
>>   arm64: dts: add Hi6220's stub clock node
>>
>>  .../bindings/mailbox/hisilicon,hi6220-mailbox.txt  |  74 
>>  arch/arm64/boot/dts/hisilicon/hi6220.dtsi  |  64 
>>  drivers/mailbox/Kconfig|   8 +
>>  drivers/mailbox/Makefile   |   2 +
>>  drivers/mailbox/hi6220-mailbox.c   | 395 
>> +
>>
> Applied 1 & 2 to mailbox-for-next.  Patch-3&4 should go via asoc tree.

Applied 3 and 4 to the hisilicon soc tree.
Thanks!

Best Regards,
Wei

> 
> Thanks
> 
> .
>

Re: [PATCH 0/2][GIT PULL] Timekeeping updates to tip/timers/core for 4.6

2016-02-26 Thread Thomas Gleixner

On Fri, 26 Feb 2016, John Stultz wrote:
>   clocksource: introduce clocksource_freq2mult()
>   jiffies: use CLOCKSOURCE_MASK instead of constant

Bah. You again forgot to make the first letter of the sentence upper case.

Hint: There is the concept of scripts, which can automate that :)

Thanks,

tglx

Re: [PATCH 0/2][GIT PULL] Timekeeping updates to tip/timers/core for 4.6

2016-02-26 Thread Thomas Gleixner

On Fri, 26 Feb 2016, John Stultz wrote:
>   clocksource: introduce clocksource_freq2mult()
>   jiffies: use CLOCKSOURCE_MASK instead of constant

Bah. You again forgot to make the first letter of the sentence upper case.

Hint: There is the concept of scripts, which can automate that :)

Thanks,

tglx

Re: [PATCH 0/2][GIT PULL] Timekeeping updates to tip/timers/core for 4.6

2016-02-26 Thread Thomas Gleixner

John,

On Fri, 26 Feb 2016, John Stultz wrote:

> Hey Thomas, Ingo,
>   Here's my somewhat truncated queue for 4.6. I was hoping to
> get the cross-timestamp patchset from Christopher sent along,
> but he's got some last minute changes to address feedback from
> Andy, so I'm holding off.
> 
> If the response is good for that last change, I may try to send
> another set with those changes next week, but we're cutting it
> fairly close to -rc6, so I'll check with you before doing so.

If it's just the small fixup to address review comments, we still should take
it.
 
> So.. two tiny cleanup fixes is all for now.
> 
> Boring is good, right?

:)

Thanks,

tglx

Re: [PATCH 0/2][GIT PULL] Timekeeping updates to tip/timers/core for 4.6

2016-02-26 Thread Thomas Gleixner

John,

On Fri, 26 Feb 2016, John Stultz wrote:

> Hey Thomas, Ingo,
>   Here's my somewhat truncated queue for 4.6. I was hoping to
> get the cross-timestamp patchset from Christopher sent along,
> but he's got some last minute changes to address feedback from
> Andy, so I'm holding off.
> 
> If the response is good for that last change, I may try to send
> another set with those changes next week, but we're cutting it
> fairly close to -rc6, so I'll check with you before doing so.

If it's just the small fixup to address review comments, we still should take
it.
 
> So.. two tiny cleanup fixes is all for now.
> 
> Boring is good, right?

:)

Thanks,

tglx

Re: [patch 20/20] rcu: Make CPU_DYING_IDLE an explicit call

2016-02-26 Thread Thomas Gleixner

On Fri, 26 Feb 2016, Paul E. McKenney wrote:
> > > --- a/kernel/cpu.c
> > > +++ b/kernel/cpu.c
> > > @@ -762,6 +762,7 @@ void cpuhp_report_idle_dead(void)
> > >   BUG_ON(st->state != CPUHP_AP_OFFLINE);
> > >   st->state = CPUHP_AP_IDLE_DEAD;
> > >   complete(>done);
> > 
> > What prevents the other CPU from killing this CPU at this point, so
> > that this CPU does not tell RCU that it is dead?
> >
> > I agree that the odds should be low, but there are all manner of things
> > that might delay a CPU for just a little bit too long...
> > 
> > Or am I missing something subtle here?

No. The reason why I moved the rcu call past the complete is, that otherwise
complete() complains about rcu being dead already. Hmm, but you are right. In
theory the other side could allow physical removal before it actually told rcu
that it's gone.

> Just in case I am not missing anything...
> 
> One approach is to go back to the spinning, but to do rcu_report_dead()
> just before kicking the other CPU.  This would also fix some issues with
> use of RCU of the offline path, so would definitely be better than my
> earlier approach of notifying RCU from within the idle loop.
> 
> This assumes that all the offline paths have been consolidated into
> this path.  (Yes, I was too lazy and cowardly to consolidate them all
> last I touched this code, but perhaps that has happened elsewise?)

The question is whether the rcu dead notification has to happen
instantaniously and needs to be done on the dead cpu. If we can avoid both,
then there is a very simple solution.

Thanks,

tglx

Re: [patch 20/20] rcu: Make CPU_DYING_IDLE an explicit call

2016-02-26 Thread Thomas Gleixner

On Fri, 26 Feb 2016, Paul E. McKenney wrote:
> > > --- a/kernel/cpu.c
> > > +++ b/kernel/cpu.c
> > > @@ -762,6 +762,7 @@ void cpuhp_report_idle_dead(void)
> > >   BUG_ON(st->state != CPUHP_AP_OFFLINE);
> > >   st->state = CPUHP_AP_IDLE_DEAD;
> > >   complete(>done);
> > 
> > What prevents the other CPU from killing this CPU at this point, so
> > that this CPU does not tell RCU that it is dead?
> >
> > I agree that the odds should be low, but there are all manner of things
> > that might delay a CPU for just a little bit too long...
> > 
> > Or am I missing something subtle here?

No. The reason why I moved the rcu call past the complete is, that otherwise
complete() complains about rcu being dead already. Hmm, but you are right. In
theory the other side could allow physical removal before it actually told rcu
that it's gone.

> Just in case I am not missing anything...
> 
> One approach is to go back to the spinning, but to do rcu_report_dead()
> just before kicking the other CPU.  This would also fix some issues with
> use of RCU of the offline path, so would definitely be better than my
> earlier approach of notifying RCU from within the idle loop.
> 
> This assumes that all the offline paths have been consolidated into
> this path.  (Yes, I was too lazy and cowardly to consolidate them all
> last I touched this code, but perhaps that has happened elsewise?)

The question is whether the rcu dead notification has to happen
instantaniously and needs to be done on the dead cpu. If we can avoid both,
then there is a very simple solution.

Thanks,

tglx

Re: [PATCH v3 22/22] sound/usb: Use Media Controller API to share media resources

2016-02-26 Thread Takashi Iwai

On Sat, 27 Feb 2016 03:55:39 +0100,
Shuah Khan wrote:
> 
> On 02/26/2016 01:50 PM, Takashi Iwai wrote:
> > On Fri, 26 Feb 2016 21:08:43 +0100,
> > Shuah Khan wrote:
> >>
> >> On 02/26/2016 12:55 PM, Takashi Iwai wrote:
> >>> On Fri, 12 Feb 2016 00:41:38 +0100,
> >>> Shuah Khan wrote:
> 
>  Change ALSA driver to use Media Controller API to
>  share media resources with DVB and V4L2 drivers
>  on a AU0828 media device. Media Controller specific
>  initialization is done after sound card is registered.
>  ALSA creates Media interface and entity function graph
>  nodes for Control, Mixer, PCM Playback, and PCM Capture
>  devices.
> 
>  snd_usb_hw_params() will call Media Controller enable
>  source handler interface to request the media resource.
>  If resource request is granted, it will release it from
>  snd_usb_hw_free(). If resource is busy, -EBUSY is returned.
> 
>  Media specific cleanup is done in usb_audio_disconnect().
> 
>  Signed-off-by: Shuah Khan 
>  ---
>   sound/usb/Kconfig|   4 +
>   sound/usb/Makefile   |   2 +
>   sound/usb/card.c |  14 +++
>   sound/usb/card.h |   3 +
>   sound/usb/media.c| 318 
>  +++
>   sound/usb/media.h|  72 +++
>   sound/usb/mixer.h|   3 +
>   sound/usb/pcm.c  |  28 -
>   sound/usb/quirks-table.h |   1 +
>   sound/usb/stream.c   |   2 +
>   sound/usb/usbaudio.h |   6 +
>   11 files changed, 448 insertions(+), 5 deletions(-)
>   create mode 100644 sound/usb/media.c
>   create mode 100644 sound/usb/media.h
> 
>  diff --git a/sound/usb/Kconfig b/sound/usb/Kconfig
>  index a452ad7..ba117f5 100644
>  --- a/sound/usb/Kconfig
>  +++ b/sound/usb/Kconfig
>  @@ -15,6 +15,7 @@ config SND_USB_AUDIO
>   select SND_RAWMIDI
>   select SND_PCM
>   select BITREVERSE
>  +select SND_USB_AUDIO_USE_MEDIA_CONTROLLER if MEDIA_CONTROLLER 
>  && MEDIA_SUPPORT
> >>>
> >>> Looking at the media Kconfig again, this would be broken if
> >>> MEDIA_SUPPORT=m and SND_USB_AUDIO=y.  The ugly workaround is something
> >>> like:
> >>>   select SND_USB_AUDIO_USE_MEDIA_CONTROLLER \
> >>>   if MEDIA_CONTROLLER && (MEDIA_SUPPORT=y || MEDIA_SUPPORT=SND)
> >>
> >> My current config is MEDIA_SUPPORT=m and SND_USB_AUDIO=y
> >> It is working and I didn't see any issues so far.
> > 
> > Hmm, how does it be?  In drivers/media/Makefile:
> > 
> > ifeq ($(CONFIG_MEDIA_CONTROLLER),y)
> >   obj-$(CONFIG_MEDIA_SUPPORT) += media.o
> > endif
> > 
> > So it's a module.  Meanwhile you have reference from usb-audio driver
> > that is built-in kernel.  How is the symbol resolved?
> 
> Sorry my mistake. I misspoke. My config had:
> CONFIG_MEDIA_SUPPORT=m
> CONFIG_MEDIA_CONTROLLER=y
> CONFIG_SND_USB_AUDIO=m
> 
> The following doesn't work as you pointed out.
> 
> CONFIG_MEDIA_SUPPORT=m
> CONFIG_MEDIA_CONTROLLER=y
> CONFIG_SND_USB_AUDIO=y
> 
> okay here is what will work for all of the possible
> combinations of CONFIG_MEDIA_SUPPORT and CONFIG_SND_USB_AUDIO
> 
> select SND_USB_AUDIO_USE_MEDIA_CONTROLLER \
>if MEDIA_CONTROLLER && ((MEDIA_SUPPORT=y) || (MEDIA_SUPPORT=m && 
> SND_USB_AUDIO=m))
> 
> The above will cover the cases when
> 
> 1. CONFIG_MEDIA_SUPPORT and CONFIG_SND_USB_AUDIO are
>both modules
>CONFIG_SND_USB_AUDIO_USE_MEDIA_CONTROLLER is selected
> 
> 2. CONFIG_MEDIA_SUPPORT=y and CONFIG_SND_USB_AUDIO=m
>CONFIG_SND_USB_AUDIO_USE_MEDIA_CONTROLLER is selected
> 
> 3. CONFIG_MEDIA_SUPPORT=y and CONFIG_SND_USB_AUDIO=y
>CONFIG_SND_USB_AUDIO_USE_MEDIA_CONTROLLER is selected
> 
> 4. CONFIG_MEDIA_SUPPORT=m and CONFIG_SND_USB_AUDIO=y
>This is when we don't want
>CONFIG_SND_USB_AUDIO_USE_MEDIA_CONTROLLER selected
> 
> I verified all of the above combinations to make sure
> the logic works.
> 
> If you think of a better way to do this please let me
> know. I will go ahead and send patch v4 with the above
> change and you can decide if that is acceptable.

I'm not 100% sure whether CONFIG_SND_USB_AUDIO=m can be put there as
conditional inside CONFIG_SND_USB_AUDIO definition.  Maybe a safer
form would be like:

config SND_USB_AUDIO_USE_MEDIA_CONTROLLER
bool
default y
depends on SND_USB_AUDIO
depends on MEDIA_CONTROLLER
depends on (MEDIA_SUPPORT=y || MEDIA_SUPPORT=SND_USB_AUDIO)

and drop select from SND_USB_AUDIO.


> >>> Other than that, it looks more or less OK to me.
> >>> The way how media_stream_init() gets called is a bit worrisome, but it
> >>> should work practically.  Another concern is about the disconnection.
> >>> Can all function calls in media_device_delete() be safe even if it's
> >>> called while the application still opens the MC device?
> >>
> >> Right. I have been looking into device

Re: [PATCH v3 22/22] sound/usb: Use Media Controller API to share media resources

2016-02-26 Thread Takashi Iwai

On Sat, 27 Feb 2016 03:55:39 +0100,
Shuah Khan wrote:
> 
> On 02/26/2016 01:50 PM, Takashi Iwai wrote:
> > On Fri, 26 Feb 2016 21:08:43 +0100,
> > Shuah Khan wrote:
> >>
> >> On 02/26/2016 12:55 PM, Takashi Iwai wrote:
> >>> On Fri, 12 Feb 2016 00:41:38 +0100,
> >>> Shuah Khan wrote:
> 
>  Change ALSA driver to use Media Controller API to
>  share media resources with DVB and V4L2 drivers
>  on a AU0828 media device. Media Controller specific
>  initialization is done after sound card is registered.
>  ALSA creates Media interface and entity function graph
>  nodes for Control, Mixer, PCM Playback, and PCM Capture
>  devices.
> 
>  snd_usb_hw_params() will call Media Controller enable
>  source handler interface to request the media resource.
>  If resource request is granted, it will release it from
>  snd_usb_hw_free(). If resource is busy, -EBUSY is returned.
> 
>  Media specific cleanup is done in usb_audio_disconnect().
> 
>  Signed-off-by: Shuah Khan 
>  ---
>   sound/usb/Kconfig|   4 +
>   sound/usb/Makefile   |   2 +
>   sound/usb/card.c |  14 +++
>   sound/usb/card.h |   3 +
>   sound/usb/media.c| 318 
>  +++
>   sound/usb/media.h|  72 +++
>   sound/usb/mixer.h|   3 +
>   sound/usb/pcm.c  |  28 -
>   sound/usb/quirks-table.h |   1 +
>   sound/usb/stream.c   |   2 +
>   sound/usb/usbaudio.h |   6 +
>   11 files changed, 448 insertions(+), 5 deletions(-)
>   create mode 100644 sound/usb/media.c
>   create mode 100644 sound/usb/media.h
> 
>  diff --git a/sound/usb/Kconfig b/sound/usb/Kconfig
>  index a452ad7..ba117f5 100644
>  --- a/sound/usb/Kconfig
>  +++ b/sound/usb/Kconfig
>  @@ -15,6 +15,7 @@ config SND_USB_AUDIO
>   select SND_RAWMIDI
>   select SND_PCM
>   select BITREVERSE
>  +select SND_USB_AUDIO_USE_MEDIA_CONTROLLER if MEDIA_CONTROLLER 
>  && MEDIA_SUPPORT
> >>>
> >>> Looking at the media Kconfig again, this would be broken if
> >>> MEDIA_SUPPORT=m and SND_USB_AUDIO=y.  The ugly workaround is something
> >>> like:
> >>>   select SND_USB_AUDIO_USE_MEDIA_CONTROLLER \
> >>>   if MEDIA_CONTROLLER && (MEDIA_SUPPORT=y || MEDIA_SUPPORT=SND)
> >>
> >> My current config is MEDIA_SUPPORT=m and SND_USB_AUDIO=y
> >> It is working and I didn't see any issues so far.
> > 
> > Hmm, how does it be?  In drivers/media/Makefile:
> > 
> > ifeq ($(CONFIG_MEDIA_CONTROLLER),y)
> >   obj-$(CONFIG_MEDIA_SUPPORT) += media.o
> > endif
> > 
> > So it's a module.  Meanwhile you have reference from usb-audio driver
> > that is built-in kernel.  How is the symbol resolved?
> 
> Sorry my mistake. I misspoke. My config had:
> CONFIG_MEDIA_SUPPORT=m
> CONFIG_MEDIA_CONTROLLER=y
> CONFIG_SND_USB_AUDIO=m
> 
> The following doesn't work as you pointed out.
> 
> CONFIG_MEDIA_SUPPORT=m
> CONFIG_MEDIA_CONTROLLER=y
> CONFIG_SND_USB_AUDIO=y
> 
> okay here is what will work for all of the possible
> combinations of CONFIG_MEDIA_SUPPORT and CONFIG_SND_USB_AUDIO
> 
> select SND_USB_AUDIO_USE_MEDIA_CONTROLLER \
>if MEDIA_CONTROLLER && ((MEDIA_SUPPORT=y) || (MEDIA_SUPPORT=m && 
> SND_USB_AUDIO=m))
> 
> The above will cover the cases when
> 
> 1. CONFIG_MEDIA_SUPPORT and CONFIG_SND_USB_AUDIO are
>both modules
>CONFIG_SND_USB_AUDIO_USE_MEDIA_CONTROLLER is selected
> 
> 2. CONFIG_MEDIA_SUPPORT=y and CONFIG_SND_USB_AUDIO=m
>CONFIG_SND_USB_AUDIO_USE_MEDIA_CONTROLLER is selected
> 
> 3. CONFIG_MEDIA_SUPPORT=y and CONFIG_SND_USB_AUDIO=y
>CONFIG_SND_USB_AUDIO_USE_MEDIA_CONTROLLER is selected
> 
> 4. CONFIG_MEDIA_SUPPORT=m and CONFIG_SND_USB_AUDIO=y
>This is when we don't want
>CONFIG_SND_USB_AUDIO_USE_MEDIA_CONTROLLER selected
> 
> I verified all of the above combinations to make sure
> the logic works.
> 
> If you think of a better way to do this please let me
> know. I will go ahead and send patch v4 with the above
> change and you can decide if that is acceptable.

I'm not 100% sure whether CONFIG_SND_USB_AUDIO=m can be put there as
conditional inside CONFIG_SND_USB_AUDIO definition.  Maybe a safer
form would be like:

config SND_USB_AUDIO_USE_MEDIA_CONTROLLER
bool
default y
depends on SND_USB_AUDIO
depends on MEDIA_CONTROLLER
depends on (MEDIA_SUPPORT=y || MEDIA_SUPPORT=SND_USB_AUDIO)

and drop select from SND_USB_AUDIO.


> >>> Other than that, it looks more or less OK to me.
> >>> The way how media_stream_init() gets called is a bit worrisome, but it
> >>> should work practically.  Another concern is about the disconnection.
> >>> Can all function calls in media_device_delete() be safe even if it's
> >>> called while the application still opens the MC device?
> >>
> >> Right. I have been looking into device removal path when
> >>

Re: [patch 10/20] cpu/hotplug: Make target state writeable

2016-02-26 Thread Thomas Gleixner

Rafael,

On Sat, 27 Feb 2016, Rafael J. Wysocki wrote:
> On Friday, February 26, 2016 06:43:32 PM Thomas Gleixner wrote:
> > Make it possible to write a target state to the per cpu state file, so we 
> > can
> > switch between states.
> 
> One thing that potentially may be problematic here is that any kind of
> "offline" operations needs to be carried out under device_hotplug_lock,
> because there are cases in which devices (including CPUs) are taken
> offline in groups and if one offline fails, the whole operation has to
> be rolled back.
>
> So if you put a CPU into one of the intermediate states manually and
> something like the above happens in parallel with it, they may not
> play well together IMO.

I don't see how that is related. device_hotplug_lock is completely independent
of cpu hotplug today, unless I'm missing some magic connection here.

Physical CPU hotplug is a different story, but that's about bringing the cpus
into the system or taking them out. Sure, if you want to take one or more cpus
physically out, you have to bring them offline first. If you plug them in then
it's not necessarily related to actually bringing them online. That's a
different set of operations.

We surely need to look into that aspect, but I don't see a reason why e.g. a
device hotplug operation should be in any way related to the intermediate
state of a particular cpu. If that's the case, then there is something really
wrong.

I'm aware that we have a gazillion of silly assumptions all over the place and
some of them are wrong today and just do not explode in our face simply
because it's extremly hard to trigger. That's one reason why we need to go
through all the cpu notifier related sites and inspect them deeply.

Thanks,

tglx

Re: [patch 10/20] cpu/hotplug: Make target state writeable

2016-02-26 Thread Thomas Gleixner

Rafael,

On Sat, 27 Feb 2016, Rafael J. Wysocki wrote:
> On Friday, February 26, 2016 06:43:32 PM Thomas Gleixner wrote:
> > Make it possible to write a target state to the per cpu state file, so we 
> > can
> > switch between states.
> 
> One thing that potentially may be problematic here is that any kind of
> "offline" operations needs to be carried out under device_hotplug_lock,
> because there are cases in which devices (including CPUs) are taken
> offline in groups and if one offline fails, the whole operation has to
> be rolled back.
>
> So if you put a CPU into one of the intermediate states manually and
> something like the above happens in parallel with it, they may not
> play well together IMO.

I don't see how that is related. device_hotplug_lock is completely independent
of cpu hotplug today, unless I'm missing some magic connection here.

Physical CPU hotplug is a different story, but that's about bringing the cpus
into the system or taking them out. Sure, if you want to take one or more cpus
physically out, you have to bring them offline first. If you plug them in then
it's not necessarily related to actually bringing them online. That's a
different set of operations.

We surely need to look into that aspect, but I don't see a reason why e.g. a
device hotplug operation should be in any way related to the intermediate
state of a particular cpu. If that's the case, then there is something really
wrong.

I'm aware that we have a gazillion of silly assumptions all over the place and
some of them are wrong today and just do not explode in our face simply
because it's extremly hard to trigger. That's one reason why we need to go
through all the cpu notifier related sites and inspect them deeply.

Thanks,

tglx

Re: [PATCH v2 3/3] scsi: allow scsi devices to use direct complete

2016-02-26 Thread Mika Westerberg

On Wed, Feb 24, 2016 at 04:22:28PM -0800, Derek Basehore wrote:
> This allows scsi devices to remain runtime suspended for system
> suspend. Since runtime suspend is stricter than system suspend
> callbacks, this is just returning a positive number for the prepare
> callback.

AFAICT SCSI layer already leaves devices runtime suspended during system
suspend (see scsi_bus_suspend_common()). What's the benefit using
direct_complete over the current implementation?

> Signed-off-by: Derek Basehore 
> Reviewed-by: Eric Caruso 
> ---
>  drivers/scsi/scsi_pm.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/scsi/scsi_pm.c b/drivers/scsi/scsi_pm.c
> index b44c1bb..7af76ad 100644
> --- a/drivers/scsi/scsi_pm.c
> +++ b/drivers/scsi/scsi_pm.c
> @@ -178,7 +178,7 @@ static int scsi_bus_prepare(struct device *dev)
>   /* Wait until async scanning is finished */
>   scsi_complete_async_scans();
>   }
> - return 0;
> + return 1;
>  }
>  
>  static int scsi_bus_suspend(struct device *dev)
> -- 
> 2.7.0.rc3.207.g0ac5344

Re: [PATCH v2 3/3] scsi: allow scsi devices to use direct complete

2016-02-26 Thread Mika Westerberg

On Wed, Feb 24, 2016 at 04:22:28PM -0800, Derek Basehore wrote:
> This allows scsi devices to remain runtime suspended for system
> suspend. Since runtime suspend is stricter than system suspend
> callbacks, this is just returning a positive number for the prepare
> callback.

AFAICT SCSI layer already leaves devices runtime suspended during system
suspend (see scsi_bus_suspend_common()). What's the benefit using
direct_complete over the current implementation?

> Signed-off-by: Derek Basehore 
> Reviewed-by: Eric Caruso 
> ---
>  drivers/scsi/scsi_pm.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/scsi/scsi_pm.c b/drivers/scsi/scsi_pm.c
> index b44c1bb..7af76ad 100644
> --- a/drivers/scsi/scsi_pm.c
> +++ b/drivers/scsi/scsi_pm.c
> @@ -178,7 +178,7 @@ static int scsi_bus_prepare(struct device *dev)
>   /* Wait until async scanning is finished */
>   scsi_complete_async_scans();
>   }
> - return 0;
> + return 1;
>  }
>  
>  static int scsi_bus_suspend(struct device *dev)
> -- 
> 2.7.0.rc3.207.g0ac5344

Re: [PATCH v3 1/2] gpio: designware: switch device node to fwnode

2016-02-26 Thread Jiang Qiu

在 2016/2/25 21:43, Andy Shevchenko 写道:
> On Thu, Feb 25, 2016 at 1:58 PM, Jiang Qiu  wrote:
>> 在 2016/2/24 21:46, Andy Shevchenko 写道:
>>> On Wed, Feb 24, 2016 at 2:33 PM, qiujiang  wrote:
> 
>>>  - why do you use fwnode_*() instead of device_property_*() calls?
>>> What prevents us to move to device property API directly?
>> Yes, it looks more reasonable by using devce_property. Howerver,
>> device_get_child_node_count was used here to find each child node. This
>> API output the fwnode_handle for each child node directly, but device
>> property APIs need 'dev' data instead. Actually, the effects of fwnode_*()
>> and device_*() are the same. So, I used fwnode_*() APIs here.
> 
> Right, looks okay then.
> 
 -   node = dev->of_node;
 -   if (!IS_ENABLED(CONFIG_OF_GPIO) || !node)
 +   if (!IS_ENABLED(CONFIG_OF_GPIO) || !(dev->of_node))
 return ERR_PTR(-ENODEV);
>>>
>>> So, since you converted to fwnode, do you still need this check?
>>>
>> Although this patch coverted device node to fwnode, only DTs binding was
>> supported here, and patch2 support ACPI will remove this check.
> 
> Yes, but like I said below device_get_child_node_count() will take
> care of that, will it?
Right, device_get_child_node_count() will take of it, this should be removed.
> 

 -   nports = of_get_child_count(node);
 +   nports = device_get_child_node_count(dev);
 if (nports == 0)
 return ERR_PTR(-ENODEV);
>>>
>>> ...I think this one fail if it will not found any child.
>> This one fail? yes, it will return to failure.
>> I am not very clear here.
> 
> See above.
Here, device_get_child_node_count will return ZERO if there is not any child.
So, I think this will work ok, will it?
>

Re: [PATCH v3 1/2] gpio: designware: switch device node to fwnode

2016-02-26 Thread Jiang Qiu

在 2016/2/25 21:43, Andy Shevchenko 写道:
> On Thu, Feb 25, 2016 at 1:58 PM, Jiang Qiu  wrote:
>> 在 2016/2/24 21:46, Andy Shevchenko 写道:
>>> On Wed, Feb 24, 2016 at 2:33 PM, qiujiang  wrote:
> 
>>>  - why do you use fwnode_*() instead of device_property_*() calls?
>>> What prevents us to move to device property API directly?
>> Yes, it looks more reasonable by using devce_property. Howerver,
>> device_get_child_node_count was used here to find each child node. This
>> API output the fwnode_handle for each child node directly, but device
>> property APIs need 'dev' data instead. Actually, the effects of fwnode_*()
>> and device_*() are the same. So, I used fwnode_*() APIs here.
> 
> Right, looks okay then.
> 
 -   node = dev->of_node;
 -   if (!IS_ENABLED(CONFIG_OF_GPIO) || !node)
 +   if (!IS_ENABLED(CONFIG_OF_GPIO) || !(dev->of_node))
 return ERR_PTR(-ENODEV);
>>>
>>> So, since you converted to fwnode, do you still need this check?
>>>
>> Although this patch coverted device node to fwnode, only DTs binding was
>> supported here, and patch2 support ACPI will remove this check.
> 
> Yes, but like I said below device_get_child_node_count() will take
> care of that, will it?
Right, device_get_child_node_count() will take of it, this should be removed.
> 

 -   nports = of_get_child_count(node);
 +   nports = device_get_child_node_count(dev);
 if (nports == 0)
 return ERR_PTR(-ENODEV);
>>>
>>> ...I think this one fail if it will not found any child.
>> This one fail? yes, it will return to failure.
>> I am not very clear here.
> 
> See above.
Here, device_get_child_node_count will return ZERO if there is not any child.
So, I think this will work ok, will it?
>

Re: [PATCH] IPIP tunnel performance improvement

2016-02-26 Thread zhao ya



Yes, I did, but have no effect.

I want to ask is, why David's patch not used.

Thanks.



Cong Wang said, at 2/27/2016 2:29 PM:
> On Fri, Feb 26, 2016 at 8:40 PM, zhao ya  wrote:
>> From: Zhao Ya 
>> Date: Sat, 27 Feb 2016 10:06:44 +0800
>> Subject: [PATCH] IPIP tunnel performance improvement
>>
>> bypass the logic of each packet's own neighbour creation when using
>> pointopint or loopback device.
>>
>> Recently, in our tests, met a performance problem.
>> In a large number of packets with different target IP address through
>> ipip tunnel, PPS will decrease sharply.
>>
>> The output of perf top are as follows, __write_lock_failed is of the first:
>>   - 5.89% [kernel]  [k] __write_lock_failed
>>-__write_lock_failed a
>>-_raw_write_lock_bh  a
>>-__neigh_create  a
>>-ip_finish_outputa
>>-ip_output   a
>>-ip_local_outa
>>
>> The neighbour subsystem will create a neighbour object for each target
>> when using pointopint device. When massive amounts of packets with diff-
>> erent target IP address to be xmit through a pointopint device, these
>> packets will suffer the bottleneck at write_lock_bh(>lock) after
>> creating the neighbour object and then inserting it into a hash-table
>> at the same time.
>>
>> This patch correct it. Only one or little amounts of neighbour objects
>> will be created when massive amounts of packets with different target IP
>> address through ipip tunnel.
>>
>> As the result, performance will be improved.
> 
> Well, you just basically revert another bug fix:
> 
> commit 0bb4087cbec0ef74fd416789d6aad67957063057
> Author: David S. Miller 
> Date:   Fri Jul 20 16:00:53 2012 -0700
> 
> ipv4: Fix neigh lookup keying over loopback/point-to-point devices.
> 
> We were using a special key "0" for all loopback and point-to-point
> device neigh lookups under ipv4, but we wouldn't use that special
> key for the neigh creation.
> 
> So basically we'd make a new neigh at each and every lookup :-)
> 
> This special case to use only one neigh for these device types
> is of dubious value, so just remove it entirely.
> 
> Reported-by: Eric Dumazet 
> Signed-off-by: David S. Miller 
> 
> which would bring the neigh entries counting problem back...
> 
> Did you try to tune the neigh gc parameters for your case?
> 
> Thanks.
>

Re: [PATCH] IPIP tunnel performance improvement

2016-02-26 Thread zhao ya



Yes, I did, but have no effect.

I want to ask is, why David's patch not used.

Thanks.



Cong Wang said, at 2/27/2016 2:29 PM:
> On Fri, Feb 26, 2016 at 8:40 PM, zhao ya  wrote:
>> From: Zhao Ya 
>> Date: Sat, 27 Feb 2016 10:06:44 +0800
>> Subject: [PATCH] IPIP tunnel performance improvement
>>
>> bypass the logic of each packet's own neighbour creation when using
>> pointopint or loopback device.
>>
>> Recently, in our tests, met a performance problem.
>> In a large number of packets with different target IP address through
>> ipip tunnel, PPS will decrease sharply.
>>
>> The output of perf top are as follows, __write_lock_failed is of the first:
>>   - 5.89% [kernel]  [k] __write_lock_failed
>>-__write_lock_failed a
>>-_raw_write_lock_bh  a
>>-__neigh_create  a
>>-ip_finish_outputa
>>-ip_output   a
>>-ip_local_outa
>>
>> The neighbour subsystem will create a neighbour object for each target
>> when using pointopint device. When massive amounts of packets with diff-
>> erent target IP address to be xmit through a pointopint device, these
>> packets will suffer the bottleneck at write_lock_bh(>lock) after
>> creating the neighbour object and then inserting it into a hash-table
>> at the same time.
>>
>> This patch correct it. Only one or little amounts of neighbour objects
>> will be created when massive amounts of packets with different target IP
>> address through ipip tunnel.
>>
>> As the result, performance will be improved.
> 
> Well, you just basically revert another bug fix:
> 
> commit 0bb4087cbec0ef74fd416789d6aad67957063057
> Author: David S. Miller 
> Date:   Fri Jul 20 16:00:53 2012 -0700
> 
> ipv4: Fix neigh lookup keying over loopback/point-to-point devices.
> 
> We were using a special key "0" for all loopback and point-to-point
> device neigh lookups under ipv4, but we wouldn't use that special
> key for the neigh creation.
> 
> So basically we'd make a new neigh at each and every lookup :-)
> 
> This special case to use only one neigh for these device types
> is of dubious value, so just remove it entirely.
> 
> Reported-by: Eric Dumazet 
> Signed-off-by: David S. Miller 
> 
> which would bring the neigh entries counting problem back...
> 
> Did you try to tune the neigh gc parameters for your case?
> 
> Thanks.
>

[PATCH v4 1/3] input: cygnus-update touchscreen dt node document

2016-02-26 Thread Raveendra Padasalagi

In Cygnus SOC touch screen controller registers are shared
with ADC and flex timer. Using readl/writel could lead to
race condition. So touch screen driver is enhanced to support
register access using syscon framework API's to take care of
mutually exclusive access.In addition to this existing touchscreen
node name "tsc" is renamed to "touchscreen".

Hence touchscreen dt node bindings document is updated to take care
of above changes in the touchscreen driver.

Signed-off-by: Raveendra Padasalagi 
Reviewed-by: Ray Jui 
Reviewed-by: Scott Branden 
---
 .../input/touchscreen/brcm,iproc-touchscreen.txt| 21 -
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git 
a/Documentation/devicetree/bindings/input/touchscreen/brcm,iproc-touchscreen.txt
 
b/Documentation/devicetree/bindings/input/touchscreen/brcm,iproc-touchscreen.txt
index 34e3382..25541f3 100644
--- 
a/Documentation/devicetree/bindings/input/touchscreen/brcm,iproc-touchscreen.txt
+++ 
b/Documentation/devicetree/bindings/input/touchscreen/brcm,iproc-touchscreen.txt
@@ -2,11 +2,17 @@
 
 Required properties:
 - compatible: must be "brcm,iproc-touchscreen"
-- reg: physical base address of the controller and length of memory mapped
-  region.
+- ts_syscon: handler of syscon node defining physical base
+  address of the controller and length of memory mapped region.
+  If this property is selected please make sure MFD_SYSCON config
+  is enabled in the defconfig file.
 - clocks:  The clock provided by the SOC to driver the tsc
 - clock-name:  name for the clock
 - interrupts: The touchscreen controller's interrupt
+- address-cells: Specify the number of u32 entries needed in child nodes.
+  Should set to 1.
+- size-cells: Specify number of u32 entries needed to specify child nodes size
+  in reg property. Should set to 1.
 
 Optional properties:
 - scanning_period: Time between scans. Each step is 1024 us.  Valid 1-256.
@@ -53,13 +59,18 @@ Optional properties:
 - touchscreen-inverted-x: X axis is inverted (boolean)
 - touchscreen-inverted-y: Y axis is inverted (boolean)
 
-Example:
+Example: An example of touchscreen node
 
-   touchscreen: tsc@0x180A6000 {
+   ts_adc_syscon: ts_adc_syscon@180a6000 {
+   compatible = "brcm,iproc-ts-adc-syscon","syscon";
+   reg = <0x180a6000 0xc30>;
+   };
+
+   touchscreen: touchscreen@180A6000 {
compatible = "brcm,iproc-touchscreen";
#address-cells = <1>;
#size-cells = <1>;
-   reg = <0x180A6000 0x40>;
+   ts_syscon = <_adc_syscon>;
clocks = <_clk>;
clock-names = "tsc_clk";
interrupts = ;
-- 
1.9.1

[PATCH v4 1/3] input: cygnus-update touchscreen dt node document

2016-02-26 Thread Raveendra Padasalagi

In Cygnus SOC touch screen controller registers are shared
with ADC and flex timer. Using readl/writel could lead to
race condition. So touch screen driver is enhanced to support
register access using syscon framework API's to take care of
mutually exclusive access.In addition to this existing touchscreen
node name "tsc" is renamed to "touchscreen".

Hence touchscreen dt node bindings document is updated to take care
of above changes in the touchscreen driver.

Signed-off-by: Raveendra Padasalagi 
Reviewed-by: Ray Jui 
Reviewed-by: Scott Branden 
---
 .../input/touchscreen/brcm,iproc-touchscreen.txt| 21 -
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git 
a/Documentation/devicetree/bindings/input/touchscreen/brcm,iproc-touchscreen.txt
 
b/Documentation/devicetree/bindings/input/touchscreen/brcm,iproc-touchscreen.txt
index 34e3382..25541f3 100644
--- 
a/Documentation/devicetree/bindings/input/touchscreen/brcm,iproc-touchscreen.txt
+++ 
b/Documentation/devicetree/bindings/input/touchscreen/brcm,iproc-touchscreen.txt
@@ -2,11 +2,17 @@
 
 Required properties:
 - compatible: must be "brcm,iproc-touchscreen"
-- reg: physical base address of the controller and length of memory mapped
-  region.
+- ts_syscon: handler of syscon node defining physical base
+  address of the controller and length of memory mapped region.
+  If this property is selected please make sure MFD_SYSCON config
+  is enabled in the defconfig file.
 - clocks:  The clock provided by the SOC to driver the tsc
 - clock-name:  name for the clock
 - interrupts: The touchscreen controller's interrupt
+- address-cells: Specify the number of u32 entries needed in child nodes.
+  Should set to 1.
+- size-cells: Specify number of u32 entries needed to specify child nodes size
+  in reg property. Should set to 1.
 
 Optional properties:
 - scanning_period: Time between scans. Each step is 1024 us.  Valid 1-256.
@@ -53,13 +59,18 @@ Optional properties:
 - touchscreen-inverted-x: X axis is inverted (boolean)
 - touchscreen-inverted-y: Y axis is inverted (boolean)
 
-Example:
+Example: An example of touchscreen node
 
-   touchscreen: tsc@0x180A6000 {
+   ts_adc_syscon: ts_adc_syscon@180a6000 {
+   compatible = "brcm,iproc-ts-adc-syscon","syscon";
+   reg = <0x180a6000 0xc30>;
+   };
+
+   touchscreen: touchscreen@180A6000 {
compatible = "brcm,iproc-touchscreen";
#address-cells = <1>;
#size-cells = <1>;
-   reg = <0x180A6000 0x40>;
+   ts_syscon = <_adc_syscon>;
clocks = <_clk>;
clock-names = "tsc_clk";
interrupts = ;
-- 
1.9.1

[PATCH v4 2/3] input: syscon support in bcm_iproc_tsc driver

2016-02-26 Thread Raveendra Padasalagi

In Cygnus SOC touch screen controller registers are shared
with ADC and flex timer. Using readl/writel could lead to
race condition. So touch screen driver is enhanced to support
register access using syscon framework API's to take care of
mutually exclusive access.

Signed-off-by: Raveendra Padasalagi 
Reviewed-by: Ray Jui 
Reviewed-by: Scott Branden 
---
 drivers/input/touchscreen/bcm_iproc_tsc.c | 79 +--
 1 file changed, 44 insertions(+), 35 deletions(-)

diff --git a/drivers/input/touchscreen/bcm_iproc_tsc.c 
b/drivers/input/touchscreen/bcm_iproc_tsc.c
index ae460a5c..9957587 100644
--- a/drivers/input/touchscreen/bcm_iproc_tsc.c
+++ b/drivers/input/touchscreen/bcm_iproc_tsc.c
@@ -23,6 +23,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #define IPROC_TS_NAME "iproc-ts"
 
@@ -88,7 +90,11 @@
 #define TS_WIRE_MODE_BITBIT(1)
 
 #define dbg_reg(dev, priv, reg) \
-   dev_dbg(dev, "%20s= 0x%08x\n", #reg, readl((priv)->regs + reg))
+do { \
+   u32 val; \
+   regmap_read(priv->regmap, reg, ); \
+   dev_dbg(dev, "%20s= 0x%08x\n", #reg, val); \
+} while (0)
 
 struct tsc_param {
/* Each step is 1024 us.  Valid 1-256 */
@@ -141,7 +147,7 @@ struct iproc_ts_priv {
struct platform_device *pdev;
struct input_dev *idev;
 
-   void __iomem *regs;
+   struct regmap *regmap;
struct clk *tsc_clk;
 
int  pen_status;
@@ -196,22 +202,22 @@ static irqreturn_t iproc_touchscreen_interrupt(int irq, 
void *data)
int i;
bool needs_sync = false;
 
-   intr_status = readl(priv->regs + INTERRUPT_STATUS);
-   intr_status &= TS_PEN_INTR_MASK | TS_FIFO_INTR_MASK;
+   regmap_read(priv->regmap, INTERRUPT_STATUS, _status);
+   intr_status &= (TS_PEN_INTR_MASK | TS_FIFO_INTR_MASK);
if (intr_status == 0)
return IRQ_NONE;
 
/* Clear all interrupt status bits, write-1-clear */
-   writel(intr_status, priv->regs + INTERRUPT_STATUS);
-
+   regmap_write(priv->regmap, INTERRUPT_STATUS, intr_status);
/* Pen up/down */
if (intr_status & TS_PEN_INTR_MASK) {
-   if (readl(priv->regs + CONTROLLER_STATUS) & TS_PEN_DOWN)
+   regmap_read(priv->regmap, CONTROLLER_STATUS, >pen_status);
+   if (priv->pen_status & TS_PEN_DOWN)
priv->pen_status = PEN_DOWN_STATUS;
else
priv->pen_status = PEN_UP_STATUS;
 
-   input_report_key(priv->idev, BTN_TOUCH, priv->pen_status);
+   input_report_key(priv->idev, BTN_TOUCH, priv->pen_status);
needs_sync = true;
 
dev_dbg(>pdev->dev,
@@ -221,7 +227,7 @@ static irqreturn_t iproc_touchscreen_interrupt(int irq, 
void *data)
/* coordinates in FIFO exceed the theshold */
if (intr_status & TS_FIFO_INTR_MASK) {
for (i = 0; i < priv->cfg_params.fifo_threshold; i++) {
-   raw_coordinate = readl(priv->regs + FIFO_DATA);
+   regmap_read(priv->regmap, FIFO_DATA, _coordinate);
if (raw_coordinate == INVALID_COORD)
continue;
 
@@ -239,7 +245,7 @@ static irqreturn_t iproc_touchscreen_interrupt(int irq, 
void *data)
x = (x >> 4) & 0x0FFF;
y = (y >> 4) & 0x0FFF;
 
-   /* adjust x y according to lcd tsc mount angle */
+   /* Adjust x y according to LCD tsc mount angle. */
if (priv->cfg_params.invert_x)
x = priv->cfg_params.max_x - x;
 
@@ -262,9 +268,10 @@ static irqreturn_t iproc_touchscreen_interrupt(int irq, 
void *data)
 
 static int iproc_ts_start(struct input_dev *idev)
 {
-   struct iproc_ts_priv *priv = input_get_drvdata(idev);
u32 val;
+   u32 mask;
int error;
+   struct iproc_ts_priv *priv = input_get_drvdata(idev);
 
/* Enable clock */
error = clk_prepare_enable(priv->tsc_clk);
@@ -279,9 +286,10 @@ static int iproc_ts_start(struct input_dev *idev)
 *  FIFO reaches the int_th value, and pen event(up/down)
 */
val = TS_PEN_INTR_MASK | TS_FIFO_INTR_MASK;
-   writel(val, priv->regs + INTERRUPT_MASK);
+   regmap_update_bits(priv->regmap, INTERRUPT_MASK, val, val);
 
-   writel(priv->cfg_params.fifo_threshold, priv->regs + INTERRUPT_THRES);
+   val = priv->cfg_params.fifo_threshold;
+   regmap_write(priv->regmap, INTERRUPT_THRES, val);
 
/* Initialize control reg1 */
val = 0;
@@ -289,26 +297,23 @@ static int iproc_ts_start(struct input_dev *idev)
val |= priv->cfg_params.debounce_timeout << DEBOUNCE_TIMEOUT_SHIFT;
val |= priv->cfg_params.settling_timeout << SETTLING_TIMEOUT_SHIFT;
val |=

[PATCH v4 3/3] ARM: dts: use syscon in cygnus touchscreen dt node

2016-02-26 Thread Raveendra Padasalagi

In Cygnus SOC touch screen controller registers are shared
with ADC and flex timer. Using readl/writel could lead to
race condition. So touchscreen driver is enhanced to support
syscon based register access to take care of mutually exclusive
access.

This patch enables syscon support in touchscreen driver
by adding necessary properties in touchscreen dt node and
in addition to this renamed existing "tsc" touchscreen node
name to "touchscreen".

Signed-off-by: Raveendra Padasalagi 
Reviewed-by: Ray Jui 
Reviewed-by: Scott Branden 
---
 arch/arm/boot/dts/bcm-cygnus.dtsi | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/arm/boot/dts/bcm-cygnus.dtsi 
b/arch/arm/boot/dts/bcm-cygnus.dtsi
index 3878793..b42fe55 100644
--- a/arch/arm/boot/dts/bcm-cygnus.dtsi
+++ b/arch/arm/boot/dts/bcm-cygnus.dtsi
@@ -351,9 +351,16 @@
< 142 10 1>;
};
 
-   touchscreen: tsc@180a6000 {
+   ts_adc_syscon: ts_adc_syscon@180a6000 {
+   compatible = "brcm,iproc-ts-adc-syscon", "syscon";
+   reg = <0x180a6000 0xc30>;
+   };
+
+   touchscreen: touchscreen@180a6000 {
compatible = "brcm,iproc-touchscreen";
-   reg = <0x180a6000 0x40>;
+   #address-cells = <1>;
+   #size-cells = <1>;
+   ts_syscon = <_adc_syscon>;
clocks = <_clks BCM_CYGNUS_ASIU_ADC_CLK>;
clock-names = "tsc_clk";
interrupts = ;
-- 
1.9.1

[PATCH v4 2/3] input: syscon support in bcm_iproc_tsc driver

2016-02-26 Thread Raveendra Padasalagi

In Cygnus SOC touch screen controller registers are shared
with ADC and flex timer. Using readl/writel could lead to
race condition. So touch screen driver is enhanced to support
register access using syscon framework API's to take care of
mutually exclusive access.

Signed-off-by: Raveendra Padasalagi 
Reviewed-by: Ray Jui 
Reviewed-by: Scott Branden 
---
 drivers/input/touchscreen/bcm_iproc_tsc.c | 79 +--
 1 file changed, 44 insertions(+), 35 deletions(-)

diff --git a/drivers/input/touchscreen/bcm_iproc_tsc.c 
b/drivers/input/touchscreen/bcm_iproc_tsc.c
index ae460a5c..9957587 100644
--- a/drivers/input/touchscreen/bcm_iproc_tsc.c
+++ b/drivers/input/touchscreen/bcm_iproc_tsc.c
@@ -23,6 +23,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #define IPROC_TS_NAME "iproc-ts"
 
@@ -88,7 +90,11 @@
 #define TS_WIRE_MODE_BITBIT(1)
 
 #define dbg_reg(dev, priv, reg) \
-   dev_dbg(dev, "%20s= 0x%08x\n", #reg, readl((priv)->regs + reg))
+do { \
+   u32 val; \
+   regmap_read(priv->regmap, reg, ); \
+   dev_dbg(dev, "%20s= 0x%08x\n", #reg, val); \
+} while (0)
 
 struct tsc_param {
/* Each step is 1024 us.  Valid 1-256 */
@@ -141,7 +147,7 @@ struct iproc_ts_priv {
struct platform_device *pdev;
struct input_dev *idev;
 
-   void __iomem *regs;
+   struct regmap *regmap;
struct clk *tsc_clk;
 
int  pen_status;
@@ -196,22 +202,22 @@ static irqreturn_t iproc_touchscreen_interrupt(int irq, 
void *data)
int i;
bool needs_sync = false;
 
-   intr_status = readl(priv->regs + INTERRUPT_STATUS);
-   intr_status &= TS_PEN_INTR_MASK | TS_FIFO_INTR_MASK;
+   regmap_read(priv->regmap, INTERRUPT_STATUS, _status);
+   intr_status &= (TS_PEN_INTR_MASK | TS_FIFO_INTR_MASK);
if (intr_status == 0)
return IRQ_NONE;
 
/* Clear all interrupt status bits, write-1-clear */
-   writel(intr_status, priv->regs + INTERRUPT_STATUS);
-
+   regmap_write(priv->regmap, INTERRUPT_STATUS, intr_status);
/* Pen up/down */
if (intr_status & TS_PEN_INTR_MASK) {
-   if (readl(priv->regs + CONTROLLER_STATUS) & TS_PEN_DOWN)
+   regmap_read(priv->regmap, CONTROLLER_STATUS, >pen_status);
+   if (priv->pen_status & TS_PEN_DOWN)
priv->pen_status = PEN_DOWN_STATUS;
else
priv->pen_status = PEN_UP_STATUS;
 
-   input_report_key(priv->idev, BTN_TOUCH, priv->pen_status);
+   input_report_key(priv->idev, BTN_TOUCH, priv->pen_status);
needs_sync = true;
 
dev_dbg(>pdev->dev,
@@ -221,7 +227,7 @@ static irqreturn_t iproc_touchscreen_interrupt(int irq, 
void *data)
/* coordinates in FIFO exceed the theshold */
if (intr_status & TS_FIFO_INTR_MASK) {
for (i = 0; i < priv->cfg_params.fifo_threshold; i++) {
-   raw_coordinate = readl(priv->regs + FIFO_DATA);
+   regmap_read(priv->regmap, FIFO_DATA, _coordinate);
if (raw_coordinate == INVALID_COORD)
continue;
 
@@ -239,7 +245,7 @@ static irqreturn_t iproc_touchscreen_interrupt(int irq, 
void *data)
x = (x >> 4) & 0x0FFF;
y = (y >> 4) & 0x0FFF;
 
-   /* adjust x y according to lcd tsc mount angle */
+   /* Adjust x y according to LCD tsc mount angle. */
if (priv->cfg_params.invert_x)
x = priv->cfg_params.max_x - x;
 
@@ -262,9 +268,10 @@ static irqreturn_t iproc_touchscreen_interrupt(int irq, 
void *data)
 
 static int iproc_ts_start(struct input_dev *idev)
 {
-   struct iproc_ts_priv *priv = input_get_drvdata(idev);
u32 val;
+   u32 mask;
int error;
+   struct iproc_ts_priv *priv = input_get_drvdata(idev);
 
/* Enable clock */
error = clk_prepare_enable(priv->tsc_clk);
@@ -279,9 +286,10 @@ static int iproc_ts_start(struct input_dev *idev)
 *  FIFO reaches the int_th value, and pen event(up/down)
 */
val = TS_PEN_INTR_MASK | TS_FIFO_INTR_MASK;
-   writel(val, priv->regs + INTERRUPT_MASK);
+   regmap_update_bits(priv->regmap, INTERRUPT_MASK, val, val);
 
-   writel(priv->cfg_params.fifo_threshold, priv->regs + INTERRUPT_THRES);
+   val = priv->cfg_params.fifo_threshold;
+   regmap_write(priv->regmap, INTERRUPT_THRES, val);
 
/* Initialize control reg1 */
val = 0;
@@ -289,26 +297,23 @@ static int iproc_ts_start(struct input_dev *idev)
val |= priv->cfg_params.debounce_timeout << DEBOUNCE_TIMEOUT_SHIFT;
val |= priv->cfg_params.settling_timeout << SETTLING_TIMEOUT_SHIFT;
val |= priv->cfg_params.touch_timeout << TOUCH_TIMEOUT_SHIFT;
-   writel(val, priv->regs + REGCTL1);
+

[PATCH v4 3/3] ARM: dts: use syscon in cygnus touchscreen dt node

2016-02-26 Thread Raveendra Padasalagi

In Cygnus SOC touch screen controller registers are shared
with ADC and flex timer. Using readl/writel could lead to
race condition. So touchscreen driver is enhanced to support
syscon based register access to take care of mutually exclusive
access.

This patch enables syscon support in touchscreen driver
by adding necessary properties in touchscreen dt node and
in addition to this renamed existing "tsc" touchscreen node
name to "touchscreen".

Signed-off-by: Raveendra Padasalagi 
Reviewed-by: Ray Jui 
Reviewed-by: Scott Branden 
---
 arch/arm/boot/dts/bcm-cygnus.dtsi | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/arm/boot/dts/bcm-cygnus.dtsi 
b/arch/arm/boot/dts/bcm-cygnus.dtsi
index 3878793..b42fe55 100644
--- a/arch/arm/boot/dts/bcm-cygnus.dtsi
+++ b/arch/arm/boot/dts/bcm-cygnus.dtsi
@@ -351,9 +351,16 @@
< 142 10 1>;
};
 
-   touchscreen: tsc@180a6000 {
+   ts_adc_syscon: ts_adc_syscon@180a6000 {
+   compatible = "brcm,iproc-ts-adc-syscon", "syscon";
+   reg = <0x180a6000 0xc30>;
+   };
+
+   touchscreen: touchscreen@180a6000 {
compatible = "brcm,iproc-touchscreen";
-   reg = <0x180a6000 0x40>;
+   #address-cells = <1>;
+   #size-cells = <1>;
+   ts_syscon = <_adc_syscon>;
clocks = <_clks BCM_CYGNUS_ASIU_ADC_CLK>;
clock-names = "tsc_clk";
interrupts = ;
-- 
1.9.1

[PATCH v4 0/3] Syscon support for iProc touchscreen driver

2016-02-26 Thread Raveendra Padasalagi

This patchset is based on v4.5-rc3 tag and its tested on
Broadcom Cygnus SoC.

The patches can be fetched from iproc-tsc-v4 branch of
https://github.com/Broadcom/arm64-linux.git

Changes since v3:
 - Renamed touchscreen node "tsc" to "touchscreen" in dt binding document
 - Added support for syscon only based register access in touch screen
   driver and removed "regs" based register access. Updated dt binding
   document to reflect the changes.
 - Removed "brcm,iproc-touchscreen-syscon" compatible string support in
   dt binding document and touchscreen driver implementation.

Changes since v2:
 - Omitted '0x' in "tsc node" definition in dt documentation file
 - Omitted '0x' in "ts_adc_syscon" definition in dt documentation file
 - Added "brcm,iproc-ts-adc-syscon" compatible string in "ts_adc_syscon"
   node. Updated dt documentation file to reflect this change.

Changes since v1:
 - Enhanced touchscreen driver to handle syscon based register access if
   "brcm,iproc-touchscreen-syscon" compatible string is provided in dt
 - Normal register access is handled through readl and writel API's if
   "brcm,iproc-touchscreen" compatible string is provided.
 - Updated touchscreen dt node document to reflect the new changes.
 - Updated change logs in each patchset to reflect the new changes

Raveendra Padasalagi (3):
  input: cygnus-update touchscreen dt node document
  input: syscon support in bcm_iproc_tsc driver
  ARM: dts: use syscon in cygnus touchscreen dt node

 .../input/touchscreen/brcm,iproc-touchscreen.txt   | 21 --
 arch/arm/boot/dts/bcm-cygnus.dtsi  | 11 ++-
 drivers/input/touchscreen/bcm_iproc_tsc.c  | 79 --
 3 files changed, 69 insertions(+), 42 deletions(-)

-- 
1.9.1

[PATCH v4 0/3] Syscon support for iProc touchscreen driver

2016-02-26 Thread Raveendra Padasalagi

This patchset is based on v4.5-rc3 tag and its tested on
Broadcom Cygnus SoC.

The patches can be fetched from iproc-tsc-v4 branch of
https://github.com/Broadcom/arm64-linux.git

Changes since v3:
 - Renamed touchscreen node "tsc" to "touchscreen" in dt binding document
 - Added support for syscon only based register access in touch screen
   driver and removed "regs" based register access. Updated dt binding
   document to reflect the changes.
 - Removed "brcm,iproc-touchscreen-syscon" compatible string support in
   dt binding document and touchscreen driver implementation.

Changes since v2:
 - Omitted '0x' in "tsc node" definition in dt documentation file
 - Omitted '0x' in "ts_adc_syscon" definition in dt documentation file
 - Added "brcm,iproc-ts-adc-syscon" compatible string in "ts_adc_syscon"
   node. Updated dt documentation file to reflect this change.

Changes since v1:
 - Enhanced touchscreen driver to handle syscon based register access if
   "brcm,iproc-touchscreen-syscon" compatible string is provided in dt
 - Normal register access is handled through readl and writel API's if
   "brcm,iproc-touchscreen" compatible string is provided.
 - Updated touchscreen dt node document to reflect the new changes.
 - Updated change logs in each patchset to reflect the new changes

Raveendra Padasalagi (3):
  input: cygnus-update touchscreen dt node document
  input: syscon support in bcm_iproc_tsc driver
  ARM: dts: use syscon in cygnus touchscreen dt node

 .../input/touchscreen/brcm,iproc-touchscreen.txt   | 21 --
 arch/arm/boot/dts/bcm-cygnus.dtsi  | 11 ++-
 drivers/input/touchscreen/bcm_iproc_tsc.c  | 79 --
 3 files changed, 69 insertions(+), 42 deletions(-)

-- 
1.9.1

Re: [PATCH v4 1/1] ideapad-laptop: Add ideapad Y700 (15) to the no_hw_rfkill DMI list

2016-02-26 Thread John Dahlstrom


On Fri, 26 Feb 2016, Darren Hart wrote:

On Mon, Feb 22, 2016 at 10:29:48PM -0600, John Dahlstrom wrote:

[...]

Thank you for that information. One commit is sufficient to apply the patch
to all kernel versions without fuzz:

4fa9dab ideapad_laptop: Lenovo G50-30 fix rfkill reports wireless blocked


This does not yield correct results for me on 3.17 (or 3.17.8). Do you get
different results from the following?


[...]

Note that the DMI match blocks were stuffed in the middle of the
ideapad_acpi_add() function instead of the no_hw_rfkill.


The v3, v4, and v5 patch submissions had been corrupted by spurious
whitespace inserted by my email client. I have sent a corrected version,
[PATCH v6 1/1] ideapad-laptop: Add ideapad Y700 (15) to the no_hw_rfkill DMI 
list

After submission, I verified the successful application of the patch
as received by the mailing list.

I also ran git clone, "git checkout -b 3.17 v3.17", and
"git cherry-pick 4fa9dab", followed by patch with success.

Kind regards,

John

Re: [PATCH v4 1/1] ideapad-laptop: Add ideapad Y700 (15) to the no_hw_rfkill DMI list

2016-02-26 Thread John Dahlstrom


On Fri, 26 Feb 2016, Darren Hart wrote:

On Mon, Feb 22, 2016 at 10:29:48PM -0600, John Dahlstrom wrote:

[...]

Thank you for that information. One commit is sufficient to apply the patch
to all kernel versions without fuzz:

4fa9dab ideapad_laptop: Lenovo G50-30 fix rfkill reports wireless blocked


This does not yield correct results for me on 3.17 (or 3.17.8). Do you get
different results from the following?


[...]

Note that the DMI match blocks were stuffed in the middle of the
ideapad_acpi_add() function instead of the no_hw_rfkill.


The v3, v4, and v5 patch submissions had been corrupted by spurious
whitespace inserted by my email client. I have sent a corrected version,
[PATCH v6 1/1] ideapad-laptop: Add ideapad Y700 (15) to the no_hw_rfkill DMI 
list

After submission, I verified the successful application of the patch
as received by the mailing list.

I also ran git clone, "git checkout -b 3.17 v3.17", and
"git cherry-pick 4fa9dab", followed by patch with success.

Kind regards,

John

Re: [PATCH v2 1/3] input: cygnus-update touchscreen dt node document

2016-02-26 Thread Raveendra Padasalagi

Thanks Scott and Ray for the inputs. I will implement syscon only
register access and send out the changes in patch set - v4.

Regards,
Raveendra

On Tue, Feb 23, 2016 at 1:18 AM, Ray Jui  wrote:
>
>
> On 2/22/2016 11:41 AM, Scott Branden wrote:
>>
>> My comments below
>>
>> On 16-02-22 11:36 AM, Dmitry Torokhov wrote:
>>>
>>> On Fri, Feb 19, 2016 at 11:43:50AM +0530, Raveendra Padasalagi wrote:

 On Thu, Feb 18, 2016 at 8:06 PM, Rob Herring  wrote:
>
> On Wed, Feb 17, 2016 at 03:13:44PM +0530, Raveendra Padasalagi wrote:
>>
>> In Cygnus SOC touch screen controller registers are shared
>> with ADC and flex timer. Using readl/writel could lead to
>> race condition. So touch screen driver is enhanced to support
>>
>> 1. If touchscreen register's are not shared. Register access
>> is handled through readl/writel if "brcm,iproc-touchscreen"
>> compatible is provided in touchscreen dt node. This will help
>> for future SOC's if comes with dedicated touchscreen IP register's.
>>
>> 2. If touchscreen register's are shared with other IP's, register
>> access is handled through syscon framework API's to take care of
>> mutually exclusive access. This feature can be enabled by selecting
>> "brcm,iproc-touchscreen-syscon" compatible string in the touchscreen
>> dt node.
>>
>> Hence touchscreen dt node bindings document is updated to take care
>> of above changes in the touchscreen driver.
>>
>> Signed-off-by: Raveendra Padasalagi
>> 
>> Reviewed-by: Ray Jui 
>> Reviewed-by: Scott Branden 
>> ---
>>   .../input/touchscreen/brcm,iproc-touchscreen.txt   | 57
>> +++---
>>   1 file changed, 51 insertions(+), 6 deletions(-)
>>
>> diff --git
>>
>> a/Documentation/devicetree/bindings/input/touchscreen/brcm,iproc-touchscreen.txt
>>
>> b/Documentation/devicetree/bindings/input/touchscreen/brcm,iproc-touchscreen.txt
>>
>> index 34e3382..f530c25 100644
>> ---
>>
>> a/Documentation/devicetree/bindings/input/touchscreen/brcm,iproc-touchscreen.txt
>>
>> +++
>>
>> b/Documentation/devicetree/bindings/input/touchscreen/brcm,iproc-touchscreen.txt
>>
>> @@ -1,12 +1,30 @@
>>   * Broadcom's IPROC Touchscreen Controller
>>
>>   Required properties:
>> -- compatible: must be "brcm,iproc-touchscreen"
>> -- reg: physical base address of the controller and length of
>> memory mapped
>> -  region.
>> +- compatible: should be one of
>> +"brcm,iproc-touchscreen"
>> +"brcm,iproc-touchscreen-syscon"
>
>
> More specific and this is not how you do syscon. Either the block is or
> isn't. You can't have it both ways.


 Existing driver has support for reg, if we modify now to support only
 syscon
 then this driver will not work if some one wishes to use previous
 kernel version's
 dt and vice versa. Basically this breaks dt compatibility. Is that ok ?
>>>
>>>
>>> But the issue is that the driver does not actually work correctly with
>>> direct register access on those systems, since the registers are
>>> actually shared with other components. I am not quite sure if it is OK
>>> to break DT binding in this case...
>>
>>
>> The driver does work correctly with direct register access on those
>> systems because the other components using those registers are not
>> active in those systems - so syscon is not needed in those cases.
>>
>> I'm ok with not containing backwards compatibility though and always
>> using syscon.  There are no deployed systems using older versions of the
>> upstreamed kernel.
>>>
>>>
>>> Thanks.
>>>
>>
>> Regards,
>> Scott
>
>
> The iproc touchscreen is currently activated in the "bcm9hmidc.dtsi" that
> represents the optional daughter card installed on reference boards
> bcm958300k and bcm958305k. While not maintaining backwards compatibility
> *might not* be a serious issue, it would be nice if we can at least make
> sure the driver change and DT are merged into the same kernel version so
> they stay in sync.
>
> Going forward, if we are only going to support syscon based implementation,
> the existing compatible string "brcm,iproc-touchscreen" is preferred over
> "brcm,iproc-touchscreen-syscon".
>
> Thanks,
>
> Ray

Re: [PATCH v2 1/3] input: cygnus-update touchscreen dt node document

2016-02-26 Thread Raveendra Padasalagi

Thanks Scott and Ray for the inputs. I will implement syscon only
register access and send out the changes in patch set - v4.

Regards,
Raveendra

On Tue, Feb 23, 2016 at 1:18 AM, Ray Jui  wrote:
>
>
> On 2/22/2016 11:41 AM, Scott Branden wrote:
>>
>> My comments below
>>
>> On 16-02-22 11:36 AM, Dmitry Torokhov wrote:
>>>
>>> On Fri, Feb 19, 2016 at 11:43:50AM +0530, Raveendra Padasalagi wrote:

 On Thu, Feb 18, 2016 at 8:06 PM, Rob Herring  wrote:
>
> On Wed, Feb 17, 2016 at 03:13:44PM +0530, Raveendra Padasalagi wrote:
>>
>> In Cygnus SOC touch screen controller registers are shared
>> with ADC and flex timer. Using readl/writel could lead to
>> race condition. So touch screen driver is enhanced to support
>>
>> 1. If touchscreen register's are not shared. Register access
>> is handled through readl/writel if "brcm,iproc-touchscreen"
>> compatible is provided in touchscreen dt node. This will help
>> for future SOC's if comes with dedicated touchscreen IP register's.
>>
>> 2. If touchscreen register's are shared with other IP's, register
>> access is handled through syscon framework API's to take care of
>> mutually exclusive access. This feature can be enabled by selecting
>> "brcm,iproc-touchscreen-syscon" compatible string in the touchscreen
>> dt node.
>>
>> Hence touchscreen dt node bindings document is updated to take care
>> of above changes in the touchscreen driver.
>>
>> Signed-off-by: Raveendra Padasalagi
>> 
>> Reviewed-by: Ray Jui 
>> Reviewed-by: Scott Branden 
>> ---
>>   .../input/touchscreen/brcm,iproc-touchscreen.txt   | 57
>> +++---
>>   1 file changed, 51 insertions(+), 6 deletions(-)
>>
>> diff --git
>>
>> a/Documentation/devicetree/bindings/input/touchscreen/brcm,iproc-touchscreen.txt
>>
>> b/Documentation/devicetree/bindings/input/touchscreen/brcm,iproc-touchscreen.txt
>>
>> index 34e3382..f530c25 100644
>> ---
>>
>> a/Documentation/devicetree/bindings/input/touchscreen/brcm,iproc-touchscreen.txt
>>
>> +++
>>
>> b/Documentation/devicetree/bindings/input/touchscreen/brcm,iproc-touchscreen.txt
>>
>> @@ -1,12 +1,30 @@
>>   * Broadcom's IPROC Touchscreen Controller
>>
>>   Required properties:
>> -- compatible: must be "brcm,iproc-touchscreen"
>> -- reg: physical base address of the controller and length of
>> memory mapped
>> -  region.
>> +- compatible: should be one of
>> +"brcm,iproc-touchscreen"
>> +"brcm,iproc-touchscreen-syscon"
>
>
> More specific and this is not how you do syscon. Either the block is or
> isn't. You can't have it both ways.


 Existing driver has support for reg, if we modify now to support only
 syscon
 then this driver will not work if some one wishes to use previous
 kernel version's
 dt and vice versa. Basically this breaks dt compatibility. Is that ok ?
>>>
>>>
>>> But the issue is that the driver does not actually work correctly with
>>> direct register access on those systems, since the registers are
>>> actually shared with other components. I am not quite sure if it is OK
>>> to break DT binding in this case...
>>
>>
>> The driver does work correctly with direct register access on those
>> systems because the other components using those registers are not
>> active in those systems - so syscon is not needed in those cases.
>>
>> I'm ok with not containing backwards compatibility though and always
>> using syscon.  There are no deployed systems using older versions of the
>> upstreamed kernel.
>>>
>>>
>>> Thanks.
>>>
>>
>> Regards,
>> Scott
>
>
> The iproc touchscreen is currently activated in the "bcm9hmidc.dtsi" that
> represents the optional daughter card installed on reference boards
> bcm958300k and bcm958305k. While not maintaining backwards compatibility
> *might not* be a serious issue, it would be nice if we can at least make
> sure the driver change and DT are merged into the same kernel version so
> they stay in sync.
>
> Going forward, if we are only going to support syscon based implementation,
> the existing compatible string "brcm,iproc-touchscreen" is preferred over
> "brcm,iproc-touchscreen-syscon".
>
> Thanks,
>
> Ray

Re: [RFC][PATCH v2 3/3] mm/zsmalloc: increase ZS_MAX_PAGES_PER_ZSPAGE

2016-02-26 Thread Sergey Senozhatsky

Hello Minchan,

sorry for very long reply.

On (02/24/16 01:05), Minchan Kim wrote:
[..]
> > And the thing is -- quite huge internal class fragmentation. These are the 
> > 'normal'
> > classes, not affected by ORDER modification in any way:
> > 
> >  class  size almost_full almost_empty obj_allocated   obj_used pages_used 
> > pages_per_zspage compact
> >107  1744   1   23   196 76 84   
> >  3  51
> >111  1808   0063 63 28   
> >  4   0
> >126  2048   0  160   568408284   
> >  1  80
> >144  2336  52  620  8631   5747   4932   
> >  41648
> >151  2448 123  406 10090   8736   6054   
> >  3 810
> >168  2720   0  512 15738  14926  10492   
> >  2 540
> >190  3072   02   136130102   
> >  3   3
> > 
> > 
> > so I've been thinking about using some sort of watermaks (well, zsmalloc is 
> > an allocator
> > after all, allocators love watermarks :-)). we can't defeat this 
> > fragmentation, we never
> > know in advance which of the pages will be modified or we the size class 
> > those pages will
> > land after compression. but we know stats for every class -- 
> > zs_can_compact(),
> > obj_allocated/obj_used, etc. so we can start class compaction if we detect 
> > that internal
> > fragmentation is too high (e.g. 30+% of class pages can be compacted).
> 
> AFAIRC, we discussed about that when I introduced compaction.
> Namely, per-class compaction.
> I love it and just wanted to do after soft landing of compaction.
> So, it's good time to introduce it. ;-)

ah, yeah, indeed. I vaguely recall this. my first 'auto-compaction' submission
has had this "compact every class in zs_free()", which was a subject to 10+%
performance penalty on some of the tests. but with watermarks this will be less
dramatic, I think.

> > 
> > on the other hand, we always can wait for the shrinker to come in and do 
> > the job for us,
> > but that can take some time.
> 
> Sure, with the feature, we can remove shrinker itself, I think.
> > 
> > what's your opinion on this?
> 
> I will be very happy.

good, I'll take a look later, to avoid any conflicts with your re-work.

[..]
> > does it look to you good enough to be committed on its own (off the series)?
> 
> I think it's good to have. Firstly, I thought we can get the information
> by existing stats with simple math on userspace but changed my mind
> because we could change the implementation sometime so such simple math
> might not be perfect in future and even, we can expose it easily so yes,
> let's do it.

thanks! submitted.

-ss

Re: [RFC][PATCH v2 3/3] mm/zsmalloc: increase ZS_MAX_PAGES_PER_ZSPAGE

2016-02-26 Thread Sergey Senozhatsky

Hello Minchan,

sorry for very long reply.

On (02/24/16 01:05), Minchan Kim wrote:
[..]
> > And the thing is -- quite huge internal class fragmentation. These are the 
> > 'normal'
> > classes, not affected by ORDER modification in any way:
> > 
> >  class  size almost_full almost_empty obj_allocated   obj_used pages_used 
> > pages_per_zspage compact
> >107  1744   1   23   196 76 84   
> >  3  51
> >111  1808   0063 63 28   
> >  4   0
> >126  2048   0  160   568408284   
> >  1  80
> >144  2336  52  620  8631   5747   4932   
> >  41648
> >151  2448 123  406 10090   8736   6054   
> >  3 810
> >168  2720   0  512 15738  14926  10492   
> >  2 540
> >190  3072   02   136130102   
> >  3   3
> > 
> > 
> > so I've been thinking about using some sort of watermaks (well, zsmalloc is 
> > an allocator
> > after all, allocators love watermarks :-)). we can't defeat this 
> > fragmentation, we never
> > know in advance which of the pages will be modified or we the size class 
> > those pages will
> > land after compression. but we know stats for every class -- 
> > zs_can_compact(),
> > obj_allocated/obj_used, etc. so we can start class compaction if we detect 
> > that internal
> > fragmentation is too high (e.g. 30+% of class pages can be compacted).
> 
> AFAIRC, we discussed about that when I introduced compaction.
> Namely, per-class compaction.
> I love it and just wanted to do after soft landing of compaction.
> So, it's good time to introduce it. ;-)

ah, yeah, indeed. I vaguely recall this. my first 'auto-compaction' submission
has had this "compact every class in zs_free()", which was a subject to 10+%
performance penalty on some of the tests. but with watermarks this will be less
dramatic, I think.

> > 
> > on the other hand, we always can wait for the shrinker to come in and do 
> > the job for us,
> > but that can take some time.
> 
> Sure, with the feature, we can remove shrinker itself, I think.
> > 
> > what's your opinion on this?
> 
> I will be very happy.

good, I'll take a look later, to avoid any conflicts with your re-work.

[..]
> > does it look to you good enough to be committed on its own (off the series)?
> 
> I think it's good to have. Firstly, I thought we can get the information
> by existing stats with simple math on userspace but changed my mind
> because we could change the implementation sometime so such simple math
> might not be perfect in future and even, we can expose it easily so yes,
> let's do it.

thanks! submitted.

-ss

Re: [PATCH] IPIP tunnel performance improvement

2016-02-26 Thread Cong Wang

On Fri, Feb 26, 2016 at 8:40 PM, zhao ya  wrote:
> From: Zhao Ya 
> Date: Sat, 27 Feb 2016 10:06:44 +0800
> Subject: [PATCH] IPIP tunnel performance improvement
>
> bypass the logic of each packet's own neighbour creation when using
> pointopint or loopback device.
>
> Recently, in our tests, met a performance problem.
> In a large number of packets with different target IP address through
> ipip tunnel, PPS will decrease sharply.
>
> The output of perf top are as follows, __write_lock_failed is of the first:
>   - 5.89% [kernel]  [k] __write_lock_failed
>-__write_lock_failed a
>-_raw_write_lock_bh  a
>-__neigh_create  a
>-ip_finish_outputa
>-ip_output   a
>-ip_local_outa
>
> The neighbour subsystem will create a neighbour object for each target
> when using pointopint device. When massive amounts of packets with diff-
> erent target IP address to be xmit through a pointopint device, these
> packets will suffer the bottleneck at write_lock_bh(>lock) after
> creating the neighbour object and then inserting it into a hash-table
> at the same time.
>
> This patch correct it. Only one or little amounts of neighbour objects
> will be created when massive amounts of packets with different target IP
> address through ipip tunnel.
>
> As the result, performance will be improved.

Well, you just basically revert another bug fix:

commit 0bb4087cbec0ef74fd416789d6aad67957063057
Author: David S. Miller 
Date:   Fri Jul 20 16:00:53 2012 -0700

ipv4: Fix neigh lookup keying over loopback/point-to-point devices.

We were using a special key "0" for all loopback and point-to-point
device neigh lookups under ipv4, but we wouldn't use that special
key for the neigh creation.

So basically we'd make a new neigh at each and every lookup :-)

This special case to use only one neigh for these device types
is of dubious value, so just remove it entirely.

Reported-by: Eric Dumazet 
Signed-off-by: David S. Miller 

which would bring the neigh entries counting problem back...

Did you try to tune the neigh gc parameters for your case?

Thanks.

Re: [PATCH] IPIP tunnel performance improvement

2016-02-26 Thread Cong Wang

On Fri, Feb 26, 2016 at 8:40 PM, zhao ya  wrote:
> From: Zhao Ya 
> Date: Sat, 27 Feb 2016 10:06:44 +0800
> Subject: [PATCH] IPIP tunnel performance improvement
>
> bypass the logic of each packet's own neighbour creation when using
> pointopint or loopback device.
>
> Recently, in our tests, met a performance problem.
> In a large number of packets with different target IP address through
> ipip tunnel, PPS will decrease sharply.
>
> The output of perf top are as follows, __write_lock_failed is of the first:
>   - 5.89% [kernel]  [k] __write_lock_failed
>-__write_lock_failed a
>-_raw_write_lock_bh  a
>-__neigh_create  a
>-ip_finish_outputa
>-ip_output   a
>-ip_local_outa
>
> The neighbour subsystem will create a neighbour object for each target
> when using pointopint device. When massive amounts of packets with diff-
> erent target IP address to be xmit through a pointopint device, these
> packets will suffer the bottleneck at write_lock_bh(>lock) after
> creating the neighbour object and then inserting it into a hash-table
> at the same time.
>
> This patch correct it. Only one or little amounts of neighbour objects
> will be created when massive amounts of packets with different target IP
> address through ipip tunnel.
>
> As the result, performance will be improved.

Well, you just basically revert another bug fix:

commit 0bb4087cbec0ef74fd416789d6aad67957063057
Author: David S. Miller 
Date:   Fri Jul 20 16:00:53 2012 -0700

ipv4: Fix neigh lookup keying over loopback/point-to-point devices.

We were using a special key "0" for all loopback and point-to-point
device neigh lookups under ipv4, but we wouldn't use that special
key for the neigh creation.

So basically we'd make a new neigh at each and every lookup :-)

This special case to use only one neigh for these device types
is of dubious value, so just remove it entirely.

Reported-by: Eric Dumazet 
Signed-off-by: David S. Miller 

which would bring the neigh entries counting problem back...

Did you try to tune the neigh gc parameters for your case?

Thanks.

Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread

2016-02-26 Thread H. Peter Anvin


On 02/26/16 16:40, Mathieu Desnoyers wrote:


I think it would be a good idea to make this a general pointer for the kernel to
be able to write per thread state to user space, which obviously can't be done
with the vDSO.

This means the libc per thread startup should query the kernel for the size of
this structure and allocate thread local data accordingly.  We can then grow
this structure if needed without making the ABI even more complex.

This is more than a system call: this is an entirely new way for userspace to
interact with the kernel.  Therefore we should make it a general facility.


I'm really glad to see I'm not the only one seeing potential for
genericity here. :-) This is exactly what I had in mind
last year when proposing the thread_local_abi() system call:
a generic way to register an extensible per-thread data structure
so the kernel can communicate with user-space and vice-versa.

Rather than having the libc query the kernel for size of the structure,
I would recommend that libc tells the kernel the size of the thread-local
ABI structure it supports. The idea here is that both the kernel and libc
need to know about the fields in that structure to allow a two-way
interaction. Fields known only by either the kernel or userspace
are useless for a given thread anyway. This way, libc could statically
define the structure.


Big fat NOPE there.  Why?  Because it means that EVERY interaction with 
this memory, no matter how critical, needs to be conditionalized. 
Furthermore, userspace != libc.  Applications or higher-layer libraries 
might have more information than the running libc about additional 
fields, but with your proposal libc would gate them.


As far as the kernel providing the size in the structure (alone) -- I 
*really* hope you can see what is wrong with that!!  That doesn't mean 
we can't provide it in the structure as well, and that too might avoid 
the skipped libc problem.



I would be tempted to also add "features" flags, so both user-space
and the kernel could tell each other what they support: user-space
would announce the set of features it supports, and it could also
query the kernel for the set of supported features. One simple approach
would be to use a uint64_t as type for those feature flags, and
reserve the last bit for extending to future flags if we ever have
more than 64.

Thoughts ?


It doesn't seem like it would hurt, although the size of the flags field 
could end up being an issue.


-hpa

Re: [PATCH v4 1/5] getcpu_cache system call: cache CPU number of running thread

2016-02-26 Thread H. Peter Anvin


On 02/26/16 16:40, Mathieu Desnoyers wrote:


I think it would be a good idea to make this a general pointer for the kernel to
be able to write per thread state to user space, which obviously can't be done
with the vDSO.

This means the libc per thread startup should query the kernel for the size of
this structure and allocate thread local data accordingly.  We can then grow
this structure if needed without making the ABI even more complex.

This is more than a system call: this is an entirely new way for userspace to
interact with the kernel.  Therefore we should make it a general facility.


I'm really glad to see I'm not the only one seeing potential for
genericity here. :-) This is exactly what I had in mind
last year when proposing the thread_local_abi() system call:
a generic way to register an extensible per-thread data structure
so the kernel can communicate with user-space and vice-versa.

Rather than having the libc query the kernel for size of the structure,
I would recommend that libc tells the kernel the size of the thread-local
ABI structure it supports. The idea here is that both the kernel and libc
need to know about the fields in that structure to allow a two-way
interaction. Fields known only by either the kernel or userspace
are useless for a given thread anyway. This way, libc could statically
define the structure.


Big fat NOPE there.  Why?  Because it means that EVERY interaction with 
this memory, no matter how critical, needs to be conditionalized. 
Furthermore, userspace != libc.  Applications or higher-layer libraries 
might have more information than the running libc about additional 
fields, but with your proposal libc would gate them.


As far as the kernel providing the size in the structure (alone) -- I 
*really* hope you can see what is wrong with that!!  That doesn't mean 
we can't provide it in the structure as well, and that too might avoid 
the skipped libc problem.



I would be tempted to also add "features" flags, so both user-space
and the kernel could tell each other what they support: user-space
would announce the set of features it supports, and it could also
query the kernel for the set of supported features. One simple approach
would be to use a uint64_t as type for those feature flags, and
reserve the last bit for extending to future flags if we ever have
more than 64.

Thoughts ?


It doesn't seem like it would hurt, although the size of the flags field 
could end up being an issue.


-hpa

[PATCH] mm/zsmalloc: add compact column to pool stat

2016-02-26 Thread Sergey Senozhatsky

Add a new column to pool stats, which will tell us class' zs_can_compact()
number, so it will be easier to analyze zsmalloc fragmentation.

At the moment, we have only numbers of FULL and ALMOST_EMPTY classes, but
they don't tell us how badly the class is fragmented internally.

The new /sys/kernel/debug/zsmalloc/zramX/classes output look as follows:

 class  size almost_full almost_empty obj_allocated   obj_used pages_used 
pages_per_zspage compact
[..]
12   224   02   146  5  8   
 4   4
13   240   00 0  0  0   
 1   0
14   256   1   13  1840   1672115   
 1  10
15   272   00 0  0  0   
 1   0
[..]
49   816   03   745735149   
 1   2
51   848   34   361306 76   
 4   8
52   864  12   14   378268 81   
 3  21
54   896   1   12   117 57 26   
 2  12
57   944   00 0  0  0   
 3   0
[..]
 Total26  131 12709  10994   1071   
   134

For example, from this particular output we can easily conclude that class-896
is heavily fragmented -- it occupies 26 pages, 12 can be freed by compaction.

Signed-off-by: Sergey Senozhatsky 
---
 mm/zsmalloc.c | 20 +---
 1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 43e4cbc..046d364 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -494,6 +494,8 @@ static void __exit zs_stat_exit(void)
debugfs_remove_recursive(zs_stat_root);
 }
 
+static unsigned long zs_can_compact(struct size_class *class);
+
 static int zs_stats_size_show(struct seq_file *s, void *v)
 {
int i;
@@ -501,14 +503,15 @@ static int zs_stats_size_show(struct seq_file *s, void *v)
struct size_class *class;
int objs_per_zspage;
unsigned long class_almost_full, class_almost_empty;
-   unsigned long obj_allocated, obj_used, pages_used;
+   unsigned long obj_allocated, obj_used, pages_used, compact;
unsigned long total_class_almost_full = 0, total_class_almost_empty = 0;
unsigned long total_objs = 0, total_used_objs = 0, total_pages = 0;
+   unsigned long total_compact = 0;
 
-   seq_printf(s, " %5s %5s %11s %12s %13s %10s %10s %16s\n",
+   seq_printf(s, " %5s %5s %11s %12s %13s %10s %10s %16s %7s\n",
"class", "size", "almost_full", "almost_empty",
"obj_allocated", "obj_used", "pages_used",
-   "pages_per_zspage");
+   "pages_per_zspage", "compact");
 
for (i = 0; i < zs_size_classes; i++) {
class = pool->size_class[i];
@@ -521,6 +524,7 @@ static int zs_stats_size_show(struct seq_file *s, void *v)
class_almost_empty = zs_stat_get(class, CLASS_ALMOST_EMPTY);
obj_allocated = zs_stat_get(class, OBJ_ALLOCATED);
obj_used = zs_stat_get(class, OBJ_USED);
+   compact = zs_can_compact(class);
spin_unlock(>lock);
 
objs_per_zspage = get_maxobj_per_zspage(class->size,
@@ -528,23 +532,25 @@ static int zs_stats_size_show(struct seq_file *s, void *v)
pages_used = obj_allocated / objs_per_zspage *
class->pages_per_zspage;
 
-   seq_printf(s, " %5u %5u %11lu %12lu %13lu %10lu %10lu %16d\n",
+   seq_printf(s, " %5u %5u %11lu %12lu %13lu"
+   " %10lu %10lu %16d %7lu\n",
i, class->size, class_almost_full, class_almost_empty,
obj_allocated, obj_used, pages_used,
-   class->pages_per_zspage);
+   class->pages_per_zspage, compact);
 
total_class_almost_full += class_almost_full;
total_class_almost_empty += class_almost_empty;
total_objs += obj_allocated;
total_used_objs += obj_used;
total_pages += pages_used;
+   total_compact += compact;
}
 
seq_puts(s, "\n");
-   seq_printf(s, " %5s %5s %11lu %12lu %13lu %10lu %10lu\n",
+   seq_printf(s, " %5s %5s %11lu %12lu %13lu %10lu %10lu %16s %7lu\n",
"Total", "", total_class_almost_full,
total_class_almost_empty, total_objs,
-   total_used_objs, total_pages);
+   total_used_objs, total_pages, "", total_compact);

[PATCH] mm/zsmalloc: add compact column to pool stat

2016-02-26 Thread Sergey Senozhatsky

Add a new column to pool stats, which will tell us class' zs_can_compact()
number, so it will be easier to analyze zsmalloc fragmentation.

At the moment, we have only numbers of FULL and ALMOST_EMPTY classes, but
they don't tell us how badly the class is fragmented internally.

The new /sys/kernel/debug/zsmalloc/zramX/classes output look as follows:

 class  size almost_full almost_empty obj_allocated   obj_used pages_used 
pages_per_zspage compact
[..]
12   224   02   146  5  8   
 4   4
13   240   00 0  0  0   
 1   0
14   256   1   13  1840   1672115   
 1  10
15   272   00 0  0  0   
 1   0
[..]
49   816   03   745735149   
 1   2
51   848   34   361306 76   
 4   8
52   864  12   14   378268 81   
 3  21
54   896   1   12   117 57 26   
 2  12
57   944   00 0  0  0   
 3   0
[..]
 Total26  131 12709  10994   1071   
   134

For example, from this particular output we can easily conclude that class-896
is heavily fragmented -- it occupies 26 pages, 12 can be freed by compaction.

Signed-off-by: Sergey Senozhatsky 
---
 mm/zsmalloc.c | 20 +---
 1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 43e4cbc..046d364 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -494,6 +494,8 @@ static void __exit zs_stat_exit(void)
debugfs_remove_recursive(zs_stat_root);
 }
 
+static unsigned long zs_can_compact(struct size_class *class);
+
 static int zs_stats_size_show(struct seq_file *s, void *v)
 {
int i;
@@ -501,14 +503,15 @@ static int zs_stats_size_show(struct seq_file *s, void *v)
struct size_class *class;
int objs_per_zspage;
unsigned long class_almost_full, class_almost_empty;
-   unsigned long obj_allocated, obj_used, pages_used;
+   unsigned long obj_allocated, obj_used, pages_used, compact;
unsigned long total_class_almost_full = 0, total_class_almost_empty = 0;
unsigned long total_objs = 0, total_used_objs = 0, total_pages = 0;
+   unsigned long total_compact = 0;
 
-   seq_printf(s, " %5s %5s %11s %12s %13s %10s %10s %16s\n",
+   seq_printf(s, " %5s %5s %11s %12s %13s %10s %10s %16s %7s\n",
"class", "size", "almost_full", "almost_empty",
"obj_allocated", "obj_used", "pages_used",
-   "pages_per_zspage");
+   "pages_per_zspage", "compact");
 
for (i = 0; i < zs_size_classes; i++) {
class = pool->size_class[i];
@@ -521,6 +524,7 @@ static int zs_stats_size_show(struct seq_file *s, void *v)
class_almost_empty = zs_stat_get(class, CLASS_ALMOST_EMPTY);
obj_allocated = zs_stat_get(class, OBJ_ALLOCATED);
obj_used = zs_stat_get(class, OBJ_USED);
+   compact = zs_can_compact(class);
spin_unlock(>lock);
 
objs_per_zspage = get_maxobj_per_zspage(class->size,
@@ -528,23 +532,25 @@ static int zs_stats_size_show(struct seq_file *s, void *v)
pages_used = obj_allocated / objs_per_zspage *
class->pages_per_zspage;
 
-   seq_printf(s, " %5u %5u %11lu %12lu %13lu %10lu %10lu %16d\n",
+   seq_printf(s, " %5u %5u %11lu %12lu %13lu"
+   " %10lu %10lu %16d %7lu\n",
i, class->size, class_almost_full, class_almost_empty,
obj_allocated, obj_used, pages_used,
-   class->pages_per_zspage);
+   class->pages_per_zspage, compact);
 
total_class_almost_full += class_almost_full;
total_class_almost_empty += class_almost_empty;
total_objs += obj_allocated;
total_used_objs += obj_used;
total_pages += pages_used;
+   total_compact += compact;
}
 
seq_puts(s, "\n");
-   seq_printf(s, " %5s %5s %11lu %12lu %13lu %10lu %10lu\n",
+   seq_printf(s, " %5s %5s %11lu %12lu %13lu %10lu %10lu %16s %7lu\n",
"Total", "", total_class_almost_full,
total_class_almost_empty, total_objs,
-   total_used_objs, total_pages);
+   total_used_objs, total_pages, "", total_compact);
 
return 0;
 }
-- 
2.7.1

[PATCH v6 1/1] ideapad-laptop: Add ideapad Y700 (15) to the no_hw_rfkill DMI list

2016-02-26 Thread John Dahlstrom


Some Lenovo ideapad models lack a physical rfkill switch.
On Lenovo models ideapad Y700 Touch-15ISK and ideapad Y700-15ISK,
ideapad-laptop would wrongly report all radios as blocked by
hardware which caused wireless network connections to fail.

Add these models without an rfkill switch to the no_hw_rfkill list.

Signed-off-by: John Dahlstrom 
Cc:  # 3.17.x-: 4fa9dab: ideapad_laptop: Lenovo G50-30 
fix rfkill reports wireless blocked
---
  Test configuration
   Hardware: Lenovo ideapad Y700 Touch-15ISK
   Kernel version: 4.4.2

  Patch changelog
   v2 split patch between Touch and non-Touch devices
   v3 undo patch split and limit summary to 72 characters
   v4 include comprehensive list of patchable kernels in Cc
   v5 include commit ID of the prerequisite patch in Cc
   v6 3.17.x- in Cc; fix whitespace corruption from email client

  drivers/platform/x86/ideapad-laptop.c |   14 ++
  1 file changed, 14 insertions(+)

--- a/drivers/platform/x86/ideapad-laptop.c 2016-02-14 15:05:20.0 
-0600
+++ b/drivers/platform/x86/ideapad-laptop.c 2016-02-16 03:54:48.484423725 
-0600
@@ -864,4 +864,18 @@ static const struct dmi_system_id no_hw_rfkill_list[] = {
DMI_MATCH(DMI_PRODUCT_VERSION, "Lenovo G50-30"),
},
+   },
+   {
+   .ident = "Lenovo ideapad Y700-15ISK",
+   .matches = {
+   DMI_MATCH(DMI_SYS_VENDOR, "LENOVO"),
+   DMI_MATCH(DMI_PRODUCT_VERSION, "Lenovo ideapad 
Y700-15ISK"),
+   },
+   },
+   {
+   .ident = "Lenovo ideapad Y700 Touch-15ISK",
+   .matches = {
+   DMI_MATCH(DMI_SYS_VENDOR, "LENOVO"),
+   DMI_MATCH(DMI_PRODUCT_VERSION, "Lenovo ideapad Y700 
Touch-15ISK"),
+   },
},
{

[PATCH v6 1/1] ideapad-laptop: Add ideapad Y700 (15) to the no_hw_rfkill DMI list

2016-02-26 Thread John Dahlstrom


Some Lenovo ideapad models lack a physical rfkill switch.
On Lenovo models ideapad Y700 Touch-15ISK and ideapad Y700-15ISK,
ideapad-laptop would wrongly report all radios as blocked by
hardware which caused wireless network connections to fail.

Add these models without an rfkill switch to the no_hw_rfkill list.

Signed-off-by: John Dahlstrom 
Cc:  # 3.17.x-: 4fa9dab: ideapad_laptop: Lenovo G50-30 
fix rfkill reports wireless blocked
---
  Test configuration
   Hardware: Lenovo ideapad Y700 Touch-15ISK
   Kernel version: 4.4.2

  Patch changelog
   v2 split patch between Touch and non-Touch devices
   v3 undo patch split and limit summary to 72 characters
   v4 include comprehensive list of patchable kernels in Cc
   v5 include commit ID of the prerequisite patch in Cc
   v6 3.17.x- in Cc; fix whitespace corruption from email client

  drivers/platform/x86/ideapad-laptop.c |   14 ++
  1 file changed, 14 insertions(+)

--- a/drivers/platform/x86/ideapad-laptop.c 2016-02-14 15:05:20.0 
-0600
+++ b/drivers/platform/x86/ideapad-laptop.c 2016-02-16 03:54:48.484423725 
-0600
@@ -864,4 +864,18 @@ static const struct dmi_system_id no_hw_rfkill_list[] = {
DMI_MATCH(DMI_PRODUCT_VERSION, "Lenovo G50-30"),
},
+   },
+   {
+   .ident = "Lenovo ideapad Y700-15ISK",
+   .matches = {
+   DMI_MATCH(DMI_SYS_VENDOR, "LENOVO"),
+   DMI_MATCH(DMI_PRODUCT_VERSION, "Lenovo ideapad 
Y700-15ISK"),
+   },
+   },
+   {
+   .ident = "Lenovo ideapad Y700 Touch-15ISK",
+   .matches = {
+   DMI_MATCH(DMI_SYS_VENDOR, "LENOVO"),
+   DMI_MATCH(DMI_PRODUCT_VERSION, "Lenovo ideapad Y700 
Touch-15ISK"),
+   },
},
{

[net-next][PATCH 03/13] MAINTAINERS: update RDS entry

2016-02-26 Thread Santosh Shilimkar

Acked-by: Chien Yen 
Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 MAINTAINERS | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 355e1c8..9d79bea 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9081,10 +9081,14 @@ S:  Maintained
 F: drivers/net/ethernet/rdc/r6040.c
 
 RDS - RELIABLE DATAGRAM SOCKETS
-M: Chien Yen 
+M: Santosh Shilimkar 
+L: net...@vger.kernel.org
+L: linux-r...@vger.kernel.org
 L: rds-de...@oss.oracle.com (moderated for non-subscribers)
+W: https://oss.oracle.com/projects/rds/
 S: Supported
 F: net/rds/
+F: Documentation/networking/rds.txt
 
 READ-COPY UPDATE (RCU)
 M: "Paul E. McKenney" 
-- 
1.9.1

[net-next][PATCH 01/13] RDS: Drop stale iWARP RDMA transport

2016-02-26 Thread Santosh Shilimkar

RDS iWarp support code has become stale and non testable. As
indicated earlier, am dropping the support for it.

If new iWarp user(s) shows up in future, we can adapat the RDS IB
transprt for the special RDMA READ sink case. iWarp needs an MR
for the RDMA READ sink.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 Documentation/networking/rds.txt |   4 +-
 net/rds/Kconfig  |   7 +-
 net/rds/Makefile |   4 +-
 net/rds/iw.c | 312 -
 net/rds/iw.h | 398 
 net/rds/iw_cm.c  | 769 --
 net/rds/iw_rdma.c| 837 -
 net/rds/iw_recv.c| 904 
 net/rds/iw_ring.c| 169 ---
 net/rds/iw_send.c| 981 ---
 net/rds/iw_stats.c   |  95 
 net/rds/iw_sysctl.c  | 123 -
 net/rds/rdma_transport.c |  13 +-
 net/rds/rdma_transport.h |   5 -
 14 files changed, 7 insertions(+), 4614 deletions(-)
 delete mode 100644 net/rds/iw.c
 delete mode 100644 net/rds/iw.h
 delete mode 100644 net/rds/iw_cm.c
 delete mode 100644 net/rds/iw_rdma.c
 delete mode 100644 net/rds/iw_recv.c
 delete mode 100644 net/rds/iw_ring.c
 delete mode 100644 net/rds/iw_send.c
 delete mode 100644 net/rds/iw_stats.c
 delete mode 100644 net/rds/iw_sysctl.c

diff --git a/Documentation/networking/rds.txt b/Documentation/networking/rds.txt
index e1a3d59..9d219d8 100644
--- a/Documentation/networking/rds.txt
+++ b/Documentation/networking/rds.txt
@@ -19,9 +19,7 @@ to N*N if you use a connection-oriented socket transport like 
TCP.
 
 RDS is not Infiniband-specific; it was designed to support different
 transports.  The current implementation used to support RDS over TCP as well
-as IB. Work is in progress to support RDS over iWARP, and using DCE to
-guarantee no dropped packets on Ethernet, it may be possible to use RDS over
-UDP in the future.
+as IB.
 
 The high-level semantics of RDS from the application's point of view are
 
diff --git a/net/rds/Kconfig b/net/rds/Kconfig
index f2c670b..bffde4b 100644
--- a/net/rds/Kconfig
+++ b/net/rds/Kconfig
@@ -4,14 +4,13 @@ config RDS
depends on INET
---help---
  The RDS (Reliable Datagram Sockets) protocol provides reliable,
- sequenced delivery of datagrams over Infiniband, iWARP,
- or TCP.
+ sequenced delivery of datagrams over Infiniband or TCP.
 
 config RDS_RDMA
-   tristate "RDS over Infiniband and iWARP"
+   tristate "RDS over Infiniband"
depends on RDS && INFINIBAND && INFINIBAND_ADDR_TRANS
---help---
- Allow RDS to use Infiniband and iWARP as a transport.
+ Allow RDS to use Infiniband as a transport.
  This transport supports RDMA operations.
 
 config RDS_TCP
diff --git a/net/rds/Makefile b/net/rds/Makefile
index 56d3f60..19e5485 100644
--- a/net/rds/Makefile
+++ b/net/rds/Makefile
@@ -6,9 +6,7 @@ rds-y :=af_rds.o bind.o cong.o connection.o info.o 
message.o   \
 obj-$(CONFIG_RDS_RDMA) += rds_rdma.o
 rds_rdma-y :=  rdma_transport.o \
ib.o ib_cm.o ib_recv.o ib_ring.o ib_send.o ib_stats.o \
-   ib_sysctl.o ib_rdma.o \
-   iw.o iw_cm.o iw_recv.o iw_ring.o iw_send.o iw_stats.o \
-   iw_sysctl.o iw_rdma.o
+   ib_sysctl.o ib_rdma.o
 
 
 obj-$(CONFIG_RDS_TCP) += rds_tcp.o
diff --git a/net/rds/iw.c b/net/rds/iw.c
deleted file mode 100644
index f4a9fff..000
diff --git a/net/rds/iw.h b/net/rds/iw.h
deleted file mode 100644
index 5af01d1..000
diff --git a/net/rds/iw_cm.c b/net/rds/iw_cm.c
deleted file mode 100644
index aea4c91..000
diff --git a/net/rds/iw_rdma.c b/net/rds/iw_rdma.c
deleted file mode 100644
index b09a40c..000
diff --git a/net/rds/iw_recv.c b/net/rds/iw_recv.c
deleted file mode 100644
index a66d179..000
diff --git a/net/rds/iw_ring.c b/net/rds/iw_ring.c
deleted file mode 100644
index da8e3b6..000
diff --git a/net/rds/iw_send.c b/net/rds/iw_send.c
deleted file mode 100644
index e20bd50..000
diff --git a/net/rds/iw_stats.c b/net/rds/iw_stats.c
deleted file mode 100644
index 5fe67f6..000
diff --git a/net/rds/iw_sysctl.c b/net/rds/iw_sysctl.c
deleted file mode 100644
index 139239d..000
diff --git a/net/rds/rdma_transport.c b/net/rds/rdma_transport.c
index 9c1fed8..4f4b3d8 100644
--- a/net/rds/rdma_transport.c
+++ b/net/rds/rdma_transport.c
@@ -49,9 +49,7 @@ int rds_rdma_cm_event_handler(struct rdma_cm_id *cm_id,
rdsdebug("conn %p id %p handling event %u (%s)\n", conn, cm_id,
 event->event, rdma_event_msg(event->event));
 
-   if (cm_id->device->node_type == RDMA_NODE_RNIC)
-   trans = _iw_transport;
-   else
+

[net-next][PATCH 02/13] RDS: Add support for SO_TIMESTAMP for incoming messages

2016-02-26 Thread Santosh Shilimkar

The SO_TIMESTAMP generates time stamp for each incoming RDS messages
User app can enable it by using SO_TIMESTAMP setsocketopt() at
SOL_SOCKET level. CMSG data of cmsg type SO_TIMESTAMP contains the
time stamp in struct timeval format.

Reviewed-by: Sowmini Varadhan 
Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/af_rds.c | 26 ++
 net/rds/rds.h|  1 +
 net/rds/recv.c   | 20 ++--
 3 files changed, 45 insertions(+), 2 deletions(-)

diff --git a/net/rds/af_rds.c b/net/rds/af_rds.c
index b5476aeb..6beaeb1 100644
--- a/net/rds/af_rds.c
+++ b/net/rds/af_rds.c
@@ -277,6 +277,27 @@ static int rds_set_transport(struct rds_sock *rs, char 
__user *optval,
return rs->rs_transport ? 0 : -ENOPROTOOPT;
 }
 
+static int rds_enable_recvtstamp(struct sock *sk, char __user *optval,
+int optlen)
+{
+   int val, valbool;
+
+   if (optlen != sizeof(int))
+   return -EFAULT;
+
+   if (get_user(val, (int __user *)optval))
+   return -EFAULT;
+
+   valbool = val ? 1 : 0;
+
+   if (valbool)
+   sock_set_flag(sk, SOCK_RCVTSTAMP);
+   else
+   sock_reset_flag(sk, SOCK_RCVTSTAMP);
+
+   return 0;
+}
+
 static int rds_setsockopt(struct socket *sock, int level, int optname,
  char __user *optval, unsigned int optlen)
 {
@@ -312,6 +333,11 @@ static int rds_setsockopt(struct socket *sock, int level, 
int optname,
ret = rds_set_transport(rs, optval, optlen);
release_sock(sock->sk);
break;
+   case SO_TIMESTAMP:
+   lock_sock(sock->sk);
+   ret = rds_enable_recvtstamp(sock->sk, optval, optlen);
+   release_sock(sock->sk);
+   break;
default:
ret = -ENOPROTOOPT;
}
diff --git a/net/rds/rds.h b/net/rds/rds.h
index 0e2797b..80256b0 100644
--- a/net/rds/rds.h
+++ b/net/rds/rds.h
@@ -222,6 +222,7 @@ struct rds_incoming {
__be32  i_saddr;
 
rds_rdma_cookie_t   i_rdma_cookie;
+   struct timeval  i_rx_tstamp;
 };
 
 struct rds_mr {
diff --git a/net/rds/recv.c b/net/rds/recv.c
index a00462b..c0be1ec 100644
--- a/net/rds/recv.c
+++ b/net/rds/recv.c
@@ -35,6 +35,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include "rds.h"
 
@@ -46,6 +48,8 @@ void rds_inc_init(struct rds_incoming *inc, struct 
rds_connection *conn,
inc->i_conn = conn;
inc->i_saddr = saddr;
inc->i_rdma_cookie = 0;
+   inc->i_rx_tstamp.tv_sec = 0;
+   inc->i_rx_tstamp.tv_usec = 0;
 }
 EXPORT_SYMBOL_GPL(rds_inc_init);
 
@@ -228,6 +232,8 @@ void rds_recv_incoming(struct rds_connection *conn, __be32 
saddr, __be32 daddr,
rds_recv_rcvbuf_delta(rs, sk, inc->i_conn->c_lcong,
  be32_to_cpu(inc->i_hdr.h_len),
  inc->i_hdr.h_dport);
+   if (sock_flag(sk, SOCK_RCVTSTAMP))
+   do_gettimeofday(>i_rx_tstamp);
rds_inc_addref(inc);
list_add_tail(>i_item, >rs_recv_queue);
__rds_wake_sk_sleep(sk);
@@ -381,7 +387,8 @@ static int rds_notify_cong(struct rds_sock *rs, struct 
msghdr *msghdr)
 /*
  * Receive any control messages.
  */
-static int rds_cmsg_recv(struct rds_incoming *inc, struct msghdr *msg)
+static int rds_cmsg_recv(struct rds_incoming *inc, struct msghdr *msg,
+struct rds_sock *rs)
 {
int ret = 0;
 
@@ -392,6 +399,15 @@ static int rds_cmsg_recv(struct rds_incoming *inc, struct 
msghdr *msg)
return ret;
}
 
+   if ((inc->i_rx_tstamp.tv_sec != 0) &&
+   sock_flag(rds_rs_to_sk(rs), SOCK_RCVTSTAMP)) {
+   ret = put_cmsg(msg, SOL_SOCKET, SCM_TIMESTAMP,
+  sizeof(struct timeval),
+  >i_rx_tstamp);
+   if (ret)
+   return ret;
+   }
+
return 0;
 }
 
@@ -474,7 +490,7 @@ int rds_recvmsg(struct socket *sock, struct msghdr *msg, 
size_t size,
msg->msg_flags |= MSG_TRUNC;
}
 
-   if (rds_cmsg_recv(inc, msg)) {
+   if (rds_cmsg_recv(inc, msg, rs)) {
ret = -EFAULT;
goto out;
}
-- 
1.9.1

[net-next][PATCH 03/13] MAINTAINERS: update RDS entry

2016-02-26 Thread Santosh Shilimkar

Acked-by: Chien Yen 
Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 MAINTAINERS | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 355e1c8..9d79bea 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9081,10 +9081,14 @@ S:  Maintained
 F: drivers/net/ethernet/rdc/r6040.c
 
 RDS - RELIABLE DATAGRAM SOCKETS
-M: Chien Yen 
+M: Santosh Shilimkar 
+L: net...@vger.kernel.org
+L: linux-r...@vger.kernel.org
 L: rds-de...@oss.oracle.com (moderated for non-subscribers)
+W: https://oss.oracle.com/projects/rds/
 S: Supported
 F: net/rds/
+F: Documentation/networking/rds.txt
 
 READ-COPY UPDATE (RCU)
 M: "Paul E. McKenney" 
-- 
1.9.1

[net-next][PATCH 01/13] RDS: Drop stale iWARP RDMA transport

2016-02-26 Thread Santosh Shilimkar

RDS iWarp support code has become stale and non testable. As
indicated earlier, am dropping the support for it.

If new iWarp user(s) shows up in future, we can adapat the RDS IB
transprt for the special RDMA READ sink case. iWarp needs an MR
for the RDMA READ sink.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 Documentation/networking/rds.txt |   4 +-
 net/rds/Kconfig  |   7 +-
 net/rds/Makefile |   4 +-
 net/rds/iw.c | 312 -
 net/rds/iw.h | 398 
 net/rds/iw_cm.c  | 769 --
 net/rds/iw_rdma.c| 837 -
 net/rds/iw_recv.c| 904 
 net/rds/iw_ring.c| 169 ---
 net/rds/iw_send.c| 981 ---
 net/rds/iw_stats.c   |  95 
 net/rds/iw_sysctl.c  | 123 -
 net/rds/rdma_transport.c |  13 +-
 net/rds/rdma_transport.h |   5 -
 14 files changed, 7 insertions(+), 4614 deletions(-)
 delete mode 100644 net/rds/iw.c
 delete mode 100644 net/rds/iw.h
 delete mode 100644 net/rds/iw_cm.c
 delete mode 100644 net/rds/iw_rdma.c
 delete mode 100644 net/rds/iw_recv.c
 delete mode 100644 net/rds/iw_ring.c
 delete mode 100644 net/rds/iw_send.c
 delete mode 100644 net/rds/iw_stats.c
 delete mode 100644 net/rds/iw_sysctl.c

diff --git a/Documentation/networking/rds.txt b/Documentation/networking/rds.txt
index e1a3d59..9d219d8 100644
--- a/Documentation/networking/rds.txt
+++ b/Documentation/networking/rds.txt
@@ -19,9 +19,7 @@ to N*N if you use a connection-oriented socket transport like 
TCP.
 
 RDS is not Infiniband-specific; it was designed to support different
 transports.  The current implementation used to support RDS over TCP as well
-as IB. Work is in progress to support RDS over iWARP, and using DCE to
-guarantee no dropped packets on Ethernet, it may be possible to use RDS over
-UDP in the future.
+as IB.
 
 The high-level semantics of RDS from the application's point of view are
 
diff --git a/net/rds/Kconfig b/net/rds/Kconfig
index f2c670b..bffde4b 100644
--- a/net/rds/Kconfig
+++ b/net/rds/Kconfig
@@ -4,14 +4,13 @@ config RDS
depends on INET
---help---
  The RDS (Reliable Datagram Sockets) protocol provides reliable,
- sequenced delivery of datagrams over Infiniband, iWARP,
- or TCP.
+ sequenced delivery of datagrams over Infiniband or TCP.
 
 config RDS_RDMA
-   tristate "RDS over Infiniband and iWARP"
+   tristate "RDS over Infiniband"
depends on RDS && INFINIBAND && INFINIBAND_ADDR_TRANS
---help---
- Allow RDS to use Infiniband and iWARP as a transport.
+ Allow RDS to use Infiniband as a transport.
  This transport supports RDMA operations.
 
 config RDS_TCP
diff --git a/net/rds/Makefile b/net/rds/Makefile
index 56d3f60..19e5485 100644
--- a/net/rds/Makefile
+++ b/net/rds/Makefile
@@ -6,9 +6,7 @@ rds-y :=af_rds.o bind.o cong.o connection.o info.o 
message.o   \
 obj-$(CONFIG_RDS_RDMA) += rds_rdma.o
 rds_rdma-y :=  rdma_transport.o \
ib.o ib_cm.o ib_recv.o ib_ring.o ib_send.o ib_stats.o \
-   ib_sysctl.o ib_rdma.o \
-   iw.o iw_cm.o iw_recv.o iw_ring.o iw_send.o iw_stats.o \
-   iw_sysctl.o iw_rdma.o
+   ib_sysctl.o ib_rdma.o
 
 
 obj-$(CONFIG_RDS_TCP) += rds_tcp.o
diff --git a/net/rds/iw.c b/net/rds/iw.c
deleted file mode 100644
index f4a9fff..000
diff --git a/net/rds/iw.h b/net/rds/iw.h
deleted file mode 100644
index 5af01d1..000
diff --git a/net/rds/iw_cm.c b/net/rds/iw_cm.c
deleted file mode 100644
index aea4c91..000
diff --git a/net/rds/iw_rdma.c b/net/rds/iw_rdma.c
deleted file mode 100644
index b09a40c..000
diff --git a/net/rds/iw_recv.c b/net/rds/iw_recv.c
deleted file mode 100644
index a66d179..000
diff --git a/net/rds/iw_ring.c b/net/rds/iw_ring.c
deleted file mode 100644
index da8e3b6..000
diff --git a/net/rds/iw_send.c b/net/rds/iw_send.c
deleted file mode 100644
index e20bd50..000
diff --git a/net/rds/iw_stats.c b/net/rds/iw_stats.c
deleted file mode 100644
index 5fe67f6..000
diff --git a/net/rds/iw_sysctl.c b/net/rds/iw_sysctl.c
deleted file mode 100644
index 139239d..000
diff --git a/net/rds/rdma_transport.c b/net/rds/rdma_transport.c
index 9c1fed8..4f4b3d8 100644
--- a/net/rds/rdma_transport.c
+++ b/net/rds/rdma_transport.c
@@ -49,9 +49,7 @@ int rds_rdma_cm_event_handler(struct rdma_cm_id *cm_id,
rdsdebug("conn %p id %p handling event %u (%s)\n", conn, cm_id,
 event->event, rdma_event_msg(event->event));
 
-   if (cm_id->device->node_type == RDMA_NODE_RNIC)
-   trans = _iw_transport;
-   else
+   if (cm_id->device->node_type ==

[net-next][PATCH 02/13] RDS: Add support for SO_TIMESTAMP for incoming messages

2016-02-26 Thread Santosh Shilimkar

The SO_TIMESTAMP generates time stamp for each incoming RDS messages
User app can enable it by using SO_TIMESTAMP setsocketopt() at
SOL_SOCKET level. CMSG data of cmsg type SO_TIMESTAMP contains the
time stamp in struct timeval format.

Reviewed-by: Sowmini Varadhan 
Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/af_rds.c | 26 ++
 net/rds/rds.h|  1 +
 net/rds/recv.c   | 20 ++--
 3 files changed, 45 insertions(+), 2 deletions(-)

diff --git a/net/rds/af_rds.c b/net/rds/af_rds.c
index b5476aeb..6beaeb1 100644
--- a/net/rds/af_rds.c
+++ b/net/rds/af_rds.c
@@ -277,6 +277,27 @@ static int rds_set_transport(struct rds_sock *rs, char 
__user *optval,
return rs->rs_transport ? 0 : -ENOPROTOOPT;
 }
 
+static int rds_enable_recvtstamp(struct sock *sk, char __user *optval,
+int optlen)
+{
+   int val, valbool;
+
+   if (optlen != sizeof(int))
+   return -EFAULT;
+
+   if (get_user(val, (int __user *)optval))
+   return -EFAULT;
+
+   valbool = val ? 1 : 0;
+
+   if (valbool)
+   sock_set_flag(sk, SOCK_RCVTSTAMP);
+   else
+   sock_reset_flag(sk, SOCK_RCVTSTAMP);
+
+   return 0;
+}
+
 static int rds_setsockopt(struct socket *sock, int level, int optname,
  char __user *optval, unsigned int optlen)
 {
@@ -312,6 +333,11 @@ static int rds_setsockopt(struct socket *sock, int level, 
int optname,
ret = rds_set_transport(rs, optval, optlen);
release_sock(sock->sk);
break;
+   case SO_TIMESTAMP:
+   lock_sock(sock->sk);
+   ret = rds_enable_recvtstamp(sock->sk, optval, optlen);
+   release_sock(sock->sk);
+   break;
default:
ret = -ENOPROTOOPT;
}
diff --git a/net/rds/rds.h b/net/rds/rds.h
index 0e2797b..80256b0 100644
--- a/net/rds/rds.h
+++ b/net/rds/rds.h
@@ -222,6 +222,7 @@ struct rds_incoming {
__be32  i_saddr;
 
rds_rdma_cookie_t   i_rdma_cookie;
+   struct timeval  i_rx_tstamp;
 };
 
 struct rds_mr {
diff --git a/net/rds/recv.c b/net/rds/recv.c
index a00462b..c0be1ec 100644
--- a/net/rds/recv.c
+++ b/net/rds/recv.c
@@ -35,6 +35,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include "rds.h"
 
@@ -46,6 +48,8 @@ void rds_inc_init(struct rds_incoming *inc, struct 
rds_connection *conn,
inc->i_conn = conn;
inc->i_saddr = saddr;
inc->i_rdma_cookie = 0;
+   inc->i_rx_tstamp.tv_sec = 0;
+   inc->i_rx_tstamp.tv_usec = 0;
 }
 EXPORT_SYMBOL_GPL(rds_inc_init);
 
@@ -228,6 +232,8 @@ void rds_recv_incoming(struct rds_connection *conn, __be32 
saddr, __be32 daddr,
rds_recv_rcvbuf_delta(rs, sk, inc->i_conn->c_lcong,
  be32_to_cpu(inc->i_hdr.h_len),
  inc->i_hdr.h_dport);
+   if (sock_flag(sk, SOCK_RCVTSTAMP))
+   do_gettimeofday(>i_rx_tstamp);
rds_inc_addref(inc);
list_add_tail(>i_item, >rs_recv_queue);
__rds_wake_sk_sleep(sk);
@@ -381,7 +387,8 @@ static int rds_notify_cong(struct rds_sock *rs, struct 
msghdr *msghdr)
 /*
  * Receive any control messages.
  */
-static int rds_cmsg_recv(struct rds_incoming *inc, struct msghdr *msg)
+static int rds_cmsg_recv(struct rds_incoming *inc, struct msghdr *msg,
+struct rds_sock *rs)
 {
int ret = 0;
 
@@ -392,6 +399,15 @@ static int rds_cmsg_recv(struct rds_incoming *inc, struct 
msghdr *msg)
return ret;
}
 
+   if ((inc->i_rx_tstamp.tv_sec != 0) &&
+   sock_flag(rds_rs_to_sk(rs), SOCK_RCVTSTAMP)) {
+   ret = put_cmsg(msg, SOL_SOCKET, SCM_TIMESTAMP,
+  sizeof(struct timeval),
+  >i_rx_tstamp);
+   if (ret)
+   return ret;
+   }
+
return 0;
 }
 
@@ -474,7 +490,7 @@ int rds_recvmsg(struct socket *sock, struct msghdr *msg, 
size_t size,
msg->msg_flags |= MSG_TRUNC;
}
 
-   if (rds_cmsg_recv(inc, msg)) {
+   if (rds_cmsg_recv(inc, msg, rs)) {
ret = -EFAULT;
goto out;
}
-- 
1.9.1

[net-next][PATCH 05/13] RDS: IB: Re-organise ibmr code

2016-02-26 Thread Santosh Shilimkar

No functional changes. This is in preperation towards adding
fastreg memory resgitration support.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/Makefile  |   2 +-
 net/rds/ib.c  |  37 +++---
 net/rds/ib.h  |  25 +---
 net/rds/ib_fmr.c  | 217 +++
 net/rds/ib_mr.h   | 109 
 net/rds/ib_rdma.c | 379 +++---
 6 files changed, 422 insertions(+), 347 deletions(-)
 create mode 100644 net/rds/ib_fmr.c
 create mode 100644 net/rds/ib_mr.h

diff --git a/net/rds/Makefile b/net/rds/Makefile
index 19e5485..bcf5591 100644
--- a/net/rds/Makefile
+++ b/net/rds/Makefile
@@ -6,7 +6,7 @@ rds-y :=af_rds.o bind.o cong.o connection.o info.o 
message.o   \
 obj-$(CONFIG_RDS_RDMA) += rds_rdma.o
 rds_rdma-y :=  rdma_transport.o \
ib.o ib_cm.o ib_recv.o ib_ring.o ib_send.o ib_stats.o \
-   ib_sysctl.o ib_rdma.o
+   ib_sysctl.o ib_rdma.o ib_fmr.o
 
 
 obj-$(CONFIG_RDS_TCP) += rds_tcp.o
diff --git a/net/rds/ib.c b/net/rds/ib.c
index 9481d55..bb32cb9 100644
--- a/net/rds/ib.c
+++ b/net/rds/ib.c
@@ -42,15 +42,16 @@
 
 #include "rds.h"
 #include "ib.h"
+#include "ib_mr.h"
 
-unsigned int rds_ib_fmr_1m_pool_size = RDS_FMR_1M_POOL_SIZE;
-unsigned int rds_ib_fmr_8k_pool_size = RDS_FMR_8K_POOL_SIZE;
+unsigned int rds_ib_mr_1m_pool_size = RDS_MR_1M_POOL_SIZE;
+unsigned int rds_ib_mr_8k_pool_size = RDS_MR_8K_POOL_SIZE;
 unsigned int rds_ib_retry_count = RDS_IB_DEFAULT_RETRY_COUNT;
 
-module_param(rds_ib_fmr_1m_pool_size, int, 0444);
-MODULE_PARM_DESC(rds_ib_fmr_1m_pool_size, " Max number of 1M fmr per HCA");
-module_param(rds_ib_fmr_8k_pool_size, int, 0444);
-MODULE_PARM_DESC(rds_ib_fmr_8k_pool_size, " Max number of 8K fmr per HCA");
+module_param(rds_ib_mr_1m_pool_size, int, 0444);
+MODULE_PARM_DESC(rds_ib_mr_1m_pool_size, " Max number of 1M mr per HCA");
+module_param(rds_ib_mr_8k_pool_size, int, 0444);
+MODULE_PARM_DESC(rds_ib_mr_8k_pool_size, " Max number of 8K mr per HCA");
 module_param(rds_ib_retry_count, int, 0444);
 MODULE_PARM_DESC(rds_ib_retry_count, " Number of hw retries before reporting 
an error");
 
@@ -140,13 +141,13 @@ static void rds_ib_add_one(struct ib_device *device)
rds_ibdev->max_sge = min(device->attrs.max_sge, RDS_IB_MAX_SGE);
 
rds_ibdev->fmr_max_remaps = device->attrs.max_map_per_fmr?: 32;
-   rds_ibdev->max_1m_fmrs = device->attrs.max_mr ?
+   rds_ibdev->max_1m_mrs = device->attrs.max_mr ?
min_t(unsigned int, (device->attrs.max_mr / 2),
- rds_ib_fmr_1m_pool_size) : rds_ib_fmr_1m_pool_size;
+ rds_ib_mr_1m_pool_size) : rds_ib_mr_1m_pool_size;
 
-   rds_ibdev->max_8k_fmrs = device->attrs.max_mr ?
+   rds_ibdev->max_8k_mrs = device->attrs.max_mr ?
min_t(unsigned int, ((device->attrs.max_mr / 2) * 
RDS_MR_8K_SCALE),
- rds_ib_fmr_8k_pool_size) : rds_ib_fmr_8k_pool_size;
+ rds_ib_mr_8k_pool_size) : rds_ib_mr_8k_pool_size;
 
rds_ibdev->max_initiator_depth = device->attrs.max_qp_init_rd_atom;
rds_ibdev->max_responder_resources = device->attrs.max_qp_rd_atom;
@@ -172,10 +173,10 @@ static void rds_ib_add_one(struct ib_device *device)
goto put_dev;
}
 
-   rdsdebug("RDS/IB: max_mr = %d, max_wrs = %d, max_sge = %d, 
fmr_max_remaps = %d, max_1m_fmrs = %d, max_8k_fmrs = %d\n",
+   rdsdebug("RDS/IB: max_mr = %d, max_wrs = %d, max_sge = %d, 
fmr_max_remaps = %d, max_1m_mrs = %d, max_8k_mrs = %d\n",
 device->attrs.max_fmr, rds_ibdev->max_wrs, rds_ibdev->max_sge,
-rds_ibdev->fmr_max_remaps, rds_ibdev->max_1m_fmrs,
-rds_ibdev->max_8k_fmrs);
+rds_ibdev->fmr_max_remaps, rds_ibdev->max_1m_mrs,
+rds_ibdev->max_8k_mrs);
 
INIT_LIST_HEAD(_ibdev->ipaddr_list);
INIT_LIST_HEAD(_ibdev->conn_list);
@@ -364,7 +365,7 @@ void rds_ib_exit(void)
rds_ib_sysctl_exit();
rds_ib_recv_exit();
rds_trans_unregister(_ib_transport);
-   rds_ib_fmr_exit();
+   rds_ib_mr_exit();
 }
 
 struct rds_transport rds_ib_transport = {
@@ -400,13 +401,13 @@ int rds_ib_init(void)
 
INIT_LIST_HEAD(_ib_devices);
 
-   ret = rds_ib_fmr_init();
+   ret = rds_ib_mr_init();
if (ret)
goto out;
 
ret = ib_register_client(_ib_client);
if (ret)
-   goto out_fmr_exit;
+   goto out_mr_exit;
 
ret = rds_ib_sysctl_init();
if (ret)
@@ -430,8 +431,8 @@ out_sysctl:
rds_ib_sysctl_exit();
 out_ibreg:
rds_ib_unregister_client();
-out_fmr_exit:
-   rds_ib_fmr_exit();
+out_mr_exit:
+   rds_ib_mr_exit();
 out:
return ret;
 }
diff --git a/net/rds/ib.h b/net/rds/ib.h
index 09cd8e3..c88cb22 100644
---

[net-next][PATCH 05/13] RDS: IB: Re-organise ibmr code

2016-02-26 Thread Santosh Shilimkar

No functional changes. This is in preperation towards adding
fastreg memory resgitration support.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/Makefile  |   2 +-
 net/rds/ib.c  |  37 +++---
 net/rds/ib.h  |  25 +---
 net/rds/ib_fmr.c  | 217 +++
 net/rds/ib_mr.h   | 109 
 net/rds/ib_rdma.c | 379 +++---
 6 files changed, 422 insertions(+), 347 deletions(-)
 create mode 100644 net/rds/ib_fmr.c
 create mode 100644 net/rds/ib_mr.h

diff --git a/net/rds/Makefile b/net/rds/Makefile
index 19e5485..bcf5591 100644
--- a/net/rds/Makefile
+++ b/net/rds/Makefile
@@ -6,7 +6,7 @@ rds-y :=af_rds.o bind.o cong.o connection.o info.o 
message.o   \
 obj-$(CONFIG_RDS_RDMA) += rds_rdma.o
 rds_rdma-y :=  rdma_transport.o \
ib.o ib_cm.o ib_recv.o ib_ring.o ib_send.o ib_stats.o \
-   ib_sysctl.o ib_rdma.o
+   ib_sysctl.o ib_rdma.o ib_fmr.o
 
 
 obj-$(CONFIG_RDS_TCP) += rds_tcp.o
diff --git a/net/rds/ib.c b/net/rds/ib.c
index 9481d55..bb32cb9 100644
--- a/net/rds/ib.c
+++ b/net/rds/ib.c
@@ -42,15 +42,16 @@
 
 #include "rds.h"
 #include "ib.h"
+#include "ib_mr.h"
 
-unsigned int rds_ib_fmr_1m_pool_size = RDS_FMR_1M_POOL_SIZE;
-unsigned int rds_ib_fmr_8k_pool_size = RDS_FMR_8K_POOL_SIZE;
+unsigned int rds_ib_mr_1m_pool_size = RDS_MR_1M_POOL_SIZE;
+unsigned int rds_ib_mr_8k_pool_size = RDS_MR_8K_POOL_SIZE;
 unsigned int rds_ib_retry_count = RDS_IB_DEFAULT_RETRY_COUNT;
 
-module_param(rds_ib_fmr_1m_pool_size, int, 0444);
-MODULE_PARM_DESC(rds_ib_fmr_1m_pool_size, " Max number of 1M fmr per HCA");
-module_param(rds_ib_fmr_8k_pool_size, int, 0444);
-MODULE_PARM_DESC(rds_ib_fmr_8k_pool_size, " Max number of 8K fmr per HCA");
+module_param(rds_ib_mr_1m_pool_size, int, 0444);
+MODULE_PARM_DESC(rds_ib_mr_1m_pool_size, " Max number of 1M mr per HCA");
+module_param(rds_ib_mr_8k_pool_size, int, 0444);
+MODULE_PARM_DESC(rds_ib_mr_8k_pool_size, " Max number of 8K mr per HCA");
 module_param(rds_ib_retry_count, int, 0444);
 MODULE_PARM_DESC(rds_ib_retry_count, " Number of hw retries before reporting 
an error");
 
@@ -140,13 +141,13 @@ static void rds_ib_add_one(struct ib_device *device)
rds_ibdev->max_sge = min(device->attrs.max_sge, RDS_IB_MAX_SGE);
 
rds_ibdev->fmr_max_remaps = device->attrs.max_map_per_fmr?: 32;
-   rds_ibdev->max_1m_fmrs = device->attrs.max_mr ?
+   rds_ibdev->max_1m_mrs = device->attrs.max_mr ?
min_t(unsigned int, (device->attrs.max_mr / 2),
- rds_ib_fmr_1m_pool_size) : rds_ib_fmr_1m_pool_size;
+ rds_ib_mr_1m_pool_size) : rds_ib_mr_1m_pool_size;
 
-   rds_ibdev->max_8k_fmrs = device->attrs.max_mr ?
+   rds_ibdev->max_8k_mrs = device->attrs.max_mr ?
min_t(unsigned int, ((device->attrs.max_mr / 2) * 
RDS_MR_8K_SCALE),
- rds_ib_fmr_8k_pool_size) : rds_ib_fmr_8k_pool_size;
+ rds_ib_mr_8k_pool_size) : rds_ib_mr_8k_pool_size;
 
rds_ibdev->max_initiator_depth = device->attrs.max_qp_init_rd_atom;
rds_ibdev->max_responder_resources = device->attrs.max_qp_rd_atom;
@@ -172,10 +173,10 @@ static void rds_ib_add_one(struct ib_device *device)
goto put_dev;
}
 
-   rdsdebug("RDS/IB: max_mr = %d, max_wrs = %d, max_sge = %d, 
fmr_max_remaps = %d, max_1m_fmrs = %d, max_8k_fmrs = %d\n",
+   rdsdebug("RDS/IB: max_mr = %d, max_wrs = %d, max_sge = %d, 
fmr_max_remaps = %d, max_1m_mrs = %d, max_8k_mrs = %d\n",
 device->attrs.max_fmr, rds_ibdev->max_wrs, rds_ibdev->max_sge,
-rds_ibdev->fmr_max_remaps, rds_ibdev->max_1m_fmrs,
-rds_ibdev->max_8k_fmrs);
+rds_ibdev->fmr_max_remaps, rds_ibdev->max_1m_mrs,
+rds_ibdev->max_8k_mrs);
 
INIT_LIST_HEAD(_ibdev->ipaddr_list);
INIT_LIST_HEAD(_ibdev->conn_list);
@@ -364,7 +365,7 @@ void rds_ib_exit(void)
rds_ib_sysctl_exit();
rds_ib_recv_exit();
rds_trans_unregister(_ib_transport);
-   rds_ib_fmr_exit();
+   rds_ib_mr_exit();
 }
 
 struct rds_transport rds_ib_transport = {
@@ -400,13 +401,13 @@ int rds_ib_init(void)
 
INIT_LIST_HEAD(_ib_devices);
 
-   ret = rds_ib_fmr_init();
+   ret = rds_ib_mr_init();
if (ret)
goto out;
 
ret = ib_register_client(_ib_client);
if (ret)
-   goto out_fmr_exit;
+   goto out_mr_exit;
 
ret = rds_ib_sysctl_init();
if (ret)
@@ -430,8 +431,8 @@ out_sysctl:
rds_ib_sysctl_exit();
 out_ibreg:
rds_ib_unregister_client();
-out_fmr_exit:
-   rds_ib_fmr_exit();
+out_mr_exit:
+   rds_ib_mr_exit();
 out:
return ret;
 }
diff --git a/net/rds/ib.h b/net/rds/ib.h
index 09cd8e3..c88cb22 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -9,12 +9,6 @@

[net-next][PATCH 11/13] RDS: IB: add Fastreg MR (FRMR) detection support

2016-02-26 Thread Santosh Shilimkar

Discovere Fast Memmory Registration support using IB device
IB_DEVICE_MEM_MGT_EXTENSIONS. Certain HCA might support just FRMR
or FMR or both FMR and FRWR. In case both mr type are supported,
default FMR is used. Using module parameter 'prefer_frmr',
user can choose its preferred MR method for RDS. Ofcourse the
module parameter has no effect if the HCA support only FRMR
or only FRMR.

Default MR is still kept as FMR against what everyone else
is following. Default will be changed to FRMR once the
RDS performance with FRMR is comparable with FMR. The
work is in progress for the same.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib.c| 14 ++
 net/rds/ib.h|  4 
 net/rds/ib_mr.h |  1 +
 3 files changed, 19 insertions(+)

diff --git a/net/rds/ib.c b/net/rds/ib.c
index bb32cb9..68c94b0 100644
--- a/net/rds/ib.c
+++ b/net/rds/ib.c
@@ -47,6 +47,7 @@
 unsigned int rds_ib_mr_1m_pool_size = RDS_MR_1M_POOL_SIZE;
 unsigned int rds_ib_mr_8k_pool_size = RDS_MR_8K_POOL_SIZE;
 unsigned int rds_ib_retry_count = RDS_IB_DEFAULT_RETRY_COUNT;
+bool prefer_frmr;
 
 module_param(rds_ib_mr_1m_pool_size, int, 0444);
 MODULE_PARM_DESC(rds_ib_mr_1m_pool_size, " Max number of 1M mr per HCA");
@@ -54,6 +55,8 @@ module_param(rds_ib_mr_8k_pool_size, int, 0444);
 MODULE_PARM_DESC(rds_ib_mr_8k_pool_size, " Max number of 8K mr per HCA");
 module_param(rds_ib_retry_count, int, 0444);
 MODULE_PARM_DESC(rds_ib_retry_count, " Number of hw retries before reporting 
an error");
+module_param(prefer_frmr, bool, 0444);
+MODULE_PARM_DESC(prefer_frmr, "Preferred MR method if both FMR and FRMR 
supported");
 
 /*
  * we have a clumsy combination of RCU and a rwsem protecting this list
@@ -140,6 +143,13 @@ static void rds_ib_add_one(struct ib_device *device)
rds_ibdev->max_wrs = device->attrs.max_qp_wr;
rds_ibdev->max_sge = min(device->attrs.max_sge, RDS_IB_MAX_SGE);
 
+   rds_ibdev->has_fr = (device->attrs.device_cap_flags &
+ IB_DEVICE_MEM_MGT_EXTENSIONS);
+   rds_ibdev->has_fmr = (device->alloc_fmr && device->dealloc_fmr &&
+   device->map_phys_fmr && device->unmap_fmr);
+   rds_ibdev->use_fastreg = (rds_ibdev->has_fr &&
+(!rds_ibdev->has_fmr || prefer_frmr));
+
rds_ibdev->fmr_max_remaps = device->attrs.max_map_per_fmr?: 32;
rds_ibdev->max_1m_mrs = device->attrs.max_mr ?
min_t(unsigned int, (device->attrs.max_mr / 2),
@@ -178,6 +188,10 @@ static void rds_ib_add_one(struct ib_device *device)
 rds_ibdev->fmr_max_remaps, rds_ibdev->max_1m_mrs,
 rds_ibdev->max_8k_mrs);
 
+   pr_info("RDS/IB: %s: %s supported and preferred\n",
+   device->name,
+   rds_ibdev->use_fastreg ? "FRMR" : "FMR");
+
INIT_LIST_HEAD(_ibdev->ipaddr_list);
INIT_LIST_HEAD(_ibdev->conn_list);
 
diff --git a/net/rds/ib.h b/net/rds/ib.h
index 62fe7d5..c5eddc2 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -200,6 +200,10 @@ struct rds_ib_device {
struct list_headconn_list;
struct ib_device*dev;
struct ib_pd*pd;
+   boolhas_fmr;
+   boolhas_fr;
+   booluse_fastreg;
+
unsigned intmax_mrs;
struct rds_ib_mr_pool   *mr_1m_pool;
struct rds_ib_mr_pool   *mr_8k_pool;
diff --git a/net/rds/ib_mr.h b/net/rds/ib_mr.h
index add7725..2f9b9c3 100644
--- a/net/rds/ib_mr.h
+++ b/net/rds/ib_mr.h
@@ -93,6 +93,7 @@ struct rds_ib_mr_pool {
 extern struct workqueue_struct *rds_ib_mr_wq;
 extern unsigned int rds_ib_mr_1m_pool_size;
 extern unsigned int rds_ib_mr_8k_pool_size;
+extern bool prefer_frmr;
 
 struct rds_ib_mr_pool *rds_ib_create_mr_pool(struct rds_ib_device *rds_dev,
 int npages);
-- 
1.9.1

[net-next][PATCH 09/13] RDS: IB: handle the RDMA CM time wait event

2016-02-26 Thread Santosh Shilimkar

Drop the RDS connection on RDMA_CM_EVENT_TIMEWAIT_EXIT so that
it can reconnect and resume.

While testing fastreg, this error happened in couple of tests but
was getting un-noticed.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/rdma_transport.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/net/rds/rdma_transport.c b/net/rds/rdma_transport.c
index 4f4b3d8..7220beb 100644
--- a/net/rds/rdma_transport.c
+++ b/net/rds/rdma_transport.c
@@ -117,6 +117,14 @@ int rds_rdma_cm_event_handler(struct rdma_cm_id *cm_id,
rds_conn_drop(conn);
break;
 
+   case RDMA_CM_EVENT_TIMEWAIT_EXIT:
+   if (conn) {
+   pr_info("RDS: RDMA_CM_EVENT_TIMEWAIT_EXIT event: 
dropping connection %pI4->%pI4\n",
+   >c_laddr, >c_faddr);
+   rds_conn_drop(conn);
+   }
+   break;
+
default:
/* things like device disconnect? */
printk(KERN_ERR "RDS: unknown event %u (%s)!\n",
-- 
1.9.1

[net-next][PATCH 00/13] RDS: Major clean-up with couple of new features for 4.6

2016-02-26 Thread Santosh Shilimkar

Series is generated against net-next but also applies against Linus's tip
cleanly. The diff-stat looks bit scary since almost ~4K lines of code is
getting removed.

Brief summary of the series:

- Drop the stale iWARP support:
RDS iWarp support code has become stale and non testable for
sometime.  As discussed and agreed earlier on list [1], am dropping
its support for good. If new iWarp user(s) shows up in future,
the plan is to adapt existing IB RDMA with special sink case.
- RDS gets SO_TIMESTAMP support
- Long due RDS maintainer entry gets updated
- Some RDS IB code refactoring towards new FastReg Memory registration (FRMR)
- Lastly the initial support for FRMR

RDS IB RDMA performance with FRMR is not yet as good as FMR and I do have
some patches in progress to address that. But they are not ready for 4.6
so I left them out of this series. 

Also am keeping eye on new CQ API adaptations like other ULPs doing and
will try to adapt RDS for the same most likely in 4.7 timeframe. 

Entire patchset is available below git tree:
git://git.kernel.org/pub/scm/linux/kernel/git/ssantosh/linux.git 
for_4.6/net-next/rds

Feedback/comments welcome !!

Santosh Shilimkar (12):
  RDS: Drop stale iWARP RDMA transport
  RDS: Add support for SO_TIMESTAMP for incoming messages
  MAINTAINERS: update RDS entry
  RDS: IB: Remove the RDS_IB_SEND_OP dependency
  RDS: IB: Re-organise ibmr code
  RDS: IB: create struct rds_ib_fmr
  RDS: IB: move FMR code to its own file
  RDS: IB: add connection info to ibmr
  RDS: IB: handle the RDMA CM time wait event
  RDS: IB: add mr reused stats
  RDS: IB: add Fastreg MR (FRMR) detection support
  RDS: IB: allocate extra space on queues for FRMR support

Avinash Repaka (1):
  RDS: IB: Support Fastreg MR (FRMR) memory registration mode

 Documentation/networking/rds.txt |   4 +-
 MAINTAINERS  |   6 +-
 net/rds/Kconfig  |   7 +-
 net/rds/Makefile |   4 +-
 net/rds/af_rds.c |  26 ++
 net/rds/ib.c |  51 +-
 net/rds/ib.h |  37 +-
 net/rds/ib_cm.c  |  59 ++-
 net/rds/ib_fmr.c | 248 ++
 net/rds/ib_frmr.c| 376 +++
 net/rds/ib_mr.h  | 148 ++
 net/rds/ib_rdma.c| 492 ++--
 net/rds/ib_send.c|   6 +-
 net/rds/ib_stats.c   |   2 +
 net/rds/iw.c | 312 -
 net/rds/iw.h | 398 
 net/rds/iw_cm.c  | 769 --
 net/rds/iw_rdma.c| 837 -
 net/rds/iw_recv.c| 904 
 net/rds/iw_ring.c| 169 ---
 net/rds/iw_send.c| 981 ---
 net/rds/iw_stats.c   |  95 
 net/rds/iw_sysctl.c  | 123 -
 net/rds/rdma_transport.c |  21 +-
 net/rds/rdma_transport.h |   5 -
 net/rds/rds.h|   1 +
 net/rds/recv.c   |  20 +-
 27 files changed, 1068 insertions(+), 5033 deletions(-)
 create mode 100644 net/rds/ib_fmr.c
 create mode 100644 net/rds/ib_frmr.c
 create mode 100644 net/rds/ib_mr.h
 delete mode 100644 net/rds/iw.c
 delete mode 100644 net/rds/iw.h
 delete mode 100644 net/rds/iw_cm.c
 delete mode 100644 net/rds/iw_rdma.c
 delete mode 100644 net/rds/iw_recv.c
 delete mode 100644 net/rds/iw_ring.c
 delete mode 100644 net/rds/iw_send.c
 delete mode 100644 net/rds/iw_stats.c
 delete mode 100644 net/rds/iw_sysctl.c


Regards,
Santosh

[1] http://www.spinics.net/lists/linux-rdma/msg30769.html

-- 
1.9.1

[net-next][PATCH 10/13] RDS: IB: add mr reused stats

2016-02-26 Thread Santosh Shilimkar

Add MR reuse statistics to RDS IB transport.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib.h   | 2 ++
 net/rds/ib_rdma.c  | 7 ++-
 net/rds/ib_stats.c | 2 ++
 3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/net/rds/ib.h b/net/rds/ib.h
index c88cb22..62fe7d5 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -259,6 +259,8 @@ struct rds_ib_statistics {
uint64_ts_ib_rdma_mr_1m_pool_flush;
uint64_ts_ib_rdma_mr_1m_pool_wait;
uint64_ts_ib_rdma_mr_1m_pool_depleted;
+   uint64_ts_ib_rdma_mr_8k_reused;
+   uint64_ts_ib_rdma_mr_1m_reused;
uint64_ts_ib_atomic_cswp;
uint64_ts_ib_atomic_fadd;
 };
diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c
index 20ff191..00e9064 100644
--- a/net/rds/ib_rdma.c
+++ b/net/rds/ib_rdma.c
@@ -188,8 +188,13 @@ struct rds_ib_mr *rds_ib_reuse_mr(struct rds_ib_mr_pool 
*pool)
flag = this_cpu_ptr(_list_grace);
set_bit(CLEAN_LIST_BUSY_BIT, flag);
ret = llist_del_first(>clean_list);
-   if (ret)
+   if (ret) {
ibmr = llist_entry(ret, struct rds_ib_mr, llnode);
+   if (pool->pool_type == RDS_IB_MR_8K_POOL)
+   rds_ib_stats_inc(s_ib_rdma_mr_8k_reused);
+   else
+   rds_ib_stats_inc(s_ib_rdma_mr_1m_reused);
+   }
 
clear_bit(CLEAN_LIST_BUSY_BIT, flag);
preempt_enable();
diff --git a/net/rds/ib_stats.c b/net/rds/ib_stats.c
index d77e044..7e78dca 100644
--- a/net/rds/ib_stats.c
+++ b/net/rds/ib_stats.c
@@ -73,6 +73,8 @@ static const char *const rds_ib_stat_names[] = {
"ib_rdma_mr_1m_pool_flush",
"ib_rdma_mr_1m_pool_wait",
"ib_rdma_mr_1m_pool_depleted",
+   "ib_rdma_mr_8k_reused",
+   "ib_rdma_mr_1m_reused",
"ib_atomic_cswp",
"ib_atomic_fadd",
 };
-- 
1.9.1

[net-next][PATCH 11/13] RDS: IB: add Fastreg MR (FRMR) detection support

2016-02-26 Thread Santosh Shilimkar

Discovere Fast Memmory Registration support using IB device
IB_DEVICE_MEM_MGT_EXTENSIONS. Certain HCA might support just FRMR
or FMR or both FMR and FRWR. In case both mr type are supported,
default FMR is used. Using module parameter 'prefer_frmr',
user can choose its preferred MR method for RDS. Ofcourse the
module parameter has no effect if the HCA support only FRMR
or only FRMR.

Default MR is still kept as FMR against what everyone else
is following. Default will be changed to FRMR once the
RDS performance with FRMR is comparable with FMR. The
work is in progress for the same.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib.c| 14 ++
 net/rds/ib.h|  4 
 net/rds/ib_mr.h |  1 +
 3 files changed, 19 insertions(+)

diff --git a/net/rds/ib.c b/net/rds/ib.c
index bb32cb9..68c94b0 100644
--- a/net/rds/ib.c
+++ b/net/rds/ib.c
@@ -47,6 +47,7 @@
 unsigned int rds_ib_mr_1m_pool_size = RDS_MR_1M_POOL_SIZE;
 unsigned int rds_ib_mr_8k_pool_size = RDS_MR_8K_POOL_SIZE;
 unsigned int rds_ib_retry_count = RDS_IB_DEFAULT_RETRY_COUNT;
+bool prefer_frmr;
 
 module_param(rds_ib_mr_1m_pool_size, int, 0444);
 MODULE_PARM_DESC(rds_ib_mr_1m_pool_size, " Max number of 1M mr per HCA");
@@ -54,6 +55,8 @@ module_param(rds_ib_mr_8k_pool_size, int, 0444);
 MODULE_PARM_DESC(rds_ib_mr_8k_pool_size, " Max number of 8K mr per HCA");
 module_param(rds_ib_retry_count, int, 0444);
 MODULE_PARM_DESC(rds_ib_retry_count, " Number of hw retries before reporting 
an error");
+module_param(prefer_frmr, bool, 0444);
+MODULE_PARM_DESC(prefer_frmr, "Preferred MR method if both FMR and FRMR 
supported");
 
 /*
  * we have a clumsy combination of RCU and a rwsem protecting this list
@@ -140,6 +143,13 @@ static void rds_ib_add_one(struct ib_device *device)
rds_ibdev->max_wrs = device->attrs.max_qp_wr;
rds_ibdev->max_sge = min(device->attrs.max_sge, RDS_IB_MAX_SGE);
 
+   rds_ibdev->has_fr = (device->attrs.device_cap_flags &
+ IB_DEVICE_MEM_MGT_EXTENSIONS);
+   rds_ibdev->has_fmr = (device->alloc_fmr && device->dealloc_fmr &&
+   device->map_phys_fmr && device->unmap_fmr);
+   rds_ibdev->use_fastreg = (rds_ibdev->has_fr &&
+(!rds_ibdev->has_fmr || prefer_frmr));
+
rds_ibdev->fmr_max_remaps = device->attrs.max_map_per_fmr?: 32;
rds_ibdev->max_1m_mrs = device->attrs.max_mr ?
min_t(unsigned int, (device->attrs.max_mr / 2),
@@ -178,6 +188,10 @@ static void rds_ib_add_one(struct ib_device *device)
 rds_ibdev->fmr_max_remaps, rds_ibdev->max_1m_mrs,
 rds_ibdev->max_8k_mrs);
 
+   pr_info("RDS/IB: %s: %s supported and preferred\n",
+   device->name,
+   rds_ibdev->use_fastreg ? "FRMR" : "FMR");
+
INIT_LIST_HEAD(_ibdev->ipaddr_list);
INIT_LIST_HEAD(_ibdev->conn_list);
 
diff --git a/net/rds/ib.h b/net/rds/ib.h
index 62fe7d5..c5eddc2 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -200,6 +200,10 @@ struct rds_ib_device {
struct list_headconn_list;
struct ib_device*dev;
struct ib_pd*pd;
+   boolhas_fmr;
+   boolhas_fr;
+   booluse_fastreg;
+
unsigned intmax_mrs;
struct rds_ib_mr_pool   *mr_1m_pool;
struct rds_ib_mr_pool   *mr_8k_pool;
diff --git a/net/rds/ib_mr.h b/net/rds/ib_mr.h
index add7725..2f9b9c3 100644
--- a/net/rds/ib_mr.h
+++ b/net/rds/ib_mr.h
@@ -93,6 +93,7 @@ struct rds_ib_mr_pool {
 extern struct workqueue_struct *rds_ib_mr_wq;
 extern unsigned int rds_ib_mr_1m_pool_size;
 extern unsigned int rds_ib_mr_8k_pool_size;
+extern bool prefer_frmr;
 
 struct rds_ib_mr_pool *rds_ib_create_mr_pool(struct rds_ib_device *rds_dev,
 int npages);
-- 
1.9.1

[net-next][PATCH 09/13] RDS: IB: handle the RDMA CM time wait event

2016-02-26 Thread Santosh Shilimkar

Drop the RDS connection on RDMA_CM_EVENT_TIMEWAIT_EXIT so that
it can reconnect and resume.

While testing fastreg, this error happened in couple of tests but
was getting un-noticed.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/rdma_transport.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/net/rds/rdma_transport.c b/net/rds/rdma_transport.c
index 4f4b3d8..7220beb 100644
--- a/net/rds/rdma_transport.c
+++ b/net/rds/rdma_transport.c
@@ -117,6 +117,14 @@ int rds_rdma_cm_event_handler(struct rdma_cm_id *cm_id,
rds_conn_drop(conn);
break;
 
+   case RDMA_CM_EVENT_TIMEWAIT_EXIT:
+   if (conn) {
+   pr_info("RDS: RDMA_CM_EVENT_TIMEWAIT_EXIT event: 
dropping connection %pI4->%pI4\n",
+   >c_laddr, >c_faddr);
+   rds_conn_drop(conn);
+   }
+   break;
+
default:
/* things like device disconnect? */
printk(KERN_ERR "RDS: unknown event %u (%s)!\n",
-- 
1.9.1

[net-next][PATCH 00/13] RDS: Major clean-up with couple of new features for 4.6

2016-02-26 Thread Santosh Shilimkar

Series is generated against net-next but also applies against Linus's tip
cleanly. The diff-stat looks bit scary since almost ~4K lines of code is
getting removed.

Brief summary of the series:

- Drop the stale iWARP support:
RDS iWarp support code has become stale and non testable for
sometime.  As discussed and agreed earlier on list [1], am dropping
its support for good. If new iWarp user(s) shows up in future,
the plan is to adapt existing IB RDMA with special sink case.
- RDS gets SO_TIMESTAMP support
- Long due RDS maintainer entry gets updated
- Some RDS IB code refactoring towards new FastReg Memory registration (FRMR)
- Lastly the initial support for FRMR

RDS IB RDMA performance with FRMR is not yet as good as FMR and I do have
some patches in progress to address that. But they are not ready for 4.6
so I left them out of this series. 

Also am keeping eye on new CQ API adaptations like other ULPs doing and
will try to adapt RDS for the same most likely in 4.7 timeframe. 

Entire patchset is available below git tree:
git://git.kernel.org/pub/scm/linux/kernel/git/ssantosh/linux.git 
for_4.6/net-next/rds

Feedback/comments welcome !!

Santosh Shilimkar (12):
  RDS: Drop stale iWARP RDMA transport
  RDS: Add support for SO_TIMESTAMP for incoming messages
  MAINTAINERS: update RDS entry
  RDS: IB: Remove the RDS_IB_SEND_OP dependency
  RDS: IB: Re-organise ibmr code
  RDS: IB: create struct rds_ib_fmr
  RDS: IB: move FMR code to its own file
  RDS: IB: add connection info to ibmr
  RDS: IB: handle the RDMA CM time wait event
  RDS: IB: add mr reused stats
  RDS: IB: add Fastreg MR (FRMR) detection support
  RDS: IB: allocate extra space on queues for FRMR support

Avinash Repaka (1):
  RDS: IB: Support Fastreg MR (FRMR) memory registration mode

 Documentation/networking/rds.txt |   4 +-
 MAINTAINERS  |   6 +-
 net/rds/Kconfig  |   7 +-
 net/rds/Makefile |   4 +-
 net/rds/af_rds.c |  26 ++
 net/rds/ib.c |  51 +-
 net/rds/ib.h |  37 +-
 net/rds/ib_cm.c  |  59 ++-
 net/rds/ib_fmr.c | 248 ++
 net/rds/ib_frmr.c| 376 +++
 net/rds/ib_mr.h  | 148 ++
 net/rds/ib_rdma.c| 492 ++--
 net/rds/ib_send.c|   6 +-
 net/rds/ib_stats.c   |   2 +
 net/rds/iw.c | 312 -
 net/rds/iw.h | 398 
 net/rds/iw_cm.c  | 769 --
 net/rds/iw_rdma.c| 837 -
 net/rds/iw_recv.c| 904 
 net/rds/iw_ring.c| 169 ---
 net/rds/iw_send.c| 981 ---
 net/rds/iw_stats.c   |  95 
 net/rds/iw_sysctl.c  | 123 -
 net/rds/rdma_transport.c |  21 +-
 net/rds/rdma_transport.h |   5 -
 net/rds/rds.h|   1 +
 net/rds/recv.c   |  20 +-
 27 files changed, 1068 insertions(+), 5033 deletions(-)
 create mode 100644 net/rds/ib_fmr.c
 create mode 100644 net/rds/ib_frmr.c
 create mode 100644 net/rds/ib_mr.h
 delete mode 100644 net/rds/iw.c
 delete mode 100644 net/rds/iw.h
 delete mode 100644 net/rds/iw_cm.c
 delete mode 100644 net/rds/iw_rdma.c
 delete mode 100644 net/rds/iw_recv.c
 delete mode 100644 net/rds/iw_ring.c
 delete mode 100644 net/rds/iw_send.c
 delete mode 100644 net/rds/iw_stats.c
 delete mode 100644 net/rds/iw_sysctl.c


Regards,
Santosh

[1] http://www.spinics.net/lists/linux-rdma/msg30769.html

-- 
1.9.1

[net-next][PATCH 10/13] RDS: IB: add mr reused stats

2016-02-26 Thread Santosh Shilimkar

Add MR reuse statistics to RDS IB transport.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib.h   | 2 ++
 net/rds/ib_rdma.c  | 7 ++-
 net/rds/ib_stats.c | 2 ++
 3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/net/rds/ib.h b/net/rds/ib.h
index c88cb22..62fe7d5 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -259,6 +259,8 @@ struct rds_ib_statistics {
uint64_ts_ib_rdma_mr_1m_pool_flush;
uint64_ts_ib_rdma_mr_1m_pool_wait;
uint64_ts_ib_rdma_mr_1m_pool_depleted;
+   uint64_ts_ib_rdma_mr_8k_reused;
+   uint64_ts_ib_rdma_mr_1m_reused;
uint64_ts_ib_atomic_cswp;
uint64_ts_ib_atomic_fadd;
 };
diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c
index 20ff191..00e9064 100644
--- a/net/rds/ib_rdma.c
+++ b/net/rds/ib_rdma.c
@@ -188,8 +188,13 @@ struct rds_ib_mr *rds_ib_reuse_mr(struct rds_ib_mr_pool 
*pool)
flag = this_cpu_ptr(_list_grace);
set_bit(CLEAN_LIST_BUSY_BIT, flag);
ret = llist_del_first(>clean_list);
-   if (ret)
+   if (ret) {
ibmr = llist_entry(ret, struct rds_ib_mr, llnode);
+   if (pool->pool_type == RDS_IB_MR_8K_POOL)
+   rds_ib_stats_inc(s_ib_rdma_mr_8k_reused);
+   else
+   rds_ib_stats_inc(s_ib_rdma_mr_1m_reused);
+   }
 
clear_bit(CLEAN_LIST_BUSY_BIT, flag);
preempt_enable();
diff --git a/net/rds/ib_stats.c b/net/rds/ib_stats.c
index d77e044..7e78dca 100644
--- a/net/rds/ib_stats.c
+++ b/net/rds/ib_stats.c
@@ -73,6 +73,8 @@ static const char *const rds_ib_stat_names[] = {
"ib_rdma_mr_1m_pool_flush",
"ib_rdma_mr_1m_pool_wait",
"ib_rdma_mr_1m_pool_depleted",
+   "ib_rdma_mr_8k_reused",
+   "ib_rdma_mr_1m_reused",
"ib_atomic_cswp",
"ib_atomic_fadd",
 };
-- 
1.9.1

[net-next][PATCH 13/13] RDS: IB: Support Fastreg MR (FRMR) memory registration mode

2016-02-26 Thread Santosh Shilimkar

From: Avinash Repaka 

Fastreg MR(FRMR) is another method with which one can
register memory to HCA. Some of the newer HCAs supports only fastreg
mr mode, so we need to add support for it to RDS to have RDS functional
on them.

Some of the older HCAs support both FMR and FRMR modes. So to try out
FRMR on older HCAs, one can use module parameter 'prefer_frmr'

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Avinash Repaka 
Signed-off-by: Santosh Shilimkar 
---
RDS IB RDMA performance with FRMR is not yet as good as FMR and I do have
some patches in progress to address that. But they are not ready for 4.6
so I left them out of this series. 

 net/rds/Makefile  |   2 +-
 net/rds/ib.h  |   1 +
 net/rds/ib_cm.c   |   7 +-
 net/rds/ib_frmr.c | 376 ++
 net/rds/ib_mr.h   |  24 
 net/rds/ib_rdma.c |  17 ++-
 6 files changed, 422 insertions(+), 5 deletions(-)
 create mode 100644 net/rds/ib_frmr.c

diff --git a/net/rds/Makefile b/net/rds/Makefile
index bcf5591..0e72bec 100644
--- a/net/rds/Makefile
+++ b/net/rds/Makefile
@@ -6,7 +6,7 @@ rds-y :=af_rds.o bind.o cong.o connection.o info.o 
message.o   \
 obj-$(CONFIG_RDS_RDMA) += rds_rdma.o
 rds_rdma-y :=  rdma_transport.o \
ib.o ib_cm.o ib_recv.o ib_ring.o ib_send.o ib_stats.o \
-   ib_sysctl.o ib_rdma.o ib_fmr.o
+   ib_sysctl.o ib_rdma.o ib_fmr.o ib_frmr.o
 
 
 obj-$(CONFIG_RDS_TCP) += rds_tcp.o
diff --git a/net/rds/ib.h b/net/rds/ib.h
index eeb0d6c..627fb79 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -349,6 +349,7 @@ int rds_ib_update_ipaddr(struct rds_ib_device *rds_ibdev, 
__be32 ipaddr);
 void rds_ib_add_conn(struct rds_ib_device *rds_ibdev, struct rds_connection 
*conn);
 void rds_ib_remove_conn(struct rds_ib_device *rds_ibdev, struct rds_connection 
*conn);
 void rds_ib_destroy_nodev_conns(void);
+void rds_ib_mr_cqe_handler(struct rds_ib_connection *ic, struct ib_wc *wc);
 
 /* ib_recv.c */
 int rds_ib_recv_init(void);
diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index 83f4673..8764970 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -249,7 +249,12 @@ static void poll_scq(struct rds_ib_connection *ic, struct 
ib_cq *cq,
 (unsigned long long)wc->wr_id, wc->status,
 wc->byte_len, be32_to_cpu(wc->ex.imm_data));
 
-   rds_ib_send_cqe_handler(ic, wc);
+   if (wc->wr_id <= ic->i_send_ring.w_nr ||
+   wc->wr_id == RDS_IB_ACK_WR_ID)
+   rds_ib_send_cqe_handler(ic, wc);
+   else
+   rds_ib_mr_cqe_handler(ic, wc);
+
}
}
 }
diff --git a/net/rds/ib_frmr.c b/net/rds/ib_frmr.c
new file mode 100644
index 000..a86de13
--- /dev/null
+++ b/net/rds/ib_frmr.c
@@ -0,0 +1,376 @@
+/*
+ * Copyright (c) 2016 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ *  - Redistributions of source code must retain the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer.
+ *
+ *  - Redistributions in binary form must reproduce the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer in the documentation and/or other materials
+ *provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include "ib_mr.h"
+
+static struct rds_ib_mr *rds_ib_alloc_frmr(struct rds_ib_device *rds_ibdev,
+  int npages)
+{
+   struct rds_ib_mr_pool *pool;
+   struct rds_ib_mr *ibmr = NULL;
+   struct rds_ib_frmr *frmr;
+   int err = 0;
+
+   if (npages <= RDS_MR_8K_MSG_SIZE)
+   pool = rds_ibdev->mr_8k_pool;
+   else
+   pool = rds_ibdev->mr_1m_pool;
+
+   ibmr = rds_ib_try_reuse_ibmr(pool);
+   if (ibmr)
+

[net-next][PATCH 12/13] RDS: IB: allocate extra space on queues for FRMR support

2016-02-26 Thread Santosh Shilimkar

Fastreg MR(FRMR) memory registration and invalidation makes use
of work request and completion queues for its operation. Patch
allocates extra queue space towards these operation(s).

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib.h|  4 
 net/rds/ib_cm.c | 16 
 2 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/net/rds/ib.h b/net/rds/ib.h
index c5eddc2..eeb0d6c 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -14,6 +14,7 @@
 
 #define RDS_IB_DEFAULT_RECV_WR 1024
 #define RDS_IB_DEFAULT_SEND_WR 256
+#define RDS_IB_DEFAULT_FR_WR   512
 
 #define RDS_IB_DEFAULT_RETRY_COUNT 2
 
@@ -122,6 +123,9 @@ struct rds_ib_connection {
struct ib_wci_send_wc[RDS_IB_WC_MAX];
struct ib_wci_recv_wc[RDS_IB_WC_MAX];
 
+   /* To control the number of wrs from fastreg */
+   atomic_ti_fastreg_wrs;
+
/* interrupt handling */
struct tasklet_struct   i_send_tasklet;
struct tasklet_struct   i_recv_tasklet;
diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index 7f68abc..83f4673 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -363,7 +363,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
struct ib_qp_init_attr attr;
struct ib_cq_init_attr cq_attr = {};
struct rds_ib_device *rds_ibdev;
-   int ret;
+   int ret, fr_queue_space;
 
/*
 * It's normal to see a null device if an incoming connection races
@@ -373,6 +373,12 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
if (!rds_ibdev)
return -EOPNOTSUPP;
 
+   /* The fr_queue_space is currently set to 512, to add extra space on
+* completion queue and send queue. This extra space is used for FRMR
+* registration and invalidation work requests
+*/
+   fr_queue_space = (rds_ibdev->use_fastreg ? RDS_IB_DEFAULT_FR_WR : 0);
+
/* add the conn now so that connection establishment has the dev */
rds_ib_add_conn(rds_ibdev, conn);
 
@@ -384,7 +390,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
/* Protection domain and memory range */
ic->i_pd = rds_ibdev->pd;
 
-   cq_attr.cqe = ic->i_send_ring.w_nr + 1;
+   cq_attr.cqe = ic->i_send_ring.w_nr + fr_queue_space + 1;
 
ic->i_send_cq = ib_create_cq(dev, rds_ib_cq_comp_handler_send,
 rds_ib_cq_event_handler, conn,
@@ -424,7 +430,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
attr.event_handler = rds_ib_qp_event_handler;
attr.qp_context = conn;
/* + 1 to allow for the single ack message */
-   attr.cap.max_send_wr = ic->i_send_ring.w_nr + 1;
+   attr.cap.max_send_wr = ic->i_send_ring.w_nr + fr_queue_space + 1;
attr.cap.max_recv_wr = ic->i_recv_ring.w_nr + 1;
attr.cap.max_send_sge = rds_ibdev->max_sge;
attr.cap.max_recv_sge = RDS_IB_RECV_SGE;
@@ -432,6 +438,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
attr.qp_type = IB_QPT_RC;
attr.send_cq = ic->i_send_cq;
attr.recv_cq = ic->i_recv_cq;
+   atomic_set(>i_fastreg_wrs, RDS_IB_DEFAULT_FR_WR);
 
/*
 * XXX this can fail if max_*_wr is too large?  Are we supposed
@@ -751,7 +758,8 @@ void rds_ib_conn_shutdown(struct rds_connection *conn)
 */
wait_event(rds_ib_ring_empty_wait,
   rds_ib_ring_empty(>i_recv_ring) &&
-  (atomic_read(>i_signaled_sends) == 0));
+  (atomic_read(>i_signaled_sends) == 0) &&
+  (atomic_read(>i_fastreg_wrs) == 
RDS_IB_DEFAULT_FR_WR));
tasklet_kill(>i_send_tasklet);
tasklet_kill(>i_recv_tasklet);
 
-- 
1.9.1

[net-next][PATCH 13/13] RDS: IB: Support Fastreg MR (FRMR) memory registration mode

2016-02-26 Thread Santosh Shilimkar

From: Avinash Repaka 

Fastreg MR(FRMR) is another method with which one can
register memory to HCA. Some of the newer HCAs supports only fastreg
mr mode, so we need to add support for it to RDS to have RDS functional
on them.

Some of the older HCAs support both FMR and FRMR modes. So to try out
FRMR on older HCAs, one can use module parameter 'prefer_frmr'

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Avinash Repaka 
Signed-off-by: Santosh Shilimkar 
---
RDS IB RDMA performance with FRMR is not yet as good as FMR and I do have
some patches in progress to address that. But they are not ready for 4.6
so I left them out of this series. 

 net/rds/Makefile  |   2 +-
 net/rds/ib.h  |   1 +
 net/rds/ib_cm.c   |   7 +-
 net/rds/ib_frmr.c | 376 ++
 net/rds/ib_mr.h   |  24 
 net/rds/ib_rdma.c |  17 ++-
 6 files changed, 422 insertions(+), 5 deletions(-)
 create mode 100644 net/rds/ib_frmr.c

diff --git a/net/rds/Makefile b/net/rds/Makefile
index bcf5591..0e72bec 100644
--- a/net/rds/Makefile
+++ b/net/rds/Makefile
@@ -6,7 +6,7 @@ rds-y :=af_rds.o bind.o cong.o connection.o info.o 
message.o   \
 obj-$(CONFIG_RDS_RDMA) += rds_rdma.o
 rds_rdma-y :=  rdma_transport.o \
ib.o ib_cm.o ib_recv.o ib_ring.o ib_send.o ib_stats.o \
-   ib_sysctl.o ib_rdma.o ib_fmr.o
+   ib_sysctl.o ib_rdma.o ib_fmr.o ib_frmr.o
 
 
 obj-$(CONFIG_RDS_TCP) += rds_tcp.o
diff --git a/net/rds/ib.h b/net/rds/ib.h
index eeb0d6c..627fb79 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -349,6 +349,7 @@ int rds_ib_update_ipaddr(struct rds_ib_device *rds_ibdev, 
__be32 ipaddr);
 void rds_ib_add_conn(struct rds_ib_device *rds_ibdev, struct rds_connection 
*conn);
 void rds_ib_remove_conn(struct rds_ib_device *rds_ibdev, struct rds_connection 
*conn);
 void rds_ib_destroy_nodev_conns(void);
+void rds_ib_mr_cqe_handler(struct rds_ib_connection *ic, struct ib_wc *wc);
 
 /* ib_recv.c */
 int rds_ib_recv_init(void);
diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index 83f4673..8764970 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -249,7 +249,12 @@ static void poll_scq(struct rds_ib_connection *ic, struct 
ib_cq *cq,
 (unsigned long long)wc->wr_id, wc->status,
 wc->byte_len, be32_to_cpu(wc->ex.imm_data));
 
-   rds_ib_send_cqe_handler(ic, wc);
+   if (wc->wr_id <= ic->i_send_ring.w_nr ||
+   wc->wr_id == RDS_IB_ACK_WR_ID)
+   rds_ib_send_cqe_handler(ic, wc);
+   else
+   rds_ib_mr_cqe_handler(ic, wc);
+
}
}
 }
diff --git a/net/rds/ib_frmr.c b/net/rds/ib_frmr.c
new file mode 100644
index 000..a86de13
--- /dev/null
+++ b/net/rds/ib_frmr.c
@@ -0,0 +1,376 @@
+/*
+ * Copyright (c) 2016 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ *  - Redistributions of source code must retain the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer.
+ *
+ *  - Redistributions in binary form must reproduce the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer in the documentation and/or other materials
+ *provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include "ib_mr.h"
+
+static struct rds_ib_mr *rds_ib_alloc_frmr(struct rds_ib_device *rds_ibdev,
+  int npages)
+{
+   struct rds_ib_mr_pool *pool;
+   struct rds_ib_mr *ibmr = NULL;
+   struct rds_ib_frmr *frmr;
+   int err = 0;
+
+   if (npages <= RDS_MR_8K_MSG_SIZE)
+   pool = rds_ibdev->mr_8k_pool;
+   else
+   pool = rds_ibdev->mr_1m_pool;
+
+   ibmr = rds_ib_try_reuse_ibmr(pool);
+   if (ibmr)
+   return ibmr;
+
+   ibmr = kzalloc_node(sizeof(*ibmr), GFP_KERNEL,
+

[net-next][PATCH 12/13] RDS: IB: allocate extra space on queues for FRMR support

2016-02-26 Thread Santosh Shilimkar

Fastreg MR(FRMR) memory registration and invalidation makes use
of work request and completion queues for its operation. Patch
allocates extra queue space towards these operation(s).

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib.h|  4 
 net/rds/ib_cm.c | 16 
 2 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/net/rds/ib.h b/net/rds/ib.h
index c5eddc2..eeb0d6c 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -14,6 +14,7 @@
 
 #define RDS_IB_DEFAULT_RECV_WR 1024
 #define RDS_IB_DEFAULT_SEND_WR 256
+#define RDS_IB_DEFAULT_FR_WR   512
 
 #define RDS_IB_DEFAULT_RETRY_COUNT 2
 
@@ -122,6 +123,9 @@ struct rds_ib_connection {
struct ib_wci_send_wc[RDS_IB_WC_MAX];
struct ib_wci_recv_wc[RDS_IB_WC_MAX];
 
+   /* To control the number of wrs from fastreg */
+   atomic_ti_fastreg_wrs;
+
/* interrupt handling */
struct tasklet_struct   i_send_tasklet;
struct tasklet_struct   i_recv_tasklet;
diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index 7f68abc..83f4673 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -363,7 +363,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
struct ib_qp_init_attr attr;
struct ib_cq_init_attr cq_attr = {};
struct rds_ib_device *rds_ibdev;
-   int ret;
+   int ret, fr_queue_space;
 
/*
 * It's normal to see a null device if an incoming connection races
@@ -373,6 +373,12 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
if (!rds_ibdev)
return -EOPNOTSUPP;
 
+   /* The fr_queue_space is currently set to 512, to add extra space on
+* completion queue and send queue. This extra space is used for FRMR
+* registration and invalidation work requests
+*/
+   fr_queue_space = (rds_ibdev->use_fastreg ? RDS_IB_DEFAULT_FR_WR : 0);
+
/* add the conn now so that connection establishment has the dev */
rds_ib_add_conn(rds_ibdev, conn);
 
@@ -384,7 +390,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
/* Protection domain and memory range */
ic->i_pd = rds_ibdev->pd;
 
-   cq_attr.cqe = ic->i_send_ring.w_nr + 1;
+   cq_attr.cqe = ic->i_send_ring.w_nr + fr_queue_space + 1;
 
ic->i_send_cq = ib_create_cq(dev, rds_ib_cq_comp_handler_send,
 rds_ib_cq_event_handler, conn,
@@ -424,7 +430,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
attr.event_handler = rds_ib_qp_event_handler;
attr.qp_context = conn;
/* + 1 to allow for the single ack message */
-   attr.cap.max_send_wr = ic->i_send_ring.w_nr + 1;
+   attr.cap.max_send_wr = ic->i_send_ring.w_nr + fr_queue_space + 1;
attr.cap.max_recv_wr = ic->i_recv_ring.w_nr + 1;
attr.cap.max_send_sge = rds_ibdev->max_sge;
attr.cap.max_recv_sge = RDS_IB_RECV_SGE;
@@ -432,6 +438,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
attr.qp_type = IB_QPT_RC;
attr.send_cq = ic->i_send_cq;
attr.recv_cq = ic->i_recv_cq;
+   atomic_set(>i_fastreg_wrs, RDS_IB_DEFAULT_FR_WR);
 
/*
 * XXX this can fail if max_*_wr is too large?  Are we supposed
@@ -751,7 +758,8 @@ void rds_ib_conn_shutdown(struct rds_connection *conn)
 */
wait_event(rds_ib_ring_empty_wait,
   rds_ib_ring_empty(>i_recv_ring) &&
-  (atomic_read(>i_signaled_sends) == 0));
+  (atomic_read(>i_signaled_sends) == 0) &&
+  (atomic_read(>i_fastreg_wrs) == 
RDS_IB_DEFAULT_FR_WR));
tasklet_kill(>i_send_tasklet);
tasklet_kill(>i_recv_tasklet);
 
-- 
1.9.1

[net-next][PATCH 08/13] RDS: IB: add connection info to ibmr

2016-02-26 Thread Santosh Shilimkar

Preperatory patch for FRMR support. From connection info,
we can retrieve cm_id which contains qp handled needed for
work request posting.

We also need to drop the RDS connection on QP error states
where connection handle becomes useful.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib_mr.h | 17 +
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/net/rds/ib_mr.h b/net/rds/ib_mr.h
index f5c1fcb..add7725 100644
--- a/net/rds/ib_mr.h
+++ b/net/rds/ib_mr.h
@@ -50,18 +50,19 @@ struct rds_ib_fmr {
 
 /* This is stored as mr->r_trans_private. */
 struct rds_ib_mr {
-   struct rds_ib_device*device;
-   struct rds_ib_mr_pool   *pool;
+   struct rds_ib_device*device;
+   struct rds_ib_mr_pool   *pool;
+   struct rds_ib_connection*ic;
 
-   struct llist_node   llnode;
+   struct llist_node   llnode;
 
/* unmap_list is for freeing */
-   struct list_headunmap_list;
-   unsigned intremap_count;
+   struct list_headunmap_list;
+   unsigned intremap_count;
 
-   struct scatterlist  *sg;
-   unsigned intsg_len;
-   int sg_dma_len;
+   struct scatterlist  *sg;
+   unsigned intsg_len;
+   int sg_dma_len;
 
union {
struct rds_ib_fmr   fmr;
-- 
1.9.1

[net-next][PATCH 08/13] RDS: IB: add connection info to ibmr

2016-02-26 Thread Santosh Shilimkar

Preperatory patch for FRMR support. From connection info,
we can retrieve cm_id which contains qp handled needed for
work request posting.

We also need to drop the RDS connection on QP error states
where connection handle becomes useful.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib_mr.h | 17 +
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/net/rds/ib_mr.h b/net/rds/ib_mr.h
index f5c1fcb..add7725 100644
--- a/net/rds/ib_mr.h
+++ b/net/rds/ib_mr.h
@@ -50,18 +50,19 @@ struct rds_ib_fmr {
 
 /* This is stored as mr->r_trans_private. */
 struct rds_ib_mr {
-   struct rds_ib_device*device;
-   struct rds_ib_mr_pool   *pool;
+   struct rds_ib_device*device;
+   struct rds_ib_mr_pool   *pool;
+   struct rds_ib_connection*ic;
 
-   struct llist_node   llnode;
+   struct llist_node   llnode;
 
/* unmap_list is for freeing */
-   struct list_headunmap_list;
-   unsigned intremap_count;
+   struct list_headunmap_list;
+   unsigned intremap_count;
 
-   struct scatterlist  *sg;
-   unsigned intsg_len;
-   int sg_dma_len;
+   struct scatterlist  *sg;
+   unsigned intsg_len;
+   int sg_dma_len;
 
union {
struct rds_ib_fmr   fmr;
-- 
1.9.1

[net-next][PATCH 07/13] RDS: IB: move FMR code to its own file

2016-02-26 Thread Santosh Shilimkar

No functional change.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib_fmr.c  | 126 +-
 net/rds/ib_mr.h   |   6 +++
 net/rds/ib_rdma.c | 105 ++---
 3 files changed, 133 insertions(+), 104 deletions(-)

diff --git a/net/rds/ib_fmr.c b/net/rds/ib_fmr.c
index 74f2c21..4fe8f4f 100644
--- a/net/rds/ib_fmr.c
+++ b/net/rds/ib_fmr.c
@@ -37,61 +37,16 @@ struct rds_ib_mr *rds_ib_alloc_fmr(struct rds_ib_device 
*rds_ibdev, int npages)
struct rds_ib_mr_pool *pool;
struct rds_ib_mr *ibmr = NULL;
struct rds_ib_fmr *fmr;
-   int err = 0, iter = 0;
+   int err = 0;
 
if (npages <= RDS_MR_8K_MSG_SIZE)
pool = rds_ibdev->mr_8k_pool;
else
pool = rds_ibdev->mr_1m_pool;
 
-   if (atomic_read(>dirty_count) >= pool->max_items / 10)
-   queue_delayed_work(rds_ib_mr_wq, >flush_worker, 10);
-
-   /* Switch pools if one of the pool is reaching upper limit */
-   if (atomic_read(>dirty_count) >=  pool->max_items * 9 / 10) {
-   if (pool->pool_type == RDS_IB_MR_8K_POOL)
-   pool = rds_ibdev->mr_1m_pool;
-   else
-   pool = rds_ibdev->mr_8k_pool;
-   }
-
-   while (1) {
-   ibmr = rds_ib_reuse_mr(pool);
-   if (ibmr)
-   return ibmr;
-
-   /* No clean MRs - now we have the choice of either
-* allocating a fresh MR up to the limit imposed by the
-* driver, or flush any dirty unused MRs.
-* We try to avoid stalling in the send path if possible,
-* so we allocate as long as we're allowed to.
-*
-* We're fussy with enforcing the FMR limit, though. If the
-* driver tells us we can't use more than N fmrs, we shouldn't
-* start arguing with it
-*/
-   if (atomic_inc_return(>item_count) <= pool->max_items)
-   break;
-
-   atomic_dec(>item_count);
-
-   if (++iter > 2) {
-   if (pool->pool_type == RDS_IB_MR_8K_POOL)
-   rds_ib_stats_inc(s_ib_rdma_mr_8k_pool_depleted);
-   else
-   rds_ib_stats_inc(s_ib_rdma_mr_1m_pool_depleted);
-   return ERR_PTR(-EAGAIN);
-   }
-
-   /* We do have some empty MRs. Flush them out. */
-   if (pool->pool_type == RDS_IB_MR_8K_POOL)
-   rds_ib_stats_inc(s_ib_rdma_mr_8k_pool_wait);
-   else
-   rds_ib_stats_inc(s_ib_rdma_mr_1m_pool_wait);
-   rds_ib_flush_mr_pool(pool, 0, );
-   if (ibmr)
-   return ibmr;
-   }
+   ibmr = rds_ib_try_reuse_ibmr(pool);
+   if (ibmr)
+   return ibmr;
 
ibmr = kzalloc_node(sizeof(*ibmr), GFP_KERNEL,
rdsibdev_to_node(rds_ibdev));
@@ -218,3 +173,76 @@ out:
 
return ret;
 }
+
+struct rds_ib_mr *rds_ib_reg_fmr(struct rds_ib_device *rds_ibdev,
+struct scatterlist *sg,
+unsigned long nents,
+u32 *key)
+{
+   struct rds_ib_mr *ibmr = NULL;
+   struct rds_ib_fmr *fmr;
+   int ret;
+
+   ibmr = rds_ib_alloc_fmr(rds_ibdev, nents);
+   if (IS_ERR(ibmr))
+   return ibmr;
+
+   ibmr->device = rds_ibdev;
+   fmr = >u.fmr;
+   ret = rds_ib_map_fmr(rds_ibdev, ibmr, sg, nents);
+   if (ret == 0)
+   *key = fmr->fmr->rkey;
+   else
+   rds_ib_free_mr(ibmr, 0);
+
+   return ibmr;
+}
+
+void rds_ib_unreg_fmr(struct list_head *list, unsigned int *nfreed,
+ unsigned long *unpinned, unsigned int goal)
+{
+   struct rds_ib_mr *ibmr, *next;
+   struct rds_ib_fmr *fmr;
+   LIST_HEAD(fmr_list);
+   int ret = 0;
+   unsigned int freed = *nfreed;
+
+   /* String all ib_mr's onto one list and hand them to  ib_unmap_fmr */
+   list_for_each_entry(ibmr, list, unmap_list) {
+   fmr = >u.fmr;
+   list_add(>fmr->list, _list);
+   }
+
+   ret = ib_unmap_fmr(_list);
+   if (ret)
+   pr_warn("RDS/IB: FMR invalidation failed (err=%d)\n", ret);
+
+   /* Now we can destroy the DMA mapping and unpin any pages */
+   list_for_each_entry_safe(ibmr, next, list, unmap_list) {
+   fmr = >u.fmr;
+   *unpinned += ibmr->sg_len;
+   __rds_ib_teardown_mr(ibmr);
+   if (freed < goal ||
+   ibmr->remap_count >= ibmr->pool->fmr_attr.max_maps) {
+   if (ibmr->pool->pool_type

[net-next][PATCH 07/13] RDS: IB: move FMR code to its own file

2016-02-26 Thread Santosh Shilimkar

No functional change.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib_fmr.c  | 126 +-
 net/rds/ib_mr.h   |   6 +++
 net/rds/ib_rdma.c | 105 ++---
 3 files changed, 133 insertions(+), 104 deletions(-)

diff --git a/net/rds/ib_fmr.c b/net/rds/ib_fmr.c
index 74f2c21..4fe8f4f 100644
--- a/net/rds/ib_fmr.c
+++ b/net/rds/ib_fmr.c
@@ -37,61 +37,16 @@ struct rds_ib_mr *rds_ib_alloc_fmr(struct rds_ib_device 
*rds_ibdev, int npages)
struct rds_ib_mr_pool *pool;
struct rds_ib_mr *ibmr = NULL;
struct rds_ib_fmr *fmr;
-   int err = 0, iter = 0;
+   int err = 0;
 
if (npages <= RDS_MR_8K_MSG_SIZE)
pool = rds_ibdev->mr_8k_pool;
else
pool = rds_ibdev->mr_1m_pool;
 
-   if (atomic_read(>dirty_count) >= pool->max_items / 10)
-   queue_delayed_work(rds_ib_mr_wq, >flush_worker, 10);
-
-   /* Switch pools if one of the pool is reaching upper limit */
-   if (atomic_read(>dirty_count) >=  pool->max_items * 9 / 10) {
-   if (pool->pool_type == RDS_IB_MR_8K_POOL)
-   pool = rds_ibdev->mr_1m_pool;
-   else
-   pool = rds_ibdev->mr_8k_pool;
-   }
-
-   while (1) {
-   ibmr = rds_ib_reuse_mr(pool);
-   if (ibmr)
-   return ibmr;
-
-   /* No clean MRs - now we have the choice of either
-* allocating a fresh MR up to the limit imposed by the
-* driver, or flush any dirty unused MRs.
-* We try to avoid stalling in the send path if possible,
-* so we allocate as long as we're allowed to.
-*
-* We're fussy with enforcing the FMR limit, though. If the
-* driver tells us we can't use more than N fmrs, we shouldn't
-* start arguing with it
-*/
-   if (atomic_inc_return(>item_count) <= pool->max_items)
-   break;
-
-   atomic_dec(>item_count);
-
-   if (++iter > 2) {
-   if (pool->pool_type == RDS_IB_MR_8K_POOL)
-   rds_ib_stats_inc(s_ib_rdma_mr_8k_pool_depleted);
-   else
-   rds_ib_stats_inc(s_ib_rdma_mr_1m_pool_depleted);
-   return ERR_PTR(-EAGAIN);
-   }
-
-   /* We do have some empty MRs. Flush them out. */
-   if (pool->pool_type == RDS_IB_MR_8K_POOL)
-   rds_ib_stats_inc(s_ib_rdma_mr_8k_pool_wait);
-   else
-   rds_ib_stats_inc(s_ib_rdma_mr_1m_pool_wait);
-   rds_ib_flush_mr_pool(pool, 0, );
-   if (ibmr)
-   return ibmr;
-   }
+   ibmr = rds_ib_try_reuse_ibmr(pool);
+   if (ibmr)
+   return ibmr;
 
ibmr = kzalloc_node(sizeof(*ibmr), GFP_KERNEL,
rdsibdev_to_node(rds_ibdev));
@@ -218,3 +173,76 @@ out:
 
return ret;
 }
+
+struct rds_ib_mr *rds_ib_reg_fmr(struct rds_ib_device *rds_ibdev,
+struct scatterlist *sg,
+unsigned long nents,
+u32 *key)
+{
+   struct rds_ib_mr *ibmr = NULL;
+   struct rds_ib_fmr *fmr;
+   int ret;
+
+   ibmr = rds_ib_alloc_fmr(rds_ibdev, nents);
+   if (IS_ERR(ibmr))
+   return ibmr;
+
+   ibmr->device = rds_ibdev;
+   fmr = >u.fmr;
+   ret = rds_ib_map_fmr(rds_ibdev, ibmr, sg, nents);
+   if (ret == 0)
+   *key = fmr->fmr->rkey;
+   else
+   rds_ib_free_mr(ibmr, 0);
+
+   return ibmr;
+}
+
+void rds_ib_unreg_fmr(struct list_head *list, unsigned int *nfreed,
+ unsigned long *unpinned, unsigned int goal)
+{
+   struct rds_ib_mr *ibmr, *next;
+   struct rds_ib_fmr *fmr;
+   LIST_HEAD(fmr_list);
+   int ret = 0;
+   unsigned int freed = *nfreed;
+
+   /* String all ib_mr's onto one list and hand them to  ib_unmap_fmr */
+   list_for_each_entry(ibmr, list, unmap_list) {
+   fmr = >u.fmr;
+   list_add(>fmr->list, _list);
+   }
+
+   ret = ib_unmap_fmr(_list);
+   if (ret)
+   pr_warn("RDS/IB: FMR invalidation failed (err=%d)\n", ret);
+
+   /* Now we can destroy the DMA mapping and unpin any pages */
+   list_for_each_entry_safe(ibmr, next, list, unmap_list) {
+   fmr = >u.fmr;
+   *unpinned += ibmr->sg_len;
+   __rds_ib_teardown_mr(ibmr);
+   if (freed < goal ||
+   ibmr->remap_count >= ibmr->pool->fmr_attr.max_maps) {
+   if (ibmr->pool->pool_type == RDS_IB_MR_8K_POOL)
+

[net-next][PATCH 06/13] RDS: IB: create struct rds_ib_fmr

2016-02-26 Thread Santosh Shilimkar

Keep fmr related filed in its own struct. Fastreg MR structure
will be added to the union.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib_fmr.c  | 17 ++---
 net/rds/ib_mr.h   | 11 +--
 net/rds/ib_rdma.c | 14 ++
 3 files changed, 29 insertions(+), 13 deletions(-)

diff --git a/net/rds/ib_fmr.c b/net/rds/ib_fmr.c
index d4f200d..74f2c21 100644
--- a/net/rds/ib_fmr.c
+++ b/net/rds/ib_fmr.c
@@ -36,6 +36,7 @@ struct rds_ib_mr *rds_ib_alloc_fmr(struct rds_ib_device 
*rds_ibdev, int npages)
 {
struct rds_ib_mr_pool *pool;
struct rds_ib_mr *ibmr = NULL;
+   struct rds_ib_fmr *fmr;
int err = 0, iter = 0;
 
if (npages <= RDS_MR_8K_MSG_SIZE)
@@ -99,15 +100,16 @@ struct rds_ib_mr *rds_ib_alloc_fmr(struct rds_ib_device 
*rds_ibdev, int npages)
goto out_no_cigar;
}
 
-   ibmr->fmr = ib_alloc_fmr(rds_ibdev->pd,
+   fmr = >u.fmr;
+   fmr->fmr = ib_alloc_fmr(rds_ibdev->pd,
(IB_ACCESS_LOCAL_WRITE |
 IB_ACCESS_REMOTE_READ |
 IB_ACCESS_REMOTE_WRITE |
 IB_ACCESS_REMOTE_ATOMIC),
>fmr_attr);
-   if (IS_ERR(ibmr->fmr)) {
-   err = PTR_ERR(ibmr->fmr);
-   ibmr->fmr = NULL;
+   if (IS_ERR(fmr->fmr)) {
+   err = PTR_ERR(fmr->fmr);
+   fmr->fmr = NULL;
pr_warn("RDS/IB: %s failed (err=%d)\n", __func__, err);
goto out_no_cigar;
}
@@ -122,8 +124,8 @@ struct rds_ib_mr *rds_ib_alloc_fmr(struct rds_ib_device 
*rds_ibdev, int npages)
 
 out_no_cigar:
if (ibmr) {
-   if (ibmr->fmr)
-   ib_dealloc_fmr(ibmr->fmr);
+   if (fmr->fmr)
+   ib_dealloc_fmr(fmr->fmr);
kfree(ibmr);
}
atomic_dec(>item_count);
@@ -134,6 +136,7 @@ int rds_ib_map_fmr(struct rds_ib_device *rds_ibdev, struct 
rds_ib_mr *ibmr,
   struct scatterlist *sg, unsigned int nents)
 {
struct ib_device *dev = rds_ibdev->dev;
+   struct rds_ib_fmr *fmr = >u.fmr;
struct scatterlist *scat = sg;
u64 io_addr = 0;
u64 *dma_pages;
@@ -190,7 +193,7 @@ int rds_ib_map_fmr(struct rds_ib_device *rds_ibdev, struct 
rds_ib_mr *ibmr,
(dma_addr & PAGE_MASK) + j;
}
 
-   ret = ib_map_phys_fmr(ibmr->fmr, dma_pages, page_cnt, io_addr);
+   ret = ib_map_phys_fmr(fmr->fmr, dma_pages, page_cnt, io_addr);
if (ret)
goto out;
 
diff --git a/net/rds/ib_mr.h b/net/rds/ib_mr.h
index d88724f..309ad59 100644
--- a/net/rds/ib_mr.h
+++ b/net/rds/ib_mr.h
@@ -43,11 +43,15 @@
 #define RDS_MR_8K_SCALE(256 / (RDS_MR_8K_MSG_SIZE + 1))
 #define RDS_MR_8K_POOL_SIZE(RDS_MR_8K_SCALE * (8192 / 2))
 
+struct rds_ib_fmr {
+   struct ib_fmr   *fmr;
+   u64 *dma;
+};
+
 /* This is stored as mr->r_trans_private. */
 struct rds_ib_mr {
struct rds_ib_device*device;
struct rds_ib_mr_pool   *pool;
-   struct ib_fmr   *fmr;
 
struct llist_node   llnode;
 
@@ -57,8 +61,11 @@ struct rds_ib_mr {
 
struct scatterlist  *sg;
unsigned intsg_len;
-   u64 *dma;
int sg_dma_len;
+
+   union {
+   struct rds_ib_fmr   fmr;
+   } u;
 };
 
 /* Our own little MR pool */
diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c
index c594519..9e608d9 100644
--- a/net/rds/ib_rdma.c
+++ b/net/rds/ib_rdma.c
@@ -334,6 +334,7 @@ int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool,
 int free_all, struct rds_ib_mr **ibmr_ret)
 {
struct rds_ib_mr *ibmr, *next;
+   struct rds_ib_fmr *fmr;
struct llist_node *clean_nodes;
struct llist_node *clean_tail;
LIST_HEAD(unmap_list);
@@ -395,8 +396,10 @@ int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool,
goto out;
 
/* String all ib_mr's onto one list and hand them to ib_unmap_fmr */
-   list_for_each_entry(ibmr, _list, unmap_list)
-   list_add(>fmr->list, _list);
+   list_for_each_entry(ibmr, _list, unmap_list) {
+   fmr = >u.fmr;
+   list_add(>fmr->list, _list);
+   }
 
ret = ib_unmap_fmr(_list);
if (ret)
@@ -405,6 +408,7 @@ int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool,
/* Now we can destroy the DMA mapping and unpin any pages */
list_for_each_entry_safe(ibmr, next, _list, unmap_list) {
unpinned += ibmr->sg_len;
+   fmr = >u.fmr;
__rds_ib_teardown_mr(ibmr);
if (nfreed < free_goal ||
ibmr->remap_count >= pool->fmr_attr.max_maps) {
@@

[net-next][PATCH 04/13] RDS: IB: Remove the RDS_IB_SEND_OP dependency

2016-02-26 Thread Santosh Shilimkar

This helps to combine asynchronous fastreg MR completion handler
with send completion handler.

No functional change.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib.h  |  1 -
 net/rds/ib_cm.c   | 42 +++---
 net/rds/ib_send.c |  6 ++
 3 files changed, 29 insertions(+), 20 deletions(-)

diff --git a/net/rds/ib.h b/net/rds/ib.h
index b3fdebb..09cd8e3 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -28,7 +28,6 @@
 #define RDS_IB_RECYCLE_BATCH_COUNT 32
 
 #define RDS_IB_WC_MAX  32
-#define RDS_IB_SEND_OP BIT_ULL(63)
 
 extern struct rw_semaphore rds_ib_devices_lock;
 extern struct list_head rds_ib_devices;
diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index da5a7fb..7f68abc 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -236,12 +236,10 @@ static void rds_ib_cq_comp_handler_recv(struct ib_cq *cq, 
void *context)
tasklet_schedule(>i_recv_tasklet);
 }
 
-static void poll_cq(struct rds_ib_connection *ic, struct ib_cq *cq,
-   struct ib_wc *wcs,
-   struct rds_ib_ack_state *ack_state)
+static void poll_scq(struct rds_ib_connection *ic, struct ib_cq *cq,
+struct ib_wc *wcs)
 {
-   int nr;
-   int i;
+   int nr, i;
struct ib_wc *wc;
 
while ((nr = ib_poll_cq(cq, RDS_IB_WC_MAX, wcs)) > 0) {
@@ -251,10 +249,7 @@ static void poll_cq(struct rds_ib_connection *ic, struct 
ib_cq *cq,
 (unsigned long long)wc->wr_id, wc->status,
 wc->byte_len, be32_to_cpu(wc->ex.imm_data));
 
-   if (wc->wr_id & RDS_IB_SEND_OP)
-   rds_ib_send_cqe_handler(ic, wc);
-   else
-   rds_ib_recv_cqe_handler(ic, wc, ack_state);
+   rds_ib_send_cqe_handler(ic, wc);
}
}
 }
@@ -263,14 +258,12 @@ static void rds_ib_tasklet_fn_send(unsigned long data)
 {
struct rds_ib_connection *ic = (struct rds_ib_connection *)data;
struct rds_connection *conn = ic->conn;
-   struct rds_ib_ack_state state;
 
rds_ib_stats_inc(s_ib_tasklet_call);
 
-   memset(, 0, sizeof(state));
-   poll_cq(ic, ic->i_send_cq, ic->i_send_wc, );
+   poll_scq(ic, ic->i_send_cq, ic->i_send_wc);
ib_req_notify_cq(ic->i_send_cq, IB_CQ_NEXT_COMP);
-   poll_cq(ic, ic->i_send_cq, ic->i_send_wc, );
+   poll_scq(ic, ic->i_send_cq, ic->i_send_wc);
 
if (rds_conn_up(conn) &&
(!test_bit(RDS_LL_SEND_FULL, >c_flags) ||
@@ -278,6 +271,25 @@ static void rds_ib_tasklet_fn_send(unsigned long data)
rds_send_xmit(ic->conn);
 }
 
+static void poll_rcq(struct rds_ib_connection *ic, struct ib_cq *cq,
+struct ib_wc *wcs,
+struct rds_ib_ack_state *ack_state)
+{
+   int nr, i;
+   struct ib_wc *wc;
+
+   while ((nr = ib_poll_cq(cq, RDS_IB_WC_MAX, wcs)) > 0) {
+   for (i = 0; i < nr; i++) {
+   wc = wcs + i;
+   rdsdebug("wc wr_id 0x%llx status %u byte_len %u 
imm_data %u\n",
+(unsigned long long)wc->wr_id, wc->status,
+wc->byte_len, be32_to_cpu(wc->ex.imm_data));
+
+   rds_ib_recv_cqe_handler(ic, wc, ack_state);
+   }
+   }
+}
+
 static void rds_ib_tasklet_fn_recv(unsigned long data)
 {
struct rds_ib_connection *ic = (struct rds_ib_connection *)data;
@@ -291,9 +303,9 @@ static void rds_ib_tasklet_fn_recv(unsigned long data)
rds_ib_stats_inc(s_ib_tasklet_call);
 
memset(, 0, sizeof(state));
-   poll_cq(ic, ic->i_recv_cq, ic->i_recv_wc, );
+   poll_rcq(ic, ic->i_recv_cq, ic->i_recv_wc, );
ib_req_notify_cq(ic->i_recv_cq, IB_CQ_SOLICITED);
-   poll_cq(ic, ic->i_recv_cq, ic->i_recv_wc, );
+   poll_rcq(ic, ic->i_recv_cq, ic->i_recv_wc, );
 
if (state.ack_next_valid)
rds_ib_set_ack(ic, state.ack_next, state.ack_required);
diff --git a/net/rds/ib_send.c b/net/rds/ib_send.c
index eac30bf..f27d2c8 100644
--- a/net/rds/ib_send.c
+++ b/net/rds/ib_send.c
@@ -195,7 +195,7 @@ void rds_ib_send_init_ring(struct rds_ib_connection *ic)
 
send->s_op = NULL;
 
-   send->s_wr.wr_id = i | RDS_IB_SEND_OP;
+   send->s_wr.wr_id = i;
send->s_wr.sg_list = send->s_sge;
send->s_wr.ex.imm_data = 0;
 
@@ -263,9 +263,7 @@ void rds_ib_send_cqe_handler(struct rds_ib_connection *ic, 
struct ib_wc *wc)
 
oldest = rds_ib_ring_oldest(>i_send_ring);
 
-   completed = rds_ib_ring_completed(>i_send_ring,
- (wc->wr_id & ~RDS_IB_SEND_OP),
- oldest);
+   completed =

[net-next][PATCH 06/13] RDS: IB: create struct rds_ib_fmr

2016-02-26 Thread Santosh Shilimkar

Keep fmr related filed in its own struct. Fastreg MR structure
will be added to the union.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib_fmr.c  | 17 ++---
 net/rds/ib_mr.h   | 11 +--
 net/rds/ib_rdma.c | 14 ++
 3 files changed, 29 insertions(+), 13 deletions(-)

diff --git a/net/rds/ib_fmr.c b/net/rds/ib_fmr.c
index d4f200d..74f2c21 100644
--- a/net/rds/ib_fmr.c
+++ b/net/rds/ib_fmr.c
@@ -36,6 +36,7 @@ struct rds_ib_mr *rds_ib_alloc_fmr(struct rds_ib_device 
*rds_ibdev, int npages)
 {
struct rds_ib_mr_pool *pool;
struct rds_ib_mr *ibmr = NULL;
+   struct rds_ib_fmr *fmr;
int err = 0, iter = 0;
 
if (npages <= RDS_MR_8K_MSG_SIZE)
@@ -99,15 +100,16 @@ struct rds_ib_mr *rds_ib_alloc_fmr(struct rds_ib_device 
*rds_ibdev, int npages)
goto out_no_cigar;
}
 
-   ibmr->fmr = ib_alloc_fmr(rds_ibdev->pd,
+   fmr = >u.fmr;
+   fmr->fmr = ib_alloc_fmr(rds_ibdev->pd,
(IB_ACCESS_LOCAL_WRITE |
 IB_ACCESS_REMOTE_READ |
 IB_ACCESS_REMOTE_WRITE |
 IB_ACCESS_REMOTE_ATOMIC),
>fmr_attr);
-   if (IS_ERR(ibmr->fmr)) {
-   err = PTR_ERR(ibmr->fmr);
-   ibmr->fmr = NULL;
+   if (IS_ERR(fmr->fmr)) {
+   err = PTR_ERR(fmr->fmr);
+   fmr->fmr = NULL;
pr_warn("RDS/IB: %s failed (err=%d)\n", __func__, err);
goto out_no_cigar;
}
@@ -122,8 +124,8 @@ struct rds_ib_mr *rds_ib_alloc_fmr(struct rds_ib_device 
*rds_ibdev, int npages)
 
 out_no_cigar:
if (ibmr) {
-   if (ibmr->fmr)
-   ib_dealloc_fmr(ibmr->fmr);
+   if (fmr->fmr)
+   ib_dealloc_fmr(fmr->fmr);
kfree(ibmr);
}
atomic_dec(>item_count);
@@ -134,6 +136,7 @@ int rds_ib_map_fmr(struct rds_ib_device *rds_ibdev, struct 
rds_ib_mr *ibmr,
   struct scatterlist *sg, unsigned int nents)
 {
struct ib_device *dev = rds_ibdev->dev;
+   struct rds_ib_fmr *fmr = >u.fmr;
struct scatterlist *scat = sg;
u64 io_addr = 0;
u64 *dma_pages;
@@ -190,7 +193,7 @@ int rds_ib_map_fmr(struct rds_ib_device *rds_ibdev, struct 
rds_ib_mr *ibmr,
(dma_addr & PAGE_MASK) + j;
}
 
-   ret = ib_map_phys_fmr(ibmr->fmr, dma_pages, page_cnt, io_addr);
+   ret = ib_map_phys_fmr(fmr->fmr, dma_pages, page_cnt, io_addr);
if (ret)
goto out;
 
diff --git a/net/rds/ib_mr.h b/net/rds/ib_mr.h
index d88724f..309ad59 100644
--- a/net/rds/ib_mr.h
+++ b/net/rds/ib_mr.h
@@ -43,11 +43,15 @@
 #define RDS_MR_8K_SCALE(256 / (RDS_MR_8K_MSG_SIZE + 1))
 #define RDS_MR_8K_POOL_SIZE(RDS_MR_8K_SCALE * (8192 / 2))
 
+struct rds_ib_fmr {
+   struct ib_fmr   *fmr;
+   u64 *dma;
+};
+
 /* This is stored as mr->r_trans_private. */
 struct rds_ib_mr {
struct rds_ib_device*device;
struct rds_ib_mr_pool   *pool;
-   struct ib_fmr   *fmr;
 
struct llist_node   llnode;
 
@@ -57,8 +61,11 @@ struct rds_ib_mr {
 
struct scatterlist  *sg;
unsigned intsg_len;
-   u64 *dma;
int sg_dma_len;
+
+   union {
+   struct rds_ib_fmr   fmr;
+   } u;
 };
 
 /* Our own little MR pool */
diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c
index c594519..9e608d9 100644
--- a/net/rds/ib_rdma.c
+++ b/net/rds/ib_rdma.c
@@ -334,6 +334,7 @@ int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool,
 int free_all, struct rds_ib_mr **ibmr_ret)
 {
struct rds_ib_mr *ibmr, *next;
+   struct rds_ib_fmr *fmr;
struct llist_node *clean_nodes;
struct llist_node *clean_tail;
LIST_HEAD(unmap_list);
@@ -395,8 +396,10 @@ int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool,
goto out;
 
/* String all ib_mr's onto one list and hand them to ib_unmap_fmr */
-   list_for_each_entry(ibmr, _list, unmap_list)
-   list_add(>fmr->list, _list);
+   list_for_each_entry(ibmr, _list, unmap_list) {
+   fmr = >u.fmr;
+   list_add(>fmr->list, _list);
+   }
 
ret = ib_unmap_fmr(_list);
if (ret)
@@ -405,6 +408,7 @@ int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool,
/* Now we can destroy the DMA mapping and unpin any pages */
list_for_each_entry_safe(ibmr, next, _list, unmap_list) {
unpinned += ibmr->sg_len;
+   fmr = >u.fmr;
__rds_ib_teardown_mr(ibmr);
if (nfreed < free_goal ||
ibmr->remap_count >= pool->fmr_attr.max_maps) {
@@ -413,7 +417,7 @@ int rds_ib_flush_mr_pool(struct

[net-next][PATCH 04/13] RDS: IB: Remove the RDS_IB_SEND_OP dependency

2016-02-26 Thread Santosh Shilimkar

This helps to combine asynchronous fastreg MR completion handler
with send completion handler.

No functional change.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib.h  |  1 -
 net/rds/ib_cm.c   | 42 +++---
 net/rds/ib_send.c |  6 ++
 3 files changed, 29 insertions(+), 20 deletions(-)

diff --git a/net/rds/ib.h b/net/rds/ib.h
index b3fdebb..09cd8e3 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -28,7 +28,6 @@
 #define RDS_IB_RECYCLE_BATCH_COUNT 32
 
 #define RDS_IB_WC_MAX  32
-#define RDS_IB_SEND_OP BIT_ULL(63)
 
 extern struct rw_semaphore rds_ib_devices_lock;
 extern struct list_head rds_ib_devices;
diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index da5a7fb..7f68abc 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -236,12 +236,10 @@ static void rds_ib_cq_comp_handler_recv(struct ib_cq *cq, 
void *context)
tasklet_schedule(>i_recv_tasklet);
 }
 
-static void poll_cq(struct rds_ib_connection *ic, struct ib_cq *cq,
-   struct ib_wc *wcs,
-   struct rds_ib_ack_state *ack_state)
+static void poll_scq(struct rds_ib_connection *ic, struct ib_cq *cq,
+struct ib_wc *wcs)
 {
-   int nr;
-   int i;
+   int nr, i;
struct ib_wc *wc;
 
while ((nr = ib_poll_cq(cq, RDS_IB_WC_MAX, wcs)) > 0) {
@@ -251,10 +249,7 @@ static void poll_cq(struct rds_ib_connection *ic, struct 
ib_cq *cq,
 (unsigned long long)wc->wr_id, wc->status,
 wc->byte_len, be32_to_cpu(wc->ex.imm_data));
 
-   if (wc->wr_id & RDS_IB_SEND_OP)
-   rds_ib_send_cqe_handler(ic, wc);
-   else
-   rds_ib_recv_cqe_handler(ic, wc, ack_state);
+   rds_ib_send_cqe_handler(ic, wc);
}
}
 }
@@ -263,14 +258,12 @@ static void rds_ib_tasklet_fn_send(unsigned long data)
 {
struct rds_ib_connection *ic = (struct rds_ib_connection *)data;
struct rds_connection *conn = ic->conn;
-   struct rds_ib_ack_state state;
 
rds_ib_stats_inc(s_ib_tasklet_call);
 
-   memset(, 0, sizeof(state));
-   poll_cq(ic, ic->i_send_cq, ic->i_send_wc, );
+   poll_scq(ic, ic->i_send_cq, ic->i_send_wc);
ib_req_notify_cq(ic->i_send_cq, IB_CQ_NEXT_COMP);
-   poll_cq(ic, ic->i_send_cq, ic->i_send_wc, );
+   poll_scq(ic, ic->i_send_cq, ic->i_send_wc);
 
if (rds_conn_up(conn) &&
(!test_bit(RDS_LL_SEND_FULL, >c_flags) ||
@@ -278,6 +271,25 @@ static void rds_ib_tasklet_fn_send(unsigned long data)
rds_send_xmit(ic->conn);
 }
 
+static void poll_rcq(struct rds_ib_connection *ic, struct ib_cq *cq,
+struct ib_wc *wcs,
+struct rds_ib_ack_state *ack_state)
+{
+   int nr, i;
+   struct ib_wc *wc;
+
+   while ((nr = ib_poll_cq(cq, RDS_IB_WC_MAX, wcs)) > 0) {
+   for (i = 0; i < nr; i++) {
+   wc = wcs + i;
+   rdsdebug("wc wr_id 0x%llx status %u byte_len %u 
imm_data %u\n",
+(unsigned long long)wc->wr_id, wc->status,
+wc->byte_len, be32_to_cpu(wc->ex.imm_data));
+
+   rds_ib_recv_cqe_handler(ic, wc, ack_state);
+   }
+   }
+}
+
 static void rds_ib_tasklet_fn_recv(unsigned long data)
 {
struct rds_ib_connection *ic = (struct rds_ib_connection *)data;
@@ -291,9 +303,9 @@ static void rds_ib_tasklet_fn_recv(unsigned long data)
rds_ib_stats_inc(s_ib_tasklet_call);
 
memset(, 0, sizeof(state));
-   poll_cq(ic, ic->i_recv_cq, ic->i_recv_wc, );
+   poll_rcq(ic, ic->i_recv_cq, ic->i_recv_wc, );
ib_req_notify_cq(ic->i_recv_cq, IB_CQ_SOLICITED);
-   poll_cq(ic, ic->i_recv_cq, ic->i_recv_wc, );
+   poll_rcq(ic, ic->i_recv_cq, ic->i_recv_wc, );
 
if (state.ack_next_valid)
rds_ib_set_ack(ic, state.ack_next, state.ack_required);
diff --git a/net/rds/ib_send.c b/net/rds/ib_send.c
index eac30bf..f27d2c8 100644
--- a/net/rds/ib_send.c
+++ b/net/rds/ib_send.c
@@ -195,7 +195,7 @@ void rds_ib_send_init_ring(struct rds_ib_connection *ic)
 
send->s_op = NULL;
 
-   send->s_wr.wr_id = i | RDS_IB_SEND_OP;
+   send->s_wr.wr_id = i;
send->s_wr.sg_list = send->s_sge;
send->s_wr.ex.imm_data = 0;
 
@@ -263,9 +263,7 @@ void rds_ib_send_cqe_handler(struct rds_ib_connection *ic, 
struct ib_wc *wc)
 
oldest = rds_ib_ring_oldest(>i_send_ring);
 
-   completed = rds_ib_ring_completed(>i_send_ring,
- (wc->wr_id & ~RDS_IB_SEND_OP),
- oldest);
+   completed = rds_ib_ring_completed(>i_send_ring, wc->wr_id, oldest);

Re: [PATCH] IPIP tunnel performance improvement

2016-02-26 Thread zhao ya


BTW,before the version 3.5 kernel, the source code contains the logic.
2.6.32, for example, in arp_bind_neighbour function, there are the following 
logic:

__be32 nexthop = ((struct rtable *) DST) - > rt_gateway;
if (dev - > flags & (IFF_LOOPBACK | IFF_POINTOPOINT))
nexthop = 0;
n = __neigh_lookup_errno (
...

zhao ya said, at 2/27/2016 12:40 PM:
> From: Zhao Ya 
> Date: Sat, 27 Feb 2016 10:06:44 +0800
> Subject: [PATCH] IPIP tunnel performance improvement
> 
> bypass the logic of each packet's own neighbour creation when using 
> pointopint or loopback device.
> 
> Recently, in our tests, met a performance problem.
> In a large number of packets with different target IP address through 
> ipip tunnel, PPS will decrease sharply.
> 
> The output of perf top are as follows, __write_lock_failed is of the first:
>   - 5.89% [kernel][k] __write_lock_failed
>-__write_lock_failed   a
>-_raw_write_lock_bha
>-__neigh_createa
>-ip_finish_output  a
>-ip_output a
>-ip_local_out  a
> 
> The neighbour subsystem will create a neighbour object for each target 
> when using pointopint device. When massive amounts of packets with diff-
> erent target IP address to be xmit through a pointopint device, these 
> packets will suffer the bottleneck at write_lock_bh(>lock) after 
> creating the neighbour object and then inserting it into a hash-table 
> at the same time. 
> 
> This patch correct it. Only one or little amounts of neighbour objects 
> will be created when massive amounts of packets with different target IP 
> address through ipip tunnel. 
> 
> As the result, performance will be improved.
> 
> 
> Signed-off-by: Zhao Ya 
> Signed-off-by: Zhaoya 
> ---
>  net/ipv4/ip_output.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> index 64878ef..d7c0594 100644
> --- a/net/ipv4/ip_output.c
> +++ b/net/ipv4/ip_output.c
> @@ -202,6 +202,8 @@ static int ip_finish_output2(struct net *net, struct sock 
> *sk, struct sk_buff *s
>  
>   rcu_read_lock_bh();
>   nexthop = (__force u32) rt_nexthop(rt, ip_hdr(skb)->daddr);
> + if (dev->flags & (IFF_LOOPBACK | IFF_POINTOPOINT))
> + nexthop = 0;
>   neigh = __ipv4_neigh_lookup_noref(dev, nexthop);
>   if (unlikely(!neigh))
>   neigh = __neigh_create(_tbl, , dev, false);
> 
>

Re: [PATCH] IPIP tunnel performance improvement

2016-02-26 Thread zhao ya


BTW,before the version 3.5 kernel, the source code contains the logic.
2.6.32, for example, in arp_bind_neighbour function, there are the following 
logic:

__be32 nexthop = ((struct rtable *) DST) - > rt_gateway;
if (dev - > flags & (IFF_LOOPBACK | IFF_POINTOPOINT))
nexthop = 0;
n = __neigh_lookup_errno (
...

zhao ya said, at 2/27/2016 12:40 PM:
> From: Zhao Ya 
> Date: Sat, 27 Feb 2016 10:06:44 +0800
> Subject: [PATCH] IPIP tunnel performance improvement
> 
> bypass the logic of each packet's own neighbour creation when using 
> pointopint or loopback device.
> 
> Recently, in our tests, met a performance problem.
> In a large number of packets with different target IP address through 
> ipip tunnel, PPS will decrease sharply.
> 
> The output of perf top are as follows, __write_lock_failed is of the first:
>   - 5.89% [kernel][k] __write_lock_failed
>-__write_lock_failed   a
>-_raw_write_lock_bha
>-__neigh_createa
>-ip_finish_output  a
>-ip_output a
>-ip_local_out  a
> 
> The neighbour subsystem will create a neighbour object for each target 
> when using pointopint device. When massive amounts of packets with diff-
> erent target IP address to be xmit through a pointopint device, these 
> packets will suffer the bottleneck at write_lock_bh(>lock) after 
> creating the neighbour object and then inserting it into a hash-table 
> at the same time. 
> 
> This patch correct it. Only one or little amounts of neighbour objects 
> will be created when massive amounts of packets with different target IP 
> address through ipip tunnel. 
> 
> As the result, performance will be improved.
> 
> 
> Signed-off-by: Zhao Ya 
> Signed-off-by: Zhaoya 
> ---
>  net/ipv4/ip_output.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> index 64878ef..d7c0594 100644
> --- a/net/ipv4/ip_output.c
> +++ b/net/ipv4/ip_output.c
> @@ -202,6 +202,8 @@ static int ip_finish_output2(struct net *net, struct sock 
> *sk, struct sk_buff *s
>  
>   rcu_read_lock_bh();
>   nexthop = (__force u32) rt_nexthop(rt, ip_hdr(skb)->daddr);
> + if (dev->flags & (IFF_LOOPBACK | IFF_POINTOPOINT))
> + nexthop = 0;
>   neigh = __ipv4_neigh_lookup_noref(dev, nexthop);
>   if (unlikely(!neigh))
>   neigh = __neigh_create(_tbl, , dev, false);
> 
>

[PATCH] IPIP tunnel performance improvement

2016-02-26 Thread zhao ya

From: Zhao Ya 
Date: Sat, 27 Feb 2016 10:06:44 +0800
Subject: [PATCH] IPIP tunnel performance improvement

bypass the logic of each packet's own neighbour creation when using 
pointopint or loopback device.

Recently, in our tests, met a performance problem.
In a large number of packets with different target IP address through 
ipip tunnel, PPS will decrease sharply.

The output of perf top are as follows, __write_lock_failed is of the first:
  - 5.89% [kernel]  [k] __write_lock_failed
   -__write_lock_failed a
   -_raw_write_lock_bh  a
   -__neigh_create  a
   -ip_finish_outputa
   -ip_output   a
   -ip_local_outa

The neighbour subsystem will create a neighbour object for each target 
when using pointopint device. When massive amounts of packets with diff-
erent target IP address to be xmit through a pointopint device, these 
packets will suffer the bottleneck at write_lock_bh(>lock) after 
creating the neighbour object and then inserting it into a hash-table 
at the same time. 

This patch correct it. Only one or little amounts of neighbour objects 
will be created when massive amounts of packets with different target IP 
address through ipip tunnel. 

As the result, performance will be improved.

Signed-off-by: Zhao Ya 
Signed-off-by: Zhaoya 
---
 net/ipv4/ip_output.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 64878ef..d7c0594 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -202,6 +202,8 @@ static int ip_finish_output2(struct net *net, struct sock 
*sk, struct sk_buff *s

rcu_read_lock_bh();
nexthop = (__force u32) rt_nexthop(rt, ip_hdr(skb)->daddr);
+   if (dev->flags & (IFF_LOOPBACK | IFF_POINTOPOINT))
+   nexthop = 0;
neigh = __ipv4_neigh_lookup_noref(dev, nexthop);
if (unlikely(!neigh))
neigh = __neigh_create(_tbl, , dev, false);

[PATCH] IPIP tunnel performance improvement

2016-02-26 Thread zhao ya

From: Zhao Ya 
Date: Sat, 27 Feb 2016 10:06:44 +0800
Subject: [PATCH] IPIP tunnel performance improvement

bypass the logic of each packet's own neighbour creation when using 
pointopint or loopback device.

Recently, in our tests, met a performance problem.
In a large number of packets with different target IP address through 
ipip tunnel, PPS will decrease sharply.

The output of perf top are as follows, __write_lock_failed is of the first:
  - 5.89% [kernel]  [k] __write_lock_failed
   -__write_lock_failed a
   -_raw_write_lock_bh  a
   -__neigh_create  a
   -ip_finish_outputa
   -ip_output   a
   -ip_local_outa

The neighbour subsystem will create a neighbour object for each target 
when using pointopint device. When massive amounts of packets with diff-
erent target IP address to be xmit through a pointopint device, these 
packets will suffer the bottleneck at write_lock_bh(>lock) after 
creating the neighbour object and then inserting it into a hash-table 
at the same time. 

This patch correct it. Only one or little amounts of neighbour objects 
will be created when massive amounts of packets with different target IP 
address through ipip tunnel. 

As the result, performance will be improved.

Signed-off-by: Zhao Ya 
Signed-off-by: Zhaoya 
---
 net/ipv4/ip_output.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 64878ef..d7c0594 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -202,6 +202,8 @@ static int ip_finish_output2(struct net *net, struct sock 
*sk, struct sk_buff *s

rcu_read_lock_bh();
nexthop = (__force u32) rt_nexthop(rt, ip_hdr(skb)->daddr);
+   if (dev->flags & (IFF_LOOPBACK | IFF_POINTOPOINT))
+   nexthop = 0;
neigh = __ipv4_neigh_lookup_noref(dev, nexthop);
if (unlikely(!neigh))
neigh = __neigh_create(_tbl, , dev, false);

RE: [PATCH 5/5] hv: Track allocations of children of hv_vmbus in private resource tree

2016-02-26 Thread Jake Oshins

> -Original Message-
> From: KY Srinivasan
> Sent: Friday, February 26, 2016 5:09 PM
> To: Jake Oshins ; linux-...@vger.kernel.org;
> gre...@linuxfoundation.org; linux-kernel@vger.kernel.org;
> de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com;
> vkuzn...@redhat.com; Haiyang Zhang ; Hadden
> Hoppert 
> Cc: Jake Oshins 
> Subject: RE: [PATCH 5/5] hv: Track allocations of children of hv_vmbus in
> private resource tree
> 
> > -Original Message-
> > From: ja...@microsoft.com [mailto:ja...@microsoft.com]
> > Sent: Wednesday, February 24, 2016 1:24 PM
> > To: linux-...@vger.kernel.org; gre...@linuxfoundation.org; KY Srinivasan
> > ; linux-kernel@vger.kernel.org;
> > de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com;
> > vkuzn...@redhat.com; Haiyang Zhang ;
> Hadden
> > Hoppert 
> > Cc: Jake Oshins 
> > Subject: [PATCH 5/5] hv: Track allocations of children of hv_vmbus in
> private
> > resource tree
> >
> > From: Jake Oshins 
> >
> > This patch changes vmbus_allocate_mmio() and vmbus_free_mmio() so
> > that when child paravirtual devices allocate memory-mapped I/O
> > space, they allocate it privately from a resource tree pointed
> > at by hyperv_mmio and also by the public resource tree
> > iomem_resource.  This allows the region to be marked as "busy"
> > in the private tree, but a "bridge window" in the public tree,
> > guaranteeing that no two bridge windows will overlap each other
> > but while also allowing the PCI device children of the bridge
> > windows to overlap that window.
> >
> > One might conclude that this belongs in the pnp layer, rather
> > than in this driver.  Rafael Wysocki, the maintainter of the
> > pnp layer, has previously asked that we not modify the pnp layer
> > as it is considered deprecated.  This patch is thus essentially
> > a workaround.
> >
> > Signed-off-by: Jake Oshins 
> > ---
> >  drivers/hv/vmbus_drv.c | 22 +-
> >  1 file changed, 21 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> > index b090548..2a7eb3f 100644
> > --- a/drivers/hv/vmbus_drv.c
> > +++ b/drivers/hv/vmbus_drv.c
> > @@ -1169,7 +1169,7 @@ int vmbus_allocate_mmio(struct resource
> **new,
> > struct hv_device *device_obj,
> > resource_size_t size, resource_size_t align,
> > bool fb_overlap_ok)
> >  {
> > -   struct resource *iter;
> > +   struct resource *iter, *shadow;
> > resource_size_t range_min, range_max, start, local_min, local_max;
> > const char *dev_n = dev_name(_obj->device);
> > u32 fb_end = screen_info.lfb_base + (screen_info.lfb_size << 1);
> > @@ -1211,12 +1211,22 @@ int vmbus_allocate_mmio(struct resource
> > **new, struct hv_device *device_obj,
> >
> > start = (local_min + align - 1) & ~(align - 1);
> > for (; start + size - 1 <= local_max; start += align) {
> > +   shadow = __request_region(iter, start,
> > + size,
> > + NULL,
> > + IORESOURCE_BUSY);
> > +   if (!shadow)
> > +   continue;
> > +
> > *new =
> > request_mem_region_exclusive(start, size,
> > dev_n);
> > if (*new) {
> > +   shadow->name = (char*)*new;
> 
> Why are you not correctly setting the name field in the shadow structure?
> 
> Regards,
> 
> K. Y

Nothing looks at the name fields in the shadow resource tree.  So it seemed 
like that pointer could point to anything.  I figured by making it point to the 
resource claim from the iomem_resource that might be useful in debugging 
someday.  If you'd rather see something different here, it doesn't make much 
difference to me.

Thanks for the review,
Jake Oshins

RE: [PATCH 5/5] hv: Track allocations of children of hv_vmbus in private resource tree

2016-02-26 Thread Jake Oshins

> -Original Message-
> From: KY Srinivasan
> Sent: Friday, February 26, 2016 5:09 PM
> To: Jake Oshins ; linux-...@vger.kernel.org;
> gre...@linuxfoundation.org; linux-kernel@vger.kernel.org;
> de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com;
> vkuzn...@redhat.com; Haiyang Zhang ; Hadden
> Hoppert 
> Cc: Jake Oshins 
> Subject: RE: [PATCH 5/5] hv: Track allocations of children of hv_vmbus in
> private resource tree
> 
> > -Original Message-
> > From: ja...@microsoft.com [mailto:ja...@microsoft.com]
> > Sent: Wednesday, February 24, 2016 1:24 PM
> > To: linux-...@vger.kernel.org; gre...@linuxfoundation.org; KY Srinivasan
> > ; linux-kernel@vger.kernel.org;
> > de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com;
> > vkuzn...@redhat.com; Haiyang Zhang ;
> Hadden
> > Hoppert 
> > Cc: Jake Oshins 
> > Subject: [PATCH 5/5] hv: Track allocations of children of hv_vmbus in
> private
> > resource tree
> >
> > From: Jake Oshins 
> >
> > This patch changes vmbus_allocate_mmio() and vmbus_free_mmio() so
> > that when child paravirtual devices allocate memory-mapped I/O
> > space, they allocate it privately from a resource tree pointed
> > at by hyperv_mmio and also by the public resource tree
> > iomem_resource.  This allows the region to be marked as "busy"
> > in the private tree, but a "bridge window" in the public tree,
> > guaranteeing that no two bridge windows will overlap each other
> > but while also allowing the PCI device children of the bridge
> > windows to overlap that window.
> >
> > One might conclude that this belongs in the pnp layer, rather
> > than in this driver.  Rafael Wysocki, the maintainter of the
> > pnp layer, has previously asked that we not modify the pnp layer
> > as it is considered deprecated.  This patch is thus essentially
> > a workaround.
> >
> > Signed-off-by: Jake Oshins 
> > ---
> >  drivers/hv/vmbus_drv.c | 22 +-
> >  1 file changed, 21 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> > index b090548..2a7eb3f 100644
> > --- a/drivers/hv/vmbus_drv.c
> > +++ b/drivers/hv/vmbus_drv.c
> > @@ -1169,7 +1169,7 @@ int vmbus_allocate_mmio(struct resource
> **new,
> > struct hv_device *device_obj,
> > resource_size_t size, resource_size_t align,
> > bool fb_overlap_ok)
> >  {
> > -   struct resource *iter;
> > +   struct resource *iter, *shadow;
> > resource_size_t range_min, range_max, start, local_min, local_max;
> > const char *dev_n = dev_name(_obj->device);
> > u32 fb_end = screen_info.lfb_base + (screen_info.lfb_size << 1);
> > @@ -1211,12 +1211,22 @@ int vmbus_allocate_mmio(struct resource
> > **new, struct hv_device *device_obj,
> >
> > start = (local_min + align - 1) & ~(align - 1);
> > for (; start + size - 1 <= local_max; start += align) {
> > +   shadow = __request_region(iter, start,
> > + size,
> > + NULL,
> > + IORESOURCE_BUSY);
> > +   if (!shadow)
> > +   continue;
> > +
> > *new =
> > request_mem_region_exclusive(start, size,
> > dev_n);
> > if (*new) {
> > +   shadow->name = (char*)*new;
> 
> Why are you not correctly setting the name field in the shadow structure?
> 
> Regards,
> 
> K. Y

Nothing looks at the name fields in the shadow resource tree.  So it seemed 
like that pointer could point to anything.  I figured by making it point to the 
resource claim from the iomem_resource that might be useful in debugging 
someday.  If you'd rather see something different here, it doesn't make much 
difference to me.

Thanks for the review,
Jake Oshins

Re: [RFC/RFT][PATCH v4 1/2] cpufreq: New governor using utilization data from the scheduler

2016-02-26 Thread Steve Muckle

On 02/25/2016 01:14 PM, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki 
> 
> Add a new cpufreq scaling governor, called "schedutil", that uses
> scheduler-provided CPU utilization information as input for making
> its decisions.
> 
> Doing that is possible after commit fe7034338ba0 (cpufreq: Add
> mechanism for registering utilization update callbacks) that
> introduced cpufreq_update_util() called by the scheduler on
> utilization changes (from CFS) and RT/DL task status updates.
> In particular, CPU frequency scaling decisions may be based on
> the the utilization data passed to cpufreq_update_util() by CFS.
> 
> The new governor is very simple.  It is almost as simple as it
> can be and remain reasonably functional.
> 
> The frequency selection formula used by it is essentially the same
> as the one used by the "ondemand" governor, although it doesn't use
> the additional up_threshold parameter, but instead of computing the
> load as the "non-idle CPU time" to "total CPU time" ratio, it takes
> the utilization data provided by CFS as input.  More specifically,
> it represents "load" as the util/max ratio, where util and max
> are the utilization and CPU capacity coming from CFS.
> 
> All of the computations are carried out in the utilization update
> handlers provided by the new governor.  One of those handlers is
> used for cpufreq policies shared between multiple CPUs and the other
> one is for policies with one CPU only (and therefore it doesn't need
> to use any extra synchronization means).  The only operation carried
> out by the new governor's ->gov_dbs_timer callback, sugov_set_freq(),
> is a __cpufreq_driver_target() call to trigger a frequency update (to
> a value already computed beforehand in one of the utilization update
> handlers).  This means that, at least for some cpufreq drivers that
> can update CPU frequency by doing simple register writes, it should
> be possible to set the frequency in the utilization update handlers
> too in which case all of the governor's activity would take place in
> the scheduler paths invoking cpufreq_update_util() without the need
> to run anything in process context.
> 
> Currently, the governor treats all of the RT and DL tasks as
> "unknown utilization" and sets the frequency to the allowed
> maximum when updated from the RT or DL sched classes.  That
> heavy-handed approach should be replaced with something more
> specifically targeted at RT and DL tasks.
> 
> To some extent it relies on the common governor code in
> cpufreq_governor.c and it uses that code in a somewhat unusual
> way (different from what the "ondemand" and "conservative"
> governors do), so some small and rather unintrusive changes
> have to be made in that code and the other governors to support it.
> 
> However, after making it possible to set the CPU frequency from
> the utilization update handlers, that new governor's interactions
> with the common code might be limited to the initialization, cleanup
> and handling of sysfs attributes (currently only one attribute,
> sampling_rate, is supported in addition to the standard policy
> attributes handled by the cpufreq core).

We'll still need a slow path for platforms that don't support fast
transitions right?

Or is this referring to plans to add an RT thread for slowpath changes
instead of using the dbs stuff :) .

> 
> Signed-off-by: Rafael J. Wysocki 
> ---
> 
> There was no v3 of this patch, but patch [2/2] had a v3 in the meantime.
> 
> Changes from v2:
> - Avoid requesting the same frequency that was requested last time for
>   the given policy.
> 
> Changes from v1:
> - Use policy->min and policy->max as min/max frequency in computations.
> 
> ---
>  drivers/cpufreq/Kconfig|   15 +
>  drivers/cpufreq/Makefile   |1 
>  drivers/cpufreq/cpufreq_conservative.c |3 
>  drivers/cpufreq/cpufreq_governor.c |   21 +-
>  drivers/cpufreq/cpufreq_governor.h |2 
>  drivers/cpufreq/cpufreq_ondemand.c |3 
>  drivers/cpufreq/cpufreq_schedutil.c|  253 
> +
>  7 files changed, 288 insertions(+), 10 deletions(-)
> 
> Index: linux-pm/drivers/cpufreq/cpufreq_governor.h
> ===
> --- linux-pm.orig/drivers/cpufreq/cpufreq_governor.h
> +++ linux-pm/drivers/cpufreq/cpufreq_governor.h
> @@ -164,7 +164,7 @@ struct dbs_governor {
>   void (*free)(struct policy_dbs_info *policy_dbs);
>   int (*init)(struct dbs_data *dbs_data, bool notify);
>   void (*exit)(struct dbs_data *dbs_data, bool notify);
> - void (*start)(struct cpufreq_policy *policy);
> + bool (*start)(struct cpufreq_policy *policy);
>  };
>  
>  static inline struct dbs_governor *dbs_governor_of(struct cpufreq_policy 
> *policy)
> Index: linux-pm/drivers/cpufreq/cpufreq_schedutil.c
> ===
> ---

Re: [RFC/RFT][PATCH v4 1/2] cpufreq: New governor using utilization data from the scheduler

2016-02-26 Thread Steve Muckle

On 02/25/2016 01:14 PM, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki 
> 
> Add a new cpufreq scaling governor, called "schedutil", that uses
> scheduler-provided CPU utilization information as input for making
> its decisions.
> 
> Doing that is possible after commit fe7034338ba0 (cpufreq: Add
> mechanism for registering utilization update callbacks) that
> introduced cpufreq_update_util() called by the scheduler on
> utilization changes (from CFS) and RT/DL task status updates.
> In particular, CPU frequency scaling decisions may be based on
> the the utilization data passed to cpufreq_update_util() by CFS.
> 
> The new governor is very simple.  It is almost as simple as it
> can be and remain reasonably functional.
> 
> The frequency selection formula used by it is essentially the same
> as the one used by the "ondemand" governor, although it doesn't use
> the additional up_threshold parameter, but instead of computing the
> load as the "non-idle CPU time" to "total CPU time" ratio, it takes
> the utilization data provided by CFS as input.  More specifically,
> it represents "load" as the util/max ratio, where util and max
> are the utilization and CPU capacity coming from CFS.
> 
> All of the computations are carried out in the utilization update
> handlers provided by the new governor.  One of those handlers is
> used for cpufreq policies shared between multiple CPUs and the other
> one is for policies with one CPU only (and therefore it doesn't need
> to use any extra synchronization means).  The only operation carried
> out by the new governor's ->gov_dbs_timer callback, sugov_set_freq(),
> is a __cpufreq_driver_target() call to trigger a frequency update (to
> a value already computed beforehand in one of the utilization update
> handlers).  This means that, at least for some cpufreq drivers that
> can update CPU frequency by doing simple register writes, it should
> be possible to set the frequency in the utilization update handlers
> too in which case all of the governor's activity would take place in
> the scheduler paths invoking cpufreq_update_util() without the need
> to run anything in process context.
> 
> Currently, the governor treats all of the RT and DL tasks as
> "unknown utilization" and sets the frequency to the allowed
> maximum when updated from the RT or DL sched classes.  That
> heavy-handed approach should be replaced with something more
> specifically targeted at RT and DL tasks.
> 
> To some extent it relies on the common governor code in
> cpufreq_governor.c and it uses that code in a somewhat unusual
> way (different from what the "ondemand" and "conservative"
> governors do), so some small and rather unintrusive changes
> have to be made in that code and the other governors to support it.
> 
> However, after making it possible to set the CPU frequency from
> the utilization update handlers, that new governor's interactions
> with the common code might be limited to the initialization, cleanup
> and handling of sysfs attributes (currently only one attribute,
> sampling_rate, is supported in addition to the standard policy
> attributes handled by the cpufreq core).

We'll still need a slow path for platforms that don't support fast
transitions right?

Or is this referring to plans to add an RT thread for slowpath changes
instead of using the dbs stuff :) .

> 
> Signed-off-by: Rafael J. Wysocki 
> ---
> 
> There was no v3 of this patch, but patch [2/2] had a v3 in the meantime.
> 
> Changes from v2:
> - Avoid requesting the same frequency that was requested last time for
>   the given policy.
> 
> Changes from v1:
> - Use policy->min and policy->max as min/max frequency in computations.
> 
> ---
>  drivers/cpufreq/Kconfig|   15 +
>  drivers/cpufreq/Makefile   |1 
>  drivers/cpufreq/cpufreq_conservative.c |3 
>  drivers/cpufreq/cpufreq_governor.c |   21 +-
>  drivers/cpufreq/cpufreq_governor.h |2 
>  drivers/cpufreq/cpufreq_ondemand.c |3 
>  drivers/cpufreq/cpufreq_schedutil.c|  253 
> +
>  7 files changed, 288 insertions(+), 10 deletions(-)
> 
> Index: linux-pm/drivers/cpufreq/cpufreq_governor.h
> ===
> --- linux-pm.orig/drivers/cpufreq/cpufreq_governor.h
> +++ linux-pm/drivers/cpufreq/cpufreq_governor.h
> @@ -164,7 +164,7 @@ struct dbs_governor {
>   void (*free)(struct policy_dbs_info *policy_dbs);
>   int (*init)(struct dbs_data *dbs_data, bool notify);
>   void (*exit)(struct dbs_data *dbs_data, bool notify);
> - void (*start)(struct cpufreq_policy *policy);
> + bool (*start)(struct cpufreq_policy *policy);
>  };
>  
>  static inline struct dbs_governor *dbs_governor_of(struct cpufreq_policy 
> *policy)
> Index: linux-pm/drivers/cpufreq/cpufreq_schedutil.c
> ===
> --- /dev/null
> +++ linux-pm/drivers/cpufreq/cpufreq_schedutil.c
>

Re: [RFCv7 PATCH 03/10] sched: scheduler-driven cpu frequency selection

2016-02-26 Thread Steve Muckle

On 02/26/2016 06:39 PM, Rafael J. Wysocki wrote:
>>> One thing I personally like in the RCU-based approach is its universality.  
>>> The
>>> callbacks may be installed by different entities in a uniform way: 
>>> intel_pstate
>>> can do that, the old governors can do that, my experimental schedutil code 
>>> can
>>> do that and your code could have done that too in principle.  And this is 
>>> very
>>> nice, because it is a common underlying mechanism that can be used by 
>>> everybody
>>> regardless of their particular implementations on the other side.
>>>
>>> Why would I want to use something different, then?
>>
>> I've got nothing against a callback registration mechanism. As you
>> mentioned in another mail it could itself use static keys, enabling the
>> static key when a callback is registered and disabling it otherwise to
>> avoid calling into cpufreq_update_util().
> 
> But then it would only make a difference if cpufreq_update_util() was not
> used at all (ie. no callbacks installed for any policies by anyone).  The
> only reason why it may matter is that the total number of systems using
> the performance governor is quite large AFAICS and they would benefit from
> that.

I'd think that's a benefit worth preserving, but I guess that's Peter
and Ingo's call.

> 
...
 +/*
 + * Capacity margin added to CFS and RT capacity requests to provide
 + * some head room if task utilization further increases.
 + */
>>>
>>> OK, where does this number come from?
>>
>> Someone's posterior :) .
>>
>> This really should be a tunable IMO, but there's a fairly strong
>> anti-tunable sentiment, so it's been left hard-coded in an attempt to
>> provide something that "just works."
> 
> Ouch.
> 
>> At the least I can add a comment saying that the 20% idle headroom
>> requirement was an off the cuff estimate and that at this time, we don't
>> have significant data to suggest it's the best number.
> 
> Well, in this area, every number has to be justified.  Otherwise we end
> up with things that sort of work, but nobody actually understands why.

It's just a starting point. There's a lot of profiling and tuning that
has yet to happen. We figured there were larger design issues to discuss
prior to spending a lot of time tweaking the headroom value.

> 
> [cut]
> 
>>>
 +
 +static int cpufreq_sched_thread(void *data)
 +{
>>>
>>> Now, what really is the advantage of having those extra threads vs using
>>> workqueues?
>>>
>>> I guess the underlying concern is that RT tasks may stall workqueues 
>>> indefinitely
>>> in theory and then the frequency won't be updated, but there's much more 
>>> kernel
>>> stuff run from workqueues and if that is starved, you won't get very far 
>>> anyway.
>>>
>>> If you take special measures to prevent frequency change requests from being
>>> stalled by RT tasks, question is why are they so special?  Aren't there any
>>> other kernel activities that also should be protected from that and may be
>>> more important than CPU frequency changes?
>>
>> I think updating the CPU frequency during periods of heavy RT/DL load is
>> one of the most (if not the most) important things. I can't speak for
>> other system activities that may get blocked, but there's an opportunity
>> to protect CPU frequency changes here, and it seems worth taking to me.
> 
> So do it in a general way for everybody and not just for one governor
> that you happen to be working on.
> 
> That said I'm unconvinced about the approach still.
> 
> Having more RT threads in a system that already is under RT pressure seems 
> like
> a recipe for trouble.  Moreover, it's likely that those new RT threads will
> disturb the system's normal operation somehow even without the RT pressure and
> have you investigated that?

Sorry I'm not sure what you mean by disturb normal operation.

Generally speaking, increasing the capacity of a heavily loaded system
seems to me to be something that should run urgently, so that the system
can potentially get itself out of trouble and meet the workload's needs.

> Also having them per policy may be overkill and
> binding them to policy CPUs only is not necessary.
> 
> Overall, it looks like a dynamic pool of threads that may run on every CPU
> might be a better approach, but that would almost duplicate the workqueues
> subsystem, so is it really worth it?
> 
> And is the problem actually visible in practice?  I have no record of any 
> reports
> mentioning it, although theoretically it's been there forever, so had it been
> real, someone would have noticed it and complained about it IMO.

While I don't have a test case drawn up to provide it seems like it'd be
easy to create one. More importantly the interactive governor in Android
uses this same kind of model, starting a frequency change thread and
making it RT. Android is particularly sensitive to latency in frequency
response. So that's likely one big reason why you're not hearing about
this issue - some folks have already

Re: [RFCv7 PATCH 03/10] sched: scheduler-driven cpu frequency selection

2016-02-26 Thread Steve Muckle

On 02/26/2016 06:39 PM, Rafael J. Wysocki wrote:
>>> One thing I personally like in the RCU-based approach is its universality.  
>>> The
>>> callbacks may be installed by different entities in a uniform way: 
>>> intel_pstate
>>> can do that, the old governors can do that, my experimental schedutil code 
>>> can
>>> do that and your code could have done that too in principle.  And this is 
>>> very
>>> nice, because it is a common underlying mechanism that can be used by 
>>> everybody
>>> regardless of their particular implementations on the other side.
>>>
>>> Why would I want to use something different, then?
>>
>> I've got nothing against a callback registration mechanism. As you
>> mentioned in another mail it could itself use static keys, enabling the
>> static key when a callback is registered and disabling it otherwise to
>> avoid calling into cpufreq_update_util().
> 
> But then it would only make a difference if cpufreq_update_util() was not
> used at all (ie. no callbacks installed for any policies by anyone).  The
> only reason why it may matter is that the total number of systems using
> the performance governor is quite large AFAICS and they would benefit from
> that.

I'd think that's a benefit worth preserving, but I guess that's Peter
and Ingo's call.

> 
...
 +/*
 + * Capacity margin added to CFS and RT capacity requests to provide
 + * some head room if task utilization further increases.
 + */
>>>
>>> OK, where does this number come from?
>>
>> Someone's posterior :) .
>>
>> This really should be a tunable IMO, but there's a fairly strong
>> anti-tunable sentiment, so it's been left hard-coded in an attempt to
>> provide something that "just works."
> 
> Ouch.
> 
>> At the least I can add a comment saying that the 20% idle headroom
>> requirement was an off the cuff estimate and that at this time, we don't
>> have significant data to suggest it's the best number.
> 
> Well, in this area, every number has to be justified.  Otherwise we end
> up with things that sort of work, but nobody actually understands why.

It's just a starting point. There's a lot of profiling and tuning that
has yet to happen. We figured there were larger design issues to discuss
prior to spending a lot of time tweaking the headroom value.

> 
> [cut]
> 
>>>
 +
 +static int cpufreq_sched_thread(void *data)
 +{
>>>
>>> Now, what really is the advantage of having those extra threads vs using
>>> workqueues?
>>>
>>> I guess the underlying concern is that RT tasks may stall workqueues 
>>> indefinitely
>>> in theory and then the frequency won't be updated, but there's much more 
>>> kernel
>>> stuff run from workqueues and if that is starved, you won't get very far 
>>> anyway.
>>>
>>> If you take special measures to prevent frequency change requests from being
>>> stalled by RT tasks, question is why are they so special?  Aren't there any
>>> other kernel activities that also should be protected from that and may be
>>> more important than CPU frequency changes?
>>
>> I think updating the CPU frequency during periods of heavy RT/DL load is
>> one of the most (if not the most) important things. I can't speak for
>> other system activities that may get blocked, but there's an opportunity
>> to protect CPU frequency changes here, and it seems worth taking to me.
> 
> So do it in a general way for everybody and not just for one governor
> that you happen to be working on.
> 
> That said I'm unconvinced about the approach still.
> 
> Having more RT threads in a system that already is under RT pressure seems 
> like
> a recipe for trouble.  Moreover, it's likely that those new RT threads will
> disturb the system's normal operation somehow even without the RT pressure and
> have you investigated that?

Sorry I'm not sure what you mean by disturb normal operation.

Generally speaking, increasing the capacity of a heavily loaded system
seems to me to be something that should run urgently, so that the system
can potentially get itself out of trouble and meet the workload's needs.

> Also having them per policy may be overkill and
> binding them to policy CPUs only is not necessary.
> 
> Overall, it looks like a dynamic pool of threads that may run on every CPU
> might be a better approach, but that would almost duplicate the workqueues
> subsystem, so is it really worth it?
> 
> And is the problem actually visible in practice?  I have no record of any 
> reports
> mentioning it, although theoretically it's been there forever, so had it been
> real, someone would have noticed it and complained about it IMO.

While I don't have a test case drawn up to provide it seems like it'd be
easy to create one. More importantly the interactive governor in Android
uses this same kind of model, starting a frequency change thread and
making it RT. Android is particularly sensitive to latency in frequency
response. So that's likely one big reason why you're not hearing about
this issue - some folks have already

Re: [PATCH v12 4/5] arm64, numa: Add NUMA support for arm64 platforms.

2016-02-26 Thread Ganapatrao Kulkarni

On Sat, Feb 27, 2016 at 1:21 AM, David Daney  wrote:
> On 02/26/2016 10:53 AM, Will Deacon wrote:
> [...]
>>>
>>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
>>> new file mode 100644
>>> index 000..604e886
>>> --- /dev/null
>>> +++ b/arch/arm64/mm/numa.c
>>> @@ -0,0 +1,403 @@
>
> [...]
>>>
>>> +
>>> +static int numa_off;
>>> +static int numa_distance_cnt;
>>> +static u8 *numa_distance;
>>> +
>>> +static __init int numa_parse_early_param(char *opt)
>>> +{
>>> +   if (!opt)
>>> +   return -EINVAL;
>>> +   if (!strncmp(opt, "off", 3)) {
>>> +   pr_info("%s\n", "NUMA turned off");
>>> +   numa_off = 1;
>>> +   }
>>> +   return 0;
>>> +}
>>> +early_param("numa", numa_parse_early_param);
>>
>>
>> Curious, but when is this option actually useful?
>>
>
> Good point.  I will remove that bit, it was used as an aid in debugging
> while bringing up the patch set.

this is handy in debugging new platforms.
this boot argument option forces to boot as single node dummy system
adding all resources to node0.
>
>
>
>>> +
>>> +cpumask_var_t node_to_cpumask_map[MAX_NUMNODES];
>>> +EXPORT_SYMBOL(node_to_cpumask_map);
>>> +
>>> +#ifdef CONFIG_DEBUG_PER_CPU_MAPS
>>> +
>>> +/*
>>> + * Returns a pointer to the bitmask of CPUs on Node 'node'.
>>> + */
>>> +const struct cpumask *cpumask_of_node(int node)
>>> +{
>>> +   if (WARN_ON(node >= nr_node_ids))
>>> +   return cpu_none_mask;
>>> +
>>> +   if (WARN_ON(node_to_cpumask_map[node] == NULL))
>>> +   return cpu_online_mask;
>>> +
>>> +   return node_to_cpumask_map[node];
>>> +}
>>> +EXPORT_SYMBOL(cpumask_of_node);
>>> +
>>> +#endif
>>> +
>>> +static void map_cpu_to_node(unsigned int cpu, int nid)
>>> +{
>>> +   set_cpu_numa_node(cpu, nid);
>>> +   if (nid >= 0)
>>> +   cpumask_set_cpu(cpu, node_to_cpumask_map[nid]);
>>> +}
>>> +
>>> +static void unmap_cpu_to_node(unsigned int cpu)
>>> +{
>>> +   int nid = cpu_to_node(cpu);
>>> +
>>> +   if (nid >= 0)
>>> +   cpumask_clear_cpu(cpu, node_to_cpumask_map[nid]);
>>> +   set_cpu_numa_node(cpu, NUMA_NO_NODE);
>>> +}
>>
>>
>> How do you end up with negative nids this late in the game?
>>
>
> It might be possible with some of the hot plugging code.  It is a little
> paranoia programming.
>
> If you really don't like it, we can remove it.
>
>>> +
>>> +void numa_clear_node(unsigned int cpu)
>>> +{
>>> +   unmap_cpu_to_node(cpu);
>>
>>
>> Why don't you just inline this function?
>
>
> Good point, I will do that.
>
> [...]
>>>
>>> +int __init numa_add_memblk(int nid, u64 start, u64 size)
>>> +{
>>> +   int ret;
>>> +
>>> +   ret = memblock_set_node(start, size, , nid);
>>> +   if (ret < 0) {
>>> +   pr_err("NUMA: memblock [0x%llx - 0x%llx] failed to add on
>>> node %d\n",
>>> +   start, (start + size - 1), nid);
>>> +   return ret;
>>> +   }
>>> +
>>> +   node_set(nid, numa_nodes_parsed);
>>> +   pr_info("NUMA: Adding memblock [0x%llx - 0x%llx] on node %d\n",
>>> +   start, (start + size - 1), nid);
>>> +   return ret;
>>> +}
>>> +EXPORT_SYMBOL(numa_add_memblk);
>>
>>
>> But this is marked __init... (and you've done this elsewhere in the patch
>> too).
>
>
> I will fix these.
>
>
>>
>> Will
>>
>

Ganapat
>
> ___
> linux-arm-kernel mailing list
> linux-arm-ker...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

Re: [PATCH v12 4/5] arm64, numa: Add NUMA support for arm64 platforms.

2016-02-26 Thread Ganapatrao Kulkarni

On Sat, Feb 27, 2016 at 1:21 AM, David Daney  wrote:
> On 02/26/2016 10:53 AM, Will Deacon wrote:
> [...]
>>>
>>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
>>> new file mode 100644
>>> index 000..604e886
>>> --- /dev/null
>>> +++ b/arch/arm64/mm/numa.c
>>> @@ -0,0 +1,403 @@
>
> [...]
>>>
>>> +
>>> +static int numa_off;
>>> +static int numa_distance_cnt;
>>> +static u8 *numa_distance;
>>> +
>>> +static __init int numa_parse_early_param(char *opt)
>>> +{
>>> +   if (!opt)
>>> +   return -EINVAL;
>>> +   if (!strncmp(opt, "off", 3)) {
>>> +   pr_info("%s\n", "NUMA turned off");
>>> +   numa_off = 1;
>>> +   }
>>> +   return 0;
>>> +}
>>> +early_param("numa", numa_parse_early_param);
>>
>>
>> Curious, but when is this option actually useful?
>>
>
> Good point.  I will remove that bit, it was used as an aid in debugging
> while bringing up the patch set.

this is handy in debugging new platforms.
this boot argument option forces to boot as single node dummy system
adding all resources to node0.
>
>
>
>>> +
>>> +cpumask_var_t node_to_cpumask_map[MAX_NUMNODES];
>>> +EXPORT_SYMBOL(node_to_cpumask_map);
>>> +
>>> +#ifdef CONFIG_DEBUG_PER_CPU_MAPS
>>> +
>>> +/*
>>> + * Returns a pointer to the bitmask of CPUs on Node 'node'.
>>> + */
>>> +const struct cpumask *cpumask_of_node(int node)
>>> +{
>>> +   if (WARN_ON(node >= nr_node_ids))
>>> +   return cpu_none_mask;
>>> +
>>> +   if (WARN_ON(node_to_cpumask_map[node] == NULL))
>>> +   return cpu_online_mask;
>>> +
>>> +   return node_to_cpumask_map[node];
>>> +}
>>> +EXPORT_SYMBOL(cpumask_of_node);
>>> +
>>> +#endif
>>> +
>>> +static void map_cpu_to_node(unsigned int cpu, int nid)
>>> +{
>>> +   set_cpu_numa_node(cpu, nid);
>>> +   if (nid >= 0)
>>> +   cpumask_set_cpu(cpu, node_to_cpumask_map[nid]);
>>> +}
>>> +
>>> +static void unmap_cpu_to_node(unsigned int cpu)
>>> +{
>>> +   int nid = cpu_to_node(cpu);
>>> +
>>> +   if (nid >= 0)
>>> +   cpumask_clear_cpu(cpu, node_to_cpumask_map[nid]);
>>> +   set_cpu_numa_node(cpu, NUMA_NO_NODE);
>>> +}
>>
>>
>> How do you end up with negative nids this late in the game?
>>
>
> It might be possible with some of the hot plugging code.  It is a little
> paranoia programming.
>
> If you really don't like it, we can remove it.
>
>>> +
>>> +void numa_clear_node(unsigned int cpu)
>>> +{
>>> +   unmap_cpu_to_node(cpu);
>>
>>
>> Why don't you just inline this function?
>
>
> Good point, I will do that.
>
> [...]
>>>
>>> +int __init numa_add_memblk(int nid, u64 start, u64 size)
>>> +{
>>> +   int ret;
>>> +
>>> +   ret = memblock_set_node(start, size, , nid);
>>> +   if (ret < 0) {
>>> +   pr_err("NUMA: memblock [0x%llx - 0x%llx] failed to add on
>>> node %d\n",
>>> +   start, (start + size - 1), nid);
>>> +   return ret;
>>> +   }
>>> +
>>> +   node_set(nid, numa_nodes_parsed);
>>> +   pr_info("NUMA: Adding memblock [0x%llx - 0x%llx] on node %d\n",
>>> +   start, (start + size - 1), nid);
>>> +   return ret;
>>> +}
>>> +EXPORT_SYMBOL(numa_add_memblk);
>>
>>
>> But this is marked __init... (and you've done this elsewhere in the patch
>> too).
>
>
> I will fix these.
>
>
>>
>> Will
>>
>

Ganapat
>
> ___
> linux-arm-kernel mailing list
> linux-arm-ker...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

Re: [PATCH] printk/nmi: restore printk_func in nmi_panic

2016-02-26 Thread Sergey Senozhatsky

On (02/27/16 12:09), Sergey Senozhatsky wrote:
> On (02/27/16 11:19), Sergey Senozhatsky wrote:
> [..]
> > > I think about a compromise. We should try to get the messages
> > > out only when kdump is not enabled.
> > 
> > can we zap_locks() if we are on 
> > nmi_panic()->panic()->console_flush_on_panic() path?
> > console_flush_on_panic() is happening after we send out smp_send_stop().
> 
> can something like this do the trick?

hm, no. it can't.

I forgot to move printk_nmi_exit() from nmi_panic() to panic(). so
it should have been:

panic()
...
printk_nmi_exit()
console_flush_on_panic()
__zap_locks()
printk_nmi_flush()
console_unlock()

but this __zap_locks() can _in theory_ race with irq_work->printk_nmi_flush().
so we need something more than this...

-ss

Re: [PATCH] printk/nmi: restore printk_func in nmi_panic

2016-02-26 Thread Sergey Senozhatsky

On (02/27/16 12:09), Sergey Senozhatsky wrote:
> On (02/27/16 11:19), Sergey Senozhatsky wrote:
> [..]
> > > I think about a compromise. We should try to get the messages
> > > out only when kdump is not enabled.
> > 
> > can we zap_locks() if we are on 
> > nmi_panic()->panic()->console_flush_on_panic() path?
> > console_flush_on_panic() is happening after we send out smp_send_stop().
> 
> can something like this do the trick?

hm, no. it can't.

I forgot to move printk_nmi_exit() from nmi_panic() to panic(). so
it should have been:

panic()
...
printk_nmi_exit()
console_flush_on_panic()
__zap_locks()
printk_nmi_flush()
console_unlock()

but this __zap_locks() can _in theory_ race with irq_work->printk_nmi_flush().
so we need something more than this...

-ss

Re: [PATCH V2 11/12] net-next: mediatek: add Kconfig and Makefile

2016-02-26 Thread kbuild test robot

Hi John,

[auto build test ERROR on net/master]
[also build test ERROR on v4.5-rc5 next-20160226]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improving the system]

url:
https://github.com/0day-ci/linux/commits/John-Crispin/net-next-mediatek-add-ethernet-driver/20160226-223245
config: arm64-allmodconfig (attached as .config)
reproduce:
wget 
https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
 -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=arm64 

All error/warnings (new ones prefixed by >>):

   drivers/net/ethernet/mediatek/mtk_eth_soc.c: In function 'mtk_init_fq_dma':
>> drivers/net/ethernet/mediatek/mtk_eth_soc.c:771:22: warning: passing 
>> argument 3 of 'dma_alloc_coherent' from incompatible pointer type
 eth->scratch_ring = dma_alloc_coherent(eth->dev,
 ^
   In file included from drivers/net/ethernet/mediatek/mtk_eth_soc.c:18:0:
   include/linux/dma-mapping.h:396:21: note: expected 'dma_addr_t *' but 
argument is of type 'unsigned int *'
static inline void *dma_alloc_coherent(struct device *dev, size_t size,
^
   drivers/net/ethernet/mediatek/mtk_eth_soc.c: In function 'mtk_probe':
>> drivers/net/ethernet/mediatek/mtk_eth_soc.c:2059:2: warning: ignoring return 
>> value of 'device_reset', declared with attribute warn_unused_result 
>> [-Wunused-result]
 device_reset(>dev);
 ^
--
   drivers/net/ethernet/mediatek/ethtool.c: In function 'mtk_set_settings':
>> drivers/net/ethernet/mediatek/ethtool.c:49:38: error: 'struct phy_device' 
>> has no member named 'addr'
 if (cmd->phy_address != mac->phy_dev->addr) {
 ^
>> drivers/net/ethernet/mediatek/ethtool.c:54:23: error: 'struct mii_bus' has 
>> no member named 'phy_map'
  mac->hw->mii_bus->phy_map[cmd->phy_address]) {
  ^
   drivers/net/ethernet/mediatek/ethtool.c:56:21: error: 'struct mii_bus' has 
no member named 'phy_map'
mac->hw->mii_bus->phy_map[cmd->phy_address];
^

vim +49 drivers/net/ethernet/mediatek/ethtool.c

79b0e682 John Crispin 2016-02-26  43  {
79b0e682 John Crispin 2016-02-26  44struct mtk_mac *mac = netdev_priv(dev);
79b0e682 John Crispin 2016-02-26  45  
79b0e682 John Crispin 2016-02-26  46if (!mac->phy_dev)
79b0e682 John Crispin 2016-02-26  47return -ENODEV;
79b0e682 John Crispin 2016-02-26  48  
79b0e682 John Crispin 2016-02-26 @49if (cmd->phy_address != 
mac->phy_dev->addr) {
79b0e682 John Crispin 2016-02-26  50if 
(mac->hw->phy->phy_node[cmd->phy_address]) {
79b0e682 John Crispin 2016-02-26  51mac->phy_dev = 
mac->hw->phy->phy[cmd->phy_address];
79b0e682 John Crispin 2016-02-26  52mac->phy_flags = 
MTK_PHY_FLAG_PORT;
79b0e682 John Crispin 2016-02-26  53} else if (mac->hw->mii_bus &&
79b0e682 John Crispin 2016-02-26 @54   
mac->hw->mii_bus->phy_map[cmd->phy_address]) {
79b0e682 John Crispin 2016-02-26  55mac->phy_dev =
79b0e682 John Crispin 2016-02-26  56
mac->hw->mii_bus->phy_map[cmd->phy_address];
79b0e682 John Crispin 2016-02-26  57mac->phy_flags = 
MTK_PHY_FLAG_ATTACH;

:: The code at line 49 was first introduced by commit
:: 79b0e682b3b2ed2a983b0263c6b8b3af61fdbf8e net-next: mediatek: add the 
drivers core files

:: TO: John Crispin <blo...@openwrt.org>
:: CC: 0day robot <fengguang...@intel.com>

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: Binary data

Re: [PATCH V2 11/12] net-next: mediatek: add Kconfig and Makefile

2016-02-26 Thread kbuild test robot

Hi John,

[auto build test ERROR on net/master]
[also build test ERROR on v4.5-rc5 next-20160226]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improving the system]

url:
https://github.com/0day-ci/linux/commits/John-Crispin/net-next-mediatek-add-ethernet-driver/20160226-223245
config: arm64-allmodconfig (attached as .config)
reproduce:
wget 
https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
 -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=arm64 

All error/warnings (new ones prefixed by >>):

   drivers/net/ethernet/mediatek/mtk_eth_soc.c: In function 'mtk_init_fq_dma':
>> drivers/net/ethernet/mediatek/mtk_eth_soc.c:771:22: warning: passing 
>> argument 3 of 'dma_alloc_coherent' from incompatible pointer type
 eth->scratch_ring = dma_alloc_coherent(eth->dev,
 ^
   In file included from drivers/net/ethernet/mediatek/mtk_eth_soc.c:18:0:
   include/linux/dma-mapping.h:396:21: note: expected 'dma_addr_t *' but 
argument is of type 'unsigned int *'
static inline void *dma_alloc_coherent(struct device *dev, size_t size,
^
   drivers/net/ethernet/mediatek/mtk_eth_soc.c: In function 'mtk_probe':
>> drivers/net/ethernet/mediatek/mtk_eth_soc.c:2059:2: warning: ignoring return 
>> value of 'device_reset', declared with attribute warn_unused_result 
>> [-Wunused-result]
 device_reset(>dev);
 ^
--
   drivers/net/ethernet/mediatek/ethtool.c: In function 'mtk_set_settings':
>> drivers/net/ethernet/mediatek/ethtool.c:49:38: error: 'struct phy_device' 
>> has no member named 'addr'
 if (cmd->phy_address != mac->phy_dev->addr) {
 ^
>> drivers/net/ethernet/mediatek/ethtool.c:54:23: error: 'struct mii_bus' has 
>> no member named 'phy_map'
  mac->hw->mii_bus->phy_map[cmd->phy_address]) {
  ^
   drivers/net/ethernet/mediatek/ethtool.c:56:21: error: 'struct mii_bus' has 
no member named 'phy_map'
mac->hw->mii_bus->phy_map[cmd->phy_address];
^

vim +49 drivers/net/ethernet/mediatek/ethtool.c

79b0e682 John Crispin 2016-02-26  43  {
79b0e682 John Crispin 2016-02-26  44struct mtk_mac *mac = netdev_priv(dev);
79b0e682 John Crispin 2016-02-26  45  
79b0e682 John Crispin 2016-02-26  46if (!mac->phy_dev)
79b0e682 John Crispin 2016-02-26  47return -ENODEV;
79b0e682 John Crispin 2016-02-26  48  
79b0e682 John Crispin 2016-02-26 @49if (cmd->phy_address != 
mac->phy_dev->addr) {
79b0e682 John Crispin 2016-02-26  50if 
(mac->hw->phy->phy_node[cmd->phy_address]) {
79b0e682 John Crispin 2016-02-26  51mac->phy_dev = 
mac->hw->phy->phy[cmd->phy_address];
79b0e682 John Crispin 2016-02-26  52mac->phy_flags = 
MTK_PHY_FLAG_PORT;
79b0e682 John Crispin 2016-02-26  53} else if (mac->hw->mii_bus &&
79b0e682 John Crispin 2016-02-26 @54   
mac->hw->mii_bus->phy_map[cmd->phy_address]) {
79b0e682 John Crispin 2016-02-26  55mac->phy_dev =
79b0e682 John Crispin 2016-02-26  56
mac->hw->mii_bus->phy_map[cmd->phy_address];
79b0e682 John Crispin 2016-02-26  57mac->phy_flags = 
MTK_PHY_FLAG_ATTACH;

:: The code at line 49 was first introduced by commit
:: 79b0e682b3b2ed2a983b0263c6b8b3af61fdbf8e net-next: mediatek: add the 
drivers core files

:: TO: John Crispin 
:: CC: 0day robot 

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: Binary data

[PATCH 1/2] clocksource: introduce clocksource_freq2mult()

2016-02-26 Thread John Stultz

From: Alexander Kuleshov 

The clocksource_khz2mult() and clocksource_hz2mult() share similar
code wihch calculates a mult from the given frequency. Both implementations
in differ only in value of a frequency. This patch introduces the
clocksource_freq2mult() helper with generic implementation of
mult calculation to prevent code duplication.

Cc: Thomas Gleixner 
Cc: Prarit Bhargava 
Cc: Richard Cochran 
Cc: Ingo Molnar 
Signed-off-by: Alexander Kuleshov 
Signed-off-by: John Stultz 
---
 include/linux/clocksource.h | 45 +++--
 1 file changed, 19 insertions(+), 26 deletions(-)

diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
index 6013021..a307bf6 100644
--- a/include/linux/clocksource.h
+++ b/include/linux/clocksource.h
@@ -118,6 +118,23 @@ struct clocksource {
 /* simplify initialization of mask field */
 #define CLOCKSOURCE_MASK(bits) (cycle_t)((bits) < 64 ? ((1ULL<<(bits))-1) : -1)
 
+static inline u32 clocksource_freq2mult(u32 freq, u32 shift_constant, u64 from)
+{
+   /*  freq = cyc/from
+*  mult/2^shift  = ns/cyc
+*  mult = ns/cyc * 2^shift
+*  mult = from/freq * 2^shift
+*  mult = from * 2^shift / freq
+*  mult = (from<

[PATCH 1/2] clocksource: introduce clocksource_freq2mult()

2016-02-26 Thread John Stultz

From: Alexander Kuleshov 

The clocksource_khz2mult() and clocksource_hz2mult() share similar
code wihch calculates a mult from the given frequency. Both implementations
in differ only in value of a frequency. This patch introduces the
clocksource_freq2mult() helper with generic implementation of
mult calculation to prevent code duplication.

Cc: Thomas Gleixner 
Cc: Prarit Bhargava 
Cc: Richard Cochran 
Cc: Ingo Molnar 
Signed-off-by: Alexander Kuleshov 
Signed-off-by: John Stultz 
---
 include/linux/clocksource.h | 45 +++--
 1 file changed, 19 insertions(+), 26 deletions(-)

diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
index 6013021..a307bf6 100644
--- a/include/linux/clocksource.h
+++ b/include/linux/clocksource.h
@@ -118,6 +118,23 @@ struct clocksource {
 /* simplify initialization of mask field */
 #define CLOCKSOURCE_MASK(bits) (cycle_t)((bits) < 64 ? ((1ULL<<(bits))-1) : -1)
 
+static inline u32 clocksource_freq2mult(u32 freq, u32 shift_constant, u64 from)
+{
+   /*  freq = cyc/from
+*  mult/2^shift  = ns/cyc
+*  mult = ns/cyc * 2^shift
+*  mult = from/freq * 2^shift
+*  mult = from * 2^shift / freq
+*  mult = (from<

Re: [PATCH net-next 7/9] net: dsa: mv88e6xxx: restore VLANTable map control

2016-02-26 Thread Andrew Lunn

On Fri, Feb 26, 2016 at 10:47:38PM +, Kevin Smith wrote:
> Hi Andrew,
> 
> On 02/26/2016 04:35 PM, Andrew Lunn wrote:
> > On Fri, Feb 26, 2016 at 10:12:28PM +, Kevin Smith wrote:
> >> Hi Vivien, Andrew,
> >>
> >> On 02/26/2016 03:37 PM, Vivien Didelot wrote:
> >>> Here, 5 is the CPU port and 6 is a DSA port.
> >>>
> >>> After joining ports 0, 1, 2 in the same bridge, we end up with:
> >>>
> >>> Port  0  1  2  3  4  5  6
> >>> 0   -  *  *  -  -  *  *
> >>> 1   *  -  *  -  -  *  *
> >>> 2   *  *  -  -  -  *  *
> >>> 3   -  -  -  -  -  *  *
> >>> 4   -  -  -  -  -  *  *
> >>> 5   *  *  *  *  *  -  *
> >>> 6   *  *  *  *  *  *  -
> >> The case I am concerned about is if the switch connected over DSA in
> >> this example has a WAN port on it, which can legitimately route to the
> >> CPU on port 5 but should not route to the LAN ports 0, 1, and 2.  Does
> >> this VLAN allow direct communication between the WAN and LAN?  Or is
> >> this prevented by DSA or some other mechanism?
> > A typical WIFI access point with a connection to a cable modem.
> >
> > So in linux you have interfaces like
> >
> > lan0, lan1, lan2, lan3, wan0
> >
> > DSA provides you these interface. And by default they are all
> > separated. There is no path between them. You can consider them as
> > being separate physical ethernet cards, just like all other interfaces
> > in linux.
> >
> > What you would typically do is:
> >
> > brctl addbr br0
> > brctl addif br0 lan0
> > brctl addif br0 lan1
> > brctl addif br0 lan2
> > brctl addif br0 lan3
> >
> > to create a bridge between the lan ports. The linux kernel will then
> > push this bridge configuration down into the hardware, so the switch
> > can forward frames between these ports.
> >
> > The wan port is not part of the bridge, so there is no L2 path to the
> > WAN port. You need to do IP routing on the CPU.
> >
> > Linux takes the stance that switch ports interfaces should act just
> > like any other linux interface and you configure them in the normal
> > linux way.
> >
> >  Andrew
> 
> Thanks for the explanation.  I am a bit befuddled by the combination of 
> all the possible configurations of the switch and how they interact with 
> Linux.  :)  I think I understand what is happening now.

You might also be looking at this the wrong way around. It is best to
think of the switch as a hardware accelerator. It offers functions to
the linux network stack to accelerate part of the linux network
stack. We only push out to the hardware functions it is capable of
accelerating. What it cannot accelerate stays in software. Think of it
as a GPU, but for networking...

  Andrew

Re: [PATCH net-next 7/9] net: dsa: mv88e6xxx: restore VLANTable map control

2016-02-26 Thread Andrew Lunn

On Fri, Feb 26, 2016 at 10:47:38PM +, Kevin Smith wrote:
> Hi Andrew,
> 
> On 02/26/2016 04:35 PM, Andrew Lunn wrote:
> > On Fri, Feb 26, 2016 at 10:12:28PM +, Kevin Smith wrote:
> >> Hi Vivien, Andrew,
> >>
> >> On 02/26/2016 03:37 PM, Vivien Didelot wrote:
> >>> Here, 5 is the CPU port and 6 is a DSA port.
> >>>
> >>> After joining ports 0, 1, 2 in the same bridge, we end up with:
> >>>
> >>> Port  0  1  2  3  4  5  6
> >>> 0   -  *  *  -  -  *  *
> >>> 1   *  -  *  -  -  *  *
> >>> 2   *  *  -  -  -  *  *
> >>> 3   -  -  -  -  -  *  *
> >>> 4   -  -  -  -  -  *  *
> >>> 5   *  *  *  *  *  -  *
> >>> 6   *  *  *  *  *  *  -
> >> The case I am concerned about is if the switch connected over DSA in
> >> this example has a WAN port on it, which can legitimately route to the
> >> CPU on port 5 but should not route to the LAN ports 0, 1, and 2.  Does
> >> this VLAN allow direct communication between the WAN and LAN?  Or is
> >> this prevented by DSA or some other mechanism?
> > A typical WIFI access point with a connection to a cable modem.
> >
> > So in linux you have interfaces like
> >
> > lan0, lan1, lan2, lan3, wan0
> >
> > DSA provides you these interface. And by default they are all
> > separated. There is no path between them. You can consider them as
> > being separate physical ethernet cards, just like all other interfaces
> > in linux.
> >
> > What you would typically do is:
> >
> > brctl addbr br0
> > brctl addif br0 lan0
> > brctl addif br0 lan1
> > brctl addif br0 lan2
> > brctl addif br0 lan3
> >
> > to create a bridge between the lan ports. The linux kernel will then
> > push this bridge configuration down into the hardware, so the switch
> > can forward frames between these ports.
> >
> > The wan port is not part of the bridge, so there is no L2 path to the
> > WAN port. You need to do IP routing on the CPU.
> >
> > Linux takes the stance that switch ports interfaces should act just
> > like any other linux interface and you configure them in the normal
> > linux way.
> >
> >  Andrew
> 
> Thanks for the explanation.  I am a bit befuddled by the combination of 
> all the possible configurations of the switch and how they interact with 
> Linux.  :)  I think I understand what is happening now.

You might also be looking at this the wrong way around. It is best to
think of the switch as a hardware accelerator. It offers functions to
the linux network stack to accelerate part of the linux network
stack. We only push out to the hardware functions it is capable of
accelerating. What it cannot accelerate stays in software. Think of it
as a GPU, but for networking...

  Andrew

[PATCH 0/2][GIT PULL] Timekeeping updates to tip/timers/core for 4.6

2016-02-26 Thread John Stultz

Hey Thomas, Ingo,
  Here's my somewhat truncated queue for 4.6. I was hoping to
get the cross-timestamp patchset from Christopher sent along,
but he's got some last minute changes to address feedback from
Andy, so I'm holding off.

If the response is good for that last change, I may try to send
another set with those changes next week, but we're cutting it
fairly close to -rc6, so I'll check with you before doing so.

So.. two tiny cleanup fixes is all for now.

Boring is good, right?

Let me know if you have any thoughts or objections.

thanks
-john


The following changes since commit 18558cae0272f8fd9647e69d3fec1565a7949865:

  Linux 4.5-rc4 (2016-02-14 13:05:20 -0800)

are available in the git repository at:

  https://git.linaro.org/people/john.stultz/linux.git fortglx/4.6/time

for you to fetch changes up to ea23c1da598bfd361af5cd769e292a9667ecb7ab:

  jiffies: use CLOCKSOURCE_MASK instead of constant (2016-02-26 19:05:25 -0800)


Alexander Kuleshov (2):
  clocksource: introduce clocksource_freq2mult()
  jiffies: use CLOCKSOURCE_MASK instead of constant

 include/linux/clocksource.h | 45 +++--
 kernel/time/jiffies.c   |  2 +-
 2 files changed, 20 insertions(+), 27 deletions(-)



Cc: Thomas Gleixner 
Cc: Prarit Bhargava 
Cc: Richard Cochran 
Cc: Ingo Molnar 
Cc: Christopher Hall 

Alexander Kuleshov (2):
  clocksource: introduce clocksource_freq2mult()
  jiffies: use CLOCKSOURCE_MASK instead of constant

 include/linux/clocksource.h | 45 +++--
 kernel/time/jiffies.c   |  2 +-
 2 files changed, 20 insertions(+), 27 deletions(-)

-- 
1.9.1

[PATCH 2/2] jiffies: use CLOCKSOURCE_MASK instead of constant

2016-02-26 Thread John Stultz

From: Alexander Kuleshov 

The CLOCKSOURCE_MASK(32) macro expands to the same value, but
makes code more readable.

Cc: Thomas Gleixner 
Cc: Prarit Bhargava 
Cc: Richard Cochran 
Cc: Ingo Molnar 
Signed-off-by: Alexander Kuleshov 
Signed-off-by: John Stultz 
---
 kernel/time/jiffies.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/time/jiffies.c b/kernel/time/jiffies.c
index 347fecf..555e21f 100644
--- a/kernel/time/jiffies.c
+++ b/kernel/time/jiffies.c
@@ -68,7 +68,7 @@ static struct clocksource clocksource_jiffies = {
.name   = "jiffies",
.rating = 1, /* lowest valid rating*/
.read   = jiffies_read,
-   .mask   = 0x, /*32bits*/
+   .mask   = CLOCKSOURCE_MASK(32),
.mult   = NSEC_PER_JIFFY << JIFFIES_SHIFT, /* details above */
.shift  = JIFFIES_SHIFT,
.max_cycles = 10,
-- 
1.9.1

[PATCH 0/2][GIT PULL] Timekeeping updates to tip/timers/core for 4.6

2016-02-26 Thread John Stultz

Hey Thomas, Ingo,
  Here's my somewhat truncated queue for 4.6. I was hoping to
get the cross-timestamp patchset from Christopher sent along,
but he's got some last minute changes to address feedback from
Andy, so I'm holding off.

If the response is good for that last change, I may try to send
another set with those changes next week, but we're cutting it
fairly close to -rc6, so I'll check with you before doing so.

So.. two tiny cleanup fixes is all for now.

Boring is good, right?

Let me know if you have any thoughts or objections.

thanks
-john


The following changes since commit 18558cae0272f8fd9647e69d3fec1565a7949865:

  Linux 4.5-rc4 (2016-02-14 13:05:20 -0800)

are available in the git repository at:

  https://git.linaro.org/people/john.stultz/linux.git fortglx/4.6/time

for you to fetch changes up to ea23c1da598bfd361af5cd769e292a9667ecb7ab:

  jiffies: use CLOCKSOURCE_MASK instead of constant (2016-02-26 19:05:25 -0800)


Alexander Kuleshov (2):
  clocksource: introduce clocksource_freq2mult()
  jiffies: use CLOCKSOURCE_MASK instead of constant

 include/linux/clocksource.h | 45 +++--
 kernel/time/jiffies.c   |  2 +-
 2 files changed, 20 insertions(+), 27 deletions(-)



Cc: Thomas Gleixner 
Cc: Prarit Bhargava 
Cc: Richard Cochran 
Cc: Ingo Molnar 
Cc: Christopher Hall 

Alexander Kuleshov (2):
  clocksource: introduce clocksource_freq2mult()
  jiffies: use CLOCKSOURCE_MASK instead of constant

 include/linux/clocksource.h | 45 +++--
 kernel/time/jiffies.c   |  2 +-
 2 files changed, 20 insertions(+), 27 deletions(-)

-- 
1.9.1

[PATCH 2/2] jiffies: use CLOCKSOURCE_MASK instead of constant

2016-02-26 Thread John Stultz

From: Alexander Kuleshov 

The CLOCKSOURCE_MASK(32) macro expands to the same value, but
makes code more readable.

Cc: Thomas Gleixner 
Cc: Prarit Bhargava 
Cc: Richard Cochran 
Cc: Ingo Molnar 
Signed-off-by: Alexander Kuleshov 
Signed-off-by: John Stultz 
---
 kernel/time/jiffies.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/time/jiffies.c b/kernel/time/jiffies.c
index 347fecf..555e21f 100644
--- a/kernel/time/jiffies.c
+++ b/kernel/time/jiffies.c
@@ -68,7 +68,7 @@ static struct clocksource clocksource_jiffies = {
.name   = "jiffies",
.rating = 1, /* lowest valid rating*/
.read   = jiffies_read,
-   .mask   = 0x, /*32bits*/
+   .mask   = CLOCKSOURCE_MASK(32),
.mult   = NSEC_PER_JIFFY << JIFFIES_SHIFT, /* details above */
.shift  = JIFFIES_SHIFT,
.max_cycles = 10,
-- 
1.9.1

Re: [PATCH] printk/nmi: restore printk_func in nmi_panic

2016-02-26 Thread Sergey Senozhatsky

On (02/27/16 11:19), Sergey Senozhatsky wrote:
[..]
> > I think about a compromise. We should try to get the messages
> > out only when kdump is not enabled.
> 
> can we zap_locks() if we are on 
> nmi_panic()->panic()->console_flush_on_panic() path?
> console_flush_on_panic() is happening after we send out smp_send_stop().

can something like this do the trick?

8<

>From 4186873bb4574b4bbb227e7448d56599849de0bd Mon Sep 17 00:00:00 2001
From: Sergey Senozhatsky 
Subject: [PATCH] printk/nmi: restore printk_func in nmi_panic

When watchdog detects a hardlockup and calls nmi_panic() `printk_func'
must be restored via printk_nmi_exit() call, so panic() will be able
to flush nmi buf and show backtrace and panic message. We also better
explicitly ask nmi to printk_nmi_flush() in console_flush_on_panic(),
because it may be too late to rely on irq work.

Factor out __zap_locks(), and call it from console_flush_on_panic().
This is normally not needed, because logbuf_lock always comes with
IRQ disable/enable magic, but we can panic() from nmi.

Signed-off-by: Sergey Senozhatsky 
---
 include/linux/kernel.h |  6 --
 kernel/printk/printk.c | 27 ---
 2 files changed, 20 insertions(+), 13 deletions(-)

diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index f4fa2b2..3ee33d5 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -469,10 +469,12 @@ do {  
\
cpu = raw_smp_processor_id();   \
old_cpu = atomic_cmpxchg(_cpu, PANIC_CPU_INVALID, cpu);   \
\
-   if (old_cpu == PANIC_CPU_INVALID)   \
+   if (old_cpu == PANIC_CPU_INVALID) { \
+   printk_nmi_exit();  \
panic(fmt, ##__VA_ARGS__);  \
-   else if (old_cpu != cpu)\
+   } else if (old_cpu != cpu) {\
nmi_panic_self_stop(regs);  \
+   }   \
 } while (0)
 
 /*
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 9917f69..0a318ed 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -1462,6 +1462,15 @@ static void call_console_drivers(int level,
}
 }
 
+static void __zap_locks(void)
+{
+   debug_locks_off();
+   /* If a crash is occurring, make sure we can't deadlock */
+   raw_spin_lock_init(_lock);
+   /* And make sure that we print immediately */
+   sema_init(_sem, 1);
+}
+
 /*
  * Zap console related locks when oopsing.
  * To leave time for slow consoles to print a full oops,
@@ -1477,11 +1486,7 @@ static void zap_locks(void)
 
oops_timestamp = jiffies;
 
-   debug_locks_off();
-   /* If a crash is occurring, make sure we can't deadlock */
-   raw_spin_lock_init(_lock);
-   /* And make sure that we print immediately */
-   sema_init(_sem, 1);
+   __zap_locks();
 }
 
 int printk_delay_msec __read_mostly;
@@ -2386,15 +2391,15 @@ void console_unblank(void)
  */
 void console_flush_on_panic(void)
 {
-   /*
-* If someone else is holding the console lock, trylock will fail
-* and may_schedule may be set.  Ignore and proceed to unlock so
-* that messages are flushed out.  As this can be called from any
-* context and we don't want to get preempted while flushing,
-* ensure may_schedule is cleared.
+   __zap_locks();
+
+   /* As this can be called from any context and we don't want
+* to get preempted while flushing, ensure may_schedule is
+* cleared.
 */
console_trylock();
console_may_schedule = 0;
+   printk_nmi_flush();
console_unlock();
 }
 
-- 
2.7.1

Re: [PATCH] printk/nmi: restore printk_func in nmi_panic

2016-02-26 Thread Sergey Senozhatsky

On (02/27/16 11:19), Sergey Senozhatsky wrote:
[..]
> > I think about a compromise. We should try to get the messages
> > out only when kdump is not enabled.
> 
> can we zap_locks() if we are on 
> nmi_panic()->panic()->console_flush_on_panic() path?
> console_flush_on_panic() is happening after we send out smp_send_stop().

can something like this do the trick?

8<

>From 4186873bb4574b4bbb227e7448d56599849de0bd Mon Sep 17 00:00:00 2001
From: Sergey Senozhatsky 
Subject: [PATCH] printk/nmi: restore printk_func in nmi_panic

When watchdog detects a hardlockup and calls nmi_panic() `printk_func'
must be restored via printk_nmi_exit() call, so panic() will be able
to flush nmi buf and show backtrace and panic message. We also better
explicitly ask nmi to printk_nmi_flush() in console_flush_on_panic(),
because it may be too late to rely on irq work.

Factor out __zap_locks(), and call it from console_flush_on_panic().
This is normally not needed, because logbuf_lock always comes with
IRQ disable/enable magic, but we can panic() from nmi.

Signed-off-by: Sergey Senozhatsky 
---
 include/linux/kernel.h |  6 --
 kernel/printk/printk.c | 27 ---
 2 files changed, 20 insertions(+), 13 deletions(-)

diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index f4fa2b2..3ee33d5 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -469,10 +469,12 @@ do {  
\
cpu = raw_smp_processor_id();   \
old_cpu = atomic_cmpxchg(_cpu, PANIC_CPU_INVALID, cpu);   \
\
-   if (old_cpu == PANIC_CPU_INVALID)   \
+   if (old_cpu == PANIC_CPU_INVALID) { \
+   printk_nmi_exit();  \
panic(fmt, ##__VA_ARGS__);  \
-   else if (old_cpu != cpu)\
+   } else if (old_cpu != cpu) {\
nmi_panic_self_stop(regs);  \
+   }   \
 } while (0)
 
 /*
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 9917f69..0a318ed 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -1462,6 +1462,15 @@ static void call_console_drivers(int level,
}
 }
 
+static void __zap_locks(void)
+{
+   debug_locks_off();
+   /* If a crash is occurring, make sure we can't deadlock */
+   raw_spin_lock_init(_lock);
+   /* And make sure that we print immediately */
+   sema_init(_sem, 1);
+}
+
 /*
  * Zap console related locks when oopsing.
  * To leave time for slow consoles to print a full oops,
@@ -1477,11 +1486,7 @@ static void zap_locks(void)
 
oops_timestamp = jiffies;
 
-   debug_locks_off();
-   /* If a crash is occurring, make sure we can't deadlock */
-   raw_spin_lock_init(_lock);
-   /* And make sure that we print immediately */
-   sema_init(_sem, 1);
+   __zap_locks();
 }
 
 int printk_delay_msec __read_mostly;
@@ -2386,15 +2391,15 @@ void console_unblank(void)
  */
 void console_flush_on_panic(void)
 {
-   /*
-* If someone else is holding the console lock, trylock will fail
-* and may_schedule may be set.  Ignore and proceed to unlock so
-* that messages are flushed out.  As this can be called from any
-* context and we don't want to get preempted while flushing,
-* ensure may_schedule is cleared.
+   __zap_locks();
+
+   /* As this can be called from any context and we don't want
+* to get preempted while flushing, ensure may_schedule is
+* cleared.
 */
console_trylock();
console_may_schedule = 0;
+   printk_nmi_flush();
console_unlock();
 }
 
-- 
2.7.1

Re: [PATCH v3 22/22] sound/usb: Use Media Controller API to share media resources

2016-02-26 Thread Shuah Khan

On 02/26/2016 01:50 PM, Takashi Iwai wrote:
> On Fri, 26 Feb 2016 21:08:43 +0100,
> Shuah Khan wrote:
>>
>> On 02/26/2016 12:55 PM, Takashi Iwai wrote:
>>> On Fri, 12 Feb 2016 00:41:38 +0100,
>>> Shuah Khan wrote:

 Change ALSA driver to use Media Controller API to
 share media resources with DVB and V4L2 drivers
 on a AU0828 media device. Media Controller specific
 initialization is done after sound card is registered.
 ALSA creates Media interface and entity function graph
 nodes for Control, Mixer, PCM Playback, and PCM Capture
 devices.

 snd_usb_hw_params() will call Media Controller enable
 source handler interface to request the media resource.
 If resource request is granted, it will release it from
 snd_usb_hw_free(). If resource is busy, -EBUSY is returned.

 Media specific cleanup is done in usb_audio_disconnect().

 Signed-off-by: Shuah Khan 
 ---
  sound/usb/Kconfig|   4 +
  sound/usb/Makefile   |   2 +
  sound/usb/card.c |  14 +++
  sound/usb/card.h |   3 +
  sound/usb/media.c| 318 
 +++
  sound/usb/media.h|  72 +++
  sound/usb/mixer.h|   3 +
  sound/usb/pcm.c  |  28 -
  sound/usb/quirks-table.h |   1 +
  sound/usb/stream.c   |   2 +
  sound/usb/usbaudio.h |   6 +
  11 files changed, 448 insertions(+), 5 deletions(-)
  create mode 100644 sound/usb/media.c
  create mode 100644 sound/usb/media.h

 diff --git a/sound/usb/Kconfig b/sound/usb/Kconfig
 index a452ad7..ba117f5 100644
 --- a/sound/usb/Kconfig
 +++ b/sound/usb/Kconfig
 @@ -15,6 +15,7 @@ config SND_USB_AUDIO
select SND_RAWMIDI
select SND_PCM
select BITREVERSE
 +  select SND_USB_AUDIO_USE_MEDIA_CONTROLLER if MEDIA_CONTROLLER && 
 MEDIA_SUPPORT
>>>
>>> Looking at the media Kconfig again, this would be broken if
>>> MEDIA_SUPPORT=m and SND_USB_AUDIO=y.  The ugly workaround is something
>>> like:
>>> select SND_USB_AUDIO_USE_MEDIA_CONTROLLER \
>>> if MEDIA_CONTROLLER && (MEDIA_SUPPORT=y || MEDIA_SUPPORT=SND)
>>
>> My current config is MEDIA_SUPPORT=m and SND_USB_AUDIO=y
>> It is working and I didn't see any issues so far.
> 
> Hmm, how does it be?  In drivers/media/Makefile:
> 
> ifeq ($(CONFIG_MEDIA_CONTROLLER),y)
>   obj-$(CONFIG_MEDIA_SUPPORT) += media.o
> endif
> 
> So it's a module.  Meanwhile you have reference from usb-audio driver
> that is built-in kernel.  How is the symbol resolved?

Sorry my mistake. I misspoke. My config had:
CONFIG_MEDIA_SUPPORT=m
CONFIG_MEDIA_CONTROLLER=y
CONFIG_SND_USB_AUDIO=m

The following doesn't work as you pointed out.

CONFIG_MEDIA_SUPPORT=m
CONFIG_MEDIA_CONTROLLER=y
CONFIG_SND_USB_AUDIO=y

okay here is what will work for all of the possible
combinations of CONFIG_MEDIA_SUPPORT and CONFIG_SND_USB_AUDIO

select SND_USB_AUDIO_USE_MEDIA_CONTROLLER \
   if MEDIA_CONTROLLER && ((MEDIA_SUPPORT=y) || (MEDIA_SUPPORT=m && 
SND_USB_AUDIO=m))

The above will cover the cases when

1. CONFIG_MEDIA_SUPPORT and CONFIG_SND_USB_AUDIO are
   both modules
   CONFIG_SND_USB_AUDIO_USE_MEDIA_CONTROLLER is selected

2. CONFIG_MEDIA_SUPPORT=y and CONFIG_SND_USB_AUDIO=m
   CONFIG_SND_USB_AUDIO_USE_MEDIA_CONTROLLER is selected

3. CONFIG_MEDIA_SUPPORT=y and CONFIG_SND_USB_AUDIO=y
   CONFIG_SND_USB_AUDIO_USE_MEDIA_CONTROLLER is selected

4. CONFIG_MEDIA_SUPPORT=m and CONFIG_SND_USB_AUDIO=y
   This is when we don't want
   CONFIG_SND_USB_AUDIO_USE_MEDIA_CONTROLLER selected

I verified all of the above combinations to make sure
the logic works.

If you think of a better way to do this please let me
know. I will go ahead and send patch v4 with the above
change and you can decide if that is acceptable.

>>>
>>> Other than that, it looks more or less OK to me.
>>> The way how media_stream_init() gets called is a bit worrisome, but it
>>> should work practically.  Another concern is about the disconnection.
>>> Can all function calls in media_device_delete() be safe even if it's
>>> called while the application still opens the MC device?
>>
>> Right. I have been looking into device removal path when
>> ioctls are active and I can resolve any issues that might
>> surface while an audio app is active when device is removed.
> 
> So, it's 100% safe to call all these media_*() functions while the
> device is being accessed before closing?
> 

There is a known problem with device removal when
media device file is open and ioctl is in progress.
This isn't specific to this patch series, a general
problem that is related to media device removal in
general.

I am working on a fix for this problem. As you said,
earlier, I can work on fixing issues after the merge.

thanks,
-- Shuah

-- 
Shuah Khan
Sr. Linux Kernel Developer
Open Source Innovation Group
Samsung Research America (Silicon

Re: [PATCH v3 22/22] sound/usb: Use Media Controller API to share media resources

2016-02-26 Thread Shuah Khan

On 02/26/2016 01:50 PM, Takashi Iwai wrote:
> On Fri, 26 Feb 2016 21:08:43 +0100,
> Shuah Khan wrote:
>>
>> On 02/26/2016 12:55 PM, Takashi Iwai wrote:
>>> On Fri, 12 Feb 2016 00:41:38 +0100,
>>> Shuah Khan wrote:

 Change ALSA driver to use Media Controller API to
 share media resources with DVB and V4L2 drivers
 on a AU0828 media device. Media Controller specific
 initialization is done after sound card is registered.
 ALSA creates Media interface and entity function graph
 nodes for Control, Mixer, PCM Playback, and PCM Capture
 devices.

 snd_usb_hw_params() will call Media Controller enable
 source handler interface to request the media resource.
 If resource request is granted, it will release it from
 snd_usb_hw_free(). If resource is busy, -EBUSY is returned.

 Media specific cleanup is done in usb_audio_disconnect().

 Signed-off-by: Shuah Khan 
 ---
  sound/usb/Kconfig|   4 +
  sound/usb/Makefile   |   2 +
  sound/usb/card.c |  14 +++
  sound/usb/card.h |   3 +
  sound/usb/media.c| 318 
 +++
  sound/usb/media.h|  72 +++
  sound/usb/mixer.h|   3 +
  sound/usb/pcm.c  |  28 -
  sound/usb/quirks-table.h |   1 +
  sound/usb/stream.c   |   2 +
  sound/usb/usbaudio.h |   6 +
  11 files changed, 448 insertions(+), 5 deletions(-)
  create mode 100644 sound/usb/media.c
  create mode 100644 sound/usb/media.h

 diff --git a/sound/usb/Kconfig b/sound/usb/Kconfig
 index a452ad7..ba117f5 100644
 --- a/sound/usb/Kconfig
 +++ b/sound/usb/Kconfig
 @@ -15,6 +15,7 @@ config SND_USB_AUDIO
select SND_RAWMIDI
select SND_PCM
select BITREVERSE
 +  select SND_USB_AUDIO_USE_MEDIA_CONTROLLER if MEDIA_CONTROLLER && 
 MEDIA_SUPPORT
>>>
>>> Looking at the media Kconfig again, this would be broken if
>>> MEDIA_SUPPORT=m and SND_USB_AUDIO=y.  The ugly workaround is something
>>> like:
>>> select SND_USB_AUDIO_USE_MEDIA_CONTROLLER \
>>> if MEDIA_CONTROLLER && (MEDIA_SUPPORT=y || MEDIA_SUPPORT=SND)
>>
>> My current config is MEDIA_SUPPORT=m and SND_USB_AUDIO=y
>> It is working and I didn't see any issues so far.
> 
> Hmm, how does it be?  In drivers/media/Makefile:
> 
> ifeq ($(CONFIG_MEDIA_CONTROLLER),y)
>   obj-$(CONFIG_MEDIA_SUPPORT) += media.o
> endif
> 
> So it's a module.  Meanwhile you have reference from usb-audio driver
> that is built-in kernel.  How is the symbol resolved?

Sorry my mistake. I misspoke. My config had:
CONFIG_MEDIA_SUPPORT=m
CONFIG_MEDIA_CONTROLLER=y
CONFIG_SND_USB_AUDIO=m

The following doesn't work as you pointed out.

CONFIG_MEDIA_SUPPORT=m
CONFIG_MEDIA_CONTROLLER=y
CONFIG_SND_USB_AUDIO=y

okay here is what will work for all of the possible
combinations of CONFIG_MEDIA_SUPPORT and CONFIG_SND_USB_AUDIO

select SND_USB_AUDIO_USE_MEDIA_CONTROLLER \
   if MEDIA_CONTROLLER && ((MEDIA_SUPPORT=y) || (MEDIA_SUPPORT=m && 
SND_USB_AUDIO=m))

The above will cover the cases when

1. CONFIG_MEDIA_SUPPORT and CONFIG_SND_USB_AUDIO are
   both modules
   CONFIG_SND_USB_AUDIO_USE_MEDIA_CONTROLLER is selected

2. CONFIG_MEDIA_SUPPORT=y and CONFIG_SND_USB_AUDIO=m
   CONFIG_SND_USB_AUDIO_USE_MEDIA_CONTROLLER is selected

3. CONFIG_MEDIA_SUPPORT=y and CONFIG_SND_USB_AUDIO=y
   CONFIG_SND_USB_AUDIO_USE_MEDIA_CONTROLLER is selected

4. CONFIG_MEDIA_SUPPORT=m and CONFIG_SND_USB_AUDIO=y
   This is when we don't want
   CONFIG_SND_USB_AUDIO_USE_MEDIA_CONTROLLER selected

I verified all of the above combinations to make sure
the logic works.

If you think of a better way to do this please let me
know. I will go ahead and send patch v4 with the above
change and you can decide if that is acceptable.

>>>
>>> Other than that, it looks more or less OK to me.
>>> The way how media_stream_init() gets called is a bit worrisome, but it
>>> should work practically.  Another concern is about the disconnection.
>>> Can all function calls in media_device_delete() be safe even if it's
>>> called while the application still opens the MC device?
>>
>> Right. I have been looking into device removal path when
>> ioctls are active and I can resolve any issues that might
>> surface while an audio app is active when device is removed.
> 
> So, it's 100% safe to call all these media_*() functions while the
> device is being accessed before closing?
> 

There is a known problem with device removal when
media device file is open and ioctl is in progress.
This isn't specific to this patch series, a general
problem that is related to media device removal in
general.

I am working on a fix for this problem. As you said,
earlier, I can work on fixing issues after the merge.

thanks,
-- Shuah

-- 
Shuah Khan
Sr. Linux Kernel Developer
Open Source Innovation Group
Samsung Research America (Silicon Valley)
shua...@osg.samsung.com

Re: [PATCH trivial] include/linux/gfp.h: Improve the coding styles

2016-02-26 Thread Theodore Ts'o

On Fri, Feb 26, 2016 at 11:26:02PM +0800, Chen Gang wrote:
> > As for coding style, actually IMHO this patch is even _not_ a coding
> > style, more like a code shuffle, indeed.
> > 
> 
> "80 column limitation" is about coding style, I guess, all of us agree
> with it.

No, it's been accepted that checkpatch requiring people to reformat
code to within be 80 columns limitation was actively harmful, and it
no longer does that.

Worse, it now complains when you split a printf string across lines,
so there were patches that split a string across multiple lines to
make checkpatch shut up.  And now there are patches that join the
string back together.

And if you now start submitting patches to split them up again because
you think the 80 column restriction is so darned important, that would
be even ***more*** code churn.

Which is one of the reasons why some of us aren't terribly happy with
people who start running checkpatch -file on other people's code and
start submitting patches, either through the trivial patch portal or
not.

Mel, as an MM developer, has already NACK'ed the patch, which means
you should not send the patch to **any** upstream maintainer for
inclusion.

- Ted

Re: [PATCH trivial] include/linux/gfp.h: Improve the coding styles

2016-02-26 Thread Theodore Ts'o

On Fri, Feb 26, 2016 at 11:26:02PM +0800, Chen Gang wrote:
> > As for coding style, actually IMHO this patch is even _not_ a coding
> > style, more like a code shuffle, indeed.
> > 
> 
> "80 column limitation" is about coding style, I guess, all of us agree
> with it.

No, it's been accepted that checkpatch requiring people to reformat
code to within be 80 columns limitation was actively harmful, and it
no longer does that.

Worse, it now complains when you split a printf string across lines,
so there were patches that split a string across multiple lines to
make checkpatch shut up.  And now there are patches that join the
string back together.

And if you now start submitting patches to split them up again because
you think the 80 column restriction is so darned important, that would
be even ***more*** code churn.

Which is one of the reasons why some of us aren't terribly happy with
people who start running checkpatch -file on other people's code and
start submitting patches, either through the trivial patch portal or
not.

Mel, as an MM developer, has already NACK'ed the patch, which means
you should not send the patch to **any** upstream maintainer for
inclusion.

- Ted

Re: [RFCv7 PATCH 03/10] sched: scheduler-driven cpu frequency selection

2016-02-26 Thread Rafael J. Wysocki

On Thursday, February 25, 2016 04:34:23 PM Steve Muckle wrote:
> On 02/24/2016 07:55 PM, Rafael J. Wysocki wrote:
> > Hi,

[cut]

> > One thing I personally like in the RCU-based approach is its universality.  
> > The
> > callbacks may be installed by different entities in a uniform way: 
> > intel_pstate
> > can do that, the old governors can do that, my experimental schedutil code 
> > can
> > do that and your code could have done that too in principle.  And this is 
> > very
> > nice, because it is a common underlying mechanism that can be used by 
> > everybody
> > regardless of their particular implementations on the other side.
> > 
> > Why would I want to use something different, then?
> 
> I've got nothing against a callback registration mechanism. As you
> mentioned in another mail it could itself use static keys, enabling the
> static key when a callback is registered and disabling it otherwise to
> avoid calling into cpufreq_update_util().

But then it would only make a difference if cpufreq_update_util() was not
used at all (ie. no callbacks installed for any policies by anyone).  The
only reason why it may matter is that the total number of systems using
the performance governor is quite large AFAICS and they would benefit from
that.

[cut]

> 
> > 
> >> +
> >> +/*
> >> + * Capacity margin added to CFS and RT capacity requests to provide
> >> + * some head room if task utilization further increases.
> >> + */
> > 
> > OK, where does this number come from?
> 
> Someone's posterior :) .
> 
> This really should be a tunable IMO, but there's a fairly strong
> anti-tunable sentiment, so it's been left hard-coded in an attempt to
> provide something that "just works."

Ouch.

> At the least I can add a comment saying that the 20% idle headroom
> requirement was an off the cuff estimate and that at this time, we don't
> have significant data to suggest it's the best number.

Well, in this area, every number has to be justified.  Otherwise we end
up with things that sort of work, but nobody actually understands why.

[cut]

> > 
> >> +
> >> +static int cpufreq_sched_thread(void *data)
> >> +{
> > 
> > Now, what really is the advantage of having those extra threads vs using
> > workqueues?
> > 
> > I guess the underlying concern is that RT tasks may stall workqueues 
> > indefinitely
> > in theory and then the frequency won't be updated, but there's much more 
> > kernel
> > stuff run from workqueues and if that is starved, you won't get very far 
> > anyway.
> > 
> > If you take special measures to prevent frequency change requests from being
> > stalled by RT tasks, question is why are they so special?  Aren't there any
> > other kernel activities that also should be protected from that and may be
> > more important than CPU frequency changes?
> 
> I think updating the CPU frequency during periods of heavy RT/DL load is
> one of the most (if not the most) important things. I can't speak for
> other system activities that may get blocked, but there's an opportunity
> to protect CPU frequency changes here, and it seems worth taking to me.

So do it in a general way for everybody and not just for one governor
that you happen to be working on.

That said I'm unconvinced about the approach still.

Having more RT threads in a system that already is under RT pressure seems like
a recipe for trouble.  Moreover, it's likely that those new RT threads will
disturb the system's normal operation somehow even without the RT pressure and
have you investigated that?  Also having them per policy may be overkill and
binding them to policy CPUs only is not necessary.

Overall, it looks like a dynamic pool of threads that may run on every CPU
might be a better approach, but that would almost duplicate the workqueues
subsystem, so is it really worth it?

And is the problem actually visible in practice?  I have no record of any 
reports
mentioning it, although theoretically it's been there forever, so had it been
real, someone would have noticed it and complained about it IMO.

> > 
> > Plus if this really is the problem here, then it also affects the other 
> > cpufreq
> > governors, so maybe it should be solved for everybody in some common way?
> 
> Agreed, I'd think a freq change thread that serves frequency change
> requests would be a useful common component. The locking and throttling
> (slowpath_lock, finish_last_request()) are somewhat specific to this
> implementation, but could probably be done generically and maybe even
> used in other governors. If you're okay with it though I'd like to view
> that as a slightly longer term effort, as I think it would get unwieldy
> trying to do that as part of this initial change.

I really am not sure if this is useful at all, so why bother with it initially?

> > 
> ...
> >> +
> >> +static void cpufreq_sched_irq_work(struct irq_work *irq_work)
> >> +{
> >> +  struct gov_data *gd;
> >> +
> >> +  gd = container_of(irq_work, struct gov_data, irq_work);
> >> +  if (!gd)
> >>

Re: [RFCv7 PATCH 03/10] sched: scheduler-driven cpu frequency selection

2016-02-26 Thread Rafael J. Wysocki

On Thursday, February 25, 2016 04:34:23 PM Steve Muckle wrote:
> On 02/24/2016 07:55 PM, Rafael J. Wysocki wrote:
> > Hi,

[cut]

> > One thing I personally like in the RCU-based approach is its universality.  
> > The
> > callbacks may be installed by different entities in a uniform way: 
> > intel_pstate
> > can do that, the old governors can do that, my experimental schedutil code 
> > can
> > do that and your code could have done that too in principle.  And this is 
> > very
> > nice, because it is a common underlying mechanism that can be used by 
> > everybody
> > regardless of their particular implementations on the other side.
> > 
> > Why would I want to use something different, then?
> 
> I've got nothing against a callback registration mechanism. As you
> mentioned in another mail it could itself use static keys, enabling the
> static key when a callback is registered and disabling it otherwise to
> avoid calling into cpufreq_update_util().

But then it would only make a difference if cpufreq_update_util() was not
used at all (ie. no callbacks installed for any policies by anyone).  The
only reason why it may matter is that the total number of systems using
the performance governor is quite large AFAICS and they would benefit from
that.

[cut]

> 
> > 
> >> +
> >> +/*
> >> + * Capacity margin added to CFS and RT capacity requests to provide
> >> + * some head room if task utilization further increases.
> >> + */
> > 
> > OK, where does this number come from?
> 
> Someone's posterior :) .
> 
> This really should be a tunable IMO, but there's a fairly strong
> anti-tunable sentiment, so it's been left hard-coded in an attempt to
> provide something that "just works."

Ouch.

> At the least I can add a comment saying that the 20% idle headroom
> requirement was an off the cuff estimate and that at this time, we don't
> have significant data to suggest it's the best number.

Well, in this area, every number has to be justified.  Otherwise we end
up with things that sort of work, but nobody actually understands why.

[cut]

> > 
> >> +
> >> +static int cpufreq_sched_thread(void *data)
> >> +{
> > 
> > Now, what really is the advantage of having those extra threads vs using
> > workqueues?
> > 
> > I guess the underlying concern is that RT tasks may stall workqueues 
> > indefinitely
> > in theory and then the frequency won't be updated, but there's much more 
> > kernel
> > stuff run from workqueues and if that is starved, you won't get very far 
> > anyway.
> > 
> > If you take special measures to prevent frequency change requests from being
> > stalled by RT tasks, question is why are they so special?  Aren't there any
> > other kernel activities that also should be protected from that and may be
> > more important than CPU frequency changes?
> 
> I think updating the CPU frequency during periods of heavy RT/DL load is
> one of the most (if not the most) important things. I can't speak for
> other system activities that may get blocked, but there's an opportunity
> to protect CPU frequency changes here, and it seems worth taking to me.

So do it in a general way for everybody and not just for one governor
that you happen to be working on.

That said I'm unconvinced about the approach still.

Having more RT threads in a system that already is under RT pressure seems like
a recipe for trouble.  Moreover, it's likely that those new RT threads will
disturb the system's normal operation somehow even without the RT pressure and
have you investigated that?  Also having them per policy may be overkill and
binding them to policy CPUs only is not necessary.

Overall, it looks like a dynamic pool of threads that may run on every CPU
might be a better approach, but that would almost duplicate the workqueues
subsystem, so is it really worth it?

And is the problem actually visible in practice?  I have no record of any 
reports
mentioning it, although theoretically it's been there forever, so had it been
real, someone would have noticed it and complained about it IMO.

> > 
> > Plus if this really is the problem here, then it also affects the other 
> > cpufreq
> > governors, so maybe it should be solved for everybody in some common way?
> 
> Agreed, I'd think a freq change thread that serves frequency change
> requests would be a useful common component. The locking and throttling
> (slowpath_lock, finish_last_request()) are somewhat specific to this
> implementation, but could probably be done generically and maybe even
> used in other governors. If you're okay with it though I'd like to view
> that as a slightly longer term effort, as I think it would get unwieldy
> trying to do that as part of this initial change.

I really am not sure if this is useful at all, so why bother with it initially?

> > 
> ...
> >> +
> >> +static void cpufreq_sched_irq_work(struct irq_work *irq_work)
> >> +{
> >> +  struct gov_data *gd;
> >> +
> >> +  gd = container_of(irq_work, struct gov_data, irq_work);
> >> +  if (!gd)
> >>

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 2144 matches

Mail list logo