date:20121120

Re: fadvise interferes with readahead

2012-11-20 Thread Fengguang Wu

On Wed, Nov 21, 2012 at 03:51:03PM +0800, Jaegeuk Hanse wrote:
> On 11/20/2012 10:58 PM, Fengguang Wu wrote:
> >On Tue, Nov 20, 2012 at 10:34:11AM -0300, Claudio Freire wrote:
> >>On Tue, Nov 20, 2012 at 5:04 AM, Fengguang Wu  
> >>wrote:
> >>>Yes. The kernel readahead code by design will outperform simple
> >>>fadvise in the case of clustered random reads. Imagine the access
> >>>pattern 1, 3, 2, 6, 4, 9. fadvise will trigger 6 IOs literally. While
> >>>kernel readahead will likely trigger 3 IOs for 1, 3, 2-9. Because on
> >>>the page miss for 2, it will detect the existence of history page 1
> >>>and do readahead properly. For hard disks, it's mainly the number of
> >>>IOs that matters. So even if kernel readahead loses some opportunities
> >>>to do async IO and possibly loads some extra pages that will never be
> >>>used, it still manges to perform much better.
> >>>
> The fix would lay in fadvise, I think. It should update readahead
> tracking structures. Alternatively, one could try to do it in
> do_generic_file_read, updating readahead on !PageUptodate or even on
> page cache hits. I really don't have the expertise or time to go
> modifying, building and testing the supposedly quite simple patch that
> would fix this. It's mostly about the testing, in fact. So if someone
> can comment or try by themselves, I guess it would really benefit
> those relying on fadvise to fix this behavior.
> >>>One possible solution is to try the context readahead at fadvise time
> >>>to check the existence of history pages and do readahead accordingly.
> >>>
> >>>However it will introduce *real interferences* between kernel
> >>>readahead and user prefetching. The original scheme is, once user
> >>>space starts its own informed prefetching, kernel readahead will
> >>>automatically stand out of the way.
> >>I understand that would seem like a reasonable design, but in this
> >>particular case it doesn't seem to be. I propose that in most cases it
> >>doesn't really work well as a design decision, to make fadvise work as
> >>direct I/O. Precisely because fadvise is supposed to be a hint to let
> >>the kernel make better decisions, and not a request to make the kernel
> >>stop making decisions.
> >>
> >>Any interference so introduced wouldn't be any worse than the
> >>interference introduced by readahead over reads. I agree, if fadvise
> >>were to trigger readahead, it could be bad for applications that don't
> >>read what they say the will.
> >Right.
> >
> >>But if cache hits were to simply update
> >>readahead state, it would only mean that read calls behave the same
> >>regardless of fadvise calls. I think that's worth pursuing.
> >Here you are describing an alternative solution that will somehow trap
> >into the readahead code even when, for example, the application is
> >accessing once and again an already cached file?  I'm afraid this will
> >add non-trivial overheads and is less attractive than the "readahead
> >on fadvise" solution.
> 
> Hi Fengguang,
> 
> Page cache sync readahead only triggered when cache miss, but if
> file has already cached, how can readahead be trigged again if the
> application is accessing once and again an already cached file.

The answer is opposite to your expectation: for an already cached
file, kernel readahead code won't be triggered at all, which is good
for avoid pointless overheads for the common repeated memory hot
accesses.

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] gpiolib: fix bug and clarify OF use of ranges

2012-11-20 Thread Linus Walleij

On Wed, Nov 21, 2012 at 8:42 AM, Viresh Kumar  wrote:

> Reviewed-by: Viresh Kumar 
>
> This is what i was asking you earlier: "Doesn't gpiochip_add_pin_range
> have any users?" and you said NO and i didn't cross checked :(

Yes I forgot that I refactored the OF code to actually use this
function sorry.

Thanks,
Linus Walleij
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 000/493] remove CONFIG_HOTPLUG as an option

2012-11-20 Thread Andrew Morton

On Tue, 20 Nov 2012 10:46:11 + Grant Likely  
wrote:

> On Sat, Nov 17, 2012 at 12:19 AM, Bill Pemberton  wrote:
> > CONFIG_HOTPLUG is no longer an optional setting.  In order to remove
> > it as on option code paths that check CONFIG_HOTPLUG will removed
> > along with the attributes __devexit_p, __devexit, __devinitconst, and
> > __devinitdata.
> >
> > I'll save the list from the mailbomb of this huge patchset.  The
> > patches themselves are going to Greg KH for the driver core tree.
> >
> >
> > Bill Pemberton (493):
> [...]
> >  2942 files changed, 11645 insertions(+), 12116 deletions(-)
> 
> So, I've got no problem with the reason for the change and I don't
> even think you need my ack for the bits that I maintain (though you
> have it if you want it). However, this looks like it is going to be
> /painful/. First of all it will touch a huge number of files in the
> tree. Yes the change is trivial, but it will require manual fixups on
> a lot of patches.

Yeah, this is dopey.  Send the script to Linus and ask him to run it
seven seconds before he releases -rc1, when everyone's trees are
empty(ish).  Or send him a single megapatch at that time.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] gpiolib: fix bug and clarify OF use of ranges

2012-11-20 Thread Linus Walleij

On Wed, Nov 21, 2012 at 8:37 AM, Linus Walleij
 wrote:

> From: Linus Walleij 
>
> In commit c905165f5946f56dca195871641bd4e488eca24a
> "gpiolib: let gpiochip_add_pin_range() specify offset"
> I forgot to update the OF use of the function
> gpiochip_add_pin_range().
>
> It turns out that this reveal a weakness in the
> OF range mappings: ranges cannot currently be sparse.
> So put in a comment so we can fix this later.
>
> Signed-off-by: Linus Walleij 

BTW I've squashed this into the original commit above to avoid any
git bisect issues.

Linus Walleij
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/4] bdi: Track users that require stable page writes

2012-11-20 Thread Christoph Hellwig

On Tue, Nov 20, 2012 at 06:00:34PM -0800, Darrick J. Wong wrote:
> This creates a per-backing-device counter that tracks the number of users 
> which
> require pages to be held immutable during writeout.  Eventually it will be 
> used
> to waive wait_for_page_writeback() if nobody requires stable pages.

Why are we going down this stupid route again now?  Just let the block
device say it needs stable writes and let the VM deal with it.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4/7] gpiolib: return any error code from range creation

2012-11-20 Thread Linus Walleij

On Tue, Nov 20, 2012 at 6:28 PM, Stephen Warren  wrote:
> On 11/20/2012 07:04 AM, Linus Walleij wrote:
>> From: Linus Walleij 
>>
>> If we try to create a range for a certain GPIO chip and the
>> target pin controller is not yet available it may return
>> a probe deferral error code, so handle this all the way
>> our by checking the error code.
>
> I think patches 3 and 4 need to be squashed together to avoid any "git
> bisect" issues?

OK that's correct I'll fix...

Linus Walleij
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Question about idle time in /proc/stat

2012-11-20 Thread Ronny Meeus

On Mon, Nov 19, 2012 at 9:15 PM, Ronny Meeus  wrote:
> Hello
>
> I have an created an application that measures the cpuload consumed by
> the tasks within a process.
> For this I use the file /proc/stat and /proc//tasks//stat
>
> The cpuload monitor is a very simple application that just executes in
> a loop a sleep command, followed by retrieving the information about
> all the tasks belonging to the process.
> What I observe is that there is a big difference in the idle time
> available in the /proc/stat file.
>
> This is the Linux version I use:
> Linux version 2.6.36.4 (meeusr@devws156) (gcc version 4.4.3
> (crosstool-NG 1.15.2 - buildroot 2012.05-hga35945e88d23) ) #1 SMP
> PREEMPT Wed Oct 10 08:41:17 CEST 2012
> Please note that this is a kernel that contains patches of FreeScale.
>
> The application is running on a P4040:
> platform: P4080 DS
> model   : fsl,P4080DS
> Memory  : 2016 MB
>
> I do not know whether it is relevant but I'm using a tickless kernel
>
> I created a small test program that contains the essence of the problem:
>
> #include 
> #include 
> #include 
> #include 
>
> void utime_delay_micro_seconds(unsigned long useconds )
> {
>   struct timespec req;
>   req.tv_sec = useconds / 100;
>   req.tv_nsec = (useconds - (req.tv_sec * 100)) * 1000;
>   nanosleep(, NULL);
> }
>
> int main(void)
> {
>   unsigned long
> delta,user=0,niceuser=0,system=0,idleload=0,user1,niceuser1,system1,idleload1;
>
>   while (1)
>   {
> struct timeval tp,tp1;
> FILE *statfile;
>
> utime_delay_micro_seconds(100);
>
> gettimeofday(,NULL);
>
> statfile = fopen("/proc/stat", "r+");
> fscanf(statfile,"cpu %ld %ld %ld
> %ld");
> fclose(statfile);
>
> gettimeofday(,NULL);
> delta = (tp1.tv_sec - tp.tv_sec) + (tp1.tv_usec - tp.tv_usec);
>
> printf("%ld:%ld Delta=%ld User=%ld NiceUser=%ld System=%ld
> Idle=%ld\n",tp.tv_sec,tp.tv_usec,delta,
>user1-user,niceuser1-niceuser,system1-system,idleload1-idleload);
>
> user=user1;
> niceuser=niceuser1;
> system=system1;
> idleload=idleload1;
>   }
>   return 0;
> }
>
> The output is the following (I skipped a few lines since the initial
> measurement is not correct):
>
> 2181820:130777 Delta=889 User=39 NiceUser=0 System=5 Idle=356
> 2181821:131720 Delta=875 User=39 NiceUser=0 System=6 Idle=356
> 2181822:132650 Delta=892 User=40 NiceUser=0 System=5 Idle=356
> 2181823:133598 Delta=874 User=38 NiceUser=0 System=6 Idle=357
> 2181824:134527 Delta=880 User=38 NiceUser=0 System=6 Idle=356
> 2181825:135460 Delta=876 User=39 NiceUser=0 System=5 Idle=356
> 2181826:136390 Delta=882 User=39 NiceUser=0 System=5 Idle=356
> 2181827:137333 Delta=909 User=40 NiceUser=0 System=9 Idle=396
> 2181828:138303 Delta=895 User=39 NiceUser=0 System=5 Idle=256
> 2181829:139255 Delta=893 User=39 NiceUser=0 System=5 Idle=256
> 2181830:140206 Delta=893 User=39 NiceUser=0 System=5 Idle=256
> 2181831:141155 Delta=891 User=38 NiceUser=0 System=6 Idle=256
> 2181832:142103 Delta=869 User=39 NiceUser=0 System=5 Idle=257
> 2181833:143051 Delta=868 User=39 NiceUser=0 System=5 Idle=256
> 2181834:143976 Delta=886 User=38 NiceUser=0 System=6 Idle=256
> 2181835:144919 Delta=890 User=39 NiceUser=0 System=5 Idle=256
> 2181836:145866 Delta=887 User=38 NiceUser=0 System=6 Idle=256
> 2181837:146814 Delta=906 User=41 NiceUser=0 System=8 Idle=1254
> 2181838:147782 Delta=891 User=38 NiceUser=0 System=6 Idle=256
> 2181839:148729 Delta=892 User=39 NiceUser=0 System=5 Idle=256
> 2181840:149678 Delta=891 User=39 NiceUser=0 System=5 Idle=256
> 2181841:150626 Delta=888 User=38 NiceUser=0 System=6 Idle=256
> 2181842:151571 Delta=889 User=39 NiceUser=0 System=5 Idle=257
> 2181843:152515 Delta=885 User=39 NiceUser=0 System=5 Idle=256
> 2181844:153457 Delta=890 User=38 NiceUser=0 System=6 Idle=256
> 2181845:154403 Delta=885 User=39 NiceUser=0 System=5 Idle=256
> 2181846:155343 Delta=886 User=39 NiceUser=0 System=5 Idle=256
> 2181847:156288 Delta=907 User=40 NiceUser=0 System=9 Idle=1253
> 2181848:157257 Delta=891 User=39 NiceUser=0 System=5 Idle=256
> 2181849:158204 Delta=888 User=39 NiceUser=0 System=6 Idle=256
> 2181850:159150 Delta=895 User=38 NiceUser=0 System=6 Idle=256
> 2181851:160102 Delta=871 User=39 NiceUser=0 System=4 Idle=257
> 2181852:161054 Delta=876 User=39 NiceUser=0 System=6 Idle=255
> 2181853:161989 Delta=883 User=39 NiceUser=0 System=5 Idle=257
> 2181854:162932 Delta=891 User=39 NiceUser=0 System=5 Idle=267
> 2181855:163878 Delta=887 User=39 NiceUser=0 System=6 Idle=245
> 2181856:164821 Delta=889 User=39 NiceUser=0 System=5 Idle=256
> 2181857:165769 Delta=910 User=40 NiceUser=0 System=9 Idle=1253
> 2181858:166740 Delta=890 User=39 NiceUser=0 System=5 Idle=256
> 2181859:167686 Delta=884 User=38 NiceUser=0 System=5 Idle=256
> 2181860:168627 Delta=888 User=39 NiceUser=0 System=6 Idle=263
>
>
> The first column is the timestamp returned by gettimeofday. This is
> nicely incrementing 1 second at a time.
> The Delta

Re: kmem accounting netperf data

2012-11-20 Thread Andrew Morton

On Fri, 16 Nov 2012 09:03:52 -0800 Greg Thelen  wrote:

> We ran some netperf comparisons measuring the overhead of enabling
> CONFIG_MEMCG_KMEM with a kmem limit.  Short answer: no regression seen.
> 
> This is a multiple machine (client,server) netperf test.  Both client
> and server machines were running the same kernel with the same
> configuration.
> 
> A baseline run (with CONFIG_MEMCG_KMEM unset) was compared with a full
> featured run (CONFIG_MEMCG_KMEM=y and a kmem limit large enough not to
> put additional pressure on the workload).  We saw no noticeable
> regression running:
> - TCP_CRR efficiency, latency
> - TCP_RR latency, rate
> - TCP_STREAM efficiency, throughput
> - UDP_RR efficiency, latency
> The tests were run with a varying number of concurrent connections
> (between 1 and 200).
> 
> The source came from one of Glauber's branches
> (git://git.kernel.org/pub/scm/linux/kernel/git/glommer/memcg
> kmemcg-slab):
>   commit 70506dcf756aaafd92f4a34752d6b8d8ff4ed360
>   Author: Glauber Costa 
>   Date:   Thu Aug 16 17:16:21 2012 +0400
> 
>   Add slab-specific documentation about the kmem controller
> 
> It's not the latest source, but I figured the data might still be
> useful.

Let's cc the netdev guys, who will be pleased to hear that we didn't
break their stuff for once ;)

Thanks for testing - it was a concern.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] gpiolib: rename pin range arguments

2012-11-20 Thread Linus Walleij

From: Linus Walleij 

To be crystal clear on what the arguments mean in this
funtion dealing with both GPIO and PIN ranges with confusing
naming, we now have gpio_offset and pin_offset and we are
on the clear that these are offsets into the specific GPIO
and pin controller respectively. The GPIO chip itself will
of course keep track of the base offset into the global
GPIO number space.

Signed-off-by: Linus Walleij 
---
 drivers/gpio/gpiolib.c | 19 ++-
 include/asm-generic/gpio.h |  4 ++--
 include/linux/gpio.h   |  2 +-
 3 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/drivers/gpio/gpiolib.c b/drivers/gpio/gpiolib.c
index 317ff04..26e27c1 100644
--- a/drivers/gpio/gpiolib.c
+++ b/drivers/gpio/gpiolib.c
@@ -1191,13 +1191,13 @@ EXPORT_SYMBOL_GPL(gpiochip_find);
  * gpiochip_add_pin_range() - add a range for GPIO <-> pin mapping
  * @chip: the gpiochip to add the range for
  * @pinctrl_name: the dev_name() of the pin controller to map to
- * @offset: the start offset in the current gpio_chip number space
- * @pin_base: the start offset in the pin controller number space
+ * @gpio_offset: the start offset in the current gpio_chip number space
+ * @pin_offset: the start offset in the pin controller number space
  * @npins: the number of pins from the offset of each pin space (GPIO and
  * pin controller) to accumulate in this range
  */
 int gpiochip_add_pin_range(struct gpio_chip *chip, const char *pinctl_name,
-  unsigned int offset, unsigned int pin_base,
+  unsigned int gpio_offset, unsigned int pin_offset,
   unsigned int npins)
 {
struct gpio_pin_range *pin_range;
@@ -1210,11 +1210,11 @@ int gpiochip_add_pin_range(struct gpio_chip *chip, 
const char *pinctl_name,
}
 
/* Use local offset as range ID */
-   pin_range->range.id = offset;
+   pin_range->range.id = gpio_offset;
pin_range->range.gc = chip;
pin_range->range.name = chip->label;
-   pin_range->range.base = chip->base + offset;
-   pin_range->range.pin_base = pin_base;
+   pin_range->range.base = chip->base + gpio_offset;
+   pin_range->range.pin_base = pin_offset;
pin_range->range.npins = npins;
pin_range->pctldev = pinctrl_find_and_add_gpio_range(pinctl_name,
_range->range);
@@ -1224,9 +1224,10 @@ int gpiochip_add_pin_range(struct gpio_chip *chip, const 
char *pinctl_name,
kfree(pin_range);
return PTR_ERR(pin_range->pctldev);
}
-   pr_debug("%s: GPIO chip: created GPIO range %d->%d ==> PIN %d->%d\n",
-chip->label, offset, offset + npins - 1,
-pin_base, pin_base + npins - 1);
+   pr_debug("GPIO chip %s: created GPIO range %d->%d ==> %s PIN %d->%d\n",
+chip->label, gpio_offset, gpio_offset + npins - 1,
+pinctl_name,
+pin_offset, pin_offset + npins - 1);
 
list_add_tail(_range->node, >pin_ranges);
 
diff --git a/include/asm-generic/gpio.h b/include/asm-generic/gpio.h
index ec58fdb..9fd3093 100644
--- a/include/asm-generic/gpio.h
+++ b/include/asm-generic/gpio.h
@@ -283,7 +283,7 @@ struct gpio_pin_range {
 };
 
 int gpiochip_add_pin_range(struct gpio_chip *chip, const char *pinctl_name,
-  unsigned int offset, unsigned int pin_base,
+  unsigned int gpio_offset, unsigned int pin_offset,
   unsigned int npins);
 void gpiochip_remove_pin_ranges(struct gpio_chip *chip);
 
@@ -291,7 +291,7 @@ void gpiochip_remove_pin_ranges(struct gpio_chip *chip);
 
 static inline int
 gpiochip_add_pin_range(struct gpio_chip *chip, const char *pinctl_name,
-  unsigned int offset, unsigned int pin_base,
+  unsigned int gpio_offset, unsigned int pin_offset,
   unsigned int npins)
 {
return 0;
diff --git a/include/linux/gpio.h b/include/linux/gpio.h
index 99861c6..bfe6656 100644
--- a/include/linux/gpio.h
+++ b/include/linux/gpio.h
@@ -233,7 +233,7 @@ static inline int irq_to_gpio(unsigned irq)
 
 static inline int
 gpiochip_add_pin_range(struct gpio_chip *chip, const char *pinctl_name,
-  unsigned int offset, unsigned int pin_base,
+  unsigned int gpio_offset, unsigned int pin_offset,
   unsigned int npins)
 {
WARN_ON(1);
-- 
1.7.11.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: fadvise interferes with readahead

2012-11-20 Thread Jaegeuk Hanse


On 11/20/2012 10:58 PM, Fengguang Wu wrote:

On Tue, Nov 20, 2012 at 10:34:11AM -0300, Claudio Freire wrote:

On Tue, Nov 20, 2012 at 5:04 AM, Fengguang Wu  wrote:

Yes. The kernel readahead code by design will outperform simple
fadvise in the case of clustered random reads. Imagine the access
pattern 1, 3, 2, 6, 4, 9. fadvise will trigger 6 IOs literally. While
kernel readahead will likely trigger 3 IOs for 1, 3, 2-9. Because on
the page miss for 2, it will detect the existence of history page 1
and do readahead properly. For hard disks, it's mainly the number of
IOs that matters. So even if kernel readahead loses some opportunities
to do async IO and possibly loads some extra pages that will never be
used, it still manges to perform much better.


The fix would lay in fadvise, I think. It should update readahead
tracking structures. Alternatively, one could try to do it in
do_generic_file_read, updating readahead on !PageUptodate or even on
page cache hits. I really don't have the expertise or time to go
modifying, building and testing the supposedly quite simple patch that
would fix this. It's mostly about the testing, in fact. So if someone
can comment or try by themselves, I guess it would really benefit
those relying on fadvise to fix this behavior.

One possible solution is to try the context readahead at fadvise time
to check the existence of history pages and do readahead accordingly.

However it will introduce *real interferences* between kernel
readahead and user prefetching. The original scheme is, once user
space starts its own informed prefetching, kernel readahead will
automatically stand out of the way.

I understand that would seem like a reasonable design, but in this
particular case it doesn't seem to be. I propose that in most cases it
doesn't really work well as a design decision, to make fadvise work as
direct I/O. Precisely because fadvise is supposed to be a hint to let
the kernel make better decisions, and not a request to make the kernel
stop making decisions.

Any interference so introduced wouldn't be any worse than the
interference introduced by readahead over reads. I agree, if fadvise
were to trigger readahead, it could be bad for applications that don't
read what they say the will.

Right.


But if cache hits were to simply update
readahead state, it would only mean that read calls behave the same
regardless of fadvise calls. I think that's worth pursuing.

Here you are describing an alternative solution that will somehow trap
into the readahead code even when, for example, the application is
accessing once and again an already cached file?  I'm afraid this will
add non-trivial overheads and is less attractive than the "readahead
on fadvise" solution.


Hi Fengguang,

Page cache sync readahead only triggered when cache miss, but if file 
has already cached, how can readahead be trigged again if the 
application is accessing once and again an already cached file.


Regards,
Jaegeuk




I ought to try to prepare a patch for this to illustrate my point. Not
sure I'll be able to though.

I'd be glad to materialize the readahead on fadvise proposal, if there
are no obvious negative examples/cases.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org;> em...@kvack.org 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: BUG at kernel/timer.c:1100 when using JFFS2

2012-11-20 Thread Artem Bityutskiy

On Wed, 2012-11-21 at 18:37 +1100, Nathan Williams wrote:
> Hi,
> 
> I've come across a problem when loading a module straight after unmounting a 
> JFFS2 partition.
> I'm using a Geos ADSL router board with an AMD Geode LX/CS5536 chipset and a 
> Hynix H27U1G8F2BTR NAND flash.
> 
> I can reproduce my problem with this shell script:
> 
> #!/bin/sh
> mount -t jffs2 mtd:logs /mnt
> echo "Hello World" > /mnt/file
> umount /mnt
> modprobe r8169

You probably use 3.5? There was a bug which was fixed, try the latest
stable 3.5 version, the fix must be there.

-- 
Best Regards,
Artem Bityutskiy


signature.asc
Description: This is a digitally signed message part

Re: fadvise interferes with readahead

2012-11-20 Thread Fengguang Wu

On Wed, Nov 21, 2012 at 02:51:41PM +0800, Jaegeuk Hanse wrote:
> On 11/20/2012 11:15 PM, Fengguang Wu wrote:
> >On Tue, Nov 20, 2012 at 10:11:54PM +0800, Jaegeuk Hanse wrote:
> >>On 11/20/2012 04:04 PM, Fengguang Wu wrote:
> >>>Hi Claudio,
> >>>
> >>>Thanks for the detailed problem description!
> >>Hi Fengguang,
> >>
> >>Another question, thanks in advance.
> >>
> >>What's the meaning of interleaved reads? If the first process
> >It's access patterns like
> >
> > 1, 1001, 2, 1002, 3, 1003, ...
> >
> >in which there are two (or more) mixed sequential read streams.
> >
> >>readahead from start ~ start + size - async_size, another process
> >>read start + size - aysnc_size + 1, then what will happen? It seems
> >>that variable hit_readahead_marker is false, and related codes can't
> >>run, where I miss?
> >Yes hit_readahead_marker will be false. However on reading 1002,
> >hit_readahead_marker()/count_history_pages() will find the previous
> >page 1001 already in page cache and trigger context readahead.
> 
> Hi Fengguang,
> 
> Thanks for your explaination, the comment in function
> ondemand_readahead, "Hit a marked page without valid readahead
> state". What's the meaning of "without valid readahead state"?

It normally happens in interleaved (or clustered random) reads. When
there are two read streams for one struct file, the one file_ra_state
won't be able to track state for the two streams. When the readahead
code is triggered for stream A, the file_ra_state may contain the
previous readahead window information for stream B. In this case
stream B's readahead state (ra->start, ra->size etc.) is invalid for
the current stream A that we are working on.

Thanks,
Fengguang

> >>>On Fri, Nov 09, 2012 at 04:30:32PM -0300, Claudio Freire wrote:
> Hi. First of all, I'm not subscribed to this list, so I'd suggest all
> replies copy me personally.
> 
> I have been trying to implement some I/O pipelining in Postgres (ie:
> read the next data page asynchronously while working on the current
> page), and stumbled upon some puzzling behavior involving the
> interaction between fadvise and readahead.
> 
> I'm running kernel 3.0.0 (debian testing), on a single-disk system
> which, though unsuitable for database workloads, is slow enough to let
> me experiment with these read-ahead issues.
> 
> Typical random I/O performance is on the order of between 150 r/s to
> 200 r/s (ballpark 7200rpm I'd say), with thoughput around 1.5MB/s.
> Sequential I/O can go up to 60MB/s, though it tends to be around 50.
> 
> Now onto the problem. In order to parallelize I/O with computation,
> I've made postgres fadvise(willneed) the pages it will read next. How
> far ahead is configurable, and I've tested with a number of
> configurations.
> 
> The prefetching logic is aware of the OS and pg-specific cache, so it
> will only fadvise a block once. fadvise calls will stay 1 (or a
> configurable N) real I/O ahead of read calls, and there's no fadvising
> of pages that won't be read eventually, in the same order. I checked
> with strace.
> 
> However, performance when fadvising drops considerably for a specific
> yet common access pattern:
> 
> When a nested loop with two index scans happens, access is random
> locally, but eventually whole ranges of a file get read (in this
> random order). Think block "1 6 8 100 34 299 3 7 68 24" followed by "2
> 4 5 101 298 301". Though random, there are ranges there that can be
> merged in one read-request.
> 
> The kernel seems to do the merge by applying some form of readahead,
> not sure if it's context, ondemand or adaptive readahead on the 3.0.0
> kernel. Anyway, it seems to do readahead, as iostat says:
> 
> Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sda   0.00 4.40  224.202.00 4.16 0.03
> 37.86 1.918.438.00   56.80   4.40  99.44
> 
> (notice the avgrq-sz of 37.8)
> 
> With fadvise calls, the thing looks a lot different:
> 
> Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sda   0.0018.00  226.801.00 1.80 0.07
> 16.81 4.00   17.52   17.23   82.40   4.39  99.92
> >>>FYI, there is a readahead tracing/stats patchset that can provide far
> >>>more accurate numbers about what's going on with readahead, which will
> >>>help eliminate lots of the guess works here.
> >>>
> >>>https://lwn.net/Articles/472798/
> >>>
> Notice the avgrq-sz of 16.8. Assuming it's 512-byte sectors, that's
> spot-on with a postgres page (8k). So, fadvise seems to carry out the
> requests verbatim, while read manages to merge at least two of them.
> 
> The random nature

Re: [PATCH v6 1/4] genalloc: add a global pool list, allow to find pools by phys address

2012-11-20 Thread Andrew Morton

On Fri, 16 Nov 2012 11:30:14 +0100 Philipp Zabel  wrote:

> This patch keeps all created pools in a global list and adds two
> functions that allow to retrieve the gen_pool pointer from a known
> physical address and from a device tree node.
>
> ...
>
> +/*
> + * gen_pool_find_by_phys - find a pool by physical start address
> + * @phys: physical address as added with gen_pool_add_virt
> + *
> + * Returns the pool that contains the chunk starting at phys,
> + * or NULL if not found.
> + */
> +struct gen_pool *gen_pool_find_by_phys(phys_addr_t phys)
> +{
> + struct gen_pool *pool, *found = NULL;
> + struct gen_pool_chunk *chunk;
> +
> + rcu_read_lock();
> + list_for_each_entry_rcu(pool, , next_pool) {
> + list_for_each_entry_rcu(chunk, >chunks, next_chunk) {
> + if (phys == chunk->phys_addr) {
> + found = pool;
> + break;
> + }
> + }
> + }
> + rcu_read_unlock();
> +
> + return found;
> +}
> +EXPORT_SYMBOL_GPL(gen_pool_find_by_phys);

It is rather pointless to use the fancy super-fast RCU locking
around a linear search!  We have various data structures which can be
used to make this search much more efficient.  radix-tree is one, if
the search keys are unique (which is the case here).

Secondly, that whole "phys" concept doesn't need to be in there.  It
would be better to implement a far more general
gen_pool_find_by_key(unsigned long key) and then do the phys->ulong
specialization elsewhere.

Finally the changelog gives no indication *why* you feel the kernel
needs this feature.  What is it for?  What are the use cases?  This is
the most important information for reviewers, hence it should be up
there front and center, in lavish detail.

Because once this is understood:

a) people might be able to suggest alternatives.  Can't do that
   without the required info and

b) people might then be interested in merging the patch into a kernel!

> +#ifdef CONFIG_OF
> +/**
> + * of_get_named_gen_pool - find a pool by phandle property
> + * @np: device node
> + * @propname: property name containing phandle(s)
> + * @index: index into the phandle array
> + *
> + * Returns the pool that contains the chunk starting at the physical
> + * address of the device tree node pointed at by the phandle property,
> + * or NULL if not found.
> + */
> +struct gen_pool *of_get_named_gen_pool(struct device_node *np,
> + const char *propname, int index)
> +{
> + struct device_node *np_pool;
> + struct resource res;
> + int ret;
> +
> + np_pool = of_parse_phandle(np, propname, index);
> + if (!np_pool)
> + return NULL;
> + ret = of_address_to_resource(np_pool, 0, );
> + if (ret < 0)
> + return NULL;
> + return gen_pool_find_by_phys((phys_addr_t) res.start);
> +}
> +EXPORT_SYMBOL_GPL(of_get_named_gen_pool);

Seems rather inappropriate that this should be in lib/genpool.c. 
Put it somewhere such as drivers/of/base.c, perhaps.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

BUG at kernel/timer.c:1100 when using JFFS2

2012-11-20 Thread Nathan Williams

Hi,

I've come across a problem when loading a module straight after unmounting a 
JFFS2 partition.
I'm using a Geos ADSL router board with an AMD Geode LX/CS5536 chipset and a 
Hynix H27U1G8F2BTR NAND flash.

I can reproduce my problem with this shell script:

#!/bin/sh
mount -t jffs2 mtd:logs /mnt
echo "Hello World" > /mnt/file
umount /mnt
modprobe r8169

After a few seconds I get this panic:

kernel BUG at kernel/timer.c:1100!
invalid opcode:  [#1] 
Modules linked in: r8169 cs553x_nand [last unloaded: r8169]
Pid: 0, comm: swapper Not tainted 3.6.0 #1  
EIP: 0060:[] EFLAGS: 00010082 CPU: 0
EIP is at cascade+0x11e/0x122
EAX: ce809f98 EBX: ce809f98 ECX: cea67938 EDX: 
ESI:  EDI: cea67938 EBP: c138ea80 ESP: ce809f8c
 DS: 007b ES: 007b FS:  GS:  SS: 0068
CR0: 8005003b CR2: b7701d8a CR3: 0dd36000 CR4: 0090 
DR0:  DR1:  DR2:  DR3:  
DR6: 0ff0 DR7: 0400 
Process swapper (pid: 0, ti=ce808000 task=c13244c0 task.ti=c1318000)
Stack:  
 ce809f98 001d  cea67938 cea67938 c138ea80  ce809fc8
 0100 c10287d0 0246 c138f890 c138f690 c138f490 c138f290 ce809fc8
 ce809fc8 0004 0001 0001 0100 c10241ef 000a 0020
Call Trace: 
 [] ? run_timer_softirq+0x134/0x1ac   
 [] ? __do_softirq+0x79/0x11c 
 [] ? irq_enter+0x4c/0x4c 
   
 [] ? irq_exit+0x5b/0x69  
 [] ? do_IRQ+0x34/0x7d
 [] ? common_interrupt+0x29/0x30  
 [] ? default_idle+0x21/0x2d  
 [] ? cpu_idle+0x52/0x54  
 [] ? start_kernel+0x236/0x286
Code: c1 e8 1a 8d 94 c5 10 0e 00 00 e9 7c ff ff ff 8b 44 24 04 83 c4 14 5b 5e 5f
EIP: [] cascade+0x11e/0x122 SS:ESP 0068:ce809f8c  
---[ end trace 9942a8bf288b5a17 ]---
Kernel panic - not syncing: Fatal exception in interrupt

Any ideas on what I should do next?

Regards,
Nathan

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] gpiolib: fix bug and clarify OF use of ranges

2012-11-20 Thread Viresh Kumar

On 21 November 2012 13:07, Linus Walleij  wrote:
> From: Linus Walleij 
>
> In commit c905165f5946f56dca195871641bd4e488eca24a
> "gpiolib: let gpiochip_add_pin_range() specify offset"
> I forgot to update the OF use of the function
> gpiochip_add_pin_range().
>
> It turns out that this reveal a weakness in the
> OF range mappings: ranges cannot currently be sparse.
> So put in a comment so we can fix this later.
>
> Signed-off-by: Linus Walleij 
> ---
>  drivers/gpio/gpiolib-of.c | 12 
>  1 file changed, 12 insertions(+)
>
> diff --git a/drivers/gpio/gpiolib-of.c b/drivers/gpio/gpiolib-of.c
> index a40cd84..d542a14 100644
> --- a/drivers/gpio/gpiolib-of.c
> +++ b/drivers/gpio/gpiolib-of.c
> @@ -238,8 +238,20 @@ static void of_gpiochip_add_pin_range(struct gpio_chip 
> *chip)
> if (!pctldev)
> break;
>
> +   /*
> +* This assumes that the n GPIO pins are consecutive in the
> +* GPIO number space, and that the pins are also consecutive
> +* in their local number space. Currently it is not possible
> +* to add different ranges for one and the same GPIO chip,
> +* as the code assumes that we have one consecutive range
> +* on both, mapping 1-to-1.
> +*
> +* TODO: make the OF bindings handle multiple sparse ranges
> +* on the same GPIO chip.
> +*/
> ret = gpiochip_add_pin_range(chip,
>  pinctrl_dev_get_name(pctldev),
> +0, /* offset in gpiochip */
>  pinspec.args[0],
>  pinspec.args[1]);

Reviewed-by: Viresh Kumar 

This is what i was asking you earlier: "Doesn't gpiochip_add_pin_range
have any users?" and you said NO and i didn't cross checked :(
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v3] pwm: Device tree support for PWM polarity.

2012-11-20 Thread Philip, Avinash

Add support for encoding PWM properties in bit encoded form with
of_pwm_xlate_with_flags() function support. Platforms require platform
specific PWM properties has to populate in 3rd cell of the pwm-specifier
and PWM driver should also set .of_xlate support with this function.
Currently PWM property polarity encoded in bit position 0 of the third
cell in pwm-specifier.

Signed-off-by: Philip, Avinash 
---
Changes since v2:
- Move PWM_SPEC_POLARITY to core.c
- Remove dummy function

Changes since v1:
- of_pwm_xlate_with_flags function support added.
- Documentation update

:100644 100644 73ec962... 04b0dc4... M  
Documentation/devicetree/bindings/pwm/pwm.txt
:100644 100644 f5acdaa... 780cb6b... M  drivers/pwm/core.c
:100644 100644 112b314... 6d661f3... M  include/linux/pwm.h
 Documentation/devicetree/bindings/pwm/pwm.txt |   18 +--
 drivers/pwm/core.c|   28 +
 include/linux/pwm.h   |3 ++
 3 files changed, 46 insertions(+), 3 deletions(-)

diff --git a/Documentation/devicetree/bindings/pwm/pwm.txt 
b/Documentation/devicetree/bindings/pwm/pwm.txt
index 73ec962..04b0dc4 100644
--- a/Documentation/devicetree/bindings/pwm/pwm.txt
+++ b/Documentation/devicetree/bindings/pwm/pwm.txt
@@ -37,10 +37,22 @@ device:
pwm-names = "backlight";
};
 
+Note that in the example above, specifying the "pwm-names" is redundant
+because the name "backlight" would be used as fallback anyway.
+
 pwm-specifier typically encodes the chip-relative PWM number and the PWM
-period in nanoseconds. Note that in the example above, specifying the
-"pwm-names" is redundant because the name "backlight" would be used as
-fallback anyway.
+period in nanoseconds.
+
+Optionally, the pwm-specifier can encode a number of flags in a third cell:
+- bit 0: PWM signal polarity (0: normal polarity, 1: inverse polarity)
+
+Example with optional PWM specifier for inverse polarity
+
+   bl: backlight {
+   pwms = < 0 500 1>;
+   pwm-names = "backlight";
+   };
+
 
 2) PWM controller nodes
 ---
diff --git a/drivers/pwm/core.c b/drivers/pwm/core.c
index f5acdaa..780cb6b 100644
--- a/drivers/pwm/core.c
+++ b/drivers/pwm/core.c
@@ -32,6 +32,9 @@
 
 #define MAX_PWMS 1024
 
+/* flags in the third cell of the DT PWM specifier */
+#define PWM_SPEC_POLARITY  (1 << 0)
+
 static DEFINE_MUTEX(pwm_lookup_lock);
 static LIST_HEAD(pwm_lookup_list);
 static DEFINE_MUTEX(pwm_lock);
@@ -129,6 +132,31 @@ static int pwm_device_request(struct pwm_device *pwm, 
const char *label)
return 0;
 }
 
+struct pwm_device *
+of_pwm_xlate_with_flags(struct pwm_chip *pc, const struct of_phandle_args 
*args)
+{
+   struct pwm_device *pwm;
+
+   if (pc->of_pwm_n_cells < 3)
+   return ERR_PTR(-EINVAL);
+
+   if (args->args[0] >= pc->npwm)
+   return ERR_PTR(-EINVAL);
+
+   pwm = pwm_request_from_chip(pc, args->args[0], NULL);
+   if (IS_ERR(pwm))
+   return pwm;
+
+   pwm_set_period(pwm, args->args[1]);
+
+   if (args->args[2] & PWM_SPEC_POLARITY)
+   pwm_set_polarity(pwm, PWM_POLARITY_INVERSED);
+   else
+   pwm_set_polarity(pwm, PWM_POLARITY_NORMAL);
+
+   return pwm;
+}
+
 static struct pwm_device *
 of_pwm_simple_xlate(struct pwm_chip *pc, const struct of_phandle_args *args)
 {
diff --git a/include/linux/pwm.h b/include/linux/pwm.h
index 112b314..6d661f3 100644
--- a/include/linux/pwm.h
+++ b/include/linux/pwm.h
@@ -171,6 +171,9 @@ struct pwm_device *pwm_request_from_chip(struct pwm_chip 
*chip,
 unsigned int index,
 const char *label);
 
+struct pwm_device *of_pwm_xlate_with_flags(struct pwm_chip *pc,
+   const struct of_phandle_args *args);
+
 struct pwm_device *pwm_get(struct device *dev, const char *consumer);
 void pwm_put(struct pwm_device *pwm);
 
-- 
1.7.0.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] futex: Avoid wake_futex for a PI futex_q

2012-11-20 Thread Darren Hart

Dave Jones reported a bug with futex_lock_pi() that his trinity test
exposed. Sometime between queue_me() and taking the q.lock_ptr, the
lock_ptr became NULL, resulting in a crash.

While futex_wake() is careful to not call wake_futex() on futex_q's with
a pi_state or an rt_waiter (which are either waiting for a
futex_unlock_pi() or a PI futex_requeue()), futex_wake_op() and
futex_requeue() do not perform the same test.

Update futex_wake_op() and futex_requeue() to test for q.pi_state and
q.rt_waiter and abort with -EINVAL if detected. To ensure any future
breakage is caught, add a WARN() to wake_futex() if the same condition
is true.

This fix has seen 3 hours of testing with "trinity -c futex" on an
x86_64 VM with 4 CPUS.

Signed-off-by: Darren Hart 
Reported-by: Dave Jones 
Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: John Kacur 
Cc: sta...@vger.kernel.org
---
 kernel/futex.c | 20 +++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 3717e7b..5699b21 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -840,6 +840,11 @@ static void wake_futex(struct futex_q *q)
 {
struct task_struct *p = q->task;
 
+   if (q->pi_state || q->rt_waiter) {
+   WARN(1, "%s: refusing to wake PI futex\n", __FUNCTION__);
+   return;
+   }
+
/*
 * We set q->lock_ptr = NULL _before_ we wake up the task. If
 * a non-futex wake up happens on another CPU then the task
@@ -1075,6 +1080,10 @@ retry_private:
 
plist_for_each_entry_safe(this, next, head, list) {
if (match_futex (>key, )) {
+   if (this->pi_state || this->rt_waiter) {
+   ret = -EINVAL;
+   goto out_unlock;
+   }
wake_futex(this);
if (++ret >= nr_wake)
break;
@@ -1087,6 +1096,10 @@ retry_private:
op_ret = 0;
plist_for_each_entry_safe(this, next, head, list) {
if (match_futex (>key, )) {
+   if (this->pi_state || this->rt_waiter) {
+   ret = -EINVAL;
+   goto out_unlock;
+   }
wake_futex(this);
if (++op_ret >= nr_wake2)
break;
@@ -1095,6 +1108,7 @@ retry_private:
ret += op_ret;
}
 
+out_unlock:
double_unlock_hb(hb1, hb2);
 out_put_keys:
put_futex_key();
@@ -1384,9 +1398,13 @@ retry_private:
/*
 * FUTEX_WAIT_REQEUE_PI and FUTEX_CMP_REQUEUE_PI should always
 * be paired with each other and no other futex ops.
+*
+* We should never be requeueing a futex_q with a pi_state,
+* which is awaiting a futex_unlock_pi().
 */
if ((requeue_pi && !this->rt_waiter) ||
-   (!requeue_pi && this->rt_waiter)) {
+   (!requeue_pi && this->rt_waiter) ||
+   this->pi_state) {
ret = -EINVAL;
break;
}
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 05/12] x86: Merge early_reserve_initrd for 32bit and 64bit

2012-11-20 Thread Pekka Enberg

On Wed, Nov 21, 2012 at 9:16 AM, Yinghai Lu  wrote:
> They are the same, could move them out from head32/64.c to setup.c.
>
> We are using memblock, and it could handle overlapping properly, so
> we don't need to reserve some at first to hold the location, and just
> need to make sure we reserve them before we are using memblock to find
> free mem to use.
>
> Signed-off-by: Yinghai Lu 

Reviewed-by: Pekka Enberg 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] x86, UV: integer wrap bug in uv_hub_ipi_value()

2012-11-20 Thread Dan Carpenter

On Tue, Nov 20, 2012 at 11:07:25AM -0600, Russ Anderson wrote:
> The issue isn't "ulong" vs "unsigned long".  The issue
> is int is 32 bit and long is 64 bit on x86_64.  Your 
> patch is casting the value as an "unsigned long" (64 bit
> on x86_64) into an int (32 bit).  I don't think that
> was your intent.

Wait what?  I only did int => long casts, not the other way around.

It occured to me to use u64 but this code is only compiled on x86_64
and I wrote my patch to match the surrounding context.

regards,
dan carpenter
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] gpiolib: let gpiochip_add_pin_range() specify offset

2012-11-20 Thread Linus Walleij

On Tue, Nov 20, 2012 at 6:24 PM, Stephen Warren  wrote:
> On 11/20/2012 04:45 AM, Linus Walleij wrote:
>> From: Linus Walleij 
>>
>> Like with commit 3c739ad0df5eb41cd7adad879eda6aa09879eb76
>> it is not always enough to specify all the pins of a gpio_chip
>> from offset zero to be added to a pin map range, since the
>> mapping from GPIO to pin controller may not be linear at all,
>> but need to be broken into a few consecutive sub-ranges or
>> 1-pin entries for complicated cases. The ranges may also be
>> sparse.
>>
>> This alters the signature of the function to accept offsets
>> into both the GPIO-chip local pinspace and the pin controller
>> local pinspace.
>
> Reviewed-by: Stephen Warren 
>
> Although perhaps rename the new "offset" parameter to "gpio_base" or
> "gpio_offset", just like the existing "pin_base" rather than
> pin/base/offset?

OK I'll rename it...

I've also made a fat notice that this isn't currently covered
by the OF GPIO range bindings as a follow-on patch.

Thanks,
Linus Walleij
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] gpiolib: fix bug and clarify OF use of ranges

2012-11-20 Thread Linus Walleij

From: Linus Walleij 

In commit c905165f5946f56dca195871641bd4e488eca24a
"gpiolib: let gpiochip_add_pin_range() specify offset"
I forgot to update the OF use of the function
gpiochip_add_pin_range().

It turns out that this reveal a weakness in the
OF range mappings: ranges cannot currently be sparse.
So put in a comment so we can fix this later.

Signed-off-by: Linus Walleij 
---
 drivers/gpio/gpiolib-of.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/drivers/gpio/gpiolib-of.c b/drivers/gpio/gpiolib-of.c
index a40cd84..d542a14 100644
--- a/drivers/gpio/gpiolib-of.c
+++ b/drivers/gpio/gpiolib-of.c
@@ -238,8 +238,20 @@ static void of_gpiochip_add_pin_range(struct gpio_chip 
*chip)
if (!pctldev)
break;
 
+   /*
+* This assumes that the n GPIO pins are consecutive in the
+* GPIO number space, and that the pins are also consecutive
+* in their local number space. Currently it is not possible
+* to add different ranges for one and the same GPIO chip,
+* as the code assumes that we have one consecutive range
+* on both, mapping 1-to-1.
+*
+* TODO: make the OF bindings handle multiple sparse ranges
+* on the same GPIO chip.
+*/
ret = gpiochip_add_pin_range(chip,
 pinctrl_dev_get_name(pctldev),
+0, /* offset in gpiochip */
 pinspec.args[0],
 pinspec.args[1]);
 
-- 
1.7.11.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:x86/urgent] x86-64: Fix ordering of CFI directives and recent ASM_CLAC additions

2012-11-20 Thread tip-bot for Jan Beulich

Commit-ID:  ee4eb87be2c3f69c2c4d9f1c1d98e363a7ad18ab
Gitweb: http://git.kernel.org/tip/ee4eb87be2c3f69c2c4d9f1c1d98e363a7ad18ab
Author: Jan Beulich 
AuthorDate: Fri, 2 Nov 2012 11:18:39 +
Committer:  H. Peter Anvin 
CommitDate: Tue, 20 Nov 2012 22:23:57 -0800

x86-64: Fix ordering of CFI directives and recent ASM_CLAC additions

While these got added in the right place everywhere else, entry_64.S
is the odd one where they ended up before the initial CFI directive(s).
In order to cover the full code ranges, the CFI directive must be
first, though.

Signed-off-by: Jan Beulich 
Link: http://lkml.kernel.org/r/5093ba1f0278000a6...@nat28.tlf.novell.com
Signed-off-by: H. Peter Anvin 
---
 arch/x86/kernel/entry_64.S | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index b51b2c7..1328fe4 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -995,8 +995,8 @@ END(interrupt)
 */
.p2align CONFIG_X86_L1_CACHE_SHIFT
 common_interrupt:
-   ASM_CLAC
XCPT_FRAME
+   ASM_CLAC
addq $-0x80,(%rsp)  /* Adjust vector to [-256,-1] range */
interrupt do_IRQ
/* 0(%rsp): old_rsp-ARGOFFSET */
@@ -1135,8 +1135,8 @@ END(common_interrupt)
  */
 .macro apicinterrupt num sym do_sym
 ENTRY(\sym)
-   ASM_CLAC
INTR_FRAME
+   ASM_CLAC
pushq_cfi $~(\num)
 .Lcommon_\sym:
interrupt \do_sym
@@ -1190,8 +1190,8 @@ apicinterrupt IRQ_WORK_VECTOR \
  */
 .macro zeroentry sym do_sym
 ENTRY(\sym)
-   ASM_CLAC
INTR_FRAME
+   ASM_CLAC
PARAVIRT_ADJUST_EXCEPTION_FRAME
pushq_cfi $-1   /* ORIG_RAX: no syscall to restart */
subq $ORIG_RAX-R15, %rsp
@@ -1208,8 +1208,8 @@ END(\sym)
 
 .macro paranoidzeroentry sym do_sym
 ENTRY(\sym)
-   ASM_CLAC
INTR_FRAME
+   ASM_CLAC
PARAVIRT_ADJUST_EXCEPTION_FRAME
pushq_cfi $-1   /* ORIG_RAX: no syscall to restart */
subq $ORIG_RAX-R15, %rsp
@@ -1227,8 +1227,8 @@ END(\sym)
 #define INIT_TSS_IST(x) PER_CPU_VAR(init_tss) + (TSS_ist + ((x) - 1) * 8)
 .macro paranoidzeroentry_ist sym do_sym ist
 ENTRY(\sym)
-   ASM_CLAC
INTR_FRAME
+   ASM_CLAC
PARAVIRT_ADJUST_EXCEPTION_FRAME
pushq_cfi $-1   /* ORIG_RAX: no syscall to restart */
subq $ORIG_RAX-R15, %rsp
@@ -1247,8 +1247,8 @@ END(\sym)
 
 .macro errorentry sym do_sym
 ENTRY(\sym)
-   ASM_CLAC
XCPT_FRAME
+   ASM_CLAC
PARAVIRT_ADJUST_EXCEPTION_FRAME
subq $ORIG_RAX-R15, %rsp
CFI_ADJUST_CFA_OFFSET ORIG_RAX-R15
@@ -1266,8 +1266,8 @@ END(\sym)
/* error code is on the stack already */
 .macro paranoiderrorentry sym do_sym
 ENTRY(\sym)
-   ASM_CLAC
XCPT_FRAME
+   ASM_CLAC
PARAVIRT_ADJUST_EXCEPTION_FRAME
subq $ORIG_RAX-R15, %rsp
CFI_ADJUST_CFA_OFFSET ORIG_RAX-R15
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:x86/urgent] x86, microcode, AMD: Add support for family 16h processors

2012-11-20 Thread tip-bot for Boris Ostrovsky

Commit-ID:  36c46ca4f322a7bf89aad5462a3a1f61713edce7
Gitweb: http://git.kernel.org/tip/36c46ca4f322a7bf89aad5462a3a1f61713edce7
Author: Boris Ostrovsky 
AuthorDate: Thu, 15 Nov 2012 13:41:50 -0500
Committer:  H. Peter Anvin 
CommitDate: Tue, 20 Nov 2012 22:23:28 -0800

x86, microcode, AMD: Add support for family 16h processors

Add valid patch size for family 16h processors.

[ hpa: promoting to urgent/stable since it is hw enabling and trivial ]

Signed-off-by: Boris Ostrovsky 
Acked-by: Andreas Herrmann 
Link: 
http://lkml.kernel.org/r/1353004910-2204-1-git-send-email-boris.ostrov...@amd.com
Signed-off-by: H. Peter Anvin 
Cc: 
---
 arch/x86/kernel/microcode_amd.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/x86/kernel/microcode_amd.c b/arch/x86/kernel/microcode_amd.c
index b3e67ba..efdec7c 100644
--- a/arch/x86/kernel/microcode_amd.c
+++ b/arch/x86/kernel/microcode_amd.c
@@ -190,6 +190,7 @@ static unsigned int verify_patch_size(int cpu, u32 
patch_size,
 #define F1XH_MPB_MAX_SIZE 2048
 #define F14H_MPB_MAX_SIZE 1824
 #define F15H_MPB_MAX_SIZE 4096
+#define F16H_MPB_MAX_SIZE 3458
 
switch (c->x86) {
case 0x14:
@@ -198,6 +199,9 @@ static unsigned int verify_patch_size(int cpu, u32 
patch_size,
case 0x15:
max_size = F15H_MPB_MAX_SIZE;
break;
+   case 0x16:
+   max_size = F16H_MPB_MAX_SIZE;
+   break;
default:
max_size = F1XH_MPB_MAX_SIZE;
break;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:x86/urgent] x86-32: Export kernel_stack_pointer() for modules

2012-11-20 Thread tip-bot for H. Peter Anvin

Commit-ID:  cb57a2b4cff7edf2a4e32c0163200e9434807e0a
Gitweb: http://git.kernel.org/tip/cb57a2b4cff7edf2a4e32c0163200e9434807e0a
Author: H. Peter Anvin 
AuthorDate: Tue, 20 Nov 2012 22:21:02 -0800
Committer:  H. Peter Anvin 
CommitDate: Tue, 20 Nov 2012 22:23:23 -0800

x86-32: Export kernel_stack_pointer() for modules

Modules, in particular oprofile (and possibly other similar tools)
need kernel_stack_pointer(), so export it using EXPORT_SYMBOL_GPL().

Cc: Yang Wei 
Cc: Robert Richter 
Cc: Jun Zhang 
Cc: 
Link: http://lkml.kernel.org/r/20120912135059.gz8...@erda.amd.com
Signed-off-by: H. Peter Anvin 
---
 arch/x86/kernel/ptrace.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 2484e33..5e0596b 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -193,6 +194,7 @@ unsigned long kernel_stack_pointer(struct pt_regs *regs)
 
return (unsigned long)regs;
 }
+EXPORT_SYMBOL_GPL(kernel_stack_pointer);
 
 static unsigned long *pt_regs_access(struct pt_regs *regs, unsigned long regno)
 {
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:x86/urgent] x86-32: Fix invalid stack address while in softirq

2012-11-20 Thread tip-bot for Robert Richter

Commit-ID:  1022623842cb72ee4d0dbf02f6937f38c92c3f41
Gitweb: http://git.kernel.org/tip/1022623842cb72ee4d0dbf02f6937f38c92c3f41
Author: Robert Richter 
AuthorDate: Mon, 3 Sep 2012 20:54:48 +0200
Committer:  H. Peter Anvin 
CommitDate: Tue, 20 Nov 2012 22:23:20 -0800

x86-32: Fix invalid stack address while in softirq

In 32 bit the stack address provided by kernel_stack_pointer() may
point to an invalid range causing NULL pointer access or page faults
while in NMI (see trace below). This happens if called in softirq
context and if the stack is empty. The address at >sp is then
out of range.

Fixing this by checking if regs and >sp are in the same stack
context. Otherwise return the previous stack pointer stored in struct
thread_info. If that address is invalid too, return address of regs.

 BUG: unable to handle kernel NULL pointer dereference at 000a
 IP: [] print_context_stack+0x6e/0x8d
 *pde = 
 Oops:  [#1] SMP
 Modules linked in:
 Pid: 4434, comm: perl Not tainted 3.6.0-rc3-oprofile-i386-standard-g4411a05 #4 
Hewlett-Packard HP xw9400 Workstation/0A1Ch
 EIP: 0060:[] EFLAGS: 00010093 CPU: 0
 EIP is at print_context_stack+0x6e/0x8d
 EAX: e000 EBX: 000a ECX: f4435f94 EDX: 000a
 ESI: f4435f94 EDI: f4435f94 EBP: f5409ec0 ESP: f5409ea0
  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
 CR0: 8005003b CR2: 000a CR3: 34ac9000 CR4: 07d0
 DR0:  DR1:  DR2:  DR3: 
 DR6: 0ff0 DR7: 0400
 Process perl (pid: 4434, ti=f5408000 task=f5637850 task.ti=f4434000)
 Stack:
  03e8 e000 1ffc f4e39b00  000a f4435f94 c155198c
  f5409ef0 c1003723 c155198c f5409f04  f5409edc  
  f5409ee8 f4435f94 f5409fc4 0001 f5409f1c c12dce1c  c155198c
 Call Trace:
  [] dump_trace+0x7b/0xa1
  [] x86_backtrace+0x40/0x88
  [] ? oprofile_add_sample+0x56/0x84
  [] oprofile_add_sample+0x75/0x84
  [] op_amd_check_ctrs+0x46/0x260
  [] profile_exceptions_notify+0x23/0x4c
  [] nmi_handle+0x31/0x4a
  [] ? ftrace_define_fields_irq_handler_entry+0x45/0x45
  [] do_nmi+0xa0/0x2ff
  [] ? ftrace_define_fields_irq_handler_entry+0x45/0x45
  [] nmi_stack_correct+0x28/0x2d
  [] ? ftrace_define_fields_irq_handler_entry+0x45/0x45
  [] ? do_softirq+0x4b/0x7f
  
  [] irq_exit+0x35/0x5b
  [] smp_apic_timer_interrupt+0x6c/0x7a
  [] apic_timer_interrupt+0x2a/0x30
 Code: 89 fe eb 08 31 c9 8b 45 0c ff 55 ec 83 c3 04 83 7d 10 00 74 0c 3b 5d 10 
73 26 3b 5d e4 73 0c eb 1f 3b 5d f0 76 1a 3b 5d e8 73 15 <8b> 13 89 d0 89 55 e0 
e8 ad 42 03 00 85 c0 8b 55 e0 75 a6 eb cc
 EIP: [] print_context_stack+0x6e/0x8d SS:ESP 0068:f5409ea0
 CR2: 000a
 ---[ end trace 62afee3481b00012 ]---
 Kernel panic - not syncing: Fatal exception in interrupt

V2:
* add comments to kernel_stack_pointer()
* always return a valid stack address by falling back to the address
  of regs

Reported-by: Yang Wei 
Cc: 
Signed-off-by: Robert Richter 
Link: http://lkml.kernel.org/r/20120912135059.gz8...@erda.amd.com
Signed-off-by: H. Peter Anvin 
Cc: Jun Zhang 
---
 arch/x86/include/asm/ptrace.h | 15 ---
 arch/x86/kernel/ptrace.c  | 28 
 2 files changed, 32 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
index dcfde52..19f16eb 100644
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -205,21 +205,14 @@ static inline bool user_64bit_mode(struct pt_regs *regs)
 }
 #endif
 
-/*
- * X86_32 CPUs don't save ss and esp if the CPU is already in kernel mode
- * when it traps.  The previous stack will be directly underneath the saved
- * registers, and 'sp/ss' won't even have been saved. Thus the '>sp'.
- *
- * This is valid only for kernel mode traps.
- */
-static inline unsigned long kernel_stack_pointer(struct pt_regs *regs)
-{
 #ifdef CONFIG_X86_32
-   return (unsigned long)(>sp);
+extern unsigned long kernel_stack_pointer(struct pt_regs *regs);
 #else
+static inline unsigned long kernel_stack_pointer(struct pt_regs *regs)
+{
return regs->sp;
-#endif
 }
+#endif
 
 #define GET_IP(regs) ((regs)->ip)
 #define GET_FP(regs) ((regs)->bp)
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index b00b33a..2484e33 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -166,6 +166,34 @@ static inline bool invalid_selector(u16 value)
 
 #define FLAG_MASK  FLAG_MASK_32
 
+/*
+ * X86_32 CPUs don't save ss and esp if the CPU is already in kernel mode
+ * when it traps.  The previous stack will be directly underneath the saved
+ * registers, and 'sp/ss' won't even have been saved. Thus the '>sp'.
+ *
+ * Now, if the stack is empty, '>sp' is out of range. In this
+ * case we try to take the previous stack. To always return a non-null
+ * stack pointer we fall back to regs as stack if no previous stack
+ * exists.
+ *
+ * This is valid only for kernel mode traps.
+ */
+unsigned long kernel_stack_pointer(struct

Re: [PATCH] ARM: exynos: add UART3 to DEBUG_LL ports

2012-11-20 Thread Olof Johansson

On Tue, Nov 20, 2012 at 02:48:58PM -0800, Doug Anderson wrote:
> From: Olof Johansson 
> 
> UART3 is used for debugging on exynos5250-snow.
> 
> [dianders: cleaned commit message.]
> 
> Signed-off-by: Olof Johansson 
> Signed-off-by: Doug Anderson 

> 
> ---
>  arch/arm/Kconfig.debug|   11 +++
>  arch/arm/plat-samsung/Kconfig |1 +
>  2 files changed, 12 insertions(+), 0 deletions(-)
> 
> diff --git a/arch/arm/Kconfig.debug b/arch/arm/Kconfig.debug
> index 33a8930..35ba7dc 100644
> --- a/arch/arm/Kconfig.debug
> +++ b/arch/arm/Kconfig.debug
> @@ -355,6 +355,17 @@ choice
> The uncompressor code port configuration is now handled
> by CONFIG_S3C_LOWLEVEL_UART_PORT.
>  
> + config DEBUG_S3C_UART3
> + depends on PLAT_SAMSUNG


Sorry, the reason I hadn't re-posted this is that Kukjin had proposed
to protect users of <= 3 UART platforms to select it. An added "Depends
on ARCH_EXYNOS4 || ARCH_EXYNOS5" should cover that. Can you add and
repost, please?


-Olof
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 2/2] random: Account for entropy loss due to overwrites

2012-11-20 Thread H. Peter Anvin

From: "H. Peter Anvin" 

When we write entropy into a non-empty pool, we currently don't
account at all for the fact that we will probabilistically overwrite
some of the entropy in that pool.  This means that unless the pool is
fully empty, we are currently *guaranteed* to overestimate the amount
of entropy in the pool!

Assuming Shannon entropy with zero correlations we end up with an
exponentally decaying value of new entropy added:

entropy <- entropy + (pool_size - entropy) *
(1 - exp(-add_entropy/pool_size))

However, calculations involving fractional exponentials are not
practical in the kernel, so apply a piecewise linearization:

  For add_entropy <= pool_size then

  (1 - exp(-add_entropy/pool_size)) >= (add_entropy/pool_size)*0.632...

  ... so we can approximate the exponential with
  add_entropy/(pool_size*2) and still be on the
  safe side by adding at most one pool_size at a time.

In order for the loop not to take arbitrary amounts of time if a bad
ioctl is received, terminate if we are within one bit of full.  This
way the loop is guaranteed to terminate after no more than
log2(poolsize) iterations, no matter what the input value is.  The
vast majority of the time the loop will be executed exactly once.

The piecewise linearization is very conservative, approaching 1/2 of
the usable input value for small inputs, however, our entropy
estimation is pretty weak at best, especially for small values; we
have no handle on correlation; and the Shannon entropy measure (Rényi
entropy of order 1) is not the correct one to use in the first place,
but rather the correct entropy measure is the min-entropy, the Rényi
entropy of infinite order.

As such, this conservatism seems more than justified.  Note, however,
that attempting to add one bit of entropy will never succeed; nor will
two bits unless the pool is completely empty.  These roundoff
artifacts could be improved by using fixed-point arithmetic and adding
some number of fractional entropy bits.

[ v2: rely on the previous patch for poolbitshift ]

Signed-off-by: H. Peter Anvin 
Cc: DJ Johnston 
Cc: 
---
 drivers/char/random.c | 56 +++
 1 file changed, 48 insertions(+), 8 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index b522338..c5c68cf 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -272,10 +272,12 @@
 /*
  * Configuration information
  */
-#define INPUT_POOL_WORDS 128
-#define OUTPUT_POOL_WORDS 32
-#define SEC_XFER_SIZE 512
-#define EXTRACT_SIZE 10
+#define INPUT_POOL_SHIFT   12
+#define INPUT_POOL_WORDS   (1 << (INPUT_POOL_SHIFT-5))
+#define OUTPUT_POOL_SHIFT  10
+#define OUTPUT_POOL_WORDS  (1 << (OUTPUT_POOL_SHIFT-5))
+#define SEC_XFER_SIZE  512
+#define EXTRACT_SIZE   10
 
 #define LONGS(x) (((x) + sizeof(unsigned long) - 1)/sizeof(unsigned long))
 
@@ -419,7 +421,7 @@ module_param(debug, bool, 0644);
 struct entropy_store;
 struct entropy_store {
/* read-only data: */
-   struct poolinfo *poolinfo;
+   const struct poolinfo *poolinfo;
__u32 *pool;
const char *name;
struct entropy_store *pull;
@@ -581,11 +583,13 @@ static void fast_mix(struct fast_pool *f, const void *in, 
int nbytes)
 }
 
 /*
- * Credit (or debit) the entropy store with n bits of entropy
+ * Credit (or debit) the entropy store with n bits of entropy.
+ * The nbits value is given in units of 2^-16 bits, i.e. 0x1 == 1 bit.
  */
 static void credit_entropy_bits(struct entropy_store *r, int nbits)
 {
int entropy_count, orig;
+   const int pool_size = r->poolinfo->poolbits;
 
if (!nbits)
return;
@@ -594,12 +598,48 @@ static void credit_entropy_bits(struct entropy_store *r, 
int nbits)
 retry:
entropy_count = orig = ACCESS_ONCE(r->entropy_count);
entropy_count += nbits;
+   if (nbits < 0) {
+   /* Debit. */
+   entropy_count += nbits;
+   } else {
+   /*
+* Credit: we have to account for the possibility of
+* overwriting already present entropy.  Even in the
+* ideal case of pure Shannon entropy, new contributions
+* approach the full value asymptotically:
+*
+* entropy <- entropy + (pool_size - entropy) *
+*  (1 - exp(-add_entropy/pool_size))
+*
+* For add_entropy <= pool_size then
+* (1 - exp(-add_entropy/pool_size)) >=
+*(add_entropy/pool_size)*0.632...
+* so we can approximate the exponential with
+* add_entropy/(pool_size*2) and still be on the
+* safe side by adding at most one pool_size at a time.
+*
+* The use of pool_size-1 in the while statement is to
+* prevent

[PATCH v2 1/2] random: Statically compute poolbitshift, poolbytes, poolbits

2012-11-20 Thread H. Peter Anvin

From: "H. Peter Anvin" 

Use a macro to statically compute poolbitshift (will be used in a
subsequent patch), poolbytes, and poolbits.  On virtually all
architectures the cost of a memory load with an offset is the same as
the one of a memory load.

It is still possible for this to generate worse code since the C
compiler doesn't know the fixed relationship between these fields, but
that is somewhat unlikely.

Signed-off-by: H. Peter Anvin 
Cc: 
---
 drivers/char/random.c | 39 +++
 1 file changed, 19 insertions(+), 20 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 85e81ec..b522338 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -309,46 +309,45 @@ static DEFINE_PER_CPU(int, trickle_count);
  * scaled squared error sum) except for the last tap, which is 1 to
  * get the twisting happening as fast as possible.
  */
+
 static struct poolinfo {
-   int poolwords;
+   int poolbitshift, poolwords, poolbytes, poolbits;
+#define S(x) ilog2(x)+5, (x), (x)*4, (x)*32
int tap1, tap2, tap3, tap4, tap5;
 } poolinfo_table[] = {
/* x^128 + x^103 + x^76 + x^51 +x^25 + x + 1 -- 105 */
-   { 128,  103,76, 51, 25, 1 },
+   { S(128),   103,76, 51, 25, 1 },
/* x^32 + x^26 + x^20 + x^14 + x^7 + x + 1 -- 15 */
-   { 32,   26, 20, 14, 7,  1 },
+   { S(32),26, 20, 14, 7,  1 },
 #if 0
/* x^2048 + x^1638 + x^1231 + x^819 + x^411 + x + 1  -- 115 */
-   { 2048, 1638,   1231,   819,411,1 },
+   { S(2048),  1638,   1231,   819,411,1 },
 
/* x^1024 + x^817 + x^615 + x^412 + x^204 + x + 1 -- 290 */
-   { 1024, 817,615,412,204,1 },
+   { S(1024),  817,615,412,204,1 },
 
/* x^1024 + x^819 + x^616 + x^410 + x^207 + x^2 + 1 -- 115 */
-   { 1024, 819,616,410,207,2 },
+   { S(1024),  819,616,410,207,2 },
 
/* x^512 + x^411 + x^308 + x^208 + x^104 + x + 1 -- 225 */
-   { 512,  411,308,208,104,1 },
+   { S(512),   411,308,208,104,1 },
 
/* x^512 + x^409 + x^307 + x^206 + x^102 + x^2 + 1 -- 95 */
-   { 512,  409,307,206,102,2 },
+   { S(512),   409,307,206,102,2 },
/* x^512 + x^409 + x^309 + x^205 + x^103 + x^2 + 1 -- 95 */
-   { 512,  409,309,205,103,2 },
+   { S(512),   409,309,205,103,2 },
 
/* x^256 + x^205 + x^155 + x^101 + x^52 + x + 1 -- 125 */
-   { 256,  205,155,101,52, 1 },
+   { S(256),   205,155,101,52, 1 },
 
/* x^128 + x^103 + x^78 + x^51 + x^27 + x^2 + 1 -- 70 */
-   { 128,  103,78, 51, 27, 2 },
+   { S(128),   103,78, 51, 27, 2 },
 
/* x^64 + x^52 + x^39 + x^26 + x^14 + x + 1 -- 15 */
-   { 64,   52, 39, 26, 14, 1 },
+   { S(64),52, 39, 26, 14, 1 },
 #endif
 };
 
-#define POOLBITS   poolwords*32
-#define POOLBYTES  poolwords*4
-
 /*
  * For the purposes of better mixing, we use the CRC-32 polynomial as
  * well to make a twisted Generalized Feedback Shift Reigster
@@ -599,8 +598,8 @@ retry:
if (entropy_count < 0) {
DEBUG_ENT("negative entropy/overflow\n");
entropy_count = 0;
-   } else if (entropy_count > r->poolinfo->POOLBITS)
-   entropy_count = r->poolinfo->POOLBITS;
+   } else if (entropy_count > r->poolinfo->poolbits)
+   entropy_count = r->poolinfo->poolbits;
if (cmpxchg(>entropy_count, orig, entropy_count) != orig)
goto retry;
 
@@ -815,7 +814,7 @@ static void xfer_secondary_pool(struct entropy_store *r, 
size_t nbytes)
__u32   tmp[OUTPUT_POOL_WORDS];
 
if (r->pull && r->entropy_count < nbytes * 8 &&
-   r->entropy_count < r->poolinfo->POOLBITS) {
+   r->entropy_count < r->poolinfo->poolbits) {
/* If we're limited, always leave two wakeup worth's BITS */
int rsvd = r->limit ? 0 : random_read_wakeup_thresh/4;
int bytes = nbytes;
@@ -856,7 +855,7 @@ static size_t account(struct entropy_store *r, size_t 
nbytes, int min,
/* Hold lock while accounting */
spin_lock_irqsave(>lock, flags);
 
-   BUG_ON(r->entropy_count > r->poolinfo->POOLBITS);
+   BUG_ON(r->entropy_count > r->poolinfo->poolbits);
DEBUG_ENT("trying to extract %zu bits from %s\n",
  nbytes * 8, r->name);
 
@@ -1100,7 +1099,7 @@ static void init_std_data(struct entropy_store *r)
r->entropy_total = 0;
r->last_data_init = false;
mix_pool_bytes(r, , sizeof(now), NULL);
-   for (i = r->poolinfo->POOLBYTES; i > 0; i -= sizeof(rv)) {
+   for (i =

[PATCH v2 0/2] random: Account for entropy loss due to overwrites

2012-11-20 Thread H. Peter Anvin

From: "H. Peter Anvin" 

When we write entropy into a non-empty pool, we currently don't
account at all for the fact that we will probabilistically overwrite
some of the entropy in that pool.  This means that unless the pool is
fully empty, we are currently *guaranteed* to overestimate the amount
of entropy in the pool!

This version of the patchset avoids manually duplicating information
by using a macro.  This removes *all* dynamic computation of derived
pool information and replaces them with static information: on just
about every architecture accessing pointer+offset is no more expensive
than just plain pointer, and this lets us get the information we
actually need from the start.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/2] ARM: EXYNOS: Add aliases for i2c controller for exynos4

2012-11-20 Thread Olof Johansson

On Tue, Nov 20, 2012 at 02:27:03PM -0800, Doug Anderson wrote:
> This is similar to a recent commit for exynos5250 titled:
>   ARM: EXYNOS: Add aliases for i2c controller
> 
> Adding aliases will be useful to prevent warnings in a future
> change.  See:
>   i2c: s3c2410: Get the i2c bus number from alias id
> 
> Signed-off-by: Doug Anderson 

Acked-by: Olof Johansson 

This can go in independently of the pending comment on the i2c driver change
(that it should be done in the core, which makes sense).


-Olof
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] of: Have of_device_add call platform_device_add rather than device_add

2012-11-20 Thread Jason Gunthorpe

This allows platform_device_add a chance to call insert_resource
on all of the resources from OF. At a minimum this fills in proc/iomem
and presumably makes resource tracking and conflict detection work
better.

Signed-off-by: Jason Gunthorpe 
---
 drivers/of/device.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

Tested on PPC32 and ARM32 embedded kernels.

diff --git a/drivers/of/device.c b/drivers/of/device.c
index 4c74e4f..a5b67dc 100644
--- a/drivers/of/device.c
+++ b/drivers/of/device.c
@@ -62,7 +62,7 @@ int of_device_add(struct platform_device *ofdev)
if (!ofdev->dev.parent)
set_dev_node(>dev, of_node_to_nid(ofdev->dev.of_node));
 
-   return device_add(>dev);
+   return platform_device_add(ofdev);
 }
 
 int of_device_register(struct platform_device *pdev)
-- 
1.7.4.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/3] ARM: zynq: ARCH_MULTIPLATFORM support

2012-11-20 Thread Olof Johansson

On Tue, Nov 20, 2012 at 01:30:58PM +0100, Michal Simek wrote:
> Hi Josh, Arnd and Olof,
> 
> 2012/11/19 Josh Cartwright :
> > Michal-
> >
> > Here's an attempt at supporting ARCH_MULTIPLATFORM on Zynq.  I've gotten
> > a multiplatform kernel building and booting on the zc702, although I
> > haven't tried to boot the same image on another non-Zynq board, due to
> > lack of available hardware.
> >
> > It would be super awesome if this set could land in 3.8, but I know
> > we're running out of time there.  I wouldn't be too heartbroken if it
> > didn't make it.
> >
> > This patchset is on top of your arm-next branch and with the
> > debug_ll_init support patch @ arm-soc/devel/debug_ll_init.
> >
> > Patch 1 drops the early TTC mapping.  It is not necessary, since the TTC
> > driver now supports pulling mapping info from the device tree.
> >
> > Patch 2 converts zynq to use the debug_ll_init() infrastructure slated
> > to go into 3.8.
> >
> > Patch 3 is the bulk of the set, moving around logic around within
> > mach-zynq/include, and setting up the necessary build magic to get Zynq
> > building w/ CONFIG_ARCH_MULTIPLATFORM.
> >
> 
> I wanted to look at it too today. You were faster!
> I have tested your patches and all works for me.
> I have also added them to my arm-next branch.
> 
> I don't have others ARM boards to test but it shouldn't be big problem
> because others will test it.
> 
> We are out of merge window that's why we should wait to the next one.
> Anyway Arnd/Olof if there is any option to get this to v3.8, please let me 
> know.

Feel free to post a pull request, if things look clean we can probably pick it 
up.


-Olof (playing good cop for once :-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] HID: hidraw: fix nonblock read return EAGAIN after device removed

2012-11-20 Thread founder.fang

when nonblock read the condition check (file->f_flags & O_NONBLOCK)
always be true,
signal_pending and device exist checking never get a chance to run, so
the user mode code always
get EAGAIN even if device removed. move nonblock mode checking to the
last can fix this problem.

Signed-off-by: Founder Fang 

--- hid-git/drivers/hid/hidraw.c.orig   2012-11-21 14:37:04.977106383 +0800
+++ hid-git/drivers/hid/hidraw.c2012-11-21 14:40:35.882152200 +0800
@@ -57,10 +57,6 @@ static ssize_t hidraw_read(struct file *
set_current_state(TASK_INTERRUPTIBLE);

while (list->head == list->tail) {
-   if (file->f_flags & O_NONBLOCK) {
-   ret = -EAGAIN;
-   break;
-   }
if (signal_pending(current)) {
ret = -ERESTARTSYS;
break;
@@ -69,6 +65,10 @@ static ssize_t hidraw_read(struct file *
ret = -EIO;
break;
}
+   if (file->f_flags & O_NONBLOCK) {
+   ret = -EAGAIN;
+   break;
+   }

/* allow O_NONBLOCK to work well from other 
threads */
mutex_unlock(>read_mutex);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v3 02/12] x86, boot: Move lldt/ltr out of 64bit code section

2012-11-20 Thread Yinghai Lu

commit 08da5a2ca

x86_64: Early segment setup for VT

add lldt/ltr to clean more segments.

Those code are put in code64, and it is using gdt that is only
loaded from code32 path.

That breaks booting with 64bit bootloader that does not go through
code32 path, and get at startup_64 directly, so they have different
gdt.

Move those lines into code32 after their gdt is loaded.

Signed-off-by: Yinghai Lu 
Cc: Zachary Amsden 
Cc: Matt Fleming 
---
 arch/x86/boot/compressed/head_64.S |9 ++---
 1 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/x86/boot/compressed/head_64.S 
b/arch/x86/boot/compressed/head_64.S
index 2c3cee4..375af23 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -154,6 +154,12 @@ ENTRY(startup_32)
btsl$_EFER_LME, %eax
wrmsr
 
+   /* After gdt is loaded */
+   xorl%eax, %eax
+   lldt%ax
+   movl$0x20, %eax
+   ltr %ax
+
/*
 * Setup for the jump to 64bit mode
 *
@@ -245,9 +251,6 @@ preferred_addr:
movl%eax, %ss
movl%eax, %fs
movl%eax, %gs
-   lldt%ax
-   movl$0x20, %eax
-   ltr %ax
 
/*
 * Compute the decompressed kernel start address.  It is where
-- 
1.7.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v3 08/12] x86, boot: Don't check if cmd_line_ptr is accessible in misc/decompressor()

2012-11-20 Thread Yinghai Lu

At that stage, it is already in 32bit protected mode or 64bit mode.
so we do not need to check if ptr less 1M.

When go from other boot loader (kexec) instead of boot/ code path.

Move out accessible checking out __cmdline_find_option

So misc.c will parse cmdline and have debug print out.

Signed-off-by: Yinghai Lu 
---
 arch/x86/boot/boot.h|   14 --
 arch/x86/boot/cmdline.c |8 
 2 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/arch/x86/boot/boot.h b/arch/x86/boot/boot.h
index 18997e5..7fadf80 100644
--- a/arch/x86/boot/boot.h
+++ b/arch/x86/boot/boot.h
@@ -289,12 +289,22 @@ int __cmdline_find_option(u32 cmdline_ptr, const char 
*option, char *buffer, int
 int __cmdline_find_option_bool(u32 cmdline_ptr, const char *option);
 static inline int cmdline_find_option(const char *option, char *buffer, int 
bufsize)
 {
-   return __cmdline_find_option(boot_params.hdr.cmd_line_ptr, option, 
buffer, bufsize);
+   u32 cmd_line_ptr = boot_params.hdr.cmd_line_ptr;
+
+   if (cmd_line_ptr >= 0x10)
+   return -1;  /* inaccessible */
+
+   return __cmdline_find_option(cmd_line_ptr, option, buffer, bufsize);
 }
 
 static inline int cmdline_find_option_bool(const char *option)
 {
-   return __cmdline_find_option_bool(boot_params.hdr.cmd_line_ptr, option);
+   u32 cmd_line_ptr = boot_params.hdr.cmd_line_ptr;
+
+   if (cmd_line_ptr >= 0x10)
+   return -1;  /* inaccessible */
+
+   return __cmdline_find_option_bool(cmd_line_ptr, option);
 }
 
 
diff --git a/arch/x86/boot/cmdline.c b/arch/x86/boot/cmdline.c
index 6b3b6f7..768f00f 100644
--- a/arch/x86/boot/cmdline.c
+++ b/arch/x86/boot/cmdline.c
@@ -41,8 +41,8 @@ int __cmdline_find_option(u32 cmdline_ptr, const char 
*option, char *buffer, int
st_bufcpy   /* Copying this to buffer */
} state = st_wordstart;
 
-   if (!cmdline_ptr || cmdline_ptr >= 0x10)
-   return -1;  /* No command line, or inaccessible */
+   if (!cmdline_ptr)
+   return -1;  /* No command line */
 
cptr = cmdline_ptr & 0xf;
set_fs(cmdline_ptr >> 4);
@@ -111,8 +111,8 @@ int __cmdline_find_option_bool(u32 cmdline_ptr, const char 
*option)
st_wordskip,/* Miscompare, skip */
} state = st_wordstart;
 
-   if (!cmdline_ptr || cmdline_ptr >= 0x10)
-   return -1;  /* No command line, or inaccessible */
+   if (!cmdline_ptr)
+   return -1;  /* No command line */
 
cptr = cmdline_ptr & 0xf;
set_fs(cmdline_ptr >> 4);
-- 
1.7.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v3 01/12] x86, boot: move verify_cpu.S after 0x200

2012-11-20 Thread Yinghai Lu

We are short of space before 0x200 that is entry for startup_64.

And we can not change startup_64 to other value --- ABI ?

We could move function verify_cpu down, and that could avoid extra
code of jmp back and forth.

Signed-off-by: Yinghai Lu 
Cc: Matt Fleming 
---
 arch/x86/boot/compressed/head_64.S |5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/boot/compressed/head_64.S 
b/arch/x86/boot/compressed/head_64.S
index 2c4b171..2c3cee4 100644
--- a/arch/x86/boot/compressed/head_64.S
+++ b/arch/x86/boot/compressed/head_64.S
@@ -182,8 +182,6 @@ no_longmode:
hlt
jmp 1b
 
-#include "../../kernel/verify_cpu.S"
-
/*
 * Be careful here startup_64 needs to be at a predictable
 * address so I can export it in an ELF header.  Bootloaders
@@ -349,6 +347,9 @@ relocated:
  */
jmp *%rbp
 
+   .code32
+#include "../../kernel/verify_cpu.S"
+
.data
 gdt:
.word   gdt_end - gdt
-- 
1.7.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v3 07/12] x86, boot: add get_cmd_line_ptr()

2012-11-20 Thread Yinghai Lu

later will check ext_cmd_line_ptr at the same time.

Signed-off-by: Yinghai Lu 
---
 arch/x86/boot/compressed/cmdline.c |   10 --
 arch/x86/kernel/head64.c   |   13 +++--
 2 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/arch/x86/boot/compressed/cmdline.c 
b/arch/x86/boot/compressed/cmdline.c
index 10f6b11..b4c913c 100644
--- a/arch/x86/boot/compressed/cmdline.c
+++ b/arch/x86/boot/compressed/cmdline.c
@@ -13,13 +13,19 @@ static inline char rdfs8(addr_t addr)
return *((char *)(fs + addr));
 }
 #include "../cmdline.c"
+static unsigned long get_cmd_line_ptr(void)
+{
+   unsigned long cmd_line_ptr = real_mode->hdr.cmd_line_ptr;
+
+   return cmd_line_ptr;
+}
 int cmdline_find_option(const char *option, char *buffer, int bufsize)
 {
-   return __cmdline_find_option(real_mode->hdr.cmd_line_ptr, option, 
buffer, bufsize);
+   return __cmdline_find_option(get_cmd_line_ptr(), option, buffer, 
bufsize);
 }
 int cmdline_find_option_bool(const char *option)
 {
-   return __cmdline_find_option_bool(real_mode->hdr.cmd_line_ptr, option);
+   return __cmdline_find_option_bool(get_cmd_line_ptr(), option);
 }
 
 #endif
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 00e612a..3ac6cad 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -41,13 +41,22 @@ static void __init clear_bss(void)
   (unsigned long) __bss_stop - (unsigned long) __bss_start);
 }
 
+static unsigned long get_cmd_line_ptr(void)
+{
+   unsigned long cmd_line_ptr = boot_params.hdr.cmd_line_ptr;
+
+   return cmd_line_ptr;
+}
+
 static void __init copy_bootdata(char *real_mode_data)
 {
char * command_line;
+   unsigned long cmd_line_ptr;
 
memcpy(_params, real_mode_data, sizeof boot_params);
-   if (boot_params.hdr.cmd_line_ptr) {
-   command_line = __va(boot_params.hdr.cmd_line_ptr);
+   cmd_line_ptr = get_cmd_line_ptr();
+   if (cmd_line_ptr) {
+   command_line = __va(cmd_line_ptr);
memcpy(boot_command_line, command_line, COMMAND_LINE_SIZE);
}
 }
-- 
1.7.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v3 05/12] x86: Merge early_reserve_initrd for 32bit and 64bit

2012-11-20 Thread Yinghai Lu

They are the same, could move them out from head32/64.c to setup.c.

We are using memblock, and it could handle overlapping properly, so
we don't need to reserve some at first to hold the location, and just
need to make sure we reserve them before we are using memblock to find
free mem to use.

Signed-off-by: Yinghai Lu 
---
 arch/x86/kernel/head32.c |   11 ---
 arch/x86/kernel/head64.c |   11 ---
 arch/x86/kernel/setup.c  |   22 ++
 3 files changed, 18 insertions(+), 26 deletions(-)

diff --git a/arch/x86/kernel/head32.c b/arch/x86/kernel/head32.c
index c18f59d..4c52efc 100644
--- a/arch/x86/kernel/head32.c
+++ b/arch/x86/kernel/head32.c
@@ -33,17 +33,6 @@ void __init i386_start_kernel(void)
memblock_reserve(__pa_symbol(&_text),
 __pa_symbol(&__bss_stop) - __pa_symbol(&_text));
 
-#ifdef CONFIG_BLK_DEV_INITRD
-   /* Reserve INITRD */
-   if (boot_params.hdr.type_of_loader && boot_params.hdr.ramdisk_image) {
-   /* Assume only end is not page aligned */
-   u64 ramdisk_image = boot_params.hdr.ramdisk_image;
-   u64 ramdisk_size  = boot_params.hdr.ramdisk_size;
-   u64 ramdisk_end   = PAGE_ALIGN(ramdisk_image + ramdisk_size);
-   memblock_reserve(ramdisk_image, ramdisk_end - ramdisk_image);
-   }
-#endif
-
/* Call the subarch specific early setup function */
switch (boot_params.hdr.hardware_subarch) {
case X86_SUBARCH_MRST:
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 037df57..00e612a 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -100,17 +100,6 @@ void __init x86_64_start_reservations(char *real_mode_data)
memblock_reserve(__pa_symbol(&_text),
 __pa_symbol(&__bss_stop) - __pa_symbol(&_text));
 
-#ifdef CONFIG_BLK_DEV_INITRD
-   /* Reserve INITRD */
-   if (boot_params.hdr.type_of_loader && boot_params.hdr.ramdisk_image) {
-   /* Assume only end is not page aligned */
-   unsigned long ramdisk_image = boot_params.hdr.ramdisk_image;
-   unsigned long ramdisk_size  = boot_params.hdr.ramdisk_size;
-   unsigned long ramdisk_end   = PAGE_ALIGN(ramdisk_image + 
ramdisk_size);
-   memblock_reserve(ramdisk_image, ramdisk_end - ramdisk_image);
-   }
-#endif
-
reserve_ebda_region();
 
/*
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 6d29d1f..ee6d267 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -364,6 +364,19 @@ static u64 __init get_mem_size(unsigned long limit_pfn)
 
return mapped_pages << PAGE_SHIFT;
 }
+static void __init early_reserve_initrd(void)
+{
+   /* Assume only end is not page aligned */
+   u64 ramdisk_image = boot_params.hdr.ramdisk_image;
+   u64 ramdisk_size  = boot_params.hdr.ramdisk_size;
+   u64 ramdisk_end   = PAGE_ALIGN(ramdisk_image + ramdisk_size);
+
+   if (!boot_params.hdr.type_of_loader ||
+   !ramdisk_image || !ramdisk_size)
+   return; /* No initrd provided by bootloader */
+
+   memblock_reserve(ramdisk_image, ramdisk_end - ramdisk_image);
+}
 static void __init reserve_initrd(void)
 {
/* Assume only end is not page aligned */
@@ -390,10 +403,6 @@ static void __init reserve_initrd(void)
if (pfn_range_is_mapped(PFN_DOWN(ramdisk_image),
PFN_DOWN(ramdisk_end))) {
/* All are mapped, easy case */
-   /*
-* don't need to reserve again, already reserved early
-* in i386_start_kernel
-*/
initrd_start = ramdisk_image + PAGE_OFFSET;
initrd_end = initrd_start + ramdisk_size;
return;
@@ -404,6 +413,9 @@ static void __init reserve_initrd(void)
memblock_free(ramdisk_image, ramdisk_end - ramdisk_image);
 }
 #else
+static void __init early_reserve_initrd(void)
+{
+}
 static void __init reserve_initrd(void)
 {
 }
@@ -665,6 +677,8 @@ early_param("reservelow", parse_reservelow);
 
 void __init setup_arch(char **cmdline_p)
 {
+   early_reserve_initrd();
+
 #ifdef CONFIG_X86_32
memcpy(_cpu_data, _cpu_data, sizeof(new_cpu_data));
visws_early_detect();
-- 
1.7.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v3 06/12] x86: add get_ramdisk_image/size

2012-11-20 Thread Yinghai Lu

There several places to find ramdisk information early for reserving
and relocating.

Use functions to make code more readable and consistent.

Later will add ext_ramdisk_image/size in those functions to support
loading ramdisk above 4g.

Signed-off-by: Yinghai Lu 
---
 arch/x86/kernel/setup.c |   29 +
 1 files changed, 21 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index ee6d267..194e151 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -298,12 +298,25 @@ static void __init reserve_brk(void)
 
 #ifdef CONFIG_BLK_DEV_INITRD
 
+static u64 __init get_ramdisk_image(void)
+{
+   u64 ramdisk_image = boot_params.hdr.ramdisk_image;
+
+   return ramdisk_image;
+}
+static u64 __init get_ramdisk_size(void)
+{
+   u64 ramdisk_size = boot_params.hdr.ramdisk_size;
+
+   return ramdisk_size;
+}
+
 #define MAX_MAP_CHUNK  (NR_FIX_BTMAPS << PAGE_SHIFT)
 static void __init relocate_initrd(void)
 {
/* Assume only end is not page aligned */
-   u64 ramdisk_image = boot_params.hdr.ramdisk_image;
-   u64 ramdisk_size  = boot_params.hdr.ramdisk_size;
+   u64 ramdisk_image = get_ramdisk_image();
+   u64 ramdisk_size  = get_ramdisk_size();
u64 area_size = PAGE_ALIGN(ramdisk_size);
u64 ramdisk_here;
unsigned long slop, clen, mapaddr;
@@ -342,8 +355,8 @@ static void __init relocate_initrd(void)
ramdisk_size  -= clen;
}
 
-   ramdisk_image = boot_params.hdr.ramdisk_image;
-   ramdisk_size  = boot_params.hdr.ramdisk_size;
+   ramdisk_image = get_ramdisk_image();
+   ramdisk_size  = get_ramdisk_size();
printk(KERN_INFO "Move RAMDISK from [mem %#010llx-%#010llx] to"
" [mem %#010llx-%#010llx]\n",
ramdisk_image, ramdisk_image + ramdisk_size - 1,
@@ -367,8 +380,8 @@ static u64 __init get_mem_size(unsigned long limit_pfn)
 static void __init early_reserve_initrd(void)
 {
/* Assume only end is not page aligned */
-   u64 ramdisk_image = boot_params.hdr.ramdisk_image;
-   u64 ramdisk_size  = boot_params.hdr.ramdisk_size;
+   u64 ramdisk_image = get_ramdisk_image();
+   u64 ramdisk_size  = get_ramdisk_size();
u64 ramdisk_end   = PAGE_ALIGN(ramdisk_image + ramdisk_size);
 
if (!boot_params.hdr.type_of_loader ||
@@ -380,8 +393,8 @@ static void __init early_reserve_initrd(void)
 static void __init reserve_initrd(void)
 {
/* Assume only end is not page aligned */
-   u64 ramdisk_image = boot_params.hdr.ramdisk_image;
-   u64 ramdisk_size  = boot_params.hdr.ramdisk_size;
+   u64 ramdisk_image = get_ramdisk_image();
+   u64 ramdisk_size  = get_ramdisk_size();
u64 ramdisk_end   = PAGE_ALIGN(ramdisk_image + ramdisk_size);
u64 mapped_size;
 
-- 
1.7.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v3 09/12] x86, boot: update cmd_line_ptr to unsigned long

2012-11-20 Thread Yinghai Lu

boot/compressed/misc.c could be with 64 bit, and cmd_line_ptr could
above 4g.

So change to unsigned long instead. that will be 64bit in 64bit,
and 32bit in 32bit.

Signed-off-by: Yinghai Lu 
---
 arch/x86/boot/boot.h|8 
 arch/x86/boot/cmdline.c |4 ++--
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/x86/boot/boot.h b/arch/x86/boot/boot.h
index 7fadf80..5b75319 100644
--- a/arch/x86/boot/boot.h
+++ b/arch/x86/boot/boot.h
@@ -285,11 +285,11 @@ struct biosregs {
 void intcall(u8 int_no, const struct biosregs *ireg, struct biosregs *oreg);
 
 /* cmdline.c */
-int __cmdline_find_option(u32 cmdline_ptr, const char *option, char *buffer, 
int bufsize);
-int __cmdline_find_option_bool(u32 cmdline_ptr, const char *option);
+int __cmdline_find_option(unsigned long cmdline_ptr, const char *option, char 
*buffer, int bufsize);
+int __cmdline_find_option_bool(unsigned long cmdline_ptr, const char *option);
 static inline int cmdline_find_option(const char *option, char *buffer, int 
bufsize)
 {
-   u32 cmd_line_ptr = boot_params.hdr.cmd_line_ptr;
+   unsigned long cmd_line_ptr = boot_params.hdr.cmd_line_ptr;
 
if (cmd_line_ptr >= 0x10)
return -1;  /* inaccessible */
@@ -299,7 +299,7 @@ static inline int cmdline_find_option(const char *option, 
char *buffer, int bufs
 
 static inline int cmdline_find_option_bool(const char *option)
 {
-   u32 cmd_line_ptr = boot_params.hdr.cmd_line_ptr;
+   unsigned long cmd_line_ptr = boot_params.hdr.cmd_line_ptr;
 
if (cmd_line_ptr >= 0x10)
return -1;  /* inaccessible */
diff --git a/arch/x86/boot/cmdline.c b/arch/x86/boot/cmdline.c
index 768f00f..625d21b 100644
--- a/arch/x86/boot/cmdline.c
+++ b/arch/x86/boot/cmdline.c
@@ -27,7 +27,7 @@ static inline int myisspace(u8 c)
  * Returns the length of the argument (regardless of if it was
  * truncated to fit in the buffer), or -1 on not found.
  */
-int __cmdline_find_option(u32 cmdline_ptr, const char *option, char *buffer, 
int bufsize)
+int __cmdline_find_option(unsigned long cmdline_ptr, const char *option, char 
*buffer, int bufsize)
 {
addr_t cptr;
char c;
@@ -99,7 +99,7 @@ int __cmdline_find_option(u32 cmdline_ptr, const char 
*option, char *buffer, int
  * Returns the position of that option (starts counting with 1)
  * or 0 on not found
  */
-int __cmdline_find_option_bool(u32 cmdline_ptr, const char *option)
+int __cmdline_find_option_bool(unsigned long cmdline_ptr, const char *option)
 {
addr_t cptr;
char c;
-- 
1.7.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v3 10/12] x86: use io_remap to access real_mode_data

2012-11-20 Thread Yinghai Lu

When 64bit bootloader put real mode data above 4g, We can not
access real mode data directly.

because in arch/x86/kernel/head_64.S, only set ident mapping
for 0-1g, and kernel code/data/bss.

So need to move early_ioremap_init() calling from setup_arch
to x86_64_start_kernel.

Also use rsi/rdi instead of esi/edi.

Signed-off-by: Yinghai Lu 
---
 arch/x86/kernel/head64.c  |   17 ++---
 arch/x86/kernel/head_64.S |4 ++--
 arch/x86/kernel/setup.c   |2 ++
 3 files changed, 18 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 3ac6cad..735cd47 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -52,12 +52,21 @@ static void __init copy_bootdata(char *real_mode_data)
 {
char * command_line;
unsigned long cmd_line_ptr;
+   char *p;
 
-   memcpy(_params, real_mode_data, sizeof boot_params);
+   /*
+* for 64bit bootload path, those data could be above 4G,
+* and we do set ident mapping for them in head_64.S.
+* So need to ioremap to access them.
+*/
+   p = early_memremap((unsigned long)real_mode_data, sizeof(boot_params));
+   memcpy(_params, p, sizeof(boot_params));
+   early_iounmap(p, sizeof(boot_params));
cmd_line_ptr = get_cmd_line_ptr();
if (cmd_line_ptr) {
-   command_line = __va(cmd_line_ptr);
+   command_line = early_memremap(cmd_line_ptr, COMMAND_LINE_SIZE);
memcpy(boot_command_line, command_line, COMMAND_LINE_SIZE);
+   early_iounmap(command_line, COMMAND_LINE_SIZE);
}
 }
 
@@ -104,7 +113,9 @@ void __init x86_64_start_kernel(char * real_mode_data)
 
 void __init x86_64_start_reservations(char *real_mode_data)
 {
-   copy_bootdata(__va(real_mode_data));
+   early_ioremap_init();
+
+   copy_bootdata(real_mode_data);
 
memblock_reserve(__pa_symbol(&_text),
 __pa_symbol(&__bss_stop) - __pa_symbol(&_text));
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 32fa9d0..14c5de2 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -262,9 +262,9 @@ ENTRY(secondary_startup_64)
movlinitial_gs+4(%rip),%edx
wrmsr   
 
-   /* esi is pointer to real mode structure with interesting info.
+   /* rsi is pointer to real mode structure with interesting info.
   pass it to C */
-   movl%esi, %edi
+   movq%rsi, %rdi

/* Finally jump to run C code and to be on real kernel address
 * Since we are running on identity-mapped space we have to jump
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 194e151..573fa7d7 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -718,7 +718,9 @@ void __init setup_arch(char **cmdline_p)
 
early_trap_init();
early_cpu_init();
+#ifdef CONFIG_X86_32
early_ioremap_init();
+#endif
 
setup_olpc_ofw_pgd();
 
-- 
1.7.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v3 12/12] x86: remove 1024g limitation for kexec buffer on 64bit

2012-11-20 Thread Yinghai Lu

Now 64bit kernel supports more than 1T ram and kexec tools
could find buffer above 1T, remove that obsolete limitation.
and use MAXMEM instead.

Tested on system more than 1024g ram.

Signed-off-by: Yinghai Lu 
Cc: "Eric W. Biederman" 
---
 arch/x86/include/asm/kexec.h |6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
index 317ff17..11bfdc5 100644
--- a/arch/x86/include/asm/kexec.h
+++ b/arch/x86/include/asm/kexec.h
@@ -48,11 +48,11 @@
 # define vmcore_elf_check_arch_cross(x) ((x)->e_machine == EM_X86_64)
 #else
 /* Maximum physical address we can use pages from */
-# define KEXEC_SOURCE_MEMORY_LIMIT  (0xFFUL)
+# define KEXEC_SOURCE_MEMORY_LIMIT  (MAXMEM-1)
 /* Maximum address we can reach in physical address mode */
-# define KEXEC_DESTINATION_MEMORY_LIMIT (0xFFUL)
+# define KEXEC_DESTINATION_MEMORY_LIMIT (MAXMEM-1)
 /* Maximum address we can use for the control pages */
-# define KEXEC_CONTROL_MEMORY_LIMIT (0xFFUL)
+# define KEXEC_CONTROL_MEMORY_LIMIT (MAXMEM-1)
 
 /* Allocate one page for the pdp and the second for the code */
 # define KEXEC_CONTROL_PAGE_SIZE  (4096UL + 4096UL)
-- 
1.7.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v3 04/12] x86, 64bit: add support for loading kernel above 512G

2012-11-20 Thread Yinghai Lu

Current kernel is not allowed to be loaded above 512g, it thinks
that address is too big.

We only need to add one extra spare page for needed level3 to
point another 512g range.

Need to check _text range and set level4 pg to point to that spare
level3 page, and set level3 to point to level2 page to cover
[_text, _end] with extra mapping.

We need this to put relocatable bzImage high above 512g.

Signed-off-by: Yinghai Lu 
Cc: "Eric W. Biederman" 
---
 arch/x86/kernel/head_64.S |   34 +++---
 1 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index efc0c08..32fa9d0 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -78,12 +78,6 @@ startup_64:
testl   %eax, %eax
jnz bad_address
 
-   /* Is the address too large? */
-   leaq_text(%rip), %rdx
-   movq$PGDIR_SIZE, %rax
-   cmpq%rax, %rdx
-   jae bad_address
-
/* Fixup the physical addresses in the page table
 */
addq%rbp, init_level4_pgt + 0(%rip)
@@ -102,12 +96,35 @@ startup_64:
andq$PMD_PAGE_MASK, %rdi
 
movq%rdi, %rax
+   shrq$PGDIR_SHIFT, %rax
+   andq$(PTRS_PER_PGD - 1), %rax
+   jz  skip_level3_spare
+
+   /* Set level3 at first */
+   leaq(level3_spare_pgt - __START_KERNEL_map + _KERNPG_TABLE)(%rbp), 
%rdx
+   leaqinit_level4_pgt(%rip), %rbx
+   movq%rdx, 0(%rbx, %rax, 8)
+   addq$L4_PAGE_OFFSET, %rax
+   movq%rdx, 0(%rbx, %rax, 8)
+
+   /* always need to set level2 */
+   movq%rdi, %rax
+   shrq$PUD_SHIFT, %rax
+   andq$(PTRS_PER_PUD - 1), %rax
+   leaqlevel3_spare_pgt(%rip), %rbx
+   jmp set_level2_spare
+
+skip_level3_spare:
+   movq%rdi, %rax
shrq$PUD_SHIFT, %rax
andq$(PTRS_PER_PUD - 1), %rax
jz  ident_complete
 
-   leaq(level2_spare_pgt - __START_KERNEL_map + _KERNPG_TABLE)(%rbp), 
%rdx
+   /* only set level2 with out level3 spare */
leaqlevel3_ident_pgt(%rip), %rbx
+
+set_level2_spare:
+   leaq(level2_spare_pgt - __START_KERNEL_map + _KERNPG_TABLE)(%rbp), 
%rdx
movq%rdx, 0(%rbx, %rax, 8)
 
movq%rdi, %rax
@@ -435,6 +452,9 @@ NEXT_PAGE(level2_kernel_pgt)
PMDS(0, __PAGE_KERNEL_LARGE_EXEC,
KERNEL_IMAGE_SIZE/PMD_SIZE)
 
+NEXT_PAGE(level3_spare_pgt)
+   .fill   512, 8, 0
+
 NEXT_PAGE(level2_spare_pgt)
.fill   512, 8, 0
 
-- 
1.7.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v3 00/12] x86, boot, 64bit: Add support for loading ramdisk and bzImage high

2012-11-20 Thread Yinghai Lu

Now we have limit kdump reseved under 896M, because kexec has the limitation.
and also bzImage need to stay under 4g.

To make kexec/kdump could use range above 4g, we need to make bzImage and
ramdisk could be loaded above 4g.
During booting bzImage will be unpacked on same postion and stay high.

The patches add field in boot header to
1. get info about ramdisk position info above 4g from bootloader/kexec
2. set xloadflags bit0 in header for bzImage and bootloader/kexec load
   could check that to decide if need to put bzImage high.

This patches is tested with kexec tools with local changes and they are sent
to kexec list.

could be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git 
for-x86-boot

and it is on top of for-x86-mm

-v2: add ext_cmd_line_ptr support, and handle boot_param/cmd_line is above
 4G case.
-v3: according to hpa, use xloadflags instead code32_start_offset.
 0x200 will not be changed...

Thanks

Yinghai

Yinghai Lu (12):
  x86, boot: move verify_cpu.S after 0x200
  x86, boot: Move lldt/ltr out of 64bit code section
  x86, 64bit: set extra ident page table for whole kernel range
  x86, 64bit: add support for loading kernel above 512G
  x86: Merge early_reserve_initrd for 32bit and 64bit
  x86: add get_ramdisk_image/size
  x86, boot: add get_cmd_line_ptr()
  x86, boot: Don't check if cmd_line_ptr is accessible in misc/decompressor()
  x86, boot: update cmd_line_ptr to unsigned long
  x86: use io_remap to access real_mode_data
  x86, boot: add fields to support load bzImage and ramdisk high
  x86: remove 1024g limitation for kexec buffer on 64bit

 Documentation/x86/boot.txt |   40 +-
 arch/x86/boot/boot.h   |   18 +--
 arch/x86/boot/cmdline.c|   12 
 arch/x86/boot/compressed/cmdline.c |   13 +++-
 arch/x86/boot/compressed/head_64.S |   14 ++---
 arch/x86/boot/header.S |   16 +-
 arch/x86/include/asm/bootparam.h   |6 +++-
 arch/x86/include/asm/kexec.h   |6 ++--
 arch/x86/kernel/head32.c   |   11 ---
 arch/x86/kernel/head64.c   |   42 +--
 arch/x86/kernel/head_64.S  |   49 +--
 arch/x86/kernel/setup.c|   55 +--
 12 files changed, 212 insertions(+), 70 deletions(-)

-- 
1.7.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v3 11/12] x86, boot: add fields to support load bzImage and ramdisk high

2012-11-20 Thread Yinghai Lu

ext_ramdisk_image/size will record high 32bits for ramdisk info.

xloadflags bit0 will be set if relocatable with 64bit.

Let get_ramdisk_image/size to use ext_ramdisk_image/size to get
right positon for ramdisk.

bootloader will fill value to ext_ramdisk_image/size when it load
ramdisk high.

Also bootloader will check if xloadflags bit0 is set to decicde if
it could load ramdisk high above 4G.

Update header version to 2.12.

-v2: add ext_cmd_line_ptr for above 4G support.
-v3: update to xloadflags from HPA

Signed-off-by: Yinghai Lu 
Cc: Rob Landley 
Cc: Matt Fleming 
---
 Documentation/x86/boot.txt |   40 +++-
 arch/x86/boot/compressed/cmdline.c |3 ++
 arch/x86/boot/header.S |   16 -
 arch/x86/include/asm/bootparam.h   |6 -
 arch/x86/kernel/head64.c   |3 ++
 arch/x86/kernel/setup.c|6 +
 6 files changed, 70 insertions(+), 4 deletions(-)

diff --git a/Documentation/x86/boot.txt b/Documentation/x86/boot.txt
index 9efceff..a8263f7 100644
--- a/Documentation/x86/boot.txt
+++ b/Documentation/x86/boot.txt
@@ -57,6 +57,9 @@ Protocol 2.10:(Kernel 2.6.31) Added a protocol for 
relaxed alignment
 Protocol 2.11: (Kernel 3.6) Added a field for offset of EFI handover
protocol entry point.
 
+Protocol 2.12: (Kernel 3.9) Added three fields for loading bzImage and
+ramdisk above 4G with 64bit.
+
  MEMORY LAYOUT
 
 The traditional memory map for the kernel loader, used for Image or
@@ -182,7 +185,7 @@ Offset  Proto   NameMeaning
 0230/4 2.05+   kernel_alignment Physical addr alignment required for kernel
 0234/1 2.05+   relocatable_kernel Whether kernel is relocatable or not
 0235/1 2.10+   min_alignment   Minimum alignment, as a power of two
-0236/2 N/A pad3Unused
+0236/2 2.12+   xloadflags  Boot protocal option flags
 0238/4 2.06+   cmdline_sizeMaximum size of the kernel command line
 023C/4 2.07+   hardware_subarch Hardware subarchitecture
 0240/8 2.07+   hardware_subarch_data Subarchitecture-specific data
@@ -193,6 +196,9 @@ Offset  Proto   NameMeaning
 0258/8 2.10+   pref_addressPreferred loading address
 0260/4 2.10+   init_size   Linear memory required during initialization
 0264/4 2.11+   handover_offset Offset of handover entry point
+0268/4 2.12+   ext_ramdisk_image ramdisk_image 32 bits
+026C/4 2.12+   ext_ramdisk_size ramdisk_size high 32 bits
+0270/4 2.12+   ext_cmd_line_ptr cmd_line_ptr high 32 bits
 
 (1) For backwards compatibility, if the setup_sects field contains 0, the
 real value is 4.
@@ -581,6 +587,16 @@ Protocol:  2.10+
   misaligned kernel.  Therefore, a loader should typically try each
   power-of-two alignment from kernel_alignment down to this alignment.
 
+Field name: xloadflags
+Type:   modify (obligatory)
+Offset/size:0x236/2
+Protocol:   2.12+
+
+  This field is a bitmask.
+
+  Bit 0 (read): LOADED_ABOVE_4G
+- If 1, kernel/boot_params/cmdline/ramdisk could be above 4g
+
 Field name:cmdline_size
 Type:  read
 Offset/size:   0x238/4
@@ -707,6 +723,28 @@ Offset/size:   0x264/4
 
   See EFI HANDOVER PROTOCOL below for more details.
 
+Field name:ext_ramdisk_image
+Type:  write
+Offset/size:   0x268/4
+Protocol:  2.12+
+
+  The high 32-bit linear address of the initial ramdisk or ramfs.  Leave at
+  zero if there is no initial ramdisk/ramfs, or under 4G.
+
+Field name:ext_ramdisk_size
+Type:  write
+Offset/size:   0x26c/4
+Protocol:  2.12+
+
+  High 32-bit size of the initial ramdisk or ramfs.  Leave at zero if there
+  is no initial ramdisk/ramfs.
+
+Field name:ext_cmd_line_ptr
+Type:  write
+Offset/size:   0x270/4
+Protocol:  2.12+
+
+  cmd_line_ptr high 32 bits. Leave at zero if under 4G.
 
  THE IMAGE CHECKSUM
 
diff --git a/arch/x86/boot/compressed/cmdline.c 
b/arch/x86/boot/compressed/cmdline.c
index b4c913c..00678d3 100644
--- a/arch/x86/boot/compressed/cmdline.c
+++ b/arch/x86/boot/compressed/cmdline.c
@@ -17,6 +17,9 @@ static unsigned long get_cmd_line_ptr(void)
 {
unsigned long cmd_line_ptr = real_mode->hdr.cmd_line_ptr;
 
+   if (real_mode->hdr.version >= 0x020c)
+   cmd_line_ptr |= (u64)real_mode->hdr.ext_cmd_line_ptr << 32;
+
return cmd_line_ptr;
 }
 int cmdline_find_option(const char *option, char *buffer, int bufsize)
diff --git a/arch/x86/boot/header.S b/arch/x86/boot/header.S
index 2a01744..598cba5 100644
--- a/arch/x86/boot/header.S
+++ b/arch/x86/boot/header.S
@@ -279,7 +279,7 @@ _start:
# Part 2 of the header, from the old setup.S
 
.ascii  "HdrS"  # header signature
-   .word   0x020b  # header version number (>= 0x0105)
+   .word   0x020c  # header version number (>= 0x0105)
# or else old loadlin-1.5 will fail)

[PATCH v3 03/12] x86, 64bit: set extra ident page table for whole kernel range

2012-11-20 Thread Yinghai Lu

Current when kernel is loaded above 1G, only [_text, _text+2M]
is set up with extra ident page table.
That is not enough, some variables that could be used early are
out of that range. (like gdt...)

Just set map for [_text, _end] include text/data/bss/brk...

Signed-off-by: Yinghai Lu 
Cc: "Eric W. Biederman" 
---
 arch/x86/kernel/head_64.S |   11 ++-
 1 files changed, 10 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 94bf9cc..efc0c08 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -115,7 +115,16 @@ startup_64:
andq$(PTRS_PER_PMD - 1), %rax
leaq__PAGE_KERNEL_IDENT_LARGE_EXEC(%rdi), %rdx
leaqlevel2_spare_pgt(%rip), %rbx
-   movq%rdx, 0(%rbx, %rax, 8)
+   leaq_end(%rip), %r8
+   decq%r8
+   shrq$PMD_SHIFT, %r8
+   andq$(PTRS_PER_PMD - 1), %r8
+1: movq%rdx, 0(%rbx, %rax, 8)
+   addq$PMD_SIZE, %rdx
+   incq%rax
+   cmp %r8, %rax
+   jle 1b
+
 ident_complete:
 
/*
-- 
1.7.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH resend] TPM: Issue TPM_STARTUP at driver load if the TPM has not been started

2012-11-20 Thread Jason Gunthorpe

The TPM will respond to TPM_GET_CAP with TPM_ERR_INVALID_POSTINIT if
TPM_STARTUP has not been issued. Detect this and automatically
issue TPM_STARTUP.

This is for embedded applications where the kernel is the first thing
to touch the TPM.

Signed-off-by: Jason Gunthorpe 
---
 drivers/char/tpm/tpm.c |   43 +++
 drivers/char/tpm/tpm.h |6 ++
 2 files changed, 45 insertions(+), 4 deletions(-)

Discussion on this got a bit sidetracked talking about
suspend/resume.. To be clear, this fixes a real, serious, problem on
normal embedded cases where the kernel refuses to attach the TPM driver
at all.

Key, are you happy with this as-is for the next merge window?

This version is rebased and retested to 3.7-rc6. Tested on Atmel
and Nuvoton LPC TPMs on PPC32.

diff --git a/drivers/char/tpm/tpm.c b/drivers/char/tpm/tpm.c
index 93211df..c576718 100644
--- a/drivers/char/tpm/tpm.c
+++ b/drivers/char/tpm/tpm.c
@@ -468,7 +468,7 @@ static ssize_t transmit_cmd(struct tpm_chip *chip, struct 
tpm_cmd_t *cmd,
return -EFAULT;
 
err = be32_to_cpu(cmd->header.out.return_code);
-   if (err != 0)
+   if (err != 0 && desc)
dev_err(chip->dev, "A TPM error (%d) occurred %s\n", err, desc);
 
return err;
@@ -485,6 +485,16 @@ static const struct tpm_input_header tpm_getcap_header = {
.ordinal = TPM_ORD_GET_CAP
 };
 
+#define TPM_ORD_STARTUP cpu_to_be32(153)
+#define TPM_ST_CLEAR cpu_to_be16(1)
+#define TPM_ST_STATE cpu_to_be16(2)
+#define TPM_ST_DEACTIVATED cpu_to_be16(3)
+static const struct tpm_input_header tpm_startup_header = {
+   .tag = TPM_TAG_RQU_COMMAND,
+   .length = cpu_to_be32(12),
+   .ordinal = TPM_ORD_STARTUP
+};
+
 ssize_t tpm_getcap(struct device *dev, __be32 subcap_id, cap_t *cap,
   const char *desc)
 {
@@ -528,6 +538,15 @@ void tpm_gen_interrupt(struct tpm_chip *chip)
 }
 EXPORT_SYMBOL_GPL(tpm_gen_interrupt);
 
+static int tpm_startup(struct tpm_chip *chip, __be16 startup_type)
+{
+   struct tpm_cmd_t start_cmd;
+   start_cmd.header.in = tpm_startup_header;
+   start_cmd.params.startup_in.startup_type = startup_type;
+   return transmit_cmd(chip, _cmd, TPM_INTERNAL_RESULT_SIZE,
+   "attempting to start the TPM");
+}
+
 int tpm_get_timeouts(struct tpm_chip *chip)
 {
struct tpm_cmd_t tpm_cmd;
@@ -541,11 +560,27 @@ int tpm_get_timeouts(struct tpm_chip *chip)
tpm_cmd.params.getcap_in.cap = TPM_CAP_PROP;
tpm_cmd.params.getcap_in.subcap_size = cpu_to_be32(4);
tpm_cmd.params.getcap_in.subcap = TPM_CAP_PROP_TIS_TIMEOUT;
+   rc = transmit_cmd(chip, _cmd, TPM_INTERNAL_RESULT_SIZE, 0);
 
-   rc = transmit_cmd(chip, _cmd, TPM_INTERNAL_RESULT_SIZE,
-   "attempting to determine the timeouts");
-   if (rc)
+   if (rc == TPM_ERR_INVALID_POSTINIT) {
+   /* The TPM is not started, we are the first to talk to it.
+  Execute a startup command. */
+   dev_info(chip->dev, "Issuing TPM_STARTUP");
+   if (tpm_startup(chip, TPM_ST_CLEAR))
+   return rc;
+
+   tpm_cmd.header.in = tpm_getcap_header;
+   tpm_cmd.params.getcap_in.cap = TPM_CAP_PROP;
+   tpm_cmd.params.getcap_in.subcap_size = cpu_to_be32(4);
+   tpm_cmd.params.getcap_in.subcap = TPM_CAP_PROP_TIS_TIMEOUT;
+   rc = transmit_cmd(chip, _cmd, TPM_INTERNAL_RESULT_SIZE, 0);
+   }
+   if (rc) {
+   dev_err(chip->dev,
+   "A TPM error (%d) occurred attempting to determine the 
timeouts\n",
+   rc);
goto duration;
+   }
 
if (be32_to_cpu(tpm_cmd.header.out.return_code) != 0 ||
be32_to_cpu(tpm_cmd.header.out.length)
diff --git a/drivers/char/tpm/tpm.h b/drivers/char/tpm/tpm.h
index 8ef7649..8971b12 100644
--- a/drivers/char/tpm/tpm.h
+++ b/drivers/char/tpm/tpm.h
@@ -47,6 +47,7 @@ enum tpm_addr {
 #define TPM_WARN_DOING_SELFTEST 0x802
 #define TPM_ERR_DEACTIVATED 0x6
 #define TPM_ERR_DISABLED0x7
+#define TPM_ERR_INVALID_POSTINIT 38
 
 #define TPM_HEADER_SIZE10
 extern ssize_t tpm_show_pubek(struct device *, struct device_attribute *attr,
@@ -291,6 +292,10 @@ struct tpm_getrandom_in {
__be32 num_bytes;
 }__attribute__((packed));
 
+struct tpm_startup_in {
+   __be16  startup_type;
+} __packed;
+
 typedef union {
struct  tpm_getcap_params_out getcap_out;
struct  tpm_readpubek_params_out readpubek_out;
@@ -301,6 +306,7 @@ typedef union {
struct  tpm_pcrextend_in pcrextend_in;
struct  tpm_getrandom_in getrandom_in;
struct  tpm_getrandom_out getrandom_out;
+   struct tpm_startup_in startup_in;
 } tpm_cmd_params;
 
 struct tpm_cmd_t {
-- 
1.7.4.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message

Re: [GIT PULL] at91: tiny cleanup for 3.8

2012-11-20 Thread Olof Johansson

On Mon, Nov 19, 2012 at 06:38:24PM +0100, Nicolas Ferre wrote:
> Arnd, Olof,
> 
> A very little cleanup single patch for AT91.
> 
> The following changes since commit 77b67063bb6bce6d475e910d3b886a606d0d91f7:
> 
>   Linux 3.7-rc5 (2012-11-11 13:44:33 +0100)
> 
> are available in the git repository at:
> 
>   git://github.com/at91linux/linux-at91.git tags/at91-for-next-cleanup

Pulled, thanks.


-Olof
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [GIT PULL] at91: header fix for 3.8

2012-11-20 Thread Olof Johansson

On Mon, Nov 19, 2012 at 05:59:04PM +0100, Nicolas Ferre wrote:
> Arnd, Olof,
> 
> A little fix that goes on top of the modification of headers made
> by Jean-Christophe. Obviously, I have placed this one on top of your current
> at91/header branch.
> 
> The following changes since commit 75984df05d86956541795f01e62d7dc67bc522fd:
> 
>   arm: at91: move at91rm9200 rtc header in drivers/rtc (2012-11-06 20:30:52 
> +0800)
> 
> are available in the git repository at:
> 
>   git://github.com/at91linux/linux-at91.git tags/at91-header

Pulled, thanks.

-Olof
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH v2 0/3] mtd: nand: OMAP: ELM error correction support for BCH ecc

2012-11-20 Thread Philip, Avinash

On Mon, Nov 19, 2012 at 18:13:56, Philip, Avinash wrote:
> 
> On Thu, Nov 15, 2012 at 16:52:14, Artem Bityutskiy wrote:
> > On Wed, 2012-10-31 at 12:38 +0530, Philip, Avinash wrote:
> > > Support to use ELM as BCH 4 & 8 bit error correction module. Also 
> > > performance
> > > enhancement by adding single shot read_page and write_page functions for 
> > > the
> > > nand flashes with page size less than 4 KB.
> > > 
> > > ELM module can be used to correct errors reported by BCH 4, 8 & 16 bit
> > > ECC scheme. For now only 4 & 8 bit support is added.
> > > 
> > > BCH 4 & 8 bit error detection support is already available in mainline
> > > kernel and works with software error correction.
> > > 
> > > This series is based on [1] and tested with RFC: OMAP GPMC bindings
> > > patch series
> > > 
> > > 1. linux-next/20121030
> > 
> > Would you please re-send a version which cleanly applies to the
> > l2-mtd.git tree? This series has many conflicts. Thanks!
> 
> Artem,
> 
> Omap nand driver is being changed considerably with Afzal's omap-gpmc cleanup
> series for common arm zImage [2] and those changes move many
> of the nand related code from platform folders to omap nand driver.
> 
> Omap-gpmc changes are present in Tony's " omap-for-v3.8/cleanup-headers-gpmc"
> branch [3]and is present in linux-next also, but is not present in l2-mtd.
> Tony has signed tag including omap-gpmc cleanup series,
> "omap-for-v3.8/cleanup-headers-prepare-multiplatform-v3-signed" [4]
> 
> If this series is made over l2-mtd, and it would cause lot of conflict
> with omap-gpmc cleanup series.
> 
> I am not sure how this dependency has to be handled for this series,
> let me know whether you still want it to be made over l2-mtd?

Artem,

Is it possible for you to give ack for these patches so that these patches
can go in Tony's tree where Omap-gpmc changes are present?

Thanks
Avinash
> 
> 2. 
> http://markmail.org/message/ev67wm7irgc2qc5d#query:+page:1+mid:wgjdv6fsfghnua5z+state:results
> 3. 
> http://git.kernel.org/?p=linux/kernel/git/tmlind/linux-omap.git;a=shortlog;h=refs/heads/omap-for-v3.8/cleanup-headers-gpmc
> 4. 
> http://git.kernel.org/?p=linux/kernel/git/tmlind/linux-omap.git;a=tag;h=refs/tags/omap-for-v3.8/cleanup-headers-prepare-multiplatform-v3-signed
> 
> 
> Thanks
> Avinash
> 
> > 
> > git://git.infradead.org/users/dedekind/l2-mtd.git
> > 
> > -- 
> > Best Regards,
> > Artem Bityutskiy
> > 
> 
>

Re: [GIT PULL v2] at91: fixes for 3.7-rc7

2012-11-20 Thread Olof Johansson

Hi,

On Tue, Nov 20, 2012 at 09:59:27AM +0100, Nicolas Ferre wrote:
> Arnd, Olof,
> 
> Just for the record, I do not want to put pressure at a such late time in
> the 3.7-rc process. So, I just reworked that pull-request because the previous
> one was wrong:
> - wrong patch content (DT nodes with wrong size)
> - not all tags in patches (Jean-Christophe and Arnd tags were missing...)
> 
> Just to start from a sane base if I have to rebase this work for 3.8, I let 
> you know
> that I have updated this tag...
> 
> The following changes since commit 641f3ce64b050961d454a0716bb6dbf528315aac:
> 
>   ARM: at91/usbh: fix overcurrent gpio setup (2012-11-16 10:46:29 +0100)
> 
> are available in the git repository at:
> 
>   git://github.com/at91linux/linux-at91.git tags/at91-fixes

The new patches seem to belong in an at91/dt branch, not in a fixes one.

I can pull in the previous fixes branch as an at91/fixes-non-critical for 3.8
if you want. There's no need to rebase them for this, is there? What is the
pinctrl dependency that you are talking about, are some of these patches needed
as prerequisites for pinctrl changes or the other way around?

Sorry if I've missed more elaborate emails on this and are asking repeat
questions. ;)

-Olof
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v3] TPM: Provide a tpm_tis OF driver

2012-11-20 Thread Jason Gunthorpe

This provides an open firwmare driver binding for tpm_tis. OF
is useful on arches where PNP is not used.

Allow the tpm_tis driver to be selected if PNP or OF are compiled in.

Signed-off-by: Jason Gunthorpe 
---
 drivers/char/tpm/Kconfig   |2 +-
 drivers/char/tpm/tpm_tis.c |   76 ---
 2 files changed, 64 insertions(+), 14 deletions(-)

v3 changes
 - Rebase and retest on PPC against 3.7-rc6
 - Compile test on x86-64
 - Include of.h so that of_match_ptr works when CONFIG_OF=n
 - Drop the errant consts

Sorry for the long delay getting these fixes turned around.

diff --git a/drivers/char/tpm/Kconfig b/drivers/char/tpm/Kconfig
index 915875e..31faef6 100644
--- a/drivers/char/tpm/Kconfig
+++ b/drivers/char/tpm/Kconfig
@@ -26,7 +26,7 @@ if TCG_TPM
 
 config TCG_TIS
tristate "TPM Interface Specification 1.2 Interface"
-   depends on X86
+   depends on X86 || OF
---help---
  If you have a TPM security chip that is compliant with the
  TCG TIS 1.2 TPM specification say Yes and it will be accessible
diff --git a/drivers/char/tpm/tpm_tis.c b/drivers/char/tpm/tpm_tis.c
index 6bdf267..1ebba01 100644
--- a/drivers/char/tpm/tpm_tis.c
+++ b/drivers/char/tpm/tpm_tis.c
@@ -27,6 +27,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include "tpm.h"
 
 enum tis_access {
@@ -507,7 +509,7 @@ module_param(interrupts, bool, 0444);
 MODULE_PARM_DESC(interrupts, "Enable interrupts");
 
 static int tpm_tis_init(struct device *dev, resource_size_t start,
-   resource_size_t len, unsigned int irq)
+   resource_size_t len, int irq, int irq_autoprobe)
 {
u32 vendor, intfcaps, intmask;
int rc, i, irq_s, irq_e, probe;
@@ -605,9 +607,12 @@ static int tpm_tis_init(struct device *dev, 
resource_size_t start,
iowrite32(intmask,
  chip->vendor.iobase +
  TPM_INT_ENABLE(chip->vendor.locality));
-   if (interrupts)
-   chip->vendor.irq = irq;
-   if (interrupts && !chip->vendor.irq) {
+   if (!interrupts) {
+   irq = 0;
+   irq_autoprobe = 0;
+   }
+   chip->vendor.irq = irq;
+   if (irq == 0 && irq_autoprobe) {
irq_s =
ioread8(chip->vendor.iobase +
TPM_INT_VECTOR(chip->vendor.locality));
@@ -740,13 +745,11 @@ static int __devinit tpm_tis_pnp_init(struct pnp_dev 
*pnp_dev,
 
if (pnp_irq_valid(pnp_dev, 0))
irq = pnp_irq(pnp_dev, 0);
-   else
-   interrupts = 0;
 
if (is_itpm(pnp_dev))
itpm = 1;
 
-   return tpm_tis_init(_dev->dev, start, len, irq);
+   return tpm_tis_init(_dev->dev, start, len, irq, 0);
 }
 
 static int tpm_tis_pnp_suspend(struct pnp_dev *dev, pm_message_t msg)
@@ -822,12 +825,52 @@ static int tpm_tis_resume(struct device *dev)
 
 static SIMPLE_DEV_PM_OPS(tpm_tis_pm, tpm_pm_suspend, tpm_tis_resume);
 
-static struct platform_driver tis_drv = {
+#ifdef CONFIG_OF
+static const struct of_device_id tis_of_platform_match[] __devinitdata = {
+   {.compatible = "tcg,tpm_tis"},
+   {},
+};
+MODULE_DEVICE_TABLE(of, tis_of_platform_match);
+
+static int __devinit tis_of_probe_one(struct platform_device *pdev)
+{
+   const struct resource *res;
+   int irq;
+
+   if (!pdev->dev.of_node)
+   return -ENODEV;
+
+   res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
+   if (!res)
+   return -ENODEV;
+
+   irq = platform_get_irq(pdev, 0);
+   if (irq < 0)
+   irq = 0;
+   return tpm_tis_init(>dev, res->start, res->end - res->start + 1,
+   irq, 0);
+}
+
+static int __devexit tis_of_remove_one(struct platform_device *odev)
+{
+   struct tpm_chip *chip = dev_get_drvdata(>dev);
+   tpm_dev_vendor_release(chip);
+   kfree(chip);
+   return 0;
+}
+#endif
+
+static struct platform_driver tis_driver = {
.driver = {
.name = "tpm_tis",
.owner  = THIS_MODULE,
.pm = _tis_pm,
+   .of_match_table = of_match_ptr(tis_of_platform_match),
},
+#ifdef CONFIG_OF
+   .probe = tis_of_probe_one,
+   .remove = __devexit_p(tis_of_remove_one)
+#endif
 };
 
 static struct platform_device *pdev;
@@ -843,15 +886,22 @@ static int __init init_tis(void)
return pnp_register_driver(_pnp_driver);
 #endif
 
-   rc = platform_driver_register(_drv);
+   rc = platform_driver_register(_driver);
if (rc < 0)
return rc;
-   if (IS_ERR(pdev=platform_device_register_simple("tpm_tis", -1, NULL, 
0)))
+   /* TIS_MEM_BASE is only going to work on x86.. */
+#ifndef CONFIG_OF
+   pdev = platform_device_register_simple("tpm_tis", -1, NULL, 0);
+   if (IS_ERR(pdev)) {
+   platform_driver_unregister(_driver);

Re: [PATCH v4 4/4] input: misc: introduce retu-pwrbutton

2012-11-20 Thread Dmitry Torokhov

Hi Aaro,

On Sun, Nov 18, 2012 at 06:36:22PM +0200, Aaro Koskinen wrote:
> Add Retu power button driver.
>

This patch (with minot edits) has been queued to 3.8.

Thanks!

-- 
Dmitry
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] lib/vsprintf.c: Fix handling of %zd when using ssize_t

2012-11-20 Thread Jason Gunthorpe

Documentation/printk-formats.txt says to use %zd for a ssize_t argument
and some drivers do. Unfortunately this prints a positive number for
negative values eg:

tpm_tis 7003.tpm_tis: tpm_transmit: tpm_send: error 4294967234

Add a case to va_args a ssize_t type if the interpretation should be
signed.

Tested on PPC32.

Signed-off-by: Jason Gunthorpe 
---
 lib/vsprintf.c |5 -
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/lib/vsprintf.c b/lib/vsprintf.c
index 39c99fe..41da074 100644
--- a/lib/vsprintf.c
+++ b/lib/vsprintf.c
@@ -1485,7 +1485,10 @@ int vsnprintf(char *buf, size_t size, const char *fmt, 
va_list args)
num = va_arg(args, long);
break;
case FORMAT_TYPE_SIZE_T:
-   num = va_arg(args, size_t);
+   if (spec.flags & SIGN)
+   num = va_arg(args, ssize_t);
+   else
+   num = va_arg(args, size_t);
break;
case FORMAT_TYPE_PTRDIFF:
num = va_arg(args, ptrdiff_t);
-- 
1.7.4.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: fadvise interferes with readahead

2012-11-20 Thread Jaegeuk Hanse


On 11/20/2012 11:15 PM, Fengguang Wu wrote:

On Tue, Nov 20, 2012 at 10:11:54PM +0800, Jaegeuk Hanse wrote:

On 11/20/2012 04:04 PM, Fengguang Wu wrote:

Hi Claudio,

Thanks for the detailed problem description!

Hi Fengguang,

Another question, thanks in advance.

What's the meaning of interleaved reads? If the first process

It's access patterns like

 1, 1001, 2, 1002, 3, 1003, ...

in which there are two (or more) mixed sequential read streams.


readahead from start ~ start + size - async_size, another process
read start + size - aysnc_size + 1, then what will happen? It seems
that variable hit_readahead_marker is false, and related codes can't
run, where I miss?

Yes hit_readahead_marker will be false. However on reading 1002,
hit_readahead_marker()/count_history_pages() will find the previous
page 1001 already in page cache and trigger context readahead.


Hi Fengguang,

Thanks for your explaination, the comment in function 
ondemand_readahead, "Hit a marked page without valid readahead state". 
What's the meaning of "without valid readahead state"?


Regards,
Jaegeuk



Thanks,
Fengguang


On Fri, Nov 09, 2012 at 04:30:32PM -0300, Claudio Freire wrote:

Hi. First of all, I'm not subscribed to this list, so I'd suggest all
replies copy me personally.

I have been trying to implement some I/O pipelining in Postgres (ie:
read the next data page asynchronously while working on the current
page), and stumbled upon some puzzling behavior involving the
interaction between fadvise and readahead.

I'm running kernel 3.0.0 (debian testing), on a single-disk system
which, though unsuitable for database workloads, is slow enough to let
me experiment with these read-ahead issues.

Typical random I/O performance is on the order of between 150 r/s to
200 r/s (ballpark 7200rpm I'd say), with thoughput around 1.5MB/s.
Sequential I/O can go up to 60MB/s, though it tends to be around 50.

Now onto the problem. In order to parallelize I/O with computation,
I've made postgres fadvise(willneed) the pages it will read next. How
far ahead is configurable, and I've tested with a number of
configurations.

The prefetching logic is aware of the OS and pg-specific cache, so it
will only fadvise a block once. fadvise calls will stay 1 (or a
configurable N) real I/O ahead of read calls, and there's no fadvising
of pages that won't be read eventually, in the same order. I checked
with strace.

However, performance when fadvising drops considerably for a specific
yet common access pattern:

When a nested loop with two index scans happens, access is random
locally, but eventually whole ranges of a file get read (in this
random order). Think block "1 6 8 100 34 299 3 7 68 24" followed by "2
4 5 101 298 301". Though random, there are ranges there that can be
merged in one read-request.

The kernel seems to do the merge by applying some form of readahead,
not sure if it's context, ondemand or adaptive readahead on the 3.0.0
kernel. Anyway, it seems to do readahead, as iostat says:

Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda   0.00 4.40  224.202.00 4.16 0.03
37.86 1.918.438.00   56.80   4.40  99.44

(notice the avgrq-sz of 37.8)

With fadvise calls, the thing looks a lot different:

Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda   0.0018.00  226.801.00 1.80 0.07
16.81 4.00   17.52   17.23   82.40   4.39  99.92

FYI, there is a readahead tracing/stats patchset that can provide far
more accurate numbers about what's going on with readahead, which will
help eliminate lots of the guess works here.

https://lwn.net/Articles/472798/


Notice the avgrq-sz of 16.8. Assuming it's 512-byte sectors, that's
spot-on with a postgres page (8k). So, fadvise seems to carry out the
requests verbatim, while read manages to merge at least two of them.

The random nature of reads makes me think the scheduler is failing to
merge the requests in both cases (rrqm/s = 0), because it only looks
at successive requests (I'm only guessing here though).

I guess it's not a merging problem, but that the kernel readahead code
manages to submit larger IO requests in the first place.


Looking into the kernel code, it seems the problem could be related to
how fadvise works in conjunction with readahead. fadvise seems to call
the function in readahead.c that schedules the asynchornous I/O[0]. It
doesn't seem subject to readahead logic itself[1], which in on itself
doesn't seem bad. But it does, I assume (not knowing the code that
well), prevent readahead logic[2] to eventually see the pattern. It
effectively disables readahead altogether.

You are right. If user space does fadvise() and the fadvised pages
cover all read() pages, the kernel readahead code will not run at all.

So the title is actually a bit misleading. The kernel

Re: [PATCH v3] devtmpfs: mount with noexec and nosuid

2012-11-20 Thread Roland Eggner

On 2012-11-20 Tuesday at 13:50 -0800 Kees Cook wrote:
> Since devtmpfs is writable, make the default noexec,nosuid as well. This
> protects from the case of a privileged process having an arbitrary file
> write flaw and an argumentless arbitrary execution (i.e. it would lack
> the ability to run "mount -o remount,exec,suid /dev").
> 
> Rather than relying on userspace "mount -o remount,noexec,nosuid /dev",
> accomplish this from the kernel. This means no additional exec during
> (potentially time-sensitive) boot is needed. The kernel is responsible
> for this mount, so the mount flags should be configurable.
> 
> Cc: ellyjo...@chromium.org
> Cc: Kay Sievers 
> Cc: Roland Eggner 
> Signed-off-by: Kees Cook 
> 
> ---
> v3:
> - use a single define for the mount flags, suggested by Greg K.H.
> v2:
> - use CONFIG_DEVTMPFS_SAFE to wrap the logic.
> ---
>  drivers/base/Kconfig|   12 
>  drivers/base/devtmpfs.c |   11 +--
>  2 files changed, 21 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
> index b34b5cd..a37fcf2 100644
> --- a/drivers/base/Kconfig
> +++ b/drivers/base/Kconfig
> @@ -56,6 +56,18 @@ config DEVTMPFS_MOUNT
> rescue mode with init=/bin/sh, even when the /dev directory
> on the rootfs is completely empty.
>  
> +config DEVTMPFS_SAFE

Can we afford 2 additional characters and name it “DEVTMPFS_NOEXEC”?

> + bool "Use nosuid,noexec mount options on devtmpfs"
> + depends on DEVTMPFS
> + help
> +   This instructs the kernel to include the MS_NOEXEC and
> +   MS_NOSUID mount flags when mounting devtmpfs. This prevents
> +   certain kinds of code-execution attacks on embedded platforms.
> +
> +   Notice: If enabled, things like /dev/mem cannot be mmapped
> +   with the PROT_EXEC flag. This can break, for example, non-KMS
> +   video drivers.
Proposal:
help
  This instructs the kernel to include the MS_NOEXEC and MS_NOSUID mount
  flags when mounting devtmpfs.
  In-kernel separation of executable and non-executable code combined
  with a proper executability policy is a basic technique to protect
  against exploits by buggy or malicious code or hardware errors.  In
  terms of overhead it is a low-cost-high-effect technique especially on
  platforms with dedicated hardware support, e.g. x86_64 (look for "NX"
  feature in BIOS settings).  Mounting devtmpfs with MS_NOEXEC flag is
  an essential building-block for this security technique.

  Notice:  If enabled, software which depends on execution of
  runtime-generated code can only be used with restricted feature set or
  not at all, e.g. proprietary video drivers, JIT-compilers, most modern
  web browsers.  The grsecurity-patchset provides exception mechanisms 
to
  solve this problem for e.g. desktop systems.

  For server and embedded systems with HA-requirements consider Y.
  For desktop systems say N unless you know what you do.

Apart from that …
Acked-by: Roland Eggner

> +
>  config STANDALONE
>   bool "Select only drivers that don't need compile-time external 
> firmware" if EXPERIMENTAL
>   default y
> diff --git a/drivers/base/devtmpfs.c b/drivers/base/devtmpfs.c
> index 147d1a4..e44ca1d 100644
> --- a/drivers/base/devtmpfs.c
> +++ b/drivers/base/devtmpfs.c
> @@ -25,6 +25,12 @@
>  #include 
>  #include 
>  
> +#ifdef CONFIG_DEVTMPFS_SAFE
> +# define DEVTMPFS_MFLAGS (MS_SILENT | MS_NOEXEC | MS_NOSUID)
> +#else
> +# define DEVTMPFS_MFLAGS MS_SILENT
> +#endif
> +
>  static struct task_struct *thread;
>  
>  #if defined CONFIG_DEVTMPFS_MOUNT
> @@ -347,7 +353,8 @@ int devtmpfs_mount(const char *mntdir)
>   if (!thread)
>   return 0;
>  
> - err = sys_mount("devtmpfs", (char *)mntdir, "devtmpfs", MS_SILENT, 
> NULL);
> + err = sys_mount("devtmpfs", (char *)mntdir, "devtmpfs",
> + DEVTMPFS_MFLAGS, NULL);
>   if (err)
>   printk(KERN_INFO "devtmpfs: error mounting %i\n", err);
>   else
> @@ -372,7 +379,7 @@ static int devtmpfsd(void *p)
>   *err = sys_unshare(CLONE_NEWNS);
>   if (*err)
>   goto out;
> - *err = sys_mount("devtmpfs", "/", "devtmpfs", MS_SILENT, options);
> + *err = sys_mount("devtmpfs", "/", "devtmpfs", DEVTMPFS_MFLAGS, options);
>   if (*err)
>   goto out;
>   sys_chdir("/.."); /* will traverse into overmounted root */
> -- 
> 1.7.9.5
> 
> 
> -- 
> Kees Cook
> Chrome OS Security


pgp4jjbjyI1ic.pgp
Description: PGP signature

[PATCH] gpiolib: Fix use after free in gpiochip_add_pin_range

2012-11-20 Thread Axel Lin

This is introduced by commit 9ab6e988
"gpiolib: return any error code from range creation".

Signed-off-by: Axel Lin 
---
This patch is against LinusW's linux-pinctrl tree, for-next branch.
Axel
 drivers/gpio/gpiolib.c |4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/gpio/gpiolib.c b/drivers/gpio/gpiolib.c
index 317ff04..8370214 100644
--- a/drivers/gpio/gpiolib.c
+++ b/drivers/gpio/gpiolib.c
@@ -1201,6 +1201,7 @@ int gpiochip_add_pin_range(struct gpio_chip *chip, const 
char *pinctl_name,
   unsigned int npins)
 {
struct gpio_pin_range *pin_range;
+   int ret;
 
pin_range = kzalloc(sizeof(*pin_range), GFP_KERNEL);
if (!pin_range) {
@@ -1219,10 +1220,11 @@ int gpiochip_add_pin_range(struct gpio_chip *chip, 
const char *pinctl_name,
pin_range->pctldev = pinctrl_find_and_add_gpio_range(pinctl_name,
_range->range);
if (IS_ERR(pin_range->pctldev)) {
+   ret = PTR_ERR(pin_range->pctldev);
pr_err("%s: GPIO chip: could not create pin range\n",
   chip->label);
kfree(pin_range);
-   return PTR_ERR(pin_range->pctldev);
+   return ret;
}
pr_debug("%s: GPIO chip: created GPIO range %d->%d ==> PIN %d->%d\n",
 chip->label, offset, offset + npins - 1,
-- 
1.7.9.5



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/5] x86: Get pg_data_t's memory from other node

2012-11-20 Thread Yasuaki Ishimatsu

Hi Tang,

2012/11/21 14:58, Tang Chen wrote:
> Hi Ishimatsu-san,
> 
> Thanks for the comments.
> 
> And I also found the some algorithm problems in patch2 ~ patch3.
> I am working on it, and a v2 patchset is coming soon. :)

O.K.
I'm waiting nwe patch-set.

Thanks,
Yasuaki Ishimatsu

> 
> Thanks.
> 
> On 11/21/2012 01:46 PM, Yasuaki Ishimatsu wrote:
>> Hi Tang,
>>
>> 2012/11/19 23:27, Tang Chen wrote:
>>> From: Yasuaki Ishimatsu
>>>
>>> If system can create movable node which all memory of the
>>> node is allocated as ZONE_MOVABLE, setup_node_data() cannot
>>> allocate memory for the node's pg_data_t.
>>> So when memblock_alloc_nid() fails, setup_node_data() retries
>>> memblock_alloc().
>>>
>>> Signed-off-by: Yasuaki Ishimatsu
>>> Signed-off-by: Lai Jiangshan
>>> Signed-off-by: Tang Chen
>>> Reviewed-by: Wen Congyang
>>> Tested-by: Lin Feng
>>> ---
>>> arch/x86/mm/numa.c |9 +++--
>>> 1 files changed, 7 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
>>> index 2d125be..ae2e76e 100644
>>> --- a/arch/x86/mm/numa.c
>>> +++ b/arch/x86/mm/numa.c
>>> @@ -224,9 +224,14 @@ static void __init setup_node_data(int nid, u64 start, 
>>> u64 end)
>>> } else {
>>> nd_pa = memblock_alloc_nid(nd_size, SMP_CACHE_BYTES, 
>>> nid);
>>> if (!nd_pa) {
>>> -   pr_err("Cannot find %zu bytes in node %d\n",
>>
>>> +   printk(KERN_WARNING "Cannot find %zu bytes in node 
>>> %d\n",
>>>nd_size, nid)
>>
>> Please change to use pr_warn().
>>
>> Thanks,
>> Yasuaki Ishimatsu
>>
>>> -   return;
>>> +   nd_pa = memblock_alloc(nd_size, SMP_CACHE_BYTES);
>>> +   if (!nd_pa) {
>>> +   pr_err("Cannot find %zu bytes in other node\n",
>>> +  nd_size);
>>> +   return;
>>> +   }
>>> }
>>> nd = __va(nd_pa);
>>> }
>>>
>>
>>
>>
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/5] page_alloc: Add movablecore_map boot option.

2012-11-20 Thread Yasuaki Ishimatsu

Hi Tang,

When I applied the patch, following error occurred.

mm/page_alloc.c: In function ‘insert_movablecore_map’:
mm/page_alloc.c:5061: error: label ‘out’ used but not defined

Thanks,
Yasuaki Ishimatsu

2012/11/19 23:27, Tang Chen wrote:
> This patch adds functions to parse movablecore_map boot option. Since the
> option could be specified more then once, all the maps will be stored in
> the global variable movablecore_map.map array.
> 
> And also, we keep the array in monotonic increasing order by start_pfn.
> And merge all overlapped ranges.
> 
> Signed-off-by: Tang Chen 
> Reviewed-by: Wen Congyang 
> Tested-by: Lin Feng 
> ---
>   Documentation/kernel-parameters.txt |   17 
>   include/linux/mm.h  |   11 +++
>   mm/page_alloc.c |  146 
> +++
>   3 files changed, 174 insertions(+), 0 deletions(-)
> 
> diff --git a/Documentation/kernel-parameters.txt 
> b/Documentation/kernel-parameters.txt
> index 9776f06..0718976 100644
> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@ -1620,6 +1620,23 @@ bytes respectively. Such letter suffixes can also be 
> entirely omitted.
>   that the amount of memory usable for all allocations
>   is not too small.
>   
> + movablecore_map=nn[KMG]@ss[KMG]
> + [KNL,X86,IA-64,PPC] This parameter is similar to
> + memmap except it specifies the memory map of
> + ZONE_MOVABLE.
> + If more areas are all within one node, then from
> + lowest ss to the end of the node will be ZONE_MOVABLE.
> + If an area covers two or more nodes, the area from
> + ss to the end of the 1st node will be ZONE_MOVABLE,
> + and all the rest nodes will only have ZONE_MOVABLE.
> + If memmap is specified at the same time, the
> + movablecore_map will be limited within the memmap
> + areas. If kernelcore or movablecore is also specified,
> + movablecore_map will have higher priority to be
> + satisfied. So the administrator should be careful that
> + the amount of movablecore_map areas are not too large.
> + Otherwise kernel won't have enough memory to start.
> +
>   MTD_Partition=  [MTD]
>   Format: ,,,
>   
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index fa06804..e4541b4 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1328,6 +1328,17 @@ extern void free_bootmem_with_active_regions(int nid,
>   unsigned long max_low_pfn);
>   extern void sparse_memory_present_with_active_regions(int nid);
>   
> +#define MOVABLECORE_MAP_MAX MAX_NUMNODES
> +struct movablecore_entry {
> + unsigned long start;/* start pfn of memory segment */
> + unsigned long end;  /* end pfn of memory segment */
> +};
> +
> +struct movablecore_map {
> + __u32 nr_map;
> + struct movablecore_entry map[MOVABLECORE_MAP_MAX];
> +};
> +
>   #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
>   
>   #if !defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP) && \
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 5b74de6..198106f 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -198,6 +198,9 @@ static unsigned long __meminitdata nr_all_pages;
>   static unsigned long __meminitdata dma_reserve;
>   
>   #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
> +/* Movable memory segments, will also be used by memblock subsystem. */
> +struct movablecore_map movablecore_map;
> +
>   static unsigned long __meminitdata 
> arch_zone_lowest_possible_pfn[MAX_NR_ZONES];
>   static unsigned long __meminitdata 
> arch_zone_highest_possible_pfn[MAX_NR_ZONES];
>   static unsigned long __initdata required_kernelcore;
> @@ -4986,6 +4989,149 @@ static int __init cmdline_parse_movablecore(char *p)
>   early_param("kernelcore", cmdline_parse_kernelcore);
>   early_param("movablecore", cmdline_parse_movablecore);
>   
> +/**
> + * insert_movablecore_map - Insert a memory range in to movablecore_map.map.
> + * @start_pfn: start pfn of the range
> + * @end_pfn: end pfn of the range
> + *
> + * This function will also merge the overlapped ranges, and sort the array
> + * by start_pfn in monotonic increasing order.
> + */
> +static void __init insert_movablecore_map(unsigned long start_pfn,
> +   unsigned long end_pfn)
> +{
> + int i, pos_start, pos_end, remove;
> + bool merge = true;
> +
> + if (!movablecore_map.nr_map) {
> + movablecore_map.map[0].start = start_pfn;
> + movablecore_map.map[0].end = end_pfn;
> + movablecore_map.nr_map++;
> + return;
> + }
> +
> + /*
> +  * pos_start at the 1st overlapped

Re: [PATCH 1/5] x86: Get pg_data_t's memory from other node

2012-11-20 Thread Tang Chen

Hi Ishimatsu-san,

Thanks for the comments.

And I also found the some algorithm problems in patch2 ~ patch3.
I am working on it, and a v2 patchset is coming soon. :)

Thanks.

On 11/21/2012 01:46 PM, Yasuaki Ishimatsu wrote:
> Hi Tang,
> 
> 2012/11/19 23:27, Tang Chen wrote:
>> From: Yasuaki Ishimatsu
>>
>> If system can create movable node which all memory of the
>> node is allocated as ZONE_MOVABLE, setup_node_data() cannot
>> allocate memory for the node's pg_data_t.
>> So when memblock_alloc_nid() fails, setup_node_data() retries
>> memblock_alloc().
>>
>> Signed-off-by: Yasuaki Ishimatsu
>> Signed-off-by: Lai Jiangshan
>> Signed-off-by: Tang Chen
>> Reviewed-by: Wen Congyang
>> Tested-by: Lin Feng
>> ---
>>arch/x86/mm/numa.c |9 +++--
>>1 files changed, 7 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
>> index 2d125be..ae2e76e 100644
>> --- a/arch/x86/mm/numa.c
>> +++ b/arch/x86/mm/numa.c
>> @@ -224,9 +224,14 @@ static void __init setup_node_data(int nid, u64 start, 
>> u64 end)
>>  } else {
>>  nd_pa = memblock_alloc_nid(nd_size, SMP_CACHE_BYTES, nid);
>>  if (!nd_pa) {
>> -pr_err("Cannot find %zu bytes in node %d\n",
> 
>> +printk(KERN_WARNING "Cannot find %zu bytes in node 
>> %d\n",
>> nd_size, nid)
> 
> Please change to use pr_warn().
> 
> Thanks,
> Yasuaki Ishimatsu
> 
>> -return;
>> +nd_pa = memblock_alloc(nd_size, SMP_CACHE_BYTES);
>> +if (!nd_pa) {
>> +pr_err("Cannot find %zu bytes in other node\n",
>> +   nd_size);
>> +return;
>> +}
>>  }
>>  nd = __va(nd_pa);
>>  }
>>
> 
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/5] x86: Get pg_data_t's memory from other node

2012-11-20 Thread Yasuaki Ishimatsu

Hi Tang,

2012/11/19 23:27, Tang Chen wrote:
> From: Yasuaki Ishimatsu 
> 
> If system can create movable node which all memory of the
> node is allocated as ZONE_MOVABLE, setup_node_data() cannot
> allocate memory for the node's pg_data_t.
> So when memblock_alloc_nid() fails, setup_node_data() retries
> memblock_alloc().
> 
> Signed-off-by: Yasuaki Ishimatsu 
> Signed-off-by: Lai Jiangshan 
> Signed-off-by: Tang Chen 
> Reviewed-by: Wen Congyang 
> Tested-by: Lin Feng 
> ---
>   arch/x86/mm/numa.c |9 +++--
>   1 files changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index 2d125be..ae2e76e 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -224,9 +224,14 @@ static void __init setup_node_data(int nid, u64 start, 
> u64 end)
>   } else {
>   nd_pa = memblock_alloc_nid(nd_size, SMP_CACHE_BYTES, nid);
>   if (!nd_pa) {
> - pr_err("Cannot find %zu bytes in node %d\n",

> + printk(KERN_WARNING "Cannot find %zu bytes in node 
> %d\n",
>  nd_size, nid)

Please change to use pr_warn().

Thanks,
Yasuaki Ishimatsu

> - return;
> + nd_pa = memblock_alloc(nd_size, SMP_CACHE_BYTES);
> + if (!nd_pa) {
> + pr_err("Cannot find %zu bytes in other node\n",
> +nd_size);
> + return;
> + }
>   }
>   nd = __va(nd_pa);
>   }
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/5] page_alloc: Add movablecore_map boot option.

2012-11-20 Thread Yasuaki Ishimatsu

Hi Tang,

The patch has two extra whitespaces.

2012/11/19 23:27, Tang Chen wrote:
> This patch adds functions to parse movablecore_map boot option. Since the
> option could be specified more then once, all the maps will be stored in
> the global variable movablecore_map.map array.
> 
> And also, we keep the array in monotonic increasing order by start_pfn.
> And merge all overlapped ranges.
> 
> Signed-off-by: Tang Chen 
> Reviewed-by: Wen Congyang 
> Tested-by: Lin Feng 
> ---
>   Documentation/kernel-parameters.txt |   17 
>   include/linux/mm.h  |   11 +++
>   mm/page_alloc.c |  146 
> +++
>   3 files changed, 174 insertions(+), 0 deletions(-)
> 
> diff --git a/Documentation/kernel-parameters.txt 
> b/Documentation/kernel-parameters.txt
> index 9776f06..0718976 100644
> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@ -1620,6 +1620,23 @@ bytes respectively. Such letter suffixes can also be 
> entirely omitted.
>   that the amount of memory usable for all allocations
>   is not too small.
>   
> + movablecore_map=nn[KMG]@ss[KMG]
> + [KNL,X86,IA-64,PPC] This parameter is similar to
> + memmap except it specifies the memory map of
> + ZONE_MOVABLE.
> + If more areas are all within one node, then from
> + lowest ss to the end of the node will be ZONE_MOVABLE.
> + If an area covers two or more nodes, the area from
> + ss to the end of the 1st node will be ZONE_MOVABLE,
> + and all the rest nodes will only have ZONE_MOVABLE.
   ^ 
here
> + If memmap is specified at the same time, the
> + movablecore_map will be limited within the memmap
> + areas. If kernelcore or movablecore is also specified,
> + movablecore_map will have higher priority to be
> + satisfied. So the administrator should be careful that
> + the amount of movablecore_map areas are not too large.
> + Otherwise kernel won't have enough memory to start.
> +
>   MTD_Partition=  [MTD]
>   Format: ,,,
>   
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index fa06804..e4541b4 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1328,6 +1328,17 @@ extern void free_bootmem_with_active_regions(int nid,
>   unsigned long max_low_pfn);
>   extern void sparse_memory_present_with_active_regions(int nid);
>   
> +#define MOVABLECORE_MAP_MAX MAX_NUMNODES
> +struct movablecore_entry {
> + unsigned long start;/* start pfn of memory segment */
> + unsigned long end;  /* end pfn of memory segment */
> +};
> +
> +struct movablecore_map {
> + __u32 nr_map;
> + struct movablecore_entry map[MOVABLECORE_MAP_MAX];
> +};
> +
>   #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
>   
>   #if !defined(CONFIG_HAVE_MEMBLOCK_NODE_MAP) && \
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 5b74de6..198106f 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -198,6 +198,9 @@ static unsigned long __meminitdata nr_all_pages;
>   static unsigned long __meminitdata dma_reserve;
>   
>   #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
> +/* Movable memory segments, will also be used by memblock subsystem. */
> +struct movablecore_map movablecore_map;
> +
>   static unsigned long __meminitdata 
> arch_zone_lowest_possible_pfn[MAX_NR_ZONES];
>   static unsigned long __meminitdata 
> arch_zone_highest_possible_pfn[MAX_NR_ZONES];
>   static unsigned long __initdata required_kernelcore;
> @@ -4986,6 +4989,149 @@ static int __init cmdline_parse_movablecore(char *p)
>   early_param("kernelcore", cmdline_parse_kernelcore);
>   early_param("movablecore", cmdline_parse_movablecore);
>   
> +/**
> + * insert_movablecore_map - Insert a memory range in to movablecore_map.map.
> + * @start_pfn: start pfn of the range
> + * @end_pfn: end pfn of the range
> + *
> + * This function will also merge the overlapped ranges, and sort the array
> + * by start_pfn in monotonic increasing order.
> + */
> +static void __init insert_movablecore_map(unsigned long start_pfn,
> +   unsigned long end_pfn)
> +{
> + int i, pos_start, pos_end, remove;
> + bool merge = true;
> +
> + if (!movablecore_map.nr_map) {
> + movablecore_map.map[0].start = start_pfn;
> + movablecore_map.map[0].end = end_pfn;
> + movablecore_map.nr_map++;
> + return;
> + }
> +
> + /*
> +  * pos_start at the 1st overlapped segment if merge_start is true,
> +  * or at the next unoverlapped segment if

Re: [PATCH -next] mfd: twl6040: remove duplicated include from twl6040.c

2012-11-20 Thread Sachin Kamat

Hi Wei,

Similar patch already submitted:
http://www.spinics.net/lists/kernel/msg1439539.html

Thanks,
Sachin

On 21 November 2012 11:01, Wei Yongjun  wrote:
> From: Wei Yongjun 
>
> Remove duplicated include.
>
> dpatch engine is used to auto generate this patch.
> (https://github.com/weiyj/dpatch)
>
> Signed-off-by: Wei Yongjun 
> ---
>  drivers/mfd/twl6040.c | 1 -
>  1 file changed, 1 deletion(-)
>
> diff --git a/drivers/mfd/twl6040.c b/drivers/mfd/twl6040.c
> index e5f7b79..583be76 100644
> --- a/drivers/mfd/twl6040.c
> +++ b/drivers/mfd/twl6040.c
> @@ -37,7 +37,6 @@
>  #include 
>  #include 
>  #include 
> -#include 
>  #include 
>  #include 
>  #include 
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH v2] pwm: Device tree support for PWM polarity.

2012-11-20 Thread Philip, Avinash

On Tue, Nov 20, 2012 at 20:37:55, Thierry Reding wrote:
> On Mon, Nov 19, 2012 at 11:21:12PM +0530, Philip, Avinash wrote:
> [...]
> > diff --git a/include/linux/pwm.h b/include/linux/pwm.h
> > index 112b314..70756f2 100644
> > --- a/include/linux/pwm.h
> > +++ b/include/linux/pwm.h
> > @@ -78,6 +78,10 @@ enum {
> > PWMF_ENABLED = 1 << 1,
> >  };
> >  
> > +/* flags in the third cell of the DT PWM specifier */
> > +#define PWM_SPEC_POLARITY  (1 << 0)
> > +
> > +
> 
> This doesn't belong in this header. It should go into core.c in
> drivers/pwm.

I will move.

> 
> >  struct pwm_device {
> > const char  *label;
> > unsigned long   flags;
> > @@ -176,6 +180,8 @@ void pwm_put(struct pwm_device *pwm);
> >  
> >  struct pwm_device *devm_pwm_get(struct device *dev, const char *consumer);
> >  void devm_pwm_put(struct device *dev, struct pwm_device *pwm);
> > +struct pwm_device *of_pwm_xlate_with_flags(struct pwm_chip *pc,
> > +   const struct of_phandle_args *args);
> 
> The placement of this prototype is odd. I think a better place would be
> between pwm_request_from_chip() and pwm_get(), separated by blank lines
> to make it stand out as an OF specific function.

Ok I will move to between pwm_request_from_chip() and pwm_get().

> 
> >  #else
> >  static inline int pwm_set_chip_data(struct pwm_device *pwm, void *data)
> >  {
> > @@ -223,6 +229,12 @@ static inline struct pwm_device *devm_pwm_get(struct 
> > device *dev,
> >  static inline void devm_pwm_put(struct device *dev, struct pwm_device *pwm)
> >  {
> >  }
> > +
> > +static inline struct pwm_device *of_pwm_xlate_with_flags(struct pwm_chip 
> > *pc,
> > +   const struct of_phandle_args *args)
> > +{
> > +   return ERR_PTR(-ENODEV);
> > +}
> 
> This function should only be used by PWM drivers and therefore doesn't
> need to have a dummy implementation such as this.

Ok I will remove.

Thanks
Avinash

> 
> Thierry
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/1] tty: vt: Remove redundant null check before kfree.

2012-11-20 Thread Sachin Kamat

kfree on a NULL pointer is a no-op.

Signed-off-by: Sachin Kamat 
---
 drivers/tty/vt/consolemap.c |6 ++
 1 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/tty/vt/consolemap.c b/drivers/tty/vt/consolemap.c
index 2aaa0c2..248381b 100644
--- a/drivers/tty/vt/consolemap.c
+++ b/drivers/tty/vt/consolemap.c
@@ -410,10 +410,8 @@ static void con_release_unimap(struct uni_pagedir *p)
kfree(p->inverse_translations[i]);
p->inverse_translations[i] = NULL;
}
-   if (p->inverse_trans_unicode) {
-   kfree(p->inverse_trans_unicode);
-   p->inverse_trans_unicode = NULL;
-   }
+   kfree(p->inverse_trans_unicode);
+   p->inverse_trans_unicode = NULL;
 }
 
 /* Caller must hold the console lock */
-- 
1.7.4.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 06/12] memory-hotplug: unregister memory section on SPARSEMEM_VMEMMAP

2012-11-20 Thread Wen Congyang

At 11/21/2012 01:03 PM, Jaegeuk Hanse Wrote:
> On 11/21/2012 12:42 PM, Wen Congyang wrote:
>> At 11/21/2012 12:22 PM, Jaegeuk Hanse Wrote:
>>> On 11/21/2012 11:05 AM, Wen Congyang wrote:
 At 11/20/2012 07:16 PM, Jaegeuk Hanse Wrote:
> On 11/01/2012 05:44 PM, Wen Congyang wrote:
>> From: Yasuaki Ishimatsu 
>>
>> Currently __remove_section for SPARSEMEM_VMEMMAP does nothing. But
>> even if
>> we use SPARSEMEM_VMEMMAP, we can unregister the memory_section.
>>
>> So the patch add unregister_memory_section() into __remove_section().
> Hi Yasuaki,
>
> I have a question about these sparse vmemmap memory related
> patches. Hot
> add memory need allocated vmemmap pages, but this time is allocated by
> buddy system. How can gurantee virtual address is continuous to the
> address allocated before? If not continuous, page_to_pfn and
> pfn_to_page
> can't work correctly.
 vmemmap has its virtual address range:
 ea00 - eaff (=40 bits) virtual memory map (1TB)

 We allocate memory from buddy system to store struct page, and its
 virtual
 address isn't in this range. So we should update the page table:

 kmalloc_section_memmap()
   sparse_mem_map_populate()
   pfn_to_page() // get the virtual address in the vmemmap range
   vmemmap_populate() // we update page table here

 When we use vmemmap, page_to_pfn() always returns address in the
 vmemmap
 range, not the address that kmalloc() returns. So the virtual address
 is continuous.
>>> Hi Congyang,
>>>
>>> Another question about memory hotplug. During hot remove memory, it will
>>> also call memblock_remove to remove related memblock.
>> IIRC, we don't touch memblock when hot-add/hot-remove memory. memblock is
>> only used for bootmem allocator. I think it isn't used after booting.
> 
> In IBM pseries servers.
> 
> pseries_remove_memory()
> pseries_remove_memblock()
> memblock_remove()

It seems that pseries servers don't use ACPI(ACPI is only supported for
ia64 and x86 now. arm will be supported in the furture).

I am not ppc expert, and I don't know why we touch memblock when hotadding
memory in ppc case. But IIRC, we don't need memblock after the machine has
booted up in x86 case. So there is no need to touch it when hotadd/hotremove
the memory in x86 case.

Thanks
Wen Congyang

> 
> Furthermore, memblock is set to record available memory ranges get from
> e820 map(you can check it in memblock_x86_fill()) in x86 case, after
> hot-remove memory, this range of memory can't be available, why not
> remove them as pseries servers' codes do.
> 
>>> memblock_remove()
>>> __memblock_remove()memory-hotplug: unregister memory
>>> section on SPARSEMEM_VMEMMAP
>>>
>>> memblock_isolate_range()
>>> memblock_remove_region()
>>>
>>> But memblock_isolate_range() only record fully contained regions,
>>> regions which are partial overlapped just be splitted instead of record.
>>> So these partial overlapped regions can't be removed. Where I miss?
>> No, memblock_isolate_range() can deal with partial overlapped region.
>> =
>> if (rbase < base) {
>> /*
>>  * @rgn intersects from below.  Split and continue
>>  * to process the next region - the new top half.
>>  */
>> rgn->base = base;
>> rgn->size -= base - rbase;
>> type->total_size -= base - rbase;
>> memblock_insert_region(type, i, rbase, base - rbase,
>>memblock_get_region_node(rgn));
>> } else if (rend > end) {
>> /*
>>  * @rgn intersects from above.  Split and redo the
>>  * current region - the new bottom half.
>>  */
>> rgn->base = end;
>> rgn->size -= end - rbase;
>> type->total_size -= end - rbase;
>> memblock_insert_region(type, i--, rbase, end - rbase,
>>memblock_get_region_node(rgn));
>> =
>>
>> If the region is partial overlapped region, we will split the old
>> region into
>> two regions. After doing this, it is full contained region now.
> 
> You are right, I misunderstand the codes.
> 
>>
>> Thanks
>> Wen Congyang
>>
>>> Regards,
>>> Jaegeuk
>>>
 Thanks
 Wen Congyang
> Regards,
> Jaegeuk
>
>> CC: David Rientjes 
>> CC: Jiang Liu 
>> CC: Len Brown 
>> CC: Christoph Lameter 
>> Cc: Minchan Kim 
>> CC: Andrew Morton 
>> CC: KOSAKI Motohiro 
>> CC: Wen Congyang 
>> Signed-off-by: Yasuaki Ishimatsu 
>> ---
>> mm/memory_hotplug.c | 13 -
>> 1 file changed, 8 insertions(+), 5 deletions(-)
>>
>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index ca07433..66a79a7 100644
>> ---

Re: [PATCH 11/42] ARM: shmobile: Register PFC platform device

2012-11-20 Thread Simon Horman

On Wed, Nov 21, 2012 at 03:27:12AM +0100, Laurent Pinchart wrote:
> Add arch code to register the PFC platform device instead of calling the
> driver directly. Platform device registration in the sh-pfc driver will
> be removed.

I'm not really sure that I understand the motivation for
moving platform device registration from the driver into
mach-shmobile. Could you explain this a little?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/1] ARM: pxa: fix pxa25x gpio wakeup setting

2012-11-20 Thread Haojian Zhuang

On Wed, Nov 21, 2012 at 8:37 AM, Andrea Adami  wrote:
> * Since 3.3 gpio wakeup is broken on pxa25x (tested on corgi and poodle).
> * Use gpio_set_wake like done for pxa27x with commit id
> * b95ace54a23e2f8ebb032744cebb17c9f43bf651
>
> Signed-off-by: Andrea Adami 
> ---
>  arch/arm/mach-pxa/pxa25x.c |5 +
>  1 files changed, 5 insertions(+), 0 deletions(-)
>
> diff --git a/arch/arm/mach-pxa/pxa25x.c b/arch/arm/mach-pxa/pxa25x.c
> index 3352b37..aeb913e 100644
> --- a/arch/arm/mach-pxa/pxa25x.c
> +++ b/arch/arm/mach-pxa/pxa25x.c
> @@ -338,6 +338,10 @@ void __init pxa25x_map_io(void)
> pxa25x_get_clk_frequency_khz(1);
>  }
>
> +static struct pxa_gpio_platform_data pxa25x_gpio_info __initdata = {
> +   .gpio_set_wake = gpio_set_wake,
> +};
> +
>  static struct platform_device *pxa25x_devices[] __initdata = {
> _device_udc,
> _device_pmu,
> @@ -370,6 +374,7 @@ static int __init pxa25x_init(void)
> register_syscore_ops(_mfp_syscore_ops);
> register_syscore_ops(_clock_syscore_ops);
>
> +   pxa_register_device(_device_gpio, _gpio_info);
> ret = platform_add_devices(pxa25x_devices,
>ARRAY_SIZE(pxa25x_devices));
> if (ret)
> --
> 1.7.8.6
>

Acked & applied

Regards
Haojian
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/2] PM/devfreq: Fix incorrect argument in error message

2012-11-20 Thread Sachin Kamat

'g' is cast to the error return code. Hence gives the following error
which is fixed by this patch.

drivers/devfreq/devfreq.c:645 devfreq_remove_governor() error:
'g' dereferencing possible ERR_PTR()

Signed-off-by: Sachin Kamat 
---
 drivers/devfreq/devfreq.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/devfreq/devfreq.c b/drivers/devfreq/devfreq.c
index 45e053e..83c2129 100644
--- a/drivers/devfreq/devfreq.c
+++ b/drivers/devfreq/devfreq.c
@@ -643,7 +643,7 @@ int devfreq_remove_governor(struct devfreq_governor 
*governor)
g = find_devfreq_governor(governor->name);
if (IS_ERR(g)) {
pr_err("%s: governor %s not registered\n", __func__,
-  g->name);
+  governor->name);
err = -EINVAL;
goto err_out;
}
-- 
1.7.4.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/2] PM/devfreq: Fix return value in devfreq_remove_governor()

2012-11-20 Thread Sachin Kamat

Use the value obtained from the function instead of -EINVAL.

Signed-off-by: Sachin Kamat 
---
 drivers/devfreq/devfreq.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/devfreq/devfreq.c b/drivers/devfreq/devfreq.c
index 83c2129..2bd9ab0 100644
--- a/drivers/devfreq/devfreq.c
+++ b/drivers/devfreq/devfreq.c
@@ -644,7 +644,7 @@ int devfreq_remove_governor(struct devfreq_governor 
*governor)
if (IS_ERR(g)) {
pr_err("%s: governor %s not registered\n", __func__,
   governor->name);
-   err = -EINVAL;
+   err = PTR_ERR(g);
goto err_out;
}
list_for_each_entry(devfreq, _list, node) {
-- 
1.7.4.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v3 12/12] iommu/exynos: add debugfs entries for System MMU

2012-11-20 Thread Cho KyongHo

This commit adds debugfs directory and nodes for inspecting internal
state of System MMU.

Change-Id: I4afcdd925609d381e7329ec118ffe52e38dc340e
Signed-off-by: KyongHo Cho 
---
 drivers/iommu/exynos-iommu.c | 204 +--
 1 file changed, 198 insertions(+), 6 deletions(-)

diff --git a/drivers/iommu/exynos-iommu.c b/drivers/iommu/exynos-iommu.c
index 886fae5..5dd49c6 100644
--- a/drivers/iommu/exynos-iommu.c
+++ b/drivers/iommu/exynos-iommu.c
@@ -26,12 +26,17 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
 #include 
 #include 
 #include 
 
 #include 
 
+#define MODULE_NAME "exynos-sysmmu"
+
 /* We does not consider super section mapping (16MB) */
 #define SECT_ORDER 20
 #define LPAGE_ORDER 16
@@ -237,6 +242,7 @@ struct sysmmu_drvdata {
spinlock_t lock;
struct sysmmu_prefbuf pbufs[MAX_NUM_PBUF];
int num_pbufs;
+   struct dentry *debugfs_root;
struct iommu_domain *domain;
sysmmu_fault_handler_t fault_handler;
unsigned long pgtable;
@@ -1093,6 +1099,8 @@ static void __init __sysmmu_init_mmuname(struct device 
*sysmmu,
}
 }
 
+static void __create_debugfs_entry(struct sysmmu_drvdata *drvdata);
+
 static int __init exynos_sysmmu_probe(struct platform_device *pdev)
 {
int i, ret;
@@ -1163,6 +1171,8 @@ static int __init exynos_sysmmu_probe(struct 
platform_device *pdev)
 
__set_fault_handler(data, _fault_handler);
 
+   __create_debugfs_entry(data);
+
platform_set_drvdata(pdev, data);
 
dev->archdata.iommu = _placeholder;
@@ -1267,7 +1277,7 @@ static struct platform_driver exynos_sysmmu_driver 
__refdata = {
.probe  = exynos_sysmmu_probe,
.driver = {
.owner  = THIS_MODULE,
-   .name   = "exynos-sysmmu",
+   .name   = MODULE_NAME,
.pm = &__pm_ops,
.of_match_table = of_match_ptr(sysmmu_of_match),
}
@@ -1644,6 +1654,8 @@ static struct iommu_ops exynos_iommu_ops = {
.pgsize_bitmap = SECT_SIZE | LPAGE_SIZE | SPAGE_SIZE,
 };
 
+static struct dentry *sysmmu_debugfs_root; /* /sys/kernel/debug/sysmmu */
+
 static int __init exynos_iommu_init(void)
 {
int ret;
@@ -1655,17 +1667,197 @@ static int __init exynos_iommu_init(void)
return -ENOMEM;
}
 
-   ret = platform_driver_register(_sysmmu_driver);
+   ret = bus_set_iommu(_bus_type, _iommu_ops);
+   if (ret) {
+   kmem_cache_destroy(lv2table_kmem_cache);
+   pr_err("%s: Failed to register IOMMU ops\n", __func__);
+   return -EFAULT;
+   }
 
-   if (ret == 0)
-   ret = bus_set_iommu(_bus_type, _iommu_ops);
+   sysmmu_debugfs_root = debugfs_create_dir("sysmmu", NULL);
+   if (!sysmmu_debugfs_root)
+   pr_err("%s: Failed to create debugfs entry, 'sysmmu'\n",
+   __func__);
+   if (IS_ERR(sysmmu_debugfs_root))
+   sysmmu_debugfs_root = NULL;
 
+   ret = platform_driver_register(_sysmmu_driver);
if (ret) {
-   pr_err("%s: Failed to register exynos-iommu driver.\n",
-   __func__);
kmem_cache_destroy(lv2table_kmem_cache);
+   pr_err("%s: Failed to register System MMU driver\n", __func__);
}
 
return ret;
 }
 subsys_initcall(exynos_iommu_init);
+
+static int debug_string_show(struct seq_file *s, void *unused)
+{
+   char *str = s->private;
+
+   seq_printf(s, "%s\n", str);
+
+   return 0;
+}
+
+static int debug_sysmmu_list_show(struct seq_file *s, void *unused)
+{
+   struct sysmmu_drvdata *drvdata = s->private;
+   struct platform_device *pdev = to_platform_device(drvdata->sysmmu);
+   int idx, maj, min, ret;
+
+   seq_printf(s, "SysMMU Name | Ver | SFR Base\n");
+
+   if (pm_runtime_enabled(drvdata->sysmmu)) {
+   ret = pm_runtime_get_sync(drvdata->sysmmu);
+   if (ret < 0)
+   return ret;
+   }
+
+   for (idx = 0; idx < drvdata->nsfrs; idx++) {
+   struct resource *res;
+
+   res = platform_get_resource(pdev, IORESOURCE_MEM, idx);
+   if (!res)
+   break;
+
+   maj = __sysmmu_version(drvdata, idx, );
+
+   if (drvdata->mmuname) {
+   if (maj == 0)
+   seq_printf(s, "%11.s | N/A | 0x%08x\n",
+   drvdata->mmuname[idx], res->start);
+   else
+   seq_printf(s, "%11.s | %d.%d | 0x%08x\n",
+   drvdata->mmuname[idx], maj, min, res->start);
+   } else {
+   if (maj == 0)
+

Re: [RFC PATCH] mm: trace filemap add and del

2012-11-20 Thread Hugh Dickins

On Thu, 8 Nov 2012, Robert Jarzmik wrote:
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -467,6 +471,7 @@ int add_to_page_cache_locked(struct page *page, struct 
> address_space *mapping,
>   } else {
>   page->mapping = NULL;
>   /* Leave page->index set: truncation relies upon it */
> + trace_mm_filemap_add_to_page_cache(page);
>   spin_unlock_irq(>tree_lock);
>   mem_cgroup_uncharge_cache_page(page);
>   page_cache_release(page);

I doubt if you really want your tracepoint sited just in this error path.

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v3 11/12] iommu/exynos: add literal name of System MMU for debugging

2012-11-20 Thread Cho KyongHo

This commit adds System MMU name to the driver data of each System
MMU. It is used by fault information.

Change-Id: If6720b69609880873ebaf160188f1e726a67b806
Signed-off-by: KyongHo Cho 
---
 drivers/iommu/exynos-iommu.c | 100 ---
 1 file changed, 76 insertions(+), 24 deletions(-)

diff --git a/drivers/iommu/exynos-iommu.c b/drivers/iommu/exynos-iommu.c
index bcfa9b0..886fae5 100644
--- a/drivers/iommu/exynos-iommu.c
+++ b/drivers/iommu/exynos-iommu.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -150,15 +151,21 @@ enum exynos_sysmmu_inttype {
SYSMMU_FAULTS_NUM
 };
 
-/*
+/**
+ * fault handler function type
+ * @dev: the client device
+ * @mmuname: name of the System MMU that generates fault
  * @itype: type of fault.
  * @pgtable_base: the physical address of page table base. This is 0 if @itype
  *is SYSMMU_BUSERROR.
  * @fault_addr: the device (virtual) address that the System MMU tried to
  * translated. This is 0 if @itype is SYSMMU_BUSERROR.
  */
-typedef int (*sysmmu_fault_handler_t)(enum exynos_sysmmu_inttype itype,
-   unsigned long pgtable_base, unsigned long fault_addr);
+typedef int (*sysmmu_fault_handler_t)(struct device *dev,
+ const char *mmuname,
+ enum exynos_sysmmu_inttype itype,
+ unsigned long pgtable_base,
+ unsigned long fault_addr);
 
 static unsigned short fault_reg_offset[SYSMMU_FAULTS_NUM] = {
REG_PAGE_FAULT_ADDR,
@@ -234,6 +241,7 @@ struct sysmmu_drvdata {
sysmmu_fault_handler_t fault_handler;
unsigned long pgtable;
bool runtime_active;
+   const char **mmuname;
void __iomem *sfrbases[0];
 };
 
@@ -611,16 +619,18 @@ void exynos_sysmmu_set_fault_handler(struct device *dev,
spin_unlock_irqrestore(>lock, flags);
 }
 
-static int default_fault_handler(enum exynos_sysmmu_inttype itype,
-unsigned long pgtable_base, unsigned long fault_addr)
+static int default_fault_handler(struct device *dev, const char *mmuname,
+   enum exynos_sysmmu_inttype itype,
+   unsigned long pgtable_base,
+   unsigned long fault_addr)
 {
unsigned long *ent;
 
if ((itype >= SYSMMU_FAULTS_NUM) || (itype < SYSMMU_PAGEFAULT))
itype = SYSMMU_FAULT_UNKNOWN;
 
-   pr_err("%s occurred at 0x%lx(Page table base: 0x%lx)\n",
-   sysmmu_fault_name[itype], fault_addr, pgtable_base);
+   dev_err(dev, "%s occured at 0x%lx by '%s'(Page table base: 0x%lx)\n",
+   sysmmu_fault_name[itype], fault_addr, mmuname, pgtable_base);
 
ent = section_entry(__va(pgtable_base), fault_addr);
pr_err("\tLv1 entry: 0x%lx\n", *ent);
@@ -641,25 +651,30 @@ static irqreturn_t exynos_sysmmu_irq(int irq, void 
*dev_id)
 {
/* SYSMMU is in blocked when interrupt occurred. */
struct sysmmu_drvdata *data = dev_id;
-   struct resource *irqres;
-   struct platform_device *pdev;
+   struct exynos_iommu_owner *owner = NULL;
enum exynos_sysmmu_inttype itype;
unsigned long addr = -1;
-
+   const char *mmuname = NULL;
int i, ret = -ENOSYS;
 
-   spin_lock(>lock);
+   if (data->master)
+   owner = data->master->archdata.iommu;
+
+   if (owner)
+   spin_lock(>lock);
 
WARN_ON(!is_sysmmu_active(data));
 
-   pdev = to_platform_device(data->sysmmu);
-   for (i = 0; i < (pdev->num_resources / 2); i++) {
-   irqres = platform_get_resource(pdev, IORESOURCE_IRQ, i);
+   for (i = 0; i < data->nsfrs; i++) {
+   struct resource *irqres;
+   irqres = platform_get_resource(
+   to_platform_device(data->sysmmu),
+   IORESOURCE_IRQ, i);
if (irqres && ((int)irqres->start == irq))
break;
}
 
-   if (i == pdev->num_resources) {
+   if (i == data->nsfrs) {
itype = SYSMMU_FAULT_UNKNOWN;
} else {
itype = (enum exynos_sysmmu_inttype)
@@ -671,28 +686,34 @@ static irqreturn_t exynos_sysmmu_irq(int irq, void 
*dev_id)
data->sfrbases[i] + fault_reg_offset[itype]);
}
 
-   if (data->domain)
-   ret = report_iommu_fault(data->domain, data->master,
+   if (data->domain) /* owner is always set if data->domain exists */
+   ret = report_iommu_fault(data->domain, owner->dev,
addr, itype);
 
if ((ret == -ENOSYS) && data->fault_handler) {
unsigned long base = data->pgtable;
+   mmuname = (data->mmuname) ?

[PATCH v3 10/12] iommu/exynos: add support for System MMU 3.2 and 3.3

2012-11-20 Thread Cho KyongHo

Since System MMU 3.2 and 3.3 have more prefetch buffers than 2, the
existing function to set prefetch buffers, exynos_sysmmu_set_prefbuf()
is not able to support them.
This commit removes exynos_sysmmu_set_prefbuf() and introduces new
interface, exynos_sysmmu_set_pbuf() that can pass information of
more buffers than 2. It is safe to remove the existing function
because there is no device driver in the kernel yet that calls the
removed function.

Change-Id: I364016b48e0d1f3d6869fbcf9b498d9da42c29b7
Signed-off-by: KyongHo Cho 
---
 drivers/iommu/exynos-iommu.c | 336 +--
 1 file changed, 290 insertions(+), 46 deletions(-)

diff --git a/drivers/iommu/exynos-iommu.c b/drivers/iommu/exynos-iommu.c
index 8d95505..bcfa9b0 100644
--- a/drivers/iommu/exynos-iommu.c
+++ b/drivers/iommu/exynos-iommu.c
@@ -80,6 +80,13 @@
 #define CTRL_BLOCK 0x7
 #define CTRL_DISABLE   0x0
 
+#define CFG_LRU0x1
+#define CFG_QOS(n) ((n & 0xF) << 7)
+#define CFG_MASK   0x0050 /* Selecting bit 0-15, 20, 22 */
+#define CFG_SYSSEL (1 << 22) /* System MMU 3.2 only */
+#define CFG_FLPDCACHE  (1 << 20) /* System MMU 3.2+ only */
+#define CFG_SHAREABLE  (1 << 12) /* System MMU 3.x only */
+
 #define REG_MMU_CTRL   0x000
 #define REG_MMU_CFG0x004
 #define REG_MMU_STATUS 0x008
@@ -88,6 +95,10 @@
 #define REG_PT_BASE_ADDR   0x014
 #define REG_INT_STATUS 0x018
 #define REG_INT_CLEAR  0x01C
+#define REG_PB_INFO0x400
+#define REG_PB_LMM 0x404
+#define REG_PB_INDICATE0x408
+#define REG_PB_CFG 0x40C
 
 #define REG_PAGE_FAULT_ADDR0x024
 #define REG_AW_FAULT_ADDR  0x028
@@ -99,10 +110,9 @@
 #define MMU_MAJ_VER(reg)   (reg >> 28)
 #define MMU_MIN_VER(reg)   ((reg >> 21) & 0x7F)
 
-#define REG_PB0_SADDR  0x04C
-#define REG_PB0_EADDR  0x050
-#define REG_PB1_SADDR  0x054
-#define REG_PB1_EADDR  0x058
+#define MAX_NUM_PBUF   6
+
+#define NUM_MINOR_OF_SYSMMU_V3 4
 
 static void *sysmmu_placeholder; /* Inidcate if a device is System MMU */
 
@@ -197,6 +207,19 @@ struct sysmmu_version {
unsigned char minor;
 };
 
+#define SYSMMU_PBUFCFG_TLB_UPDATE  (1 << 16)
+#define SYSMMU_PBUFCFG_ASCENDING   (1 << 12)
+#define SYSMMU_PBUFCFG_DSECENDING  (0 << 12) /* default */
+#define SYSMMU_PBUFCFG_PREFETCH(1 << 8)
+#define SYSMMU_PBUFCFG_WRITE   (1 << 4)
+#define SYSMMU_PBUFCFG_READ(0 << 4) /* default */
+
+struct sysmmu_prefbuf {
+   unsigned long base;
+   unsigned long size;
+   unsigned long config;
+};
+
 struct sysmmu_drvdata {
struct device *sysmmu;  /* System MMU's device descriptor */
struct device *master;  /* Client device that needs System MMU */
@@ -205,6 +228,8 @@ struct sysmmu_drvdata {
int activations;
struct sysmmu_version ver;
spinlock_t lock;
+   struct sysmmu_prefbuf pbufs[MAX_NUM_PBUF];
+   int num_pbufs;
struct iommu_domain *domain;
sysmmu_fault_handler_t fault_handler;
unsigned long pgtable;
@@ -291,59 +316,277 @@ static void __sysmmu_set_ptbase(void __iomem *sfrbase,
__sysmmu_tlb_invalidate(sfrbase);
 }
 
-static void __sysmmu_set_prefbuf(void __iomem *sfrbase, unsigned long base,
-   unsigned long size, int idx)
+static void __sysmmu_set_prefbuf(void __iomem *pbufbase, unsigned long base,
+   unsigned long size, int idx)
+{
+   __raw_writel(base, pbufbase + idx * 8);
+   __raw_writel(size - 1 + base,  pbufbase + 4 + idx * 8);
+}
+
+/*
+ * Offset of prefetch buffer setting registers are different
+ * between SysMMU 3.1 and 3.2. 3.3 has a single prefetch buffer setting.
+ */
+static unsigned short
+   pbuf_offset[NUM_MINOR_OF_SYSMMU_V3] = {0x04C, 0x04C, 0x070, 0x410};
+
+/**
+ * __sysmmu_sort_prefbuf - sort the given @prefbuf in descending order.
+ * @prefbuf: array of buffer information
+ * @nbufs: number of elements of @prefbuf
+ * @check_size: whether to compare buffer sizes. See below description.
+ *
+ * return value is valid if @check_size is ture. If the size of first buffer
+ * in @prefbuf is larger than or equal to the sum of the sizes of the other
+ * buffers, returns 1. If the size of the first buffer is smaller than the
+ * sum of other sizes, returns -1. Returns 0, otherwise.
+ */
+static int __sysmmu_sort_prefbuf(struct sysmmu_prefbuf prefbuf[],
+   int nbufs, bool check_size)
+{
+   int i;
+
+   for (i = 0; i < nbufs; i++) {
+   int j;
+   for (j = i + 1; j < nbufs; j++)
+   if (prefbuf[i].size < prefbuf[j].size)
+   swap(prefbuf[i], prefbuf[j]);
+   }
+
+   if (check_size) {
+   unsigned long sum = 0;
+   for (i = 1; i

[PATCH v3 09/12] iommu/exynos: add supoort for runtime pm and suspend/resume

2012-11-20 Thread Cho KyongHo

This change enables the client device drivers not to care about
the state of System MMU since the internal state of System MMU
is controlled by the runtime PM and suspend/resume callback functions.

Change-Id: Ic04c8f259d8b8af2846175dd7b98dbc4e463c96e
Signed-off-by: KyongHo Cho 
---
 drivers/iommu/exynos-iommu.c | 175 ++-
 1 file changed, 89 insertions(+), 86 deletions(-)

diff --git a/drivers/iommu/exynos-iommu.c b/drivers/iommu/exynos-iommu.c
index f7dff54..8d95505 100644
--- a/drivers/iommu/exynos-iommu.c
+++ b/drivers/iommu/exynos-iommu.c
@@ -208,6 +208,7 @@ struct sysmmu_drvdata {
struct iommu_domain *domain;
sysmmu_fault_handler_t fault_handler;
unsigned long pgtable;
+   bool runtime_active;
void __iomem *sfrbases[0];
 };
 
@@ -477,7 +478,8 @@ static bool __sysmmu_disable(struct sysmmu_drvdata *drvdata)
drvdata->pgtable = 0;
drvdata->domain = NULL;
 
-   __sysmmu_disable_nocount(drvdata);
+   if (drvdata->runtime_active)
+   __sysmmu_disable_nocount(drvdata);
 
dev_dbg(drvdata->sysmmu, "Disabled\n");
} else  {
@@ -490,30 +492,6 @@ static bool __sysmmu_disable(struct sysmmu_drvdata 
*drvdata)
return disabled;
 }
 
-static bool __exynos_sysmmu_disable(struct device *dev)
-{
-   unsigned long flags;
-   bool disabled = true;
-   struct exynos_iommu_owner *owner = dev->archdata.iommu;
-   struct device *sysmmu;
-
-   BUG_ON(!has_sysmmu(dev));
-
-   spin_lock_irqsave(>lock, flags);
-
-   /* Every call to __sysmmu_disable() must return same result */
-   for_each_sysmmu(dev, sysmmu) {
-   struct sysmmu_drvdata *drvdata = dev_get_drvdata(sysmmu);
-   disabled = __sysmmu_disable(drvdata);
-   if (disabled)
-   drvdata->master = NULL;
-   }
-
-   spin_unlock_irqrestore(>lock, flags);
-
-   return disabled;
-}
-
 static void __sysmmu_enable_nocount(struct sysmmu_drvdata *drvdata)
 {
int i;
@@ -554,7 +532,8 @@ static int __sysmmu_enable(struct sysmmu_drvdata *drvdata,
drvdata->pgtable = pgtable;
drvdata->domain = domain;
 
-   __sysmmu_enable_nocount(drvdata);
+   if (drvdata->runtime_active)
+   __sysmmu_enable_nocount(drvdata);
 
dev_dbg(drvdata->sysmmu, "Enabled\n");
} else {
@@ -610,42 +589,31 @@ static int __exynos_sysmmu_enable(struct device *dev, 
unsigned long pgtable,
 
 int exynos_sysmmu_enable(struct device *dev, unsigned long pgtable)
 {
-   int ret;
-   struct device *sysmmu;
-
BUG_ON(!memblock_is_memory(pgtable));
 
-   for_each_sysmmu(dev, sysmmu) {
-   ret = pm_runtime_get_sync(sysmmu);
-   if (ret < 0)
-   break;
-   }
-
-   if (ret < 0) {
-   struct device *start;
-   for_each_sysmmu_until(dev, start, sysmmu)
-   pm_runtime_put(start);
-
-   return ret;
-   }
-
-   ret = __exynos_sysmmu_enable(dev, pgtable, NULL);
-   if (ret < 0)
-   for_each_sysmmu(dev, sysmmu)
-   pm_runtime_put(sysmmu);
-
-   return ret;
+   return __exynos_sysmmu_enable(dev, pgtable, NULL);
 }
 
 bool exynos_sysmmu_disable(struct device *dev)
 {
-   bool disabled;
+   unsigned long flags;
+   bool disabled = true;
+   struct exynos_iommu_owner *owner = dev->archdata.iommu;
struct device *sysmmu;
 
-   disabled = __exynos_sysmmu_disable(dev);
+   BUG_ON(!has_sysmmu(dev));
 
-   for_each_sysmmu(dev, sysmmu)
-   pm_runtime_put(sysmmu);
+   spin_lock_irqsave(>lock, flags);
+
+   /* Every call to __sysmmu_disable() must return same result */
+   for_each_sysmmu(dev, sysmmu) {
+   struct sysmmu_drvdata *drvdata = dev_get_drvdata(sysmmu);
+   disabled = __sysmmu_disable(drvdata);
+   if (disabled)
+   drvdata->master = NULL;
+   }
+
+   spin_unlock_irqrestore(>lock, flags);
 
return disabled;
 }
@@ -661,7 +629,8 @@ static void sysmmu_tlb_invalidate_entry(struct device *dev, 
unsigned long iova)
data = dev_get_drvdata(sysmmu);
 
spin_lock_irqsave(>lock, flags);
-   if (is_sysmmu_active(data)) {
+   if (is_sysmmu_active(data) &&
+   data->runtime_active) {
int i;
for (i = 0; i < data->nsfrs; i++) {
if (sysmmu_block(data->sfrbases[i])) {
@@ -893,6 +862,7 @@ static int __init exynos_sysmmu_probe(struct 
platform_device *pdev)
 
ret = __sysmmu_setup(dev, data);
if (!ret) {
+   data->runtime_active = !pm_runtime_enabled(dev);
data->sysmmu

[PATCH v3 08/12] iommu/exynos: set System MMU as the parent of client device

2012-11-20 Thread Cho KyongHo

This commit sets System MM as the parent of the client device for
power management. If System MMU is the parent of a device, it is
guaranteed that System MMU is suspended later than the device and
resumed earlier. Runtime suspend/resume on the device is also
propagated to the System MMU.
If a device is configured to have more than one System MMU, the
advantage of power management also works and the System MMUs are
also have relationships of parent and child. In this situation,
the client device is still the descendant of its System MMUs.

Change-Id: Idd1a28e95e5610feaaa3f43ee1bd12390e5f171c
Signed-off-by: KyongHo Cho 
---
 drivers/iommu/exynos-iommu.c | 538 ---
 1 file changed, 358 insertions(+), 180 deletions(-)

diff --git a/drivers/iommu/exynos-iommu.c b/drivers/iommu/exynos-iommu.c
index e39ddac..f7dff54 100644
--- a/drivers/iommu/exynos-iommu.c
+++ b/drivers/iommu/exynos-iommu.c
@@ -104,6 +104,17 @@
 #define REG_PB1_SADDR  0x054
 #define REG_PB1_EADDR  0x058
 
+static void *sysmmu_placeholder; /* Inidcate if a device is System MMU */
+
+#define is_sysmmu(sysmmu) (sysmmu->archdata.iommu == _placeholder)
+#define has_sysmmu(dev)
\
+   (dev->parent && dev->archdata.iommu && is_sysmmu(dev->parent))
+#define for_each_sysmmu(dev, sysmmu)   \
+   for (sysmmu = dev->parent; sysmmu && is_sysmmu(sysmmu); \
+   sysmmu = sysmmu->parent)
+#define for_each_sysmmu_until(dev, sysmmu, until)  \
+   for (sysmmu = dev->parent; sysmmu != until; sysmmu = sysmmu->parent)
+
 static struct kmem_cache *lv2table_kmem_cache;
 
 static unsigned long *section_entry(unsigned long *pgtable, unsigned long iova)
@@ -170,6 +181,16 @@ struct exynos_iommu_domain {
spinlock_t pgtablelock; /* lock for modifying page table @ pgtable */
 };
 
+/* exynos_iommu_owner
+ * Metadata attached to the owner of a group of System MMUs that belong
+ * to the same owner device.
+ */
+struct exynos_iommu_owner {
+   struct list_head client; /* entry of exynos_iommu_domain.clients */
+   struct device *dev;
+   spinlock_t lock;/* Lock to preserve consistency of System MMU */
+};
+
 struct sysmmu_version {
unsigned char major; /* major = 0 means that driver must use MMU_VERSION
register instead of this structure */
@@ -177,9 +198,8 @@ struct sysmmu_version {
 };
 
 struct sysmmu_drvdata {
-   struct list_head node; /* entry of exynos_iommu_domain.clients */
struct device *sysmmu;  /* System MMU's device descriptor */
-   struct device *dev; /* Owner of system MMU */
+   struct device *master;  /* Client device that needs System MMU */
int nsfrs;
struct clk *clk;
int activations;
@@ -281,62 +301,70 @@ void exynos_sysmmu_set_prefbuf(struct device *dev,
unsigned long base0, unsigned long size0,
unsigned long base1, unsigned long size1)
 {
-   struct sysmmu_drvdata *data = dev_get_drvdata(dev->archdata.iommu);
-   unsigned long flags;
-   int i;
+   struct device *sysmmu;
 
-   BUG_ON((base0 + size0) <= base0);
-   BUG_ON((size1 > 0) && ((base1 + size1) <= base1));
+   for_each_sysmmu(dev, sysmmu) {
+   int i;
+   unsigned long flags;
+   struct sysmmu_drvdata *data = dev_get_drvdata(sysmmu);
 
-   spin_lock_irqsave(>lock, flags);
-   if (!is_sysmmu_active(data))
-   goto finish;
+   BUG_ON((base0 + size0) <= base0);
+   BUG_ON((size1 > 0) && ((base1 + size1) <= base1));
 
-   for (i = 0; i < data->nsfrs; i++) {
-   if (__sysmmu_version(data, i, NULL) == 3) {
-   if (!sysmmu_block(data->sfrbases[i]))
-   continue;
-
-   if (size1 == 0) {
-   if (size0 <= SZ_128K) {
-   base1 = base0;
-   size1 = size0;
-   } else {
-   size1 = size0 -
+   spin_lock_irqsave(>lock, flags);
+   if (!is_sysmmu_active(data)) {
+   spin_unlock_irqrestore(>lock, flags);
+   continue;
+   }
+
+   for (i = 0; i < data->nsfrs; i++) {
+   if (__sysmmu_version(data, i, NULL) == 3) {
+   if (!sysmmu_block(data->sfrbases[i]))
+   continue;
+
+   if (size1 == 0) {
+   if (size0 <= SZ_128K) {
+   base1 = base0;
+   size1 = size0;
+

Re: [PATCH v3 06/12] memory-hotplug: unregister memory section on SPARSEMEM_VMEMMAP

2012-11-20 Thread Wen Congyang

At 11/21/2012 01:03 PM, Jaegeuk Hanse Wrote:
> On 11/21/2012 12:42 PM, Wen Congyang wrote:
>> At 11/21/2012 12:22 PM, Jaegeuk Hanse Wrote:
>>> On 11/21/2012 11:05 AM, Wen Congyang wrote:
 At 11/20/2012 07:16 PM, Jaegeuk Hanse Wrote:
> On 11/01/2012 05:44 PM, Wen Congyang wrote:
>> From: Yasuaki Ishimatsu 
>>
>> Currently __remove_section for SPARSEMEM_VMEMMAP does nothing. But
>> even if
>> we use SPARSEMEM_VMEMMAP, we can unregister the memory_section.
>>
>> So the patch add unregister_memory_section() into __remove_section().
> Hi Yasuaki,
>
> I have a question about these sparse vmemmap memory related
> patches. Hot
> add memory need allocated vmemmap pages, but this time is allocated by
> buddy system. How can gurantee virtual address is continuous to the
> address allocated before? If not continuous, page_to_pfn and
> pfn_to_page
> can't work correctly.
 vmemmap has its virtual address range:
 ea00 - eaff (=40 bits) virtual memory map (1TB)

 We allocate memory from buddy system to store struct page, and its
 virtual
 address isn't in this range. So we should update the page table:

 kmalloc_section_memmap()
   sparse_mem_map_populate()
   pfn_to_page() // get the virtual address in the vmemmap range
   vmemmap_populate() // we update page table here

 When we use vmemmap, page_to_pfn() always returns address in the
 vmemmap
 range, not the address that kmalloc() returns. So the virtual address
 is continuous.
>>> Hi Congyang,
>>>
>>> Another question about memory hotplug. During hot remove memory, it will
>>> also call memblock_remove to remove related memblock.
>> IIRC, we don't touch memblock when hot-add/hot-remove memory. memblock is
>> only used for bootmem allocator. I think it isn't used after booting.
> 
> In IBM pseries servers.
> 
> pseries_remove_memory()
> pseries_remove_memblock()
> memblock_remove()
> 
> Furthermore, memblock is set to record available memory ranges get from
> e820 map(you can check it in memblock_x86_fill()) in x86 case, after
> hot-remove memory, this range of memory can't be available, why not
> remove them as pseries servers' codes do.

Oh, it is powerpc, and I don't read this code. I will check it now.

Thanks for pointing it out.

Wen Congyang

> 
>>> memblock_remove()
>>> __memblock_remove()memory-hotplug: unregister memory
>>> section on SPARSEMEM_VMEMMAP
>>>
>>> memblock_isolate_range()
>>> memblock_remove_region()
>>>
>>> But memblock_isolate_range() only record fully contained regions,
>>> regions which are partial overlapped just be splitted instead of record.
>>> So these partial overlapped regions can't be removed. Where I miss?
>> No, memblock_isolate_range() can deal with partial overlapped region.
>> =
>> if (rbase < base) {
>> /*
>>  * @rgn intersects from below.  Split and continue
>>  * to process the next region - the new top half.
>>  */
>> rgn->base = base;
>> rgn->size -= base - rbase;
>> type->total_size -= base - rbase;
>> memblock_insert_region(type, i, rbase, base - rbase,
>>memblock_get_region_node(rgn));
>> } else if (rend > end) {
>> /*
>>  * @rgn intersects from above.  Split and redo the
>>  * current region - the new bottom half.
>>  */
>> rgn->base = end;
>> rgn->size -= end - rbase;
>> type->total_size -= end - rbase;
>> memblock_insert_region(type, i--, rbase, end - rbase,
>>memblock_get_region_node(rgn));
>> =
>>
>> If the region is partial overlapped region, we will split the old
>> region into
>> two regions. After doing this, it is full contained region now.
> 
> You are right, I misunderstand the codes.
> 
>>
>> Thanks
>> Wen Congyang
>>
>>> Regards,
>>> Jaegeuk
>>>
 Thanks
 Wen Congyang
> Regards,
> Jaegeuk
>
>> CC: David Rientjes 
>> CC: Jiang Liu 
>> CC: Len Brown 
>> CC: Christoph Lameter 
>> Cc: Minchan Kim 
>> CC: Andrew Morton 
>> CC: KOSAKI Motohiro 
>> CC: Wen Congyang 
>> Signed-off-by: Yasuaki Ishimatsu 
>> ---
>> mm/memory_hotplug.c | 13 -
>> 1 file changed, 8 insertions(+), 5 deletions(-)
>>
>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index ca07433..66a79a7 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -286,11 +286,14 @@ static int __meminit __add_section(int nid,
>> struct zone *zone,
>> #ifdef CONFIG_SPARSEMEM_VMEMMAP
>> static int __remove_section(struct zone *zone, struct mem_section
>> *ms)
>>

[PATCH v3 07/12] iommu/exynos: change rwlock to spinlock

2012-11-20 Thread Cho KyongHo

Since acquiring read_lock is not more frequent than write_lock, it is
not beneficial to use rwlock, this commit changes rwlock to spinlock.

Change-Id: Ia3365ccec0744e735b71f0389e5c56a0243bcd2c
Signed-off-by: KyongHo Cho 
---
 drivers/iommu/exynos-iommu.c | 32 
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/drivers/iommu/exynos-iommu.c b/drivers/iommu/exynos-iommu.c
index 0bb194e..e39ddac 100644
--- a/drivers/iommu/exynos-iommu.c
+++ b/drivers/iommu/exynos-iommu.c
@@ -184,7 +184,7 @@ struct sysmmu_drvdata {
struct clk *clk;
int activations;
struct sysmmu_version ver;
-   rwlock_t lock;
+   spinlock_t lock;
struct iommu_domain *domain;
sysmmu_fault_handler_t fault_handler;
unsigned long pgtable;
@@ -288,7 +288,7 @@ void exynos_sysmmu_set_prefbuf(struct device *dev,
BUG_ON((base0 + size0) <= base0);
BUG_ON((size1 > 0) && ((base1 + size1) <= base1));
 
-   read_lock_irqsave(>lock, flags);
+   spin_lock_irqsave(>lock, flags);
if (!is_sysmmu_active(data))
goto finish;
 
@@ -318,7 +318,7 @@ void exynos_sysmmu_set_prefbuf(struct device *dev,
}
}
 finish:
-   read_unlock_irqrestore(>lock, flags);
+   spin_unlock_irqrestore(>lock, flags);
 }
 
 static void __set_fault_handler(struct sysmmu_drvdata *data,
@@ -326,9 +326,9 @@ static void __set_fault_handler(struct sysmmu_drvdata *data,
 {
unsigned long flags;
 
-   write_lock_irqsave(>lock, flags);
+   spin_lock_irqsave(>lock, flags);
data->fault_handler = handler;
-   write_unlock_irqrestore(>lock, flags);
+   spin_unlock_irqrestore(>lock, flags);
 }
 
 void exynos_sysmmu_set_fault_handler(struct device *dev,
@@ -376,7 +376,7 @@ static irqreturn_t exynos_sysmmu_irq(int irq, void *dev_id)
 
int i, ret = -ENOSYS;
 
-   read_lock(>lock);
+   spin_lock(>lock);
 
WARN_ON(!is_sysmmu_active(data));
 
@@ -420,7 +420,7 @@ static irqreturn_t exynos_sysmmu_irq(int irq, void *dev_id)
if (itype != SYSMMU_FAULT_UNKNOWN)
sysmmu_unblock(data->sfrbases[i]);
 
-   read_unlock(>lock);
+   spin_unlock(>lock);
 
return IRQ_HANDLED;
 }
@@ -431,7 +431,7 @@ static bool __exynos_sysmmu_disable(struct sysmmu_drvdata 
*data)
bool disabled = false;
int i;
 
-   write_lock_irqsave(>lock, flags);
+   spin_lock_irqsave(>lock, flags);
 
if (!set_sysmmu_inactive(data))
goto finish;
@@ -446,7 +446,7 @@ static bool __exynos_sysmmu_disable(struct sysmmu_drvdata 
*data)
data->pgtable = 0;
data->domain = NULL;
 finish:
-   write_unlock_irqrestore(>lock, flags);
+   spin_unlock_irqrestore(>lock, flags);
 
if (disabled)
dev_dbg(data->sysmmu, "Disabled\n");
@@ -469,7 +469,7 @@ static int __exynos_sysmmu_enable(struct sysmmu_drvdata 
*data,
int i, ret = 0;
unsigned long flags;
 
-   write_lock_irqsave(>lock, flags);
+   spin_lock_irqsave(>lock, flags);
 
if (!set_sysmmu_active(data)) {
if (WARN_ON(pgtable != data->pgtable)) {
@@ -506,7 +506,7 @@ static int __exynos_sysmmu_enable(struct sysmmu_drvdata 
*data,
 
dev_dbg(data->sysmmu, "Enabled\n");
 finish:
-   write_unlock_irqrestore(>lock, flags);
+   spin_unlock_irqrestore(>lock, flags);
 
return ret;
 }
@@ -553,7 +553,7 @@ static void sysmmu_tlb_invalidate_entry(struct device *dev, 
unsigned long iova)
unsigned long flags;
struct sysmmu_drvdata *data = dev_get_drvdata(dev->archdata.iommu);
 
-   read_lock_irqsave(>lock, flags);
+   spin_lock_irqsave(>lock, flags);
 
if (is_sysmmu_active(data)) {
int i;
@@ -569,7 +569,7 @@ static void sysmmu_tlb_invalidate_entry(struct device *dev, 
unsigned long iova)
"Disabled. Skipping invalidating TLB.\n");
}
 
-   read_unlock_irqrestore(>lock, flags);
+   spin_unlock_irqrestore(>lock, flags);
 }
 
 void exynos_sysmmu_tlb_invalidate(struct device *dev)
@@ -577,7 +577,7 @@ void exynos_sysmmu_tlb_invalidate(struct device *dev)
unsigned long flags;
struct sysmmu_drvdata *data = dev_get_drvdata(dev->archdata.iommu);
 
-   read_lock_irqsave(>lock, flags);
+   spin_lock_irqsave(>lock, flags);
 
if (is_sysmmu_active(data)) {
int i;
@@ -592,7 +592,7 @@ void exynos_sysmmu_tlb_invalidate(struct device *dev)
"Disabled. Skipping invalidating TLB.\n");
}
 
-   read_unlock_irqrestore(>lock, flags);
+   spin_unlock_irqrestore(>lock, flags);
 }
 
 static int __init __sysmmu_init_clock(struct device *sysmmu,
@@ -748,7 +748,7 @@ static int __init exynos_sysmmu_probe(struct 
platform_device *pdev)
ret = __sysmmu_setup(dev, data);
if (!ret) {
data->sysmmu = dev;
-

[PATCH v3 06/12] iommu/exynos: allocate lv2 page table from own slab

2012-11-20 Thread Cho KyongHo

Since kmalloc() does not guarantee the alignment of 1KB when it
allocates 1KB, it is required to allocate lv2 page table from
own slab that guarantees alignment of 1KB.

Change-Id: Ia25642c7c0143d2c50a8ed5a3d0dd9067f324c4e
Signed-off-by: KyongHo Cho 
---
 drivers/iommu/exynos-iommu.c | 24 
 1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/exynos-iommu.c b/drivers/iommu/exynos-iommu.c
index 4061b17..0bb194e 100644
--- a/drivers/iommu/exynos-iommu.c
+++ b/drivers/iommu/exynos-iommu.c
@@ -104,6 +104,8 @@
 #define REG_PB1_SADDR  0x054
 #define REG_PB1_EADDR  0x058
 
+static struct kmem_cache *lv2table_kmem_cache;
+
 static unsigned long *section_entry(unsigned long *pgtable, unsigned long iova)
 {
return pgtable + lv1ent_offset(iova);
@@ -865,7 +867,8 @@ static void exynos_iommu_domain_destroy(struct iommu_domain 
*domain)
 
for (i = 0; i < NUM_LV1ENTRIES; i++)
if (lv1ent_page(priv->pgtable + i))
-   kfree(__va(lv2table_base(priv->pgtable + i)));
+   kmem_cache_free(lv2table_kmem_cache,
+   __va(lv2table_base(priv->pgtable + i)));
 
free_pages((unsigned long)priv->pgtable, 2);
free_pages((unsigned long)priv->lv2entcnt, 1);
@@ -959,7 +962,7 @@ static unsigned long *alloc_lv2entry(unsigned long *sent, 
unsigned long iova,
if (lv1ent_fault(sent)) {
unsigned long *pent;
 
-   pent = kzalloc(LV2TABLE_SIZE, GFP_ATOMIC);
+   pent = kmem_cache_zalloc(lv2table_kmem_cache, GFP_ATOMIC);
BUG_ON((unsigned long)pent & (LV2TABLE_SIZE - 1));
if (!pent)
return NULL;
@@ -982,7 +985,7 @@ static int lv1set_section(unsigned long *sent, phys_addr_t 
paddr, short *pgcnt)
if (*pgcnt != NUM_LV2ENTRIES)
return -EADDRINUSE;
 
-   kfree(page_entry(sent, 0));
+   kmem_cache_free(lv2table_kmem_cache, page_entry(sent, 0));
 
*pgcnt = 0;
}
@@ -1168,10 +1171,23 @@ static int __init exynos_iommu_init(void)
 {
int ret;
 
+   lv2table_kmem_cache = kmem_cache_create("exynos-iommu-lv2table",
+   LV2TABLE_SIZE, LV2TABLE_SIZE, 0, NULL);
+   if (!lv2table_kmem_cache) {
+   pr_err("%s: failed to create kmem cache\n", __func__);
+   return -ENOMEM;
+   }
+
ret = platform_driver_register(_sysmmu_driver);
 
if (ret == 0)
-   bus_set_iommu(_bus_type, _iommu_ops);
+   ret = bus_set_iommu(_bus_type, _iommu_ops);
+
+   if (ret) {
+   pr_err("%s: Failed to register exynos-iommu driver.\n",
+   __func__);
+   kmem_cache_destroy(lv2table_kmem_cache);
+   }
 
return ret;
 }
-- 
1.8.0


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v3 05/12] iommu/exynos: pass version information from DT

2012-11-20 Thread Cho KyongHo

System MMUs in some implementation of Exynos core does not include
correct version information in the System MMU. If the version
information is not correct, exynos-iommu driver cannot take advantages
of feature of higher versions of System MMu like prefetching page
table entries prior to TLB miss.

This commit allows passing version information from DT to the driver.
If DT does not pass version information, the driver will read the
information from System MMU.

Change-Id: I944e7a8f1402fdc0cb3ea45414a77b7079c8c84c
Signed-off-by: KyongHo Cho 
---
 drivers/iommu/exynos-iommu.c | 40 ++--
 1 file changed, 38 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/exynos-iommu.c b/drivers/iommu/exynos-iommu.c
index 53972c8..4061b17 100644
--- a/drivers/iommu/exynos-iommu.c
+++ b/drivers/iommu/exynos-iommu.c
@@ -96,6 +96,9 @@
 
 #define REG_MMU_VERSION0x034
 
+#define MMU_MAJ_VER(reg)   (reg >> 28)
+#define MMU_MIN_VER(reg)   ((reg >> 21) & 0x7F)
+
 #define REG_PB0_SADDR  0x04C
 #define REG_PB0_EADDR  0x050
 #define REG_PB1_SADDR  0x054
@@ -165,6 +168,12 @@ struct exynos_iommu_domain {
spinlock_t pgtablelock; /* lock for modifying page table @ pgtable */
 };
 
+struct sysmmu_version {
+   unsigned char major; /* major = 0 means that driver must use MMU_VERSION
+   register instead of this structure */
+   unsigned char minor;
+};
+
 struct sysmmu_drvdata {
struct list_head node; /* entry of exynos_iommu_domain.clients */
struct device *sysmmu;  /* System MMU's device descriptor */
@@ -172,6 +181,7 @@ struct sysmmu_drvdata {
int nsfrs;
struct clk *clk;
int activations;
+   struct sysmmu_version ver;
rwlock_t lock;
struct iommu_domain *domain;
sysmmu_fault_handler_t fault_handler;
@@ -198,6 +208,25 @@ static bool is_sysmmu_active(struct sysmmu_drvdata *data)
return data->activations > 0;
 }
 
+static unsigned int __sysmmu_version(struct sysmmu_drvdata *drvdata,
+   int idx, unsigned int *minor)
+{
+   unsigned int major;
+
+   if (drvdata->ver.major == 0) {
+   major = readl(
+   drvdata->sfrbases[idx] + REG_MMU_VERSION);
+   if (minor)
+   *minor = MMU_MIN_VER(major);
+   major = MMU_MAJ_VER(major);
+   } else {
+   major = drvdata->ver.major;
+   if (minor)
+   *minor = drvdata->ver.minor;
+   }
+   return major;
+}
+
 static void sysmmu_unblock(void __iomem *sfrbase)
 {
__raw_writel(CTRL_ENABLE, sfrbase + REG_MMU_CTRL);
@@ -262,7 +291,7 @@ void exynos_sysmmu_set_prefbuf(struct device *dev,
goto finish;
 
for (i = 0; i < data->nsfrs; i++) {
-   if ((readl(data->sfrbases[i] + REG_MMU_VERSION) >> 28) == 3) {
+   if (__sysmmu_version(data, i, NULL) == 3) {
if (!sysmmu_block(data->sfrbases[i]))
continue;
 
@@ -460,7 +489,7 @@ static int __exynos_sysmmu_enable(struct sysmmu_drvdata 
*data,
for (i = 0; i < data->nsfrs; i++) {
__sysmmu_set_ptbase(data->sfrbases[i], pgtable);
 
-   if ((readl(data->sfrbases[i] + REG_MMU_VERSION) >> 28) == 3) {
+   if (__sysmmu_version(data, i, NULL)  == 3) {
/* System MMU version is 3.x */
__raw_writel((1 << 12) | (2 << 28),
data->sfrbases[i] + REG_MMU_CFG);
@@ -618,8 +647,15 @@ static int __init __sysmmu_setup(struct device *sysmmu,
const char *compat;
struct platform_device *pmaster = NULL;
u32 master_inst_no = -1;
+   u32 ver[2];
int ret;
 
+   if (!of_property_read_u32_array(sysmmu->of_node, "version", ver, 2)) {
+   drvdata->ver.major = (unsigned char)ver[0];
+   drvdata->ver.minor = (unsigned char)ver[1];
+   dev_dbg(sysmmu, "Found version %d.%d\n", ver[0], ver[1]);
+   }
+
master_node = of_parse_phandle(sysmmu->of_node, "mmu-master", 0);
if (!master_node && !of_property_read_string(
sysmmu->of_node, "mmu-master-compat", )) {
-- 
1.8.0


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v3 04/12] iommu/exynos: support for device tree

2012-11-20 Thread Cho KyongHo

This commit adds device tree support for System MMU.

Change-Id: If695448af4bd7829ad1543814281dfa8ce1e7aae
Signed-off-by: KyongHo Cho 
---
 drivers/iommu/Kconfig|   2 +-
 drivers/iommu/exynos-iommu.c | 289 ++-
 2 files changed, 177 insertions(+), 114 deletions(-)

diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index e39f9db..64586f1 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -168,7 +168,7 @@ config TEGRA_IOMMU_SMMU
 
 config EXYNOS_IOMMU
bool "Exynos IOMMU Support"
-   depends on ARCH_EXYNOS && EXYNOS_DEV_SYSMMU
+   depends on ARCH_EXYNOS
select IOMMU_API
help
  Support for the IOMMU(System MMU) of Samsung Exynos application
diff --git a/drivers/iommu/exynos-iommu.c b/drivers/iommu/exynos-iommu.c
index 7fe44f8..53972c8 100644
--- a/drivers/iommu/exynos-iommu.c
+++ b/drivers/iommu/exynos-iommu.c
@@ -1,6 +1,6 @@
-/* linux/drivers/iommu/exynos_iommu.c
+/* linux/drivers/iommu/exynos-iommu.c
  *
- * Copyright (c) 2011 Samsung Electronics Co., Ltd.
+ * Copyright (c) 2011-2012 Samsung Electronics Co., Ltd.
  * http://www.samsung.com
  *
  * This program is free software; you can redistribute it and/or modify
@@ -12,6 +12,7 @@
 #define DEBUG
 #endif
 
+#include 
 #include 
 #include 
 #include 
@@ -25,11 +26,10 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include 
-#include 
-
-#include 
 
 /* We does not consider super section mapping (16MB) */
 #define SECT_ORDER 20
@@ -169,15 +169,14 @@ struct sysmmu_drvdata {
struct list_head node; /* entry of exynos_iommu_domain.clients */
struct device *sysmmu;  /* System MMU's device descriptor */
struct device *dev; /* Owner of system MMU */
-   char *dbgname;
int nsfrs;
-   void __iomem **sfrbases;
-   struct clk *clk[2];
+   struct clk *clk;
int activations;
rwlock_t lock;
struct iommu_domain *domain;
sysmmu_fault_handler_t fault_handler;
unsigned long pgtable;
+   void __iomem *sfrbases[0];
 };
 
 static bool set_sysmmu_active(struct sysmmu_drvdata *data)
@@ -384,8 +383,8 @@ static irqreturn_t exynos_sysmmu_irq(int irq, void *dev_id)
if (!ret && (itype != SYSMMU_FAULT_UNKNOWN))
__raw_writel(1 << itype, data->sfrbases[i] + REG_INT_CLEAR);
else
-   dev_dbg(data->sysmmu, "(%s) %s is not handled.\n",
-   data->dbgname, sysmmu_fault_name[itype]);
+   dev_dbg(data->sysmmu, "%s is not handled.\n",
+   sysmmu_fault_name[itype]);
 
if (itype != SYSMMU_FAULT_UNKNOWN)
sysmmu_unblock(data->sfrbases[i]);
@@ -409,10 +408,8 @@ static bool __exynos_sysmmu_disable(struct sysmmu_drvdata 
*data)
for (i = 0; i < data->nsfrs; i++)
__raw_writel(CTRL_DISABLE, data->sfrbases[i] + REG_MMU_CTRL);
 
-   if (data->clk[1])
-   clk_disable(data->clk[1]);
-   if (data->clk[0])
-   clk_disable(data->clk[0]);
+   if (data->clk)
+   clk_disable(data->clk);
 
disabled = true;
data->pgtable = 0;
@@ -421,10 +418,10 @@ finish:
write_unlock_irqrestore(>lock, flags);
 
if (disabled)
-   dev_dbg(data->sysmmu, "(%s) Disabled\n", data->dbgname);
+   dev_dbg(data->sysmmu, "Disabled\n");
else
-   dev_dbg(data->sysmmu, "(%s) %d times left to be disabled\n",
-   data->dbgname, data->activations);
+   dev_dbg(data->sysmmu, "%d times left to be disabled\n",
+   data->activations);
 
return disabled;
 }
@@ -451,14 +448,12 @@ static int __exynos_sysmmu_enable(struct sysmmu_drvdata 
*data,
ret = 1;
}
 
-   dev_dbg(data->sysmmu, "(%s) Already enabled\n", data->dbgname);
+   dev_dbg(data->sysmmu, "Already enabled\n");
goto finish;
}
 
-   if (data->clk[0])
-   clk_enable(data->clk[0]);
-   if (data->clk[1])
-   clk_enable(data->clk[1]);
+   if (data->clk)
+   clk_enable(data->clk);
 
data->pgtable = pgtable;
 
@@ -478,7 +473,7 @@ static int __exynos_sysmmu_enable(struct sysmmu_drvdata 
*data,
 
data->domain = domain;
 
-   dev_dbg(data->sysmmu, "(%s) Enabled\n", data->dbgname);
+   dev_dbg(data->sysmmu, "Enabled\n");
 finish:
write_unlock_irqrestore(>lock, flags);
 
@@ -494,7 +489,7 @@ int exynos_sysmmu_enable(struct device *dev, unsigned long 
pgtable)
 
ret = pm_runtime_get_sync(data->sysmmu);
if (ret < 0) {
-   dev_dbg(data->sysmmu, "(%s) Failed to enable\n", data->dbgname);
+   dev_dbg(data->sysmmu, "Failed to enable\n");
return ret;
}
 
@@ -502,8 +497,8 @@ int

[PATCH v3 03/12] ARM: EXYNOS: add System MMU definition to DT

2012-11-20 Thread Cho KyongHo

This commit adds System MMU nodes to DT of Exynos SoCs.

Change-Id: I30ea7adcc9c0ded876618f372ed1a5c5e935ee20
Signed-off-by: KyongHo Cho 
---
 .../devicetree/bindings/arm/exynos/system-mmu.txt  |  86 
 arch/arm/boot/dts/exynos4210.dtsi  |  96 ++
 arch/arm/boot/dts/exynos4x12.dtsi  | 124 +
 arch/arm/boot/dts/exynos5250.dtsi  | 147 -
 4 files changed, 451 insertions(+), 2 deletions(-)
 create mode 100644 Documentation/devicetree/bindings/arm/exynos/system-mmu.txt

diff --git a/Documentation/devicetree/bindings/arm/exynos/system-mmu.txt
b/Documentation/devicetree/bindings/arm/exynos/system-mmu.txt
new file mode 100644
index 000..9c30a36
--- /dev/null
+++ b/Documentation/devicetree/bindings/arm/exynos/system-mmu.txt
@@ -0,0 +1,86 @@
+* Samsung Exynos System MMU
+
+Samsung's Exynos architecture includes System MMU that enables scattered
+physical chunks to be visible as a contiguous region to DMA-capabile peripheral
+devices like MFC, FIMC, FIMD, GScaler, FIMC-IS and so forth.
+
+System MMU is a sort of IOMMU and support identical translation table format to
+ARMv7 translation tables with minimum set of page properties including access
+permissions, shareability and security protection. In addition System MMU has
+another capabilities like L2 TLB or block-fetch buffers to minimize translation
+latency
+
+Each System MMU is included in the H/W block of a peripheral device. Thus, it 
is
+important to specify that a System MMU is dedicated to which peripheral device
+before using System MMU. System initialization must specify the relationships
+between a System MMU and a peripheral device that owns the System MMU.
+
+Some device drivers may control several peripheral devices with a single device
+descriptor like MFC. Since handling a System MMU with IOMMU API requires a
+device descriptor that needs the System MMU, it is best to combine the System
+MMUs of the peripheral devices and control them with a single System MMU device
+descriptor. If it is unable to combine them into a single device descriptor,
+they can be linked with each other by the means of device.parent relationship.
+
+Required properties:
+- compatible: Should be "samsung,exynos-sysmmu".
+- reg: Tuples of base address and size of System MMU registers. The number of
+   tuples can be more than one if two or more System MMUs are controlled
+   by a single device descriptor.
+- interrupt-parent: The phandle of the interrupt controller of System MMU
+- interrupts: Tuples of numbers that indicates the interrupt source. The
+  number of elements in the tuple is dependent upon
+ 'interrupt-parent' property. The number of tuples in this property
+ must be the same with 'reg' property.
+
+Optional properties:
+- mmuname: Strings of the name of System MMU for debugging purpose. The number
+  of strings must be the same with the number of tuples in 'reg'
+  property.
+- mmu-master: phandle to the device node that owns System MMU. Only the device
+  that is specified whith this property can control System MMU with
+  IOMMU API.
+
+Examples:
+
+MFC has 2 System MMUs for each port that MFC is attached. Thus it seems natural
+to define 2 System MMUs for each port of the MFC:
+
+   sysmmu-mfc-l {
+   mmuname = "mfc_l";
+   reg = <0x1121 0x1000>;
+   compatible = "samsung,exynos-sysmmu";
+   interrupt-parent = <>;
+   interrupts = <8 5>;
+   mmu-master = <>;
+   };
+
+   sysmmu-mfc-r {
+   mmuname = "mfc_r";
+   reg = <0x1120 0x1000>;
+   compatible = "samsung,exynos-sysmmu";
+   interrupt-parent = <>;
+   interrupts = <6 2>;
+   mmu-master = <>;
+   };
+
+Actually, MFC device driver requires sub-devices that represents each port and
+above 'mmu-master' properties of sysmmu-mfc-l and sysmmu-mfc-r have the 
phandles
+to those sub-devices.
+
+However, it is also a good idea that treats the above System MMUs as one System
+MMU because those System MMUs are actually required by the MFC device:
+
+   sysmmu-mfc {
+   mmuname = "mfc_l", "mfc_r";
+   reg = <0x1121 0x1000
+  0x1120 0x1000>;
+   compatible = "samsung,exynos-sysmmu";
+   interrupt-parent = <>;
+   interrupts = <8 5
+ 6 2>;
+   mmu-master = <>;
+   };
+
+If System MMU of MFC is defined like the above, the number of elements and the
+order of list in 'mmuname', 'reg' and 'interrupts' must be the same.
diff --git a/arch/arm/boot/dts/exynos4210.dtsi 
b/arch/arm/boot/dts/exynos4210.dtsi
index 939f639..d7a7a06 100644
--- a/arch/arm/boot/dts/exynos4210.dtsi
+++ b/arch/arm/boot/dts/exynos4210.dtsi
@@ -71,4 +71,100 @@

Re: [PATCH v3 06/12] memory-hotplug: unregister memory section on SPARSEMEM_VMEMMAP

2012-11-20 Thread Jaegeuk Hanse


On 11/21/2012 12:42 PM, Wen Congyang wrote:

At 11/21/2012 12:22 PM, Jaegeuk Hanse Wrote:

On 11/21/2012 11:05 AM, Wen Congyang wrote:

At 11/20/2012 07:16 PM, Jaegeuk Hanse Wrote:

On 11/01/2012 05:44 PM, Wen Congyang wrote:

From: Yasuaki Ishimatsu 

Currently __remove_section for SPARSEMEM_VMEMMAP does nothing. But
even if
we use SPARSEMEM_VMEMMAP, we can unregister the memory_section.

So the patch add unregister_memory_section() into __remove_section().

Hi Yasuaki,

I have a question about these sparse vmemmap memory related patches. Hot
add memory need allocated vmemmap pages, but this time is allocated by
buddy system. How can gurantee virtual address is continuous to the
address allocated before? If not continuous, page_to_pfn and pfn_to_page
can't work correctly.

vmemmap has its virtual address range:
ea00 - eaff (=40 bits) virtual memory map (1TB)

We allocate memory from buddy system to store struct page, and its
virtual
address isn't in this range. So we should update the page table:

kmalloc_section_memmap()
  sparse_mem_map_populate()
  pfn_to_page() // get the virtual address in the vmemmap range
  vmemmap_populate() // we update page table here

When we use vmemmap, page_to_pfn() always returns address in the vmemmap
range, not the address that kmalloc() returns. So the virtual address
is continuous.

Hi Congyang,

Another question about memory hotplug. During hot remove memory, it will
also call memblock_remove to remove related memblock.

IIRC, we don't touch memblock when hot-add/hot-remove memory. memblock is
only used for bootmem allocator. I think it isn't used after booting.


In IBM pseries servers.

pseries_remove_memory()
pseries_remove_memblock()
memblock_remove()

Furthermore, memblock is set to record available memory ranges get from 
e820 map(you can check it in memblock_x86_fill()) in x86 case, after 
hot-remove memory, this range of memory can't be available, why not 
remove them as pseries servers' codes do.



memblock_remove()
__memblock_remove()memory-hotplug: unregister memory section on 
SPARSEMEM_VMEMMAP

memblock_isolate_range()
memblock_remove_region()

But memblock_isolate_range() only record fully contained regions,
regions which are partial overlapped just be splitted instead of record.
So these partial overlapped regions can't be removed. Where I miss?

No, memblock_isolate_range() can deal with partial overlapped region.
=
if (rbase < base) {
/*
 * @rgn intersects from below.  Split and continue
 * to process the next region - the new top half.
 */
rgn->base = base;
rgn->size -= base - rbase;
type->total_size -= base - rbase;
memblock_insert_region(type, i, rbase, base - rbase,
   memblock_get_region_node(rgn));
} else if (rend > end) {
/*
 * @rgn intersects from above.  Split and redo the
 * current region - the new bottom half.
 */
rgn->base = end;
rgn->size -= end - rbase;
type->total_size -= end - rbase;
memblock_insert_region(type, i--, rbase, end - rbase,
   memblock_get_region_node(rgn));
=

If the region is partial overlapped region, we will split the old region into
two regions. After doing this, it is full contained region now.


You are right, I misunderstand the codes.



Thanks
Wen Congyang


Regards,
Jaegeuk


Thanks
Wen Congyang

Regards,
Jaegeuk


CC: David Rientjes 
CC: Jiang Liu 
CC: Len Brown 
CC: Christoph Lameter 
Cc: Minchan Kim 
CC: Andrew Morton 
CC: KOSAKI Motohiro 
CC: Wen Congyang 
Signed-off-by: Yasuaki Ishimatsu 
---
mm/memory_hotplug.c | 13 -
1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index ca07433..66a79a7 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -286,11 +286,14 @@ static int __meminit __add_section(int nid,
struct zone *zone,
#ifdef CONFIG_SPARSEMEM_VMEMMAP
static int __remove_section(struct zone *zone, struct mem_section
*ms)
{
-/*
- * XXX: Freeing memmap with vmemmap is not implement yet.
- *  This should be removed later.
- */
-return -EBUSY;
+int ret = -EINVAL;
+
+if (!valid_section(ms))
+return ret;
+
+ret = unregister_memory_section(ms);
+
+return ret;
}
#else
static int __remove_section(struct zone *zone, struct mem_section
*ms)




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in

[PATCH v3 02/12] ARM: EXYNOS: Add clk_ops for gating clocks of System MMU

2012-11-20 Thread Cho KyongHo

Touching some System MMU needs its master devices' clock to be enabled
before. This commit adds clk_ops.set_parent of gating clocks of System
MMU to ensure gating clocks of System MMU's mater devices are enabled
when enabling gating clocks of System MMU.

Change-Id: Icd58b12f599e92692c032516331a444f4703ba6b
Signed-off-by: KyongHo Cho 
---
 arch/arm/mach-exynos/clock-exynos5.c | 25 +
 1 file changed, 25 insertions(+)

diff --git a/arch/arm/mach-exynos/clock-exynos5.c 
b/arch/arm/mach-exynos/clock-exynos5.c
index 9e815ae..9dfb845 100644
--- a/arch/arm/mach-exynos/clock-exynos5.c
+++ b/arch/arm/mach-exynos/clock-exynos5.c
@@ -613,6 +613,16 @@ static struct clksrc_clk exynos5_clk_aclk_300_gscl = {
.reg_src = { .reg = EXYNOS5_CLKSRC_TOP3, .shift = 10, .size = 1 },
 };
 
+static int exynos5_gate_clk_set_parent(struct clk *clk, struct clk *parent)
+{
+   clk->parent = parent;
+   return 0;
+}
+
+static struct clk_ops exynos5_gate_clk_ops = {
+   .set_parent = exynos5_gate_clk_set_parent
+};
+
 static struct clk exynos5_init_clocks_off[] = {
{
.name   = "timers",
@@ -854,76 +864,91 @@ static struct clk exynos5_init_clocks_off[] = {
.name   = "sysmmu",
.devname= "exynos-sysmmu.0",
.enable = _clk_ip_mfc_ctrl,
+   .ops= _gate_clk_ops,
.ctrlbit= (1 << 1),
}, {
.name   = "sysmmu",
.devname= "exynos-sysmmu.1",
.enable = _clk_ip_mfc_ctrl,
+   .ops= _gate_clk_ops,
.ctrlbit= (1 << 2),
}, {
.name   = "sysmmu",
.devname= "exynos-sysmmu.2",
.enable = _clk_ip_disp1_ctrl,
+   .ops= _gate_clk_ops,
.ctrlbit= (1 << 9)
}, {
.name   = "sysmmu",
.devname= "exynos-sysmmu.3",
.enable = _clk_ip_gen_ctrl,
+   .ops= _gate_clk_ops,
.ctrlbit= (1 << 7),
}, {
.name   = "sysmmu",
.devname= "exynos-sysmmu.4",
.enable = _clk_ip_gen_ctrl,
+   .ops= _gate_clk_ops,
.ctrlbit= (1 << 6)
}, {
.name   = "sysmmu",
.devname= "exynos-sysmmu.5",
.enable = _clk_ip_gscl_ctrl,
+   .ops= _gate_clk_ops,
.ctrlbit= (1 << 7),
}, {
.name   = "sysmmu",
.devname= "exynos-sysmmu.6",
.enable = _clk_ip_gscl_ctrl,
+   .ops= _gate_clk_ops,
.ctrlbit= (1 << 8),
}, {
.name   = "sysmmu",
.devname= "exynos-sysmmu.7",
.enable = _clk_ip_gscl_ctrl,
+   .ops= _gate_clk_ops,
.ctrlbit= (1 << 9),
}, {
.name   = "sysmmu",
.devname= "exynos-sysmmu.8",
.enable = _clk_ip_gscl_ctrl,
+   .ops= _gate_clk_ops,
.ctrlbit= (1 << 10),
}, {
.name   = "sysmmu",
.devname= "exynos-sysmmu.9",
.enable = _clk_ip_isp0_ctrl,
+   .ops= _gate_clk_ops,
.ctrlbit= (0x3F << 8),
}, {
.name   = "sysmmu",
.devname= "exynos-sysmmu.10",
.enable = _clk_ip_isp1_ctrl,
+   .ops= _gate_clk_ops,
.ctrlbit= (0xF << 4),
}, {
.name   = "sysmmu",
.devname= "exynos-sysmmu.11",
.enable = _clk_ip_disp1_ctrl,
+   .ops= _gate_clk_ops,
.ctrlbit= (1 << 8)
}, {
.name   = "sysmmu",
.devname= "exynos-sysmmu.12",
.enable = _clk_ip_gscl_ctrl,
+   .ops= _gate_clk_ops,
.ctrlbit= (1 << 11),
}, {
.name   = "sysmmu",
.devname= "exynos-sysmmu.13",
.enable = _clk_ip_gscl_ctrl,
+   .ops= _gate_clk_ops,
.ctrlbit= (1 << 12),
}, {
.name   = "sysmmu",
.devname= "exynos-sysmmu.14",
.enable = _clk_ip_acp_ctrl,
+   .ops=

[PATCH v3 01/12] ARM: EXYNOS: remove system mmu initialization from exynos tree

2012-11-20 Thread Cho KyongHo

This removes System MMU initialization from arch/arm/mach-exynos/
to move them to DT and the exynos-iommu driver except gating clock
definitions.

Change-Id: Ie29f587c01c645f28fc0e0b94eb3631a0170ebf5
Signed-off-by: KyongHo Cho 
---
 arch/arm/mach-exynos/Kconfig   |   5 -
 arch/arm/mach-exynos/Makefile  |   1 -
 arch/arm/mach-exynos/clock-exynos4.c   |  41 +++--
 arch/arm/mach-exynos/clock-exynos4210.c|   9 +-
 arch/arm/mach-exynos/clock-exynos4212.c|  23 ++-
 arch/arm/mach-exynos/clock-exynos5.c   |  62 ---
 arch/arm/mach-exynos/dev-sysmmu.c  | 274 -
 arch/arm/mach-exynos/include/mach/sysmmu.h |  66 ---
 arch/arm/mach-exynos/mach-exynos4-dt.c |  34 
 arch/arm/mach-exynos/mach-exynos5-dt.c |  30 
 10 files changed, 137 insertions(+), 408 deletions(-)
 delete mode 100644 arch/arm/mach-exynos/dev-sysmmu.c
 delete mode 100644 arch/arm/mach-exynos/include/mach/sysmmu.h

diff --git a/arch/arm/mach-exynos/Kconfig b/arch/arm/mach-exynos/Kconfig
index bb3b09a..d5157d7 100644
--- a/arch/arm/mach-exynos/Kconfig
+++ b/arch/arm/mach-exynos/Kconfig
@@ -94,11 +94,6 @@ config EXYNOS4_SETUP_FIMD0
help
  Common setup code for FIMD0.
 
-config EXYNOS_DEV_SYSMMU
-   bool
-   help
- Common setup code for SYSTEM MMU in EXYNOS platforms
-
 config EXYNOS4_DEV_DWMCI
bool
help
diff --git a/arch/arm/mach-exynos/Makefile b/arch/arm/mach-exynos/Makefile
index 1797dee..7460ba2 100644
--- a/arch/arm/mach-exynos/Makefile
+++ b/arch/arm/mach-exynos/Makefile
@@ -53,7 +53,6 @@ obj-$(CONFIG_EXYNOS4_DEV_AHCI)+= dev-ahci.o
 obj-$(CONFIG_EXYNOS4_DEV_DWMCI)+= dev-dwmci.o
 obj-$(CONFIG_EXYNOS_DEV_DMA)   += dma.o
 obj-$(CONFIG_EXYNOS4_DEV_USB_OHCI) += dev-ohci.o
-obj-$(CONFIG_EXYNOS_DEV_SYSMMU)+= dev-sysmmu.o
 
 obj-$(CONFIG_ARCH_EXYNOS)  += setup-i2c0.o
 obj-$(CONFIG_EXYNOS4_SETUP_FIMC)   += setup-fimc.o
diff --git a/arch/arm/mach-exynos/clock-exynos4.c 
b/arch/arm/mach-exynos/clock-exynos4.c
index efead60..c81a0ca 100644
--- a/arch/arm/mach-exynos/clock-exynos4.c
+++ b/arch/arm/mach-exynos/clock-exynos4.c
@@ -24,7 +24,6 @@
 
 #include 
 #include 
-#include 
 
 #include "common.h"
 #include "clock-exynos4.h"
@@ -709,53 +708,53 @@ static struct clk exynos4_init_clocks_off[] = {
.enable = exynos4_clk_ip_peril_ctrl,
.ctrlbit= (1 << 14),
}, {
-   .name   = SYSMMU_CLOCK_NAME,
-   .devname= SYSMMU_CLOCK_DEVNAME(mfc_l, 0),
+   .name   = "sysmmu",
+   .devname= "exynos-sysmmu.0",
.enable = exynos4_clk_ip_mfc_ctrl,
.ctrlbit= (1 << 1),
}, {
-   .name   = SYSMMU_CLOCK_NAME,
-   .devname= SYSMMU_CLOCK_DEVNAME(mfc_r, 1),
+   .name   = "sysmmu",
+   .devname= "exynos-sysmmu.1",
.enable = exynos4_clk_ip_mfc_ctrl,
.ctrlbit= (1 << 2),
}, {
-   .name   = SYSMMU_CLOCK_NAME,
-   .devname= SYSMMU_CLOCK_DEVNAME(tv, 2),
+   .name   = "sysmmu",
+   .devname= "exynos-sysmmu.2",
.enable = exynos4_clk_ip_tv_ctrl,
.ctrlbit= (1 << 4),
}, {
-   .name   = SYSMMU_CLOCK_NAME,
-   .devname= SYSMMU_CLOCK_DEVNAME(jpeg, 3),
+   .name   = "sysmmu",
+   .devname= "exynos-sysmmu.3",
.enable = exynos4_clk_ip_cam_ctrl,
.ctrlbit= (1 << 11),
}, {
-   .name   = SYSMMU_CLOCK_NAME,
-   .devname= SYSMMU_CLOCK_DEVNAME(rot, 4),
+   .name   = "sysmmu",
+   .devname= "exynos-sysmmu.4",
.enable = exynos4_clk_ip_image_ctrl,
.ctrlbit= (1 << 4),
}, {
-   .name   = SYSMMU_CLOCK_NAME,
-   .devname= SYSMMU_CLOCK_DEVNAME(fimc0, 5),
+   .name   = "sysmmu",
+   .devname= "exynos-sysmmu.5",
.enable = exynos4_clk_ip_cam_ctrl,
.ctrlbit= (1 << 7),
}, {
-   .name   = SYSMMU_CLOCK_NAME,
-   .devname= SYSMMU_CLOCK_DEVNAME(fimc1, 6),
+   .name   = "sysmmu",
+   .devname= "exynos-sysmmu.6",
.enable = exynos4_clk_ip_cam_ctrl,
.ctrlbit= (1 << 8),
}, {
-   .name   = SYSMMU_CLOCK_NAME,
-   .devname= SYSMMU_CLOCK_DEVNAME(fimc2, 7),
+   .name   =

[PATCH v3 00/12] iommu/exynos: Fixes and Enhancements of System MMU driver with DT

2012-11-20 Thread Cho KyongHo

The current exynos-iommu(System MMU) driver does not work autonomously
since it is lack of support for power management of peripheral blocks.
For example, MFC device driver must ensure that its System MMU is disabled
before MFC block is power-down not to invalidate IOTLB in the System MMU
when I/O memory mapping is changed. Because A System MMU is resides in the
same H/W block, access to control registers of System MMU while the H/W
block is turned off must be prohibited.

This set of changes solves the above problem with setting each System MMUs
as the parent of the device which owns the System MMU to recieve the
information when the device is turned off or turned on.

Another big change to the driver is the support for devicetree.
The bindings for System MMU is described in
Documentation/devicetree/bindings/arm/samsung/system-mmu.txt

In addition, this patchset also includes several bug fixes and enhancements
of the current driver.

Change log:
v2:
- Split the patch to iommu/exynos into 9 patches
- Support for System MMU 3.3
- Some code compaction

v3:
- Fix prefetch buffer flag definition for System MMU 3.3 (patch 10/12)
- Fix incorrect setting for SET_RUNTIME_PM_OPS (patch 09/12)
   Thanks to Prathyush.

Patch summary:
[PATCH v3 01/12] ARM: EXYNOS: remove system mmu initialization from exynos tree
[PATCH v3 02/12] ARM: EXYNOS: Add clk_ops for gating clocks of System MMU
[PATCH v3 03/12] ARM: EXYNOS: add System MMU definition to DT
[PATCH v3 04/12] iommu/exynos: support for device tree
[PATCH v3 05/12] iommu/exynos: pass version information from DT
[PATCH v3 06/12] iommu/exynos: allocate lv2 page table from own slab
[PATCH v3 07/12] iommu/exynos: change rwlock to spinlock
[PATCH v3 08/12] iommu/exynos: set System MMU as the parent of client device
[PATCH v3 09/12] iommu/exynos: add supoort for runtime pm and suspend/resume
[PATCH v3 10/12] iommu/exynos: add support for System MMU 3.2 and 3.3
[PATCH v3 11/12] iommu/exynos: add literal name of System MMU for debugging
[PATCH v3 12/12] iommu/exynos: add debugfs entries for System MMU


Diffstats:
 .../devicetree/bindings/arm/exynos/system-mmu.txt  |   86 ++
 arch/arm/boot/dts/exynos4210.dtsi  |   96 ++
 arch/arm/boot/dts/exynos4x12.dtsi  |  124 ++
 arch/arm/boot/dts/exynos5250.dtsi  |  147 +-
 arch/arm/mach-exynos/Kconfig   |5 -
 arch/arm/mach-exynos/Makefile  |1 -
 arch/arm/mach-exynos/clock-exynos4.c   |   41 +-
 arch/arm/mach-exynos/clock-exynos4210.c|9 +-
 arch/arm/mach-exynos/clock-exynos4212.c|   23 +-
 arch/arm/mach-exynos/clock-exynos5.c   |   87 +-
 arch/arm/mach-exynos/dev-sysmmu.c  |  274 
 arch/arm/mach-exynos/include/mach/sysmmu.h |   66 -
 arch/arm/mach-exynos/mach-exynos4-dt.c |   34 +
 arch/arm/mach-exynos/mach-exynos5-dt.c |   30 +
 drivers/iommu/Kconfig  |2 +-
 drivers/iommu/exynos-iommu.c   | 1424 +++-
 16 files changed, 1718 insertions(+), 731 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 1/3] perf session: Free environment information when deleting session

2012-11-20 Thread Namhyung Kim

From: Namhyung Kim 

The perf session environment information was saved (so allocated)
during perf_session__open, but was not freed.  As free(3) handles NULL
pointer input properly it won't cause a issue for writing modes -
e.g. perf record

Cc: Feng Tang 
Signed-off-by: Namhyung Kim 
---
 tools/perf/util/session.c | 17 +
 1 file changed, 17 insertions(+)

diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index ce6f51162386..d5fb60760bac 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -204,11 +204,28 @@ static void perf_session__delete_threads(struct 
perf_session *session)
machine__delete_threads(>host_machine);
 }
 
+static void perf_session_env__delete(struct perf_session_env *env)
+{
+   free(env->hostname);
+   free(env->os_release);
+   free(env->version);
+   free(env->arch);
+   free(env->cpu_desc);
+   free(env->cpuid);
+
+   free(env->cmdline);
+   free(env->sibling_cores);
+   free(env->sibling_threads);
+   free(env->numa_nodes);
+   free(env->pmu_mappings);
+}
+
 void perf_session__delete(struct perf_session *self)
 {
perf_session__destroy_kernel_maps(self);
perf_session__delete_dead_threads(self);
perf_session__delete_threads(self);
+   perf_session_env__delete(>header.env);
machine__exit(>host_machine);
close(self->fd);
free(self);
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V3 11/11] ARM: delete struct sys_timer

2012-11-20 Thread Haojian Zhuang

On Tue, Nov 20, 2012 at 2:31 AM, Stephen Warren  wrote:
> From: Stephen Warren 
>
> Now that the only field in struct sys_timer is .init, delete the struct,
> and replace the machine descriptor .timer field with the initialization
> function itself.
>
> This will enable moving timer drivers into drivers/clocksource without
> having to place a public prototype of each struct sys_timer object into
> include/linux; the intent is to create a single of_clocksource_init()
> function that determines which timer driver to initialize by scanning
> the device dtree, much like the proposed irqchip_init() at:
> http://www.spinics.net/lists/arm-kernel/msg203686.html
>
> Signed-off-by: Stephen Warren 
> Tested-by: Robert Jarzmik 
> ---
> v3: Minor merge conflicts due to rebasing onto next-20121115.
> v2: Converted all platforms, not just Tegra.
>
> The patch is very large, so I've trimmed it for the mailing list, leaving
> only the core ARM changes, changes outside arch/arm, and a single machine
> example. The full series can be found at:
>
> git://nv-tegra.nvidia.com/user/swarren/linux-2.6 arm_timer_rework
> ---
>  492 files changed, 622 insertions(+), 1199 deletions(-)

I checked the patch for mach-mmp.

@@ -69,7 +65,7 @@ static const char *mmp_dt_board_compat[] __initdata = {
 DT_MACHINE_START(PXA168_DT, "Marvell PXA168 (Device Tree Support)")
.map_io = mmp_map_io,
.init_irq   = mmp_dt_irq_init,
-   .timer  = _dt_timer,
+   .init_time  = mmp_dt_init_timer,
.init_machine   = pxa168_dt_init,
.dt_compat  = mmp_dt_board_compat,
 MACHINE_END
@@ -77,7 +73,7 @@ MACHINE_END
 DT_MACHINE_START(PXA910_DT, "Marvell PXA910 (Device Tree Support)")
.map_io = mmp_map_io,
.init_irq   = mmp_dt_irq_init,
-   .timer  = _dt_timer,
+   .init_time  = mmp_dt_timer_init,
.init_machine   = pxa910_dt_init,
.dt_compat  = mmp_dt_board_compat,
 MACHINE_END

This first init_time is assigned by mmp_dt_init_timer. But the second
init_time is
assigned by mmp_dt_timer_init. I think it's a typo error. Could you
help to fix this?

Regards
Haojian
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/2] extcon: max77693: Fix coding style

2012-11-20 Thread Sachin Kamat

As per kernel coding style, if one branch of conditional statement has braces,
the other one should have too.

Signed-off-by: Sachin Kamat 
---
 drivers/extcon/extcon-max77693.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/extcon/extcon-max77693.c b/drivers/extcon/extcon-max77693.c
index 1da4ad4..8bf5e48 100644
--- a/drivers/extcon/extcon-max77693.c
+++ b/drivers/extcon/extcon-max77693.c
@@ -665,9 +665,9 @@ static int __devinit max77693_muic_probe(struct 
platform_device *pdev)
}
info->dev = >dev;
info->max77693 = max77693;
-   if (info->max77693->regmap_muic)
+   if (info->max77693->regmap_muic) {
dev_dbg(>dev, "allocate register map\n");
-   else {
+   } else {
info->max77693->regmap_muic = devm_regmap_init_i2c(
info->max77693->muic,
_muic_regmap_config);
-- 
1.7.4.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/2] extcon: max77693: Fix uninitialised variable warning

2012-11-20 Thread Sachin Kamat

Signed-off-by: Sachin Kamat 
---
Hi Chanwoo,

Please merge this patch with the previous one titled
"extcon: max77693: Use devm_kzalloc"
---
 drivers/extcon/extcon-max77693.c |3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/drivers/extcon/extcon-max77693.c b/drivers/extcon/extcon-max77693.c
index 3c29bb7..1da4ad4 100644
--- a/drivers/extcon/extcon-max77693.c
+++ b/drivers/extcon/extcon-max77693.c
@@ -672,9 +672,10 @@ static int __devinit max77693_muic_probe(struct 
platform_device *pdev)
info->max77693->muic,
_muic_regmap_config);
if (IS_ERR(info->max77693->regmap_muic)) {
+   ret = PTR_ERR(info->max77693->regmap_muic);
dev_err(max77693->dev,
"failed to allocate register map: %d\n", ret);
-   return PTR_ERR(info->max77693->regmap_muic);
+   return ret;
}
}
platform_set_drvdata(pdev, info);
-- 
1.7.4.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[git pull] Please pull powerpc.git merge branch

2012-11-20 Thread Benjamin Herrenschmidt

Hi Linus !

Here are small 52xx fixes that Anatolij asked me to pull a while back
and that I completely missed. The stuff is local to that platform code,
and was in next for a while, so it should still go into 3.7.

Thanks,
Ben.

The following changes since commit 8c23f406c6d86808726ace580657186bc3b44587:

  Merge git://git.kernel.org/pub/scm/virt/kvm/kvm (2012-11-01 08:27:02 -0700)

are available in the git repository at:


  git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc.git merge

for you to fetch changes up to d6dc24613c222f9057131ccbd5264a10bcba9f97:

  Merge remote-tracking branch 'agust/merge' into merge (2012-11-21 13:24:49 
+1100)



Anatolij Gustschin (1):
  powerpc/mpc5200: move lpbfifo node and fix its interrupt property

Benjamin Herrenschmidt (1):
  Merge remote-tracking branch 'agust/merge' into merge

Eric Millbrandt (1):
  powerpc/pcm030: add pcm030-audio-fabric to dts

Wolfram Sang (1):
  powerpc: 52xx: nop out unsupported critical IRQs

 arch/powerpc/boot/dts/mpc5200b.dtsi   |6 ++
 arch/powerpc/boot/dts/o2d.dtsi|6 --
 arch/powerpc/boot/dts/pcm030.dts  |7 ++-
 arch/powerpc/platforms/52xx/mpc52xx_pic.c |9 +
 4 files changed, 17 insertions(+), 11 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 06/12] memory-hotplug: unregister memory section on SPARSEMEM_VMEMMAP

2012-11-20 Thread Wen Congyang

At 11/21/2012 12:22 PM, Jaegeuk Hanse Wrote:
> On 11/21/2012 11:05 AM, Wen Congyang wrote:
>> At 11/20/2012 07:16 PM, Jaegeuk Hanse Wrote:
>>> On 11/01/2012 05:44 PM, Wen Congyang wrote:
 From: Yasuaki Ishimatsu 

 Currently __remove_section for SPARSEMEM_VMEMMAP does nothing. But
 even if
 we use SPARSEMEM_VMEMMAP, we can unregister the memory_section.

 So the patch add unregister_memory_section() into __remove_section().
>>> Hi Yasuaki,
>>>
>>> I have a question about these sparse vmemmap memory related patches. Hot
>>> add memory need allocated vmemmap pages, but this time is allocated by
>>> buddy system. How can gurantee virtual address is continuous to the
>>> address allocated before? If not continuous, page_to_pfn and pfn_to_page
>>> can't work correctly.
>> vmemmap has its virtual address range:
>> ea00 - eaff (=40 bits) virtual memory map (1TB)
>>
>> We allocate memory from buddy system to store struct page, and its
>> virtual
>> address isn't in this range. So we should update the page table:
>>
>> kmalloc_section_memmap()
>>  sparse_mem_map_populate()
>>  pfn_to_page() // get the virtual address in the vmemmap range
>>  vmemmap_populate() // we update page table here
>>
>> When we use vmemmap, page_to_pfn() always returns address in the vmemmap
>> range, not the address that kmalloc() returns. So the virtual address
>> is continuous.
> 
> Hi Congyang,
> 
> Another question about memory hotplug. During hot remove memory, it will
> also call memblock_remove to remove related memblock.

IIRC, we don't touch memblock when hot-add/hot-remove memory. memblock is
only used for bootmem allocator. I think it isn't used after booting.

> memblock_remove()
>__memblock_remove()
>memblock_isolate_range()
>memblock_remove_region()
> 
> But memblock_isolate_range() only record fully contained regions,
> regions which are partial overlapped just be splitted instead of record.
> So these partial overlapped regions can't be removed. Where I miss?

No, memblock_isolate_range() can deal with partial overlapped region.
=
if (rbase < base) {
/*
 * @rgn intersects from below.  Split and continue
 * to process the next region - the new top half.
 */
rgn->base = base;
rgn->size -= base - rbase;
type->total_size -= base - rbase;
memblock_insert_region(type, i, rbase, base - rbase,
   memblock_get_region_node(rgn));
} else if (rend > end) {
/*
 * @rgn intersects from above.  Split and redo the
 * current region - the new bottom half.
 */
rgn->base = end;
rgn->size -= end - rbase;
type->total_size -= end - rbase;
memblock_insert_region(type, i--, rbase, end - rbase,
   memblock_get_region_node(rgn));
=

If the region is partial overlapped region, we will split the old region into
two regions. After doing this, it is full contained region now.

Thanks
Wen Congyang

> 
> Regards,
> Jaegeuk
> 
>> Thanks
>> Wen Congyang
>>> Regards,
>>> Jaegeuk
>>>
 CC: David Rientjes 
 CC: Jiang Liu 
 CC: Len Brown 
 CC: Christoph Lameter 
 Cc: Minchan Kim 
 CC: Andrew Morton 
 CC: KOSAKI Motohiro 
 CC: Wen Congyang 
 Signed-off-by: Yasuaki Ishimatsu 
 ---
mm/memory_hotplug.c | 13 -
1 file changed, 8 insertions(+), 5 deletions(-)

 diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
 index ca07433..66a79a7 100644
 --- a/mm/memory_hotplug.c
 +++ b/mm/memory_hotplug.c
 @@ -286,11 +286,14 @@ static int __meminit __add_section(int nid,
 struct zone *zone,
#ifdef CONFIG_SPARSEMEM_VMEMMAP
static int __remove_section(struct zone *zone, struct mem_section
 *ms)
{
 -/*
 - * XXX: Freeing memmap with vmemmap is not implement yet.
 - *  This should be removed later.
 - */
 -return -EBUSY;
 +int ret = -EINVAL;
 +
 +if (!valid_section(ms))
 +return ret;
 +
 +ret = unregister_memory_section(ms);
 +
 +return ret;
}
#else
static int __remove_section(struct zone *zone, struct mem_section
 *ms)
>>>
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v5] Thermal: exynos: Add sysfs node supporting exynos's emulation mode.

2012-11-20 Thread Jonghwa Lee

This patch supports exynos's emulation mode with newly created sysfs node.
Exynos 4x12 (4212, 4412) and 5 series provide emulation mode for thermal
management unit. Thermal emulation mode supports software debug for TMU's
operation. User can set temperature manually with software code and TMU
will read current temperature from user value not from sensor's value.
This patch includes also documentary placed under Documentation/thermal/.

Signed-off-by: Jonghwa Lee 
---
v5
 - Rebase the patch at -next branch of zhang rui's git.
 - Fix EXYNOS_EMULATION_MODE Kconfig option as Amit's comment.
 - Show emulation temperature in millicelsius.
   Storing emulation temperature supports both celsius and mcelsius for input.

v4
 - Fix Typo.
 - Remove unnecessary codes.
 - Add comments about feature of exynos emulation operation to the document.

v3
 - Remove unnecessay variables.
 - Do some code clean in exynos_tmu_emulation_store().
 - Make wrapping function of sysfs node creation function to use
   #ifdefs in minimum.

v2
 exynos_thermal.c
 - Fix build error occured by wrong emulation control register name.
 - Remove exynos5410 dependent codes.
 exynos_thermal_emulation
 - Align indentation.
 Documentation/thermal/exynos_thermal_emulation |   56 +
 drivers/thermal/Kconfig|9 ++
 drivers/thermal/exynos_thermal.c   |  103 
 3 files changed, 168 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/thermal/exynos_thermal_emulation

diff --git a/Documentation/thermal/exynos_thermal_emulation 
b/Documentation/thermal/exynos_thermal_emulation
new file mode 100644
index 000..bc9b057
--- /dev/null
+++ b/Documentation/thermal/exynos_thermal_emulation
@@ -0,0 +1,56 @@
+EXYNOS EMULATION MODE
+
+
+Copyright (C) 2012 Samsung Electronics
+
+Written by Jonghwa Lee 
+
+Description
+---
+
+Exynos 4x12 (4212, 4412) and 5 series provide emulation mode for thermal 
management unit.
+Thermal emulation mode supports software debug for TMU's operation. User can 
set temperature
+manually with software code and TMU will read current temperature from user 
value not from
+sensor's value.
+
+Enabling CONFIG_EXYNOS_THERMAL_EMUL option will make this support in available.
+When it's enabled, sysfs node will be created under
+/sys/bus/platform/devices/'exynos device name'/ with name of 'emulation'.
+
+The sysfs node, 'emulation', will contain value 0 for the initial state. When 
you input any
+temperature you want to update to sysfs node, it automatically enable 
emulation mode and
+current temperature will be changed into it.
+(Exynos also supports user changable delay time which would be used to delay of
+ changing temperature. However, this node only uses same delay of real sensing 
time, 938us.)
+
+Exynos emulation mode requires synchronous of value changing and enabling. It 
means when you
+want to update the any value of delay or next temperature, then you have to 
enable emulation 
+mode at the same time. (Or you have to keep the mode enabling.) If you don't, 
it fails to
+change the value to updated one and just use last succeessful value 
repeatedly. That's why
+this node gives users the right to change termerpature only. Just one 
interface makes it more
+simply to use.
+
+Disabling emulation mode only requires writing value 0 to sysfs node.
+
+
+TEMP   120 |
+   |
+   100 |
+   |
+80 |
+   |+---
+60 ||  |
+   |  +-|  |
+40 |  | |  |
+   |  | |  |
+20 |  | |  +--
+   |  | |  |  |
+ 0 |__|_|__|__|_
+  A A  A  A TIME
+  |<->| |<->|  |<->|  |
+  | 938us | |   |  |   |  |
+emulation:  0  50 | 70  |  20  |  0
+current temp :   sensor   5070 20sensor
+
+
+
diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
index d96da07..61ab206 100644
--- a/drivers/thermal/Kconfig
+++ b/drivers/thermal/Kconfig
@@ -101,6 +101,15 @@ config EXYNOS_THERMAL
  If you say yes here you get support for TMU (Thermal Managment
  Unit) on SAMSUNG EXYNOS series of SoC.
 
+config EXYNOS_THERMAL_EMUL
+   bool "EXYNOS TMU emulation mode support"
+   depends on EXYNOS_THERMAL
+   help
+ Exynos 4412 and 4414 and 5 series has emulation mode on TMU.
+ Enable this option will be make sysfs node in exynos thermal platform
+ device directory to support emulation mode. With emulation mode sysfs
+ node, you can manually input

Re: [PATCH 00/42] SH pin control and GPIO rework

2012-11-20 Thread Paul Mundt

On Wed, Nov 21, 2012 at 03:27:01AM +0100, Laurent Pinchart wrote:
> Hi everybody,
> 
> Here's a pretty large patch series that rework pin control and GPIO support
> for SH and ARM SH/Renesas Mobile/Car platforms. The patches are based on top
> of v3.7-rc6. You can get them from my git tree at
> 
>   git://linuxtv.org/pinchartl/fbdev.git pinmux
> 
> The idea behind these patches is to move SoC-specific pin control code from
> arch/ to drivers/pinctrl/ and use the Linux device model to instantiate the
> pin control device. This is required to add device tree support for the pin
> control device.
> 
> The code has been compile-tested on all modified platforms except SH7264 and
> SH7269, and runtime tested on SH7372 (Mackerel), SH73A0 (KZM-A9-GT) and
> R8A7740 (Armadillo) so far. I will runtime test it on R8A7779 (Marzen).
> 
> The SH7264 and SH7269 platforms have no gpiolib support so the PFC code can't
> be compiled for them. As the currently implemented arch-level pinmux support
> also depends on generic GPIO, we're moving from a situation where the code
> isn't used to a different situation where the code isn't used. I don't
> consider that as a regression.
> 
> Sorry for the numerous checkpatch warnings, patches that move code around or
> rename files don't modify the content to make review easier, and thus carry
> warnings from the existing code.
> 
> Currently missing from this series are DT bindings. I will send patches for
> those a bit later. As they will build on top of this series I would appreciate
> reviews (and hopefilly ack's).
> 
I've only given it a quick look, but in general it looks good!

For the series:

Acked-by: Paul Mundt 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] dt: add helper function to read u8 & u16 variables & arrays

2012-11-20 Thread Rob Herring

On 11/19/2012 10:45 PM, Viresh Kumar wrote:
> This adds following helper routines:
> - of_property_read_u8_array()
> - of_property_read_u16_array()
> - of_property_read_u8()
> - of_property_read_u16()
> 
> This expects arrays from DT to be passed as:
> - u8 array:
>   property = /bits/ 8 <0x50 0x60 0x70>;
> - u16 array:
>   property = /bits/ 16 <0x5000 0x6000 0x7000>;
> 
> Signed-off-by: Viresh Kumar 

Applied.

Rob

> ---
> V2->V3:
> - Expect u8 & u16 arrays to be passed using: /bits/ 8 or 16
> - remove common macro, as not much common now :(
> - Tested on ARM platform.
> 
>  drivers/of/base.c  | 77 
> ++
>  include/linux/of.h | 30 +
>  2 files changed, 107 insertions(+)
> 
> diff --git a/drivers/of/base.c b/drivers/of/base.c
> index af3b22a..f564e31 100644
> --- a/drivers/of/base.c
> +++ b/drivers/of/base.c
> @@ -671,12 +671,89 @@ struct device_node *of_find_node_by_phandle(phandle 
> handle)
>  EXPORT_SYMBOL(of_find_node_by_phandle);
>  
>  /**
> + * of_property_read_u8_array - Find and read an array of u8 from a property.
> + *
> + * @np:  device node from which the property value is to be read.
> + * @propname:name of the property to be searched.
> + * @out_value:   pointer to return value, modified only if return value 
> is 0.
> + * @sz:  number of array elements to read
> + *
> + * Search for a property in a device node and read 8-bit value(s) from
> + * it. Returns 0 on success, -EINVAL if the property does not exist,
> + * -ENODATA if property does not have a value, and -EOVERFLOW if the
> + * property data isn't large enough.
> + *
> + * dts entry of array should be like:
> + *   property = /bits/ 8 <0x50 0x60 0x70>;
> + *
> + * The out_value is modified only if a valid u8 value can be decoded.
> + */
> +int of_property_read_u8_array(const struct device_node *np,
> + const char *propname, u8 *out_values, size_t sz)
> +{
> + struct property *prop = of_find_property(np, propname, NULL);
> + const u8 *val;
> +
> + if (!prop)
> + return -EINVAL;
> + if (!prop->value)
> + return -ENODATA;
> + if ((sz * sizeof(*out_values)) > prop->length)
> + return -EOVERFLOW;
> +
> + val = prop->value;
> + while (sz--)
> + *out_values++ = *val++;
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(of_property_read_u8_array);
> +
> +/**
> + * of_property_read_u16_array - Find and read an array of u16 from a 
> property.
> + *
> + * @np:  device node from which the property value is to be read.
> + * @propname:name of the property to be searched.
> + * @out_value:   pointer to return value, modified only if return value 
> is 0.
> + * @sz:  number of array elements to read
> + *
> + * Search for a property in a device node and read 16-bit value(s) from
> + * it. Returns 0 on success, -EINVAL if the property does not exist,
> + * -ENODATA if property does not have a value, and -EOVERFLOW if the
> + * property data isn't large enough.
> + *
> + * dts entry of array should be like:
> + *   property = /bits/ 16 <0x5000 0x6000 0x7000>;
> + *
> + * The out_value is modified only if a valid u16 value can be decoded.
> + */
> +int of_property_read_u16_array(const struct device_node *np,
> + const char *propname, u16 *out_values, size_t sz)
> +{
> + struct property *prop = of_find_property(np, propname, NULL);
> + const __be16 *val;
> +
> + if (!prop)
> + return -EINVAL;
> + if (!prop->value)
> + return -ENODATA;
> + if ((sz * sizeof(*out_values)) > prop->length)
> + return -EOVERFLOW;
> +
> + val = prop->value;
> + while (sz--)
> + *out_values++ = be16_to_cpup(val++);
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(of_property_read_u16_array);
> +
> +/**
>   * of_property_read_u32_array - Find and read an array of 32 bit integers
>   * from a property.
>   *
>   * @np:  device node from which the property value is to be read.
>   * @propname:name of the property to be searched.
>   * @out_value:   pointer to return value, modified only if return value 
> is 0.
> + * @sz:  number of array elements to read
>   *
>   * Search for a property in a device node and read 32-bit value(s) from
>   * it. Returns 0 on success, -EINVAL if the property does not exist,
> diff --git a/include/linux/of.h b/include/linux/of.h
> index b4e50d5..bfdc130 100644
> --- a/include/linux/of.h
> +++ b/include/linux/of.h
> @@ -223,6 +223,10 @@ extern struct device_node *of_find_node_with_property(
>  extern struct property *of_find_property(const struct device_node *np,
>const char *name,
>int *lenp);
> +extern int of_property_read_u8_array(const struct device_node *np,
> + const char

Re: [PATCH] vhost-blk: Add vhost-blk support v5

2012-11-20 Thread Asias He

On 11/20/2012 09:37 PM, Michael S. Tsirkin wrote:
> On Tue, Nov 20, 2012 at 02:39:40PM +0800, Asias He wrote:
>> On 11/20/2012 04:26 AM, Michael S. Tsirkin wrote:
>>> On Mon, Nov 19, 2012 at 04:53:42PM +0800, Asias He wrote:
 vhost-blk is an in-kernel virito-blk device accelerator.

 Due to lack of proper in-kernel AIO interface, this version converts
 guest's I/O request to bio and use submit_bio() to submit I/O directly.
 So this version any supports raw block device as guest's disk image,
 e.g. /dev/sda, /dev/ram0. We can add file based image support to
 vhost-blk once we have in-kernel AIO interface. There are some work in
 progress for in-kernel AIO interface from Dave Kleikamp and Zach Brown:

http://marc.info/?l=linux-fsdevel=133312234313122

 Performance evaluation:
 -
 1) LKVM
 Fio with libaio ioengine on Fusion IO device using kvm tool
 IOPS(k)Before   After   Improvement
 seq-read   107  121 +13.0%
 seq-write  130  179 +37.6%
 rnd-read   102  122 +19.6%
 rnd-write  125  159 +27.0%

 2) QEMU
 Fio with libaio ioengine on Fusion IO device using QEMU
 IOPS(k)Before   After   Improvement
 seq-read   76   123 +61.8%
 seq-write  139  173 +24.4%
 rnd-read   73   120 +64.3%
 rnd-write  75   156 +108.0%
>>>
>>> Could you compare with dataplane qemu as well please?
>>
>>
>> Well, I will try to collect it.
>>
>>>

 Userspace bits:
 -
 1) LKVM
 The latest vhost-blk userspace bits for kvm tool can be found here:
 g...@github.com:asias/linux-kvm.git blk.vhost-blk

 2) QEMU
 The latest vhost-blk userspace prototype for QEMU can be found here:
 g...@github.com:asias/qemu.git blk.vhost-blk

 Changes in v5:
 - Do not assume the buffer layout
 - Fix wakeup race

 Changes in v4:
 - Mark req->status as userspace pointer
 - Use __copy_to_user() instead of copy_to_user() in vhost_blk_set_status()
 - Add if (need_resched()) schedule() in blk thread
 - Kill vhost_blk_stop_vq() and move it into vhost_blk_stop()
 - Use vq_err() instead of pr_warn()
 - Fail un Unsupported request
 - Add flush in vhost_blk_set_features()

 Changes in v3:
 - Sending REQ_FLUSH bio instead of vfs_fsync, thanks Christoph!
 - Check file passed by user is a raw block device file

 Signed-off-by: Asias He 
>>>
>>> Since there are files shared by this and vhost net
>>> it's easiest for me to merge this all through the
>>> vhost tree.
>>>
>>> Jens, could you ack this and the bio usage in this driver
>>> please?
>>>
 ---
  drivers/vhost/Kconfig |   1 +
  drivers/vhost/Kconfig.blk |  10 +
  drivers/vhost/Makefile|   2 +
  drivers/vhost/blk.c   | 697 
 ++
  drivers/vhost/blk.h   |   8 +
  5 files changed, 718 insertions(+)
  create mode 100644 drivers/vhost/Kconfig.blk
  create mode 100644 drivers/vhost/blk.c
  create mode 100644 drivers/vhost/blk.h

 diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
 index 202bba6..acd8038 100644
 --- a/drivers/vhost/Kconfig
 +++ b/drivers/vhost/Kconfig
 @@ -11,4 +11,5 @@ config VHOST_NET
  
  if STAGING
  source "drivers/vhost/Kconfig.tcm"
 +source "drivers/vhost/Kconfig.blk"
  endif
 diff --git a/drivers/vhost/Kconfig.blk b/drivers/vhost/Kconfig.blk
 new file mode 100644
 index 000..ff8ab76
 --- /dev/null
 +++ b/drivers/vhost/Kconfig.blk
 @@ -0,0 +1,10 @@
 +config VHOST_BLK
 +  tristate "Host kernel accelerator for virtio blk (EXPERIMENTAL)"
 +  depends on BLOCK &&  EXPERIMENTAL && m
 +  ---help---
 +This kernel module can be loaded in host kernel to accelerate
 +guest block with virtio_blk. Not to be confused with virtio_blk
 +module itself which needs to be loaded in guest kernel.
 +
 +To compile this driver as a module, choose M here: the module will
 +be called vhost_blk.
 diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
 index a27b053..1a8a4a5 100644
 --- a/drivers/vhost/Makefile
 +++ b/drivers/vhost/Makefile
 @@ -2,3 +2,5 @@ obj-$(CONFIG_VHOST_NET) += vhost_net.o
  vhost_net-y := vhost.o net.o
  
  obj-$(CONFIG_TCM_VHOST) += tcm_vhost.o
 +obj-$(CONFIG_VHOST_BLK) += vhost_blk.o
 +vhost_blk-y := blk.o
 diff --git a/drivers/vhost/blk.c b/drivers/vhost/blk.c
 new file mode 100644
 index 000..f0f118a
 --- /dev/null
 +++ b/drivers/vhost/blk.c
 @@ -0,0 +1,697 @@
 +/*
 + * Copyright (C) 2011 Taobao, Inc.
 + * Author: Liu Yuan 
 + *
 + * Copyright (C) 2012 Red Hat, Inc.
 + * Author:

Re: [PATCHv9 1/3] Runtime Interpreted Power Sequences

2012-11-20 Thread Alex Courbot

Hi Grant,

On Wednesday 21 November 2012 05:54:29 Grant Likely wrote:
> > With the advent of the device tree and of ARM kernels that are not
> > board-tied, we cannot rely on these board-specific hooks anymore but
> 
> This isn't strictly true. It is still perfectly fine to have board
> specific code when necessary. However, there is strong encouragement to
> enable that code in device drivers as much as possible and new board
> files need to have very strong justification.

But doesn't introducing board-specific code into the kernel just defeats the 
purpose of the DT? If we extend this logic, we are heading straight back to 
board-definition files. To a lesser extent than before I agree, but the problem 
is fundamentally the same.

> > need a way to implement these sequences in a portable manner. This patch
> > introduces a simple interpreter that can execute such power sequences
> > encoded either as platform data or within the device tree. It also
> > introduces first support for regulator, GPIO and PWM resources.
> 
> This is where I start getting nervous. Up to now I've strongly resisted
> adding any kind of interpreted code to the device tree. The model is to
> identify hardware, but require the driver to know how to control it. (as
> compared to ACPI which is entirely designed around executable
> bytecode).
> 
> While the power sequences described here certainly cannot be confused
> with a Turing complete bytecode, it is a step in that direction.

Technically speaking power sequences are a step towards an interpreter, but it 
is a very small one and it should not go much further than the current state. 
I understand the concern of having "code" into the DT but I really think it 
should be viewed from a different angle.

Powering sequences are special in that they can be affected by the board design 
or the devices variations. For instance hundreds of different panels with 
backlights are currently compatible with the pwm-backlight driver. The only 
thing that differenciates them is how the backlight is powered on and off. If 
you are to build a kernel that is supposed to support all these panels, you 
would need to embed all the powering sequences in the kernel even though only 
one of them will be used by one specific board. Power sequences in the DT help 
preventing that.

With that stated, it is clear that we should not need to define more than the 
short, simple sequences of actions that cannot be elegantly handled by the 
driver. Anything beyond that should be handled by the driver itself. In 
particular, here are a few things I do *not* want to see included in power 
seqs:

- conditionals/jumps (or it's not a sequence anymore).
- direct access to hardware. Resources must at least be abstracted in some 
way. You shall not e.g. access the address space directly.
- support for non-power related resources - that is out of the special case of 
powering sequences and should be done by the driver

That should keep the "grammar" simple, and the sequences short enough to that 
we can consider then as data belonging to the device, and not as code that is 
interpreted.

> I think this will get very verbose in a hurry. Already this simple
> example is 45 lines long. Using the device tree structure to encode the
> language doesn't look like a very good fit. Not to mention that the
> order of operations is entirely based on the node name. Want to insert
> an operation between step0 and step1? Need to rename step1, step2, and
> step3 to do so.

I don't like that steps numbering thing neither, but it seems to be the best 
way to do it so far.

As for the DT structure not being adapted for this - I would agree if we 
wanted to implement a complete interpreter, but that's precisely not the case. 
More about this later.

> This implementation also isn't very consistent. The gpio is referenced
> with a phandle in step3/step0, but the regulator and pwm are referenced
> by id.

Tomi made the same remark - the reason for using the phandle in GPIO is 
because GPIO framework does not support referencing GPIOs by name yet. I 
wanted to DT bindings to reflect the underlying framework as much as possible 
until we have a function like gpio_get(device, id).

However I agree that this makes things inconsistent at the moment and would 
require a bindings change. And in the case of the DT this is actually easy to 
implement (I did it in some previous versions). I'll make sure to do it.

> As an alternative, what about something like the following?
> 
>   backlight {
>   compatible = "pwm-backlight";
>   ...
> 
>   /* resources used by the power sequences */
>   pwms = < 2 500>;
>   pwm-names = "backlight";
>   regulators = <_reg>;
>   gpios = < 28 0>;
> 
>   power-on-sequence = "r0e;d1m;p0e;g0s";
>   power-off-sequence = "g0c;p0d;d1m;r0d";
>   };

Well, *now* it really looks like bytecode. :)

>

Re: [PATCH v3 06/12] memory-hotplug: unregister memory section on SPARSEMEM_VMEMMAP

2012-11-20 Thread Jaegeuk Hanse


On 11/21/2012 11:05 AM, Wen Congyang wrote:

At 11/20/2012 07:16 PM, Jaegeuk Hanse Wrote:

On 11/01/2012 05:44 PM, Wen Congyang wrote:

From: Yasuaki Ishimatsu 

Currently __remove_section for SPARSEMEM_VMEMMAP does nothing. But
even if
we use SPARSEMEM_VMEMMAP, we can unregister the memory_section.

So the patch add unregister_memory_section() into __remove_section().

Hi Yasuaki,

I have a question about these sparse vmemmap memory related patches. Hot
add memory need allocated vmemmap pages, but this time is allocated by
buddy system. How can gurantee virtual address is continuous to the
address allocated before? If not continuous, page_to_pfn and pfn_to_page
can't work correctly.

vmemmap has its virtual address range:
ea00 - eaff (=40 bits) virtual memory map (1TB)

We allocate memory from buddy system to store struct page, and its virtual
address isn't in this range. So we should update the page table:

kmalloc_section_memmap()
 sparse_mem_map_populate()
 pfn_to_page() // get the virtual address in the vmemmap range
 vmemmap_populate() // we update page table here

When we use vmemmap, page_to_pfn() always returns address in the vmemmap
range, not the address that kmalloc() returns. So the virtual address
is continuous.


Hi Congyang,

Another question about memory hotplug. During hot remove memory, it will 
also call memblock_remove to remove related memblock.

memblock_remove()
   __memblock_remove()
   memblock_isolate_range()
   memblock_remove_region()

But memblock_isolate_range() only record fully contained regions, 
regions which are partial overlapped just be splitted instead of record. 
So these partial overlapped regions can't be removed. Where I miss?


Regards,
Jaegeuk


Thanks
Wen Congyang

Regards,
Jaegeuk


CC: David Rientjes 
CC: Jiang Liu 
CC: Len Brown 
CC: Christoph Lameter 
Cc: Minchan Kim 
CC: Andrew Morton 
CC: KOSAKI Motohiro 
CC: Wen Congyang 
Signed-off-by: Yasuaki Ishimatsu 
---
   mm/memory_hotplug.c | 13 -
   1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index ca07433..66a79a7 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -286,11 +286,14 @@ static int __meminit __add_section(int nid,
struct zone *zone,
   #ifdef CONFIG_SPARSEMEM_VMEMMAP
   static int __remove_section(struct zone *zone, struct mem_section *ms)
   {
-/*
- * XXX: Freeing memmap with vmemmap is not implement yet.
- *  This should be removed later.
- */
-return -EBUSY;
+int ret = -EINVAL;
+
+if (!valid_section(ms))
+return ret;
+
+ret = unregister_memory_section(ms);
+
+return ret;
   }
   #else
   static int __remove_section(struct zone *zone, struct mem_section *ms)




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] mm: trace filemap add and del

2012-11-20 Thread Dave Chinner

On Tue, Nov 20, 2012 at 03:57:35PM -0800, Andrew Morton wrote:
> On Thu,  8 Nov 2012 20:54:10 +0100
> Robert Jarzmik  wrote:
.
> > +   __field(dev_t, s_dev)
> 
> Perhaps use super_block.s_id here
> 
> > +   ),
> > +
> > +   TP_fast_assign(
> > +   __entry->page = page;
> > +   __entry->i_no = page->mapping->host->i_ino;
> > +   __entry->pageofs = page->index;
> > +   if (page->mapping->host->i_sb)
> > +   __entry->s_dev = page->mapping->host->i_sb->s_dev;
> > +   else
> > +   __entry->s_dev = page->mapping->host->i_rdev;
> 
> and hence avoid all this stuff.

We actually have an informal convention for formating filesystem
trace events, and that is to use the device number

> 
> > +   ),
> > +
> > +   TP_printk("page=%p pfn=%lu blk=%d:%d inode+ofs=%lu+%lu",

... and to prefix messages like:

TP_printk("dev %d:%d ino 0x%llx 
  MAJOR(__entry->dev), MINOR(__entry->dev),

i.e. the start of the event message has all the identifying
information where it is easy to grep for and get all the events for
a specific dev/inode combination without even having to think about
it.

XFS, ext3/4, jbd/jdb2 and gfs2 follow this convention, so we should
keep propagating that pattern in the name of consistency, rather
than having different trace formats for different parts of the
VFS/FS layers...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm, memcg: avoid unnecessary function call when memcg is disabled fix

2012-11-20 Thread Kamezawa Hiroyuki


(2012/11/21 11:48), David Rientjes wrote:

Move the check for !mm out of line as suggested by Andrew.

Signed-off-by: David Rientjes 


Thank you very much !

Acked-by: KAMEZAWA Hiroyuki 



---
  include/linux/memcontrol.h |2 +-
  mm/memcontrol.c|3 +++
  2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -185,7 +185,7 @@ void __mem_cgroup_count_vm_event(struct mm_struct *mm, enum 
vm_event_item idx);
  static inline void mem_cgroup_count_vm_event(struct mm_struct *mm,
 enum vm_event_item idx)
  {
-   if (mem_cgroup_disabled() || !mm)
+   if (mem_cgroup_disabled())
return;
__mem_cgroup_count_vm_event(mm, idx);
  }
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1021,6 +1021,9 @@ void __mem_cgroup_count_vm_event(struct mm_struct *mm, 
enum vm_event_item idx)
  {
struct mem_cgroup *memcg;

+   if (!mm)
+   return;
+
rcu_read_lock();
memcg = mem_cgroup_from_task(rcu_dereference(mm->owner));
if (unlikely(!memcg))




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1674 matches

Mail list logo