Re: [Xen-devel] [patch 21/33] xen: Xen SMP guest support

2007-06-06 Thread Jan Beulich
>--- a/arch/i386/xen/time.c
>+++ b/arch/i386/xen/time.c
>@@ -105,17 +105,15 @@ static void get_runstate_snapshot(struct
>   preempt_enable();
> }
> 
>-static void setup_runstate_info(void)
>+static void setup_runstate_info(int cpu)
> {
>   struct vcpu_register_runstate_memory_area area;
> 
>-  area.addr.v = &__get_cpu_var(runstate);
>+  area.addr.v = &per_cpu(runstate, cpu);
> 
>   if (HYPERVISOR_vcpu_op(VCPUOP_register_runstate_memory_area,
>  smp_processor_id(), &area))

Shouldn't this be 'cpu' rather than 'smp_processor_id()'?

>   BUG();
>-
>-  get_runstate_snapshot(&__get_cpu_var(runstate_snapshot));
> }
> 
> static void do_stolen_accounting(void)

Jan


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] [patch 21/33] xen: Xen SMP guest support

2007-06-06 Thread Jeremy Fitzhardinge
Jan Beulich wrote:
>> --- a/arch/i386/xen/time.c
>> +++ b/arch/i386/xen/time.c
>> @@ -105,17 +105,15 @@ static void get_runstate_snapshot(struct
>>  preempt_enable();
>> }
>>
>> -static void setup_runstate_info(void)
>> +static void setup_runstate_info(int cpu)
>> {
>>  struct vcpu_register_runstate_memory_area area;
>>
>> -area.addr.v = &__get_cpu_var(runstate);
>> +area.addr.v = &per_cpu(runstate, cpu);
>>
>>  if (HYPERVISOR_vcpu_op(VCPUOP_register_runstate_memory_area,
>> smp_processor_id(), &area))
>> 
>
> Shouldn't this be 'cpu' rather than 'smp_processor_id()'?
>   

Yes.  I'd noticed that, thought it got fixed later in the series, and
looks like I ultimately got confused.  Not sure when this crept in; it
has been correct in the past.

J
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] [patch 14/33] xen: xen time implementation

2007-06-06 Thread Jan Beulich
>+cycle_t xen_clocksource_read(void)
>+{
>+  struct shadow_time_info *shadow = &get_cpu_var(shadow_time);
>+  cycle_t ret;
>+
>+  get_time_values_from_xen();
>+
>+  ret = shadow->system_timestamp + get_nsec_offset(shadow);
>+
>+  put_cpu_var(shadow_time);
>+
>+  return ret;
>+}

I'm afraid this mechanism is pretty unreliable on SMP: getnstimeofday() and
do_gettimeofday() both use the difference between the last snapshot taken
and the current value read from the clock source. Since I had added this
clocksource code to our kernel, I had reproducible hangs on one of the
systems I regularly work with (you may have seen the respective thread
on xen-devel), which recently I finally found time to look into. The issue is
that on that system, transition into ACPI mode takes over 600ms (SMM
execution, and hence no interrupts delivered during that time), and with
Xen using the PIT (PM timer support was added by Keir as a result of this,
but that doesn't cure the problem here, it just reduces the likelihood it'll
be encountered) platform time and local time got pretty much out of sync.

Xen itself knows to deal with this (by using an error correction factor to
slow down the local [TSC-based] clock), but for the kernel such a situation
may be fatal: If clocksource->cycle_last was most recently set on a CPU
with shadow->tsc_to_nsec_mul sufficiently different from that where
getnstimeofday() is being used, timekeeping.c's __get_nsec_offset() will
calculate a huge nanosecond value (due to cyc2ns() doing unsigned
operations), worth abut 4000s. This value may then be used to set a
timeout that was intended to be a few milliseconds, effectively yielding
a hung app (and perhaps system).

I'm sure the time keeping code can't deal with negative values returned
from __get_nsec_offset() (timespec_add_ns() is an example, used in
__get_realtime_clock_ts()), otherwise a potential solution might have
been to set the clock source's multiplier and shift to one and zero
respectively. But I think that a clock source can be expected to be
monotonic anyway, which Xen's interpolation mechanism doesn't
guarantee across multiple CPUs. (I'm actually beginning to think that
this might also be the reason for certain test suites occasionally reporting
timeouts to fire early.)

Unfortunately so far I haven't been able to think of a reasonable solution
to this - a simplistic approach like making xen_clocksource_read() check
the value it is about to return against the last value it returned doesn't
seem to be a good idea (time might appear to have stopped over some
period of time otherwise), nor does attempting to adjust the shadowed
tsc_to_nsec_mul values (because the kernel can't know whether it should
boost the lagging CPU or throttle the rushing one).

Jan



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] [patch 14/33] xen: xen time implementation

2007-06-06 Thread Keir Fraser
On 6/6/07 09:39, "Jan Beulich" <[EMAIL PROTECTED]> wrote:

> The issue is
> that on that system, transition into ACPI mode takes over 600ms (SMM
> execution, and hence no interrupts delivered during that time), and with
> Xen using the PIT (PM timer support was added by Keir as a result of this,
> but that doesn't cure the problem here, it just reduces the likelihood it'll
> be encountered) platform time and local time got pretty much out of sync.

If you have an ACPI PM timer in your system (and if you have SMM then your
system is almost certainly modern enough to have one) then surely the
problem is fixed for all practical purposes? The problem was overflow of a
fixed-width platform counter. The PIT wraps every ~50ms, but the ACPI PM
timer will wrap only every ~4s. It would be quite unreasonable for SMM to
take the CPU away for multiple seconds, even as a one-time boot operation.

 -- Keir

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] [patch 14/33] xen: xen time implementation

2007-06-06 Thread Jan Beulich
>>> Keir Fraser <[EMAIL PROTECTED]> 06.06.07 10:54 >>>
>On 6/6/07 09:39, "Jan Beulich" <[EMAIL PROTECTED]> wrote:
>
>> The issue is
>> that on that system, transition into ACPI mode takes over 600ms (SMM
>> execution, and hence no interrupts delivered during that time), and with
>> Xen using the PIT (PM timer support was added by Keir as a result of this,
>> but that doesn't cure the problem here, it just reduces the likelihood it'll
>> be encountered) platform time and local time got pretty much out of sync.
>
>If you have an ACPI PM timer in your system (and if you have SMM then your
>system is almost certainly modern enough to have one) then surely the
>problem is fixed for all practical purposes? The problem was overflow of a
>fixed-width platform counter. The PIT wraps every ~50ms, but the ACPI PM
>timer will wrap only every ~4s. It would be quite unreasonable for SMM to
>take the CPU away for multiple seconds, even as a one-time boot operation.

No, I don't think the problem's gone with the PM timer - it is just much less
likely. Since you depend on the TSC (which must generally be assumed be
unsyncronized across CPUs) and on the error correction factor (which shows
non-zero values every few seconds), getting the interpolated times on two
CPUs out of sync is still possible, and given the way the time keeping code
works even being off by just a single nanosecond may be fatal.

Jan


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] [patch 14/33] xen: xen time implementation

2007-06-06 Thread Keir Fraser
On 6/6/07 10:30, "Jan Beulich" <[EMAIL PROTECTED]> wrote:

>> If you have an ACPI PM timer in your system (and if you have SMM then your
>> system is almost certainly modern enough to have one) then surely the
>> problem is fixed for all practical purposes? The problem was overflow of a
>> fixed-width platform counter. The PIT wraps every ~50ms, but the ACPI PM
>> timer will wrap only every ~4s. It would be quite unreasonable for SMM to
>> take the CPU away for multiple seconds, even as a one-time boot operation.
> 
> No, I don't think the problem's gone with the PM timer - it is just much less
> likely. Since you depend on the TSC (which must generally be assumed be
> unsyncronized across CPUs) and on the error correction factor (which shows
> non-zero values every few seconds), getting the interpolated times on two
> CPUs out of sync is still possible, and given the way the time keeping code
> works even being off by just a single nanosecond may be fatal.

If the error across CPUS is +/- just a few microseconds at worst then having
the clocksource clamp to no less than the last timestamp returned seems a
reasonable fix. Time won't 'stop' for longer than the cross-CPU error, and
that should always be a tiny value.

 -- Keir

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] [patch 14/33] xen: xen time implementation

2007-06-06 Thread Jan Beulich
>>  But I think that a clock source can be expected to be
>> monotonic anyway, which Xen's interpolation mechanism doesn't
>> guarantee across multiple CPUs. (I'm actually beginning to think that
>> this might also be the reason for certain test suites occasionally reporting
>> timeouts to fire early.)
>>   
>
>Does the kernel expect the tsc clocksource to be completely monotonic
>across cpus?  Any form of cpu-local clocksource is going to have this
>problem; I wonder if clocksources can really only be useful if they're
>always referring to a single system-wide time reference - seems like a
>bit of a limitation.

I suppose so - the clock source's rating gets set to zero if it can be predicted
that the TSCs aren't all synchronized, which should pretty much exclude the
use of this clock source for any other than fallback if there's really nothing
else available.

Jan


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] [patch 14/33] xen: xen time implementation

2007-06-06 Thread Jeremy Fitzhardinge
Jan Beulich wrote:
> Xen itself knows to deal with this (by using an error correction factor to
> slow down the local [TSC-based] clock), but for the kernel such a situation
> may be fatal: If clocksource->cycle_last was most recently set on a CPU
> with shadow->tsc_to_nsec_mul sufficiently different from that where
> getnstimeofday() is being used, timekeeping.c's __get_nsec_offset() will
> calculate a huge nanosecond value (due to cyc2ns() doing unsigned
> operations), worth abut 4000s. This value may then be used to set a
> timeout that was intended to be a few milliseconds, effectively yielding
> a hung app (and perhaps system).
>   

Hm.  I had a similar situation in the stolen time code, and I ended up
using signed values so I could clamp at zero.  Though that might have
been another bug; either way, the clamp is still there.

I wonder if cyc2ns might not be better using signed operations?  Or
perhaps better, the time code should endevour to do things on a
completely per-cpu basis (haven't really given this any thought).

> I'm sure the time keeping code can't deal with negative values returned
> from __get_nsec_offset() (timespec_add_ns() is an example, used in
> __get_realtime_clock_ts()), otherwise a potential solution might have
> been to set the clock source's multiplier and shift to one and zero
> respectively.

I don't quite follow you here, but wouldn't setting the multiplier/shift
to 1/0 preclude being able to warp the clocksource with ntp?

>  But I think that a clock source can be expected to be
> monotonic anyway, which Xen's interpolation mechanism doesn't
> guarantee across multiple CPUs. (I'm actually beginning to think that
> this might also be the reason for certain test suites occasionally reporting
> timeouts to fire early.)
>   

Does the kernel expect the tsc clocksource to be completely monotonic
across cpus?  Any form of cpu-local clocksource is going to have this
problem; I wonder if clocksources can really only be useful if they're
always referring to a single system-wide time reference - seems like a
bit of a limitation.

> Unfortunately so far I haven't been able to think of a reasonable solution
> to this - a simplistic approach like making xen_clocksource_read() check
> the value it is about to return against the last value it returned doesn't
> seem to be a good idea (time might appear to have stopped over some
> period of time otherwise), nor does attempting to adjust the shadowed
> tsc_to_nsec_mul values (because the kernel can't know whether it should
> boost the lagging CPU or throttle the rushing one).

I once had some code in there to do that, implemented in very boneheaded
way with a spinlock to protect the "last time returned" variable.  I
expect there's a better way to implement it.

J
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] [patch 14/33] xen: xen time implementation

2007-06-06 Thread Andi Kleen
On Wednesday 06 June 2007 12:05:22 Jeremy Fitzhardinge wrote:
> Jan Beulich wrote:
> > Xen itself knows to deal with this (by using an error correction factor to
> > slow down the local [TSC-based] clock), but for the kernel such a situation
> > may be fatal: If clocksource->cycle_last was most recently set on a CPU
> > with shadow->tsc_to_nsec_mul sufficiently different from that where
> > getnstimeofday() is being used, timekeeping.c's __get_nsec_offset() will
> > calculate a huge nanosecond value (due to cyc2ns() doing unsigned
> > operations), worth abut 4000s. This value may then be used to set a
> > timeout that was intended to be a few milliseconds, effectively yielding
> > a hung app (and perhaps system).
> >   
> 
> Hm.  I had a similar situation in the stolen time code, and I ended up
> using signed values so I could clamp at zero.  Though that might have
> been another bug; either way, the clamp is still there.
> 
> I wonder if cyc2ns might not be better using signed operations?  Or
> perhaps better, the time code should endevour to do things on a
> completely per-cpu basis (haven't really given this any thought).

This is being worked on.


> > Unfortunately so far I haven't been able to think of a reasonable solution
> > to this - a simplistic approach like making xen_clocksource_read() check
> > the value it is about to return against the last value it returned doesn't
> > seem to be a good idea (time might appear to have stopped over some
> > period of time otherwise), nor does attempting to adjust the shadowed
> > tsc_to_nsec_mul values (because the kernel can't know whether it should
> > boost the lagging CPU or throttle the rushing one).
> 
> I once had some code in there to do that, implemented in very boneheaded
> way with a spinlock to protect the "last time returned" variable.  I
> expect there's a better way to implement it.

But any per CPU setup likely needs this to avoid non monotonicity 

-Andi
 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] [patch 14/33] xen: xen time implementation

2007-06-06 Thread Jan Beulich
>>> Keir Fraser <[EMAIL PROTECTED]> 06.06.07 11:56 >>>
>On 6/6/07 10:30, "Jan Beulich" <[EMAIL PROTECTED]> wrote:
>
>>> If you have an ACPI PM timer in your system (and if you have SMM then your
>>> system is almost certainly modern enough to have one) then surely the
>>> problem is fixed for all practical purposes? The problem was overflow of a
>>> fixed-width platform counter. The PIT wraps every ~50ms, but the ACPI PM
>>> timer will wrap only every ~4s. It would be quite unreasonable for SMM to
>>> take the CPU away for multiple seconds, even as a one-time boot operation.
>> 
>> No, I don't think the problem's gone with the PM timer - it is just much less
>> likely. Since you depend on the TSC (which must generally be assumed be
>> unsyncronized across CPUs) and on the error correction factor (which shows
>> non-zero values every few seconds), getting the interpolated times on two
>> CPUs out of sync is still possible, and given the way the time keeping code
>> works even being off by just a single nanosecond may be fatal.
>
>If the error across CPUS is +/- just a few microseconds at worst then having
>the clocksource clamp to no less than the last timestamp returned seems a
>reasonable fix. Time won't 'stop' for longer than the cross-CPU error, and
>that should always be a tiny value.

Are you sure this is also true when e.g. a CPU gets throttled due to thermal
conditions? It is my understanding that both the duty cycle adjustment and
the frequency reduction would yield a reduced rate TSC, which would be
accounted for only the next time the local clock gets calibrated. Otherwise,
immediate calibration (and vcpu update) would need to be forced out of the
thermal interrupt.

Jan


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] [patch 14/33] xen: xen time implementation

2007-06-06 Thread Keir Fraser



On 6/6/07 12:00, "Jan Beulich" <[EMAIL PROTECTED]> wrote:

>> If the error across CPUS is +/- just a few microseconds at worst then having
>> the clocksource clamp to no less than the last timestamp returned seems a
>> reasonable fix. Time won't 'stop' for longer than the cross-CPU error, and
>> that should always be a tiny value.
> 
> Are you sure this is also true when e.g. a CPU gets throttled due to thermal
> conditions? It is my understanding that both the duty cycle adjustment and
> the frequency reduction would yield a reduced rate TSC, which would be
> accounted for only the next time the local clock gets calibrated. Otherwise,
> immediate calibration (and vcpu update) would need to be forced out of the
> thermal interrupt.

Yes, this could be an issue. Is there any way to get an interrupt or MCE
when thermal throttling occurs?

 -- Keir

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] [patch 14/33] xen: xen time implementation

2007-06-06 Thread Andi Kleen

> 
> Yes, this could be an issue. Is there any way to get an interrupt or MCE
> when thermal throttling occurs?

Yes you can get an thermal interrupt from the local APIC. See the Linux
kernel source. Of course there would be still a race window.

On the other hand some timing issues on throttling are probably 
the smallest of the users' problems when it really happens.

Standard Linux just ignores it.

-Andi


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] [patch 14/33] xen: xen time implementation

2007-06-06 Thread Jan Beulich
>>> Andi Kleen <[EMAIL PROTECTED]> 06.06.07 14:18 >>>
>
>> 
>> Yes, this could be an issue. Is there any way to get an interrupt or MCE
>> when thermal throttling occurs?
>
>Yes you can get an thermal interrupt from the local APIC. See the Linux
>kernel source. Of course there would be still a race window.
>
>On the other hand some timing issues on throttling are probably 
>the smallest of the users' problems when it really happens.

Not if this results in your box hanging - I think throttling is exactly intended
to keep the box alive as long as possible (and I've seen throttling in action,
with the box happily recovering from the situation - after having seen it a
few times I checked and found the fan covered with dust).

>Standard Linux just ignores it.

Jan


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] [patch 14/33] xen: xen time implementation

2007-06-06 Thread Andi Kleen
On Wednesday 06 June 2007 14:46:59 Jan Beulich wrote:
> >>> Andi Kleen <[EMAIL PROTECTED]> 06.06.07 14:18 >>>
> >
> >> 
> >> Yes, this could be an issue. Is there any way to get an interrupt or MCE
> >> when thermal throttling occurs?
> >
> >Yes you can get an thermal interrupt from the local APIC. See the Linux
> >kernel source. Of course there would be still a race window.
> >
> >On the other hand some timing issues on throttling are probably 
> >the smallest of the users' problems when it really happens.
> 
> Not if this results in your box hanging 

Yes it shouldn't hang. Just saying that some non monotonicity in the returned
values under this abnormal condition is probably not the world's end.

-Andi
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] [patch 14/33] xen: xen time implementation

2007-06-06 Thread Keir Fraser



On 6/6/07 13:46, "Jan Beulich" <[EMAIL PROTECTED]> wrote:

>> On the other hand some timing issues on throttling are probably
>> the smallest of the users' problems when it really happens.
> 
> Not if this results in your box hanging - I think throttling is exactly
> intended
> to keep the box alive as long as possible (and I've seen throttling in action,
> with the box happily recovering from the situation - after having seen it a
> few times I checked and found the fan covered with dust).

Clamping to last returned timestamp value would be fine here. Time would go
moderately haywire for a while (lurch forwards and then stop for a while),
but wouldn't go backwards and should recover reasonably within the timescale
of the thermal event itself. I don't see an issue with just implementing
this clamping if it is needed.

 -- Keir

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] [patch 14/33] xen: xen time implementation

2007-06-06 Thread Jeremy Fitzhardinge
Andi Kleen wrote:
>> I once had some code in there to do that, implemented in very boneheaded
>> way with a spinlock to protect the "last time returned" variable.  I
>> expect there's a better way to implement it.
>> 
>
> But any per CPU setup likely needs this to avoid non monotonicity 

Yeah.  The point I didn't quite make was that this should be something
that the clock core should handle rather than dealing with it in every
clocksource.

J

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[PATCH UPDATE] xen: use iret directly where possible

2007-06-06 Thread Jeremy Fitzhardinge
[ Expand, correct and clarify comments; minor code change. ]

Most of the time we can simply use the iret instruction to exit the
kernel, rather than having to use the iret hypercall - the only
exception is if we're returning into vm86 mode, or from delivering an
NMI (which we don't support yet).

When running native, iret has the behaviour of testing for a pending
interrupt atomically with re-enabling interrupts.  Unfortunately
there's no way to do this with Xen, so there's a window in which we
could get a recursive exception after enabling events but before
actually returning to userspace.

This causes a problem: if the nested interrupt causes one of the
task's TIF_WORK_MASK flags to be set, they will not be checked again
before returning to userspace.  This means that pending work may be
left pending indefinitely, until the process enters and leaves the
kernel again.  The net effect is that a pending signal or reschedule
event could be delayed for an unbounded amount of time.

To deal with this, the xen event upcall handler checks to see if the
EIP is within the critical section of the iret code, after events
are (potentially) enabled up to the iret itself.  If its within this
range, it calls the iret critical section fixup, which adjusts the
stack to deal with any unrestored registers, and then shifts the
stack frame up to replace the previous invocation.

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>

---
 arch/i386/kernel/asm-offsets.c |1 
 arch/i386/kernel/entry.S   |   16 +++
 arch/i386/xen/enlighten.c  |1 
 arch/i386/xen/xen-asm.S|  185 +++-
 arch/i386/xen/xen-ops.h|1 
 5 files changed, 199 insertions(+), 5 deletions(-)

===
--- a/arch/i386/kernel/asm-offsets.c
+++ b/arch/i386/kernel/asm-offsets.c
@@ -65,6 +65,7 @@ void foo(void)
OFFSET(TI_addr_limit, thread_info, addr_limit);
OFFSET(TI_restart_block, thread_info, restart_block);
OFFSET(TI_sysenter_return, thread_info, sysenter_return);
+   OFFSET(TI_cpu, thread_info, cpu);
BLANK();
 
OFFSET(GDS_size, Xgt_desc_struct, size);
===
--- a/arch/i386/kernel/entry.S
+++ b/arch/i386/kernel/entry.S
@@ -1030,7 +1030,21 @@ ENTRY(xen_hypervisor_callback)
CFI_ADJUST_CFA_OFFSET 4
SAVE_ALL
TRACE_IRQS_OFF
-   mov %esp, %eax
+
+   /* Check to see if we got the event in the critical
+  region in xen_iret_direct, after we've reenabled
+  events and checked for pending events.  This simulates
+  iret instruction's behaviour where it delivers a
+  pending interrupt when enabling interrupts. */
+   movl PT_EIP(%esp),%eax
+   cmpl $xen_iret_start_crit,%eax
+   jb   1f
+   cmpl $xen_iret_end_crit,%eax
+   jae  1f
+
+   call xen_iret_crit_fixup
+
+1: mov %esp, %eax
call xen_evtchn_do_upcall
jmp  ret_from_intr
CFI_ENDPROC
===
--- a/arch/i386/xen/enlighten.c
+++ b/arch/i386/xen/enlighten.c
@@ -827,6 +827,7 @@ void __init xen_setup_vcpu_info_placemen
paravirt_ops.irq_disable = xen_irq_disable_direct;
paravirt_ops.irq_enable = xen_irq_enable_direct;
paravirt_ops.read_cr2 = xen_read_cr2_direct;
+   paravirt_ops.iret = xen_iret_direct;
}
 }
 
===
--- a/arch/i386/xen/xen-asm.S
+++ b/arch/i386/xen/xen-asm.S
@@ -12,14 +12,20 @@
  */
 
 #include 
+
 #include 
 #include 
 #include 
-#include 
 #include 
+#include 
+
+#include 
 
 #define RELOC(x, v).globl x##_reloc; x##_reloc=v
 #define ENDPATCH(x).globl x##_end; x##_end=.
+
+/* Pseudo-flag used for virtual NMI, which we don't implement yet */
+#define XEN_EFLAGS_NMI 0x8000
 
 /*
Enable events.  This clears the event mask and tests the pending
@@ -81,13 +87,12 @@ ENDPATCH(xen_save_fl_direct)
  */
 ENTRY(xen_restore_fl_direct)
testb $X86_EFLAGS_IF>>8, %ah
-   setz %al
-   movb %al, PER_CPU_VAR(xen_vcpu_info)+XEN_vcpu_info_mask
+   setz PER_CPU_VAR(xen_vcpu_info)+XEN_vcpu_info_mask
/* Preempt here doesn't matter because that will deal with
   any pending interrupts.  The pending check may end up being
   run on the wrong CPU, but that doesn't hurt. */
 
-   /* check for pending but unmasked */
+   /* check for unmasked and pending */
cmpw $0x0001, PER_CPU_VAR(xen_vcpu_info)+XEN_vcpu_info_pending
jz 1f
 2: call check_events
@@ -97,6 +102,178 @@ ENDPATCH(xen_restore_fl_direct)
ENDPROC(xen_restore_fl_direct)
RELOC(xen_restore_fl_direct, 2b+1)
 
+/*
+   This is run where a normal iret would be run, with the same stack setup:
+ 8: eflags
+ 4: cs
+ 

[PATCH] xen: fix xen-smp.patch: setup_runstate_info

2007-06-06 Thread Jeremy Fitzhardinge
Somehow an smp_processor_id() survived the transition to passing the
cpu around.

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Cc: Jan Beulich <[EMAIL PROTECTED]>

---
 arch/i386/xen/time.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

===
--- a/arch/i386/xen/time.c
+++ b/arch/i386/xen/time.c
@@ -110,7 +110,7 @@ static void setup_runstate_info(int cpu)
area.addr.v = &per_cpu(runstate, cpu);
 
if (HYPERVISOR_vcpu_op(VCPUOP_register_runstate_memory_area,
-  smp_processor_id(), &area))
+  cpu, &area))
BUG();
 }
 


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[PATCH] paravirt: fix error handling in paravirt_disable_iospace

2007-06-06 Thread Jeremy Fitzhardinge
Make sure everything is backed out if it fails.

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>

---
 arch/i386/kernel/paravirt.c |5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

===
--- a/arch/i386/kernel/paravirt.c
+++ b/arch/i386/kernel/paravirt.c
@@ -254,8 +254,11 @@ int paravirt_disable_iospace(void)
int ret;
 
ret = request_resource(&ioport_resource, &reserve_ioports);
-   if (ret == 0)
+   if (ret == 0) {
ret = request_resource(&iomem_resource, &reserve_iomem);
+   if (ret)
+   release_resource(&reserve_ioports);
+   }
 
return ret;
 }


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[PATCH RFC 0/7] proposed updates to boot protocol and paravirt booting

2007-06-06 Thread Jeremy Fitzhardinge
This series:
 1. Updates the boot protocol to version 2.07
 2. Clean up the existing build process, to get rid of tools/build and
make the linker do more heavy lifting
 3. Make the bzImage payload an ELF file.  The bootloader can extract
this as a naked ELF file by skipping over boot_params.setup_sects worth
of 16-bit setup code.
 4. Update the boot_params to 2.07, and update the kernel's head.S to
jump to the appropriate subarch-specific kernel entrypoint.  The
very earliest code is common (copy boot_params, clear bss); the
split happens just before the initial pagetable setup.
+ random little changes to make it all hang together

This boots native for me, so everything basically works.  But I haven't
tested it end-to-end yet, because I haven't done the Xen bits yet.
Perhaps Rusty can do the lguest version to verify that its all sound in
principle (hint hint ;).

So, how does it look?

J
-- 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[PATCH RFC 2/7] add WEAK() for creating weak asm labels

2007-06-06 Thread Jeremy Fitzhardinge
Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>

---
 include/linux/linkage.h |6 ++
 1 file changed, 6 insertions(+)

===
--- a/include/linux/linkage.h
+++ b/include/linux/linkage.h
@@ -34,6 +34,12 @@
   name:
 #endif
 
+#ifndef WEAK
+#define WEAK(name)\
+   .weak name;\
+   name:
+#endif
+
 #define KPROBE_ENTRY(name) \
   .pushsection .kprobes.text, "ax"; \
   ENTRY(name)

-- 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[PATCH RFC 1/7] update boot spec to 2.07

2007-06-06 Thread Jeremy Fitzhardinge
Proposed updates for version 2.07 of the boot protocol.  This includes:

load_flags.KEEP_SEGMENTS- flag to request/inhibit segment reloads
hardware_subarch- what subarchitecture we're booting under
hardware_subarch_data   - per-architecture data
kernel_payload  - address of the raw kernel blob

The intention of these changes is to make booting a paravirtualized
kernel work via the normal Linux boot protocol.  The intention is that
the bzImage payload can be a properly formed ELF file, so that the
bootloader can use its ELF notes and Phdrs to get more metadata about
the kernel and its requirements.

The ELF file could be the uncompressed kernel vmlinux itself; it would
only take small buildsystem changes to implement this.

kernel_payload was added so that a bootloader can just get to the raw
bits of the kernel, so that it can do its own decompression/relocation
if it wishes.  This is not particularly well-defined yet; I just added
it with the hope that it keeps HPA happy.

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Cc: "Eric W. Biederman" <[EMAIL PROTECTED]>
Cc: H. Peter Anvin <[EMAIL PROTECTED]>
Cc: Vivek Goyal <[EMAIL PROTECTED]>
Cc: Rusty Russell <[EMAIL PROTECTED]>

---
 Documentation/i386/boot.txt|   43 +++-
 arch/i386/kernel/asm-offsets.c |7 ++
 include/asm-i386/bootparam.h   |   10 +++--
 3 files changed, 57 insertions(+), 3 deletions(-)

===
--- a/Documentation/i386/boot.txt
+++ b/Documentation/i386/boot.txt
@@ -168,6 +168,9 @@ 0234/1  2.05+   relocatable_kernel Whether 
 0234/1 2.05+   relocatable_kernel Whether kernel is relocatable or not
 0235/3 N/A pad2Unused
 0238/4 2.06+   cmdline_sizeMaximum size of the kernel command line
+023C/4 2.07+   hardware_subarch Hardware subarchitecture
+0240/8 2.07+   hardware_subarch_data Subarchitecture-specific data
+0248/4 2.07+   kernel_payload  Pointer to raw kernel data
 
 (1) For backwards compatibility, if the setup_sects field contains 0, the
 real value is 4.
@@ -204,7 +207,7 @@ boot loaders can ignore those fields.
 
 The byte order of all fields is littleendian (this is x86, after all.)
 
-Field name:setup_secs
+Field name:setup_sects
 Type:  read
 Offset/size:   0x1f1/1
 Protocol:  ALL
@@ -356,6 +359,13 @@ Protocol:  2.00+
- If 0, the protected-mode code is loaded at 0x1.
- If 1, the protected-mode code is loaded at 0x10.
 
+  Bit 6 (write): KEEP_SEGMENTS
+   Protocol: 2.07+
+   - if 0, reload the segment registers in the 32bit entry point.
+   - if 1, do not reload the segment registers in the 32bit entry point.
+   Assume that %cs %ds %ss %es are all set to flat segments with
+   a base of 0 (or the equivalent for their environment).
+
   Bit 7 (write): CAN_USE_HEAP
Set this bit to 1 to indicate that the value entered in the
heap_end_ptr is valid.  If this field is clear, some setup code
@@ -479,6 +489,37 @@ Protocol:  2.06+
   zero. This means that the command line can contain at most
   cmdline_size characters. With protocol version 2.05 and earlier, the
   maximum size was 255.
+
+Field name:hardware_subarch
+Type:  write
+Offset/size:   0x23c/4
+Protocol:  2.07+
+
+  In a paravirtualized environment the hardware low level architectural
+  pieces such as interrupt handling, page table handling, and
+  accessing process control registers needs to be done differently.
+
+  This field allows the bootloader to inform the kernel we are in one
+  one of those environments.
+
+  0x   The default x86/PC environment
+  0x0001   lguest
+  0x0002   Xen
+
+Field name:hardware_subarch_data
+Type:  write
+Offset/size:   0x240/8
+Protocol:  2.07+
+
+  A pointer to data that is specific to hardware subarch
+
+Field name:kernel_payload
+Type:  read
+Offset/size:   0x248/4
+Protocol:  2.07+
+
+  The relocated pointer to the actual kernel payload, in whatever form
+  it exists in (gzip image, normally).
 
 
  THE KERNEL COMMAND LINE
===
--- a/arch/i386/kernel/asm-offsets.c
+++ b/arch/i386/kernel/asm-offsets.c
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -143,4 +144,10 @@ void foo(void)
OFFSET(LGUEST_PAGES_regs_errcode, lguest_pages, regs.errcode);
OFFSET(LGUEST_PAGES_regs, lguest_pages, regs);
 #endif
+
+   BLANK();
+   OFFSET(BP_scratch, boot_params, scratch);
+   OFFSET(BP_loadflags, boot_params, hdr.loadflags);
+   OFFSET(BP_hardware_subarch, boot_params, hdr.hardware_subarch);
+   OFFSET(BP_version, boot_params, hdr.version);
 }
===
--- a/include/asm-i386/bootparam.h
+++ b/include/asm-i386/bootparam.h
@@ -24,8 +24,9 @@ struct

[PATCH RFC 5/7] i386: clean up bzImage generation

2007-06-06 Thread Jeremy Fitzhardinge
This patch cleans up image generation in several ways:
 - Firstly, it removes tools/build, and uses binutils to do all the
   final construction of the bzImage.  This removes a chunk of code
   and makes the image generation more flexible, since we can compute
   various numbers rather than be forced to use fixed constants.

 - Rename compressed/vmlinux to compressed/blob, to make it a
   bit clearer that it's the compressed kernel image + decompressor
   (now all the files named "vmlinux*" are directly derived from
   the kernel vmlinux).

 - Rather than using objcopy to wrap the compressed kernel into an
   object file, simply use the assembler: payload.S does a .incbin
   of the blob.bin file, which allows us to easily place
   it into a section, and it makes the Makefile dependency a little
   clearer.

 - Similarly, use the same technique to create compressed/piggy.o,
   which cleans things up even more, since the .S file can also
   set the input and output_size symbols without further linker
   script hackery; it also removes a complete linker script.

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Cc: "Eric W. Biederman" <[EMAIL PROTECTED]>
Cc: H. Peter Anvin <[EMAIL PROTECTED]>
Cc: Vivek Goyal <[EMAIL PROTECTED]>
Cc: Rusty Russell <[EMAIL PROTECTED]>

---
 arch/i386/boot/Makefile   |   31 +-
 arch/i386/boot/compressed/Makefile|   13 --
 arch/i386/boot/compressed/piggy.S |   10 +
 arch/i386/boot/compressed/vmlinux.scr |   10 -
 arch/i386/boot/header.S   |6 -
 arch/i386/boot/payload.S  |3 
 arch/i386/boot/setup.ld   |   39 ---
 arch/i386/boot/tools/.gitignore   |1 
 arch/i386/boot/tools/build.c  |  168 -
 9 files changed, 56 insertions(+), 225 deletions(-)

===
--- a/arch/i386/boot/Makefile
+++ b/arch/i386/boot/Makefile
@@ -25,12 +25,13 @@ SVGA_MODE := -DSVGA_MODE=NORMAL_VGA
 
 #RAMDISK := -DRAMDISK=512
 
-targets:= vmlinux.bin setup.bin setup.elf zImage bzImage
+targets:= blob.bin setup.elf zImage bzImage
 subdir-:= compressed
 
 setup-y+= a20.o apm.o cmdline.o copy.o cpu.o cpucheck.o edd.o
-setup-y+= header.o main.o mca.o memory.o pm.o pmjump.o
-setup-y+= printf.o string.o tty.o video.o version.o voyager.o
+setup-y+= header.o main.o mca.o memory.o payload.o pm.o
+setup-y+= pmjump.o printf.o string.o tty.o video.o version.o
+setup-y+= voyager.o
 
 # The link order of the video-*.o modules can matter.  In particular,
 # video-vga.o *must* be listed first, followed by video-vesa.o.
@@ -39,10 +40,6 @@ setup-y  += video-vga.o
 setup-y+= video-vga.o
 setup-y+= video-vesa.o
 setup-y+= video-bios.o
-
-hostprogs-y:= tools/build
-
-HOSTCFLAGS_build.o := $(LINUXINCLUDE)
 
 # ---
 
@@ -65,18 +62,12 @@ AFLAGS  := $(CFLAGS) -D__ASSEMBLY__
 $(obj)/bzImage: IMAGE_OFFSET := 0x10
 $(obj)/bzImage: EXTRA_CFLAGS := -D__BIG_KERNEL__
 $(obj)/bzImage: EXTRA_AFLAGS := $(SVGA_MODE) $(RAMDISK) -D__BIG_KERNEL__
-$(obj)/bzImage: BUILDFLAGS   := -b
 
-quiet_cmd_image = BUILD   $@
-cmd_image = $(obj)/tools/build $(BUILDFLAGS) $(obj)/setup.bin \
-   $(obj)/vmlinux.bin $(ROOT_DEV) > $@
-
-$(obj)/zImage $(obj)/bzImage: $(obj)/setup.bin \
- $(obj)/vmlinux.bin $(obj)/tools/build FORCE
-   $(call if_changed,image)
+$(obj)/zImage $(obj)/bzImage: $(obj)/setup.elf FORCE
+   $(call if_changed,objcopy)
@echo 'Kernel: $@ is ready' ' (#'`cat .version`')'
 
-$(obj)/vmlinux.bin: $(obj)/compressed/vmlinux FORCE
+$(obj)/blob.bin: $(obj)/compressed/blob FORCE
$(call if_changed,objcopy)
 
 SETUP_OBJS = $(addprefix $(obj)/,$(setup-y))
@@ -85,12 +76,10 @@ LDFLAGS_setup.elf   := -T
 $(obj)/setup.elf: $(src)/setup.ld $(SETUP_OBJS) FORCE
$(call if_changed,ld)
 
-OBJCOPYFLAGS_setup.bin := -O binary
+$(obj)/payload.o:  EXTRA_AFLAGS := -Wa,-I$(obj)
+$(obj)/payload.o: $(src)/payload.S $(obj)/blob.bin
 
-$(obj)/setup.bin: $(obj)/setup.elf FORCE
-   $(call if_changed,objcopy)
-
-$(obj)/compressed/vmlinux: FORCE
+$(obj)/compressed/blob: FORCE
$(Q)$(MAKE) $(build)=$(obj)/compressed IMAGE_OFFSET=$(IMAGE_OFFSET) $@
 
 # Set this if you want to pass append arguments to the zdisk/fdimage/isoimage 
kernel
===
--- a/arch/i386/boot/compressed/Makefile
+++ b/arch/i386/boot/compressed/Makefile
@@ -4,11 +4,10 @@
 # create a compressed vmlinux image from the original vmlinux
 #
 
-targets:= vmlinux vmlinux.bin vmlinux.bin.gz head.o misc.o 
piggy.o \
+targets:= blob vmlinux.bin vmlinux.bin.gz head.o misc.o 
piggy.o \
  

[PATCH RFC 7/7] i386: paravirt boot sequence

2007-06-06 Thread Jeremy Fitzhardinge
This patch uses the updated boot protocol to do paravirtualized boot.
If the boot version is >= 2.07, then it will do two things:

 1. Check the bootparams loadflags to see if we should reload the
segment registers and clear interrupts.  This is appropriate
for normal native boot and some paravirtualized environments, but
inappropraite for others.

 2. Check the hardware architecture, and dispatch to the appropriate
kernel entrypoint.  If the bootloader doesn't set this, then we
simply do the normal boot sequence.

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Cc: "Eric W. Biederman" <[EMAIL PROTECTED]>
Cc: H. Peter Anvin <[EMAIL PROTECTED]>
Cc: Vivek Goyal <[EMAIL PROTECTED]>
Cc: Rusty Russell <[EMAIL PROTECTED]>

---
 arch/i386/boot/header.S |9 -
 arch/i386/kernel/head.S |   47 +++
 2 files changed, 51 insertions(+), 5 deletions(-)

===
--- a/arch/i386/boot/header.S
+++ b/arch/i386/boot/header.S
@@ -119,7 +119,7 @@ 1:
# Part 2 of the header, from the old setup.S
 
.ascii  "HdrS"  # header signature
-   .word   0x0206  # header version number (>= 0x0105)
+   .word   0x0207  # header version number (>= 0x0105)
# or else old loadlin-1.5 will fail)
.globl realmode_swtch
 realmode_swtch:.word   0, 0# default_switch, SETUPSEG
@@ -209,6 +209,13 @@ cmdline_size:   .long   COMMAND_LINE_SIZ
 #added with boot protocol
 #version 2.06
 
+hardware_subarch:  .long 0 # subarchitecture, added with 
2.07
+   # default to 0 for normal x86 PC
+
+hardware_subarch_data: .quad 0
+
+kernel_payload:.long blob_payload  # raw kernel data
+
 # End of setup header #
 
.section ".inittext", "ax"
===
--- a/arch/i386/kernel/head.S
+++ b/arch/i386/kernel/head.S
@@ -71,28 +71,37 @@ INIT_MAP_BEYOND_END = BOOTBITMAP_SIZE + 
  */
 .section .text.head,"ax",@progbits
 ENTRY(startup_32)
+   /* check to see if KEEP_SEGMENTS flag is meaningful */
+   cmpw $0x207, BP_version(%esi)
+   jb 1f
+
+   /* test KEEP_SEGMENTS flag to see if the bootloader is asking
+   us to not reload segments */
+   testb $(1<<6), BP_loadflags(%esi)
+   jnz 2f
 
 /*
  * Set segments to known values.
  */
-   cld
-   lgdt boot_gdt_descr - __PAGE_OFFSET
+1: lgdt boot_gdt_descr - __PAGE_OFFSET
movl $(__BOOT_DS),%eax
movl %eax,%ds
movl %eax,%es
movl %eax,%fs
movl %eax,%gs
+2:
 
 /*
  * Clear BSS first so that there are no surprises...
- * No need to cld as DF is already clear from cld above...
- */
+ */
+   cld
xorl %eax,%eax
movl $__bss_start - __PAGE_OFFSET,%edi
movl $__bss_stop - __PAGE_OFFSET,%ecx
subl %edi,%ecx
shrl $2,%ecx
rep ; stosl
+
 /*
  * Copy bootup parameters out of the way.
  * Note: %esi still has the pointer to the real-mode data.
@@ -120,6 +129,35 @@ 2:
movsl
 1:
 
+#ifdef CONFIG_PARAVIRT
+   cmpw $0x207, (boot_params + BP_version - __PAGE_OFFSET)
+   jb default_entry
+
+   /* Paravirt-compatible boot parameters.  Look to see what architecture
+   we're booting under. */
+   movl (boot_params + BP_hardware_subarch - __PAGE_OFFSET), %eax
+   cmpl $num_subarch_entries, %eax
+   jae bad_subarch
+
+   movl subarch_entries - __PAGE_OFFSET(,%eax,4), %eax
+   subl $__PAGE_OFFSET, %eax
+   jmp *%eax
+
+bad_subarch:
+WEAK(lguest_entry)
+WEAK(xen_entry)
+   /* Unknown implementation; there's really
+  nothing we can do at this point. */
+   ud2a
+.data
+subarch_entries:
+   .long default_entry /* normal x86/PC */
+   .long lguest_entry  /* lguest hypervisor */
+   .long xen_entry /* Xen hypervisor */
+num_subarch_entries = (. - subarch_entries) / 4
+.previous
+#endif /* CONFIG_PARAVIRT */
+
 /*
  * Initialize page tables.  This creates a PDE and a set of page
  * tables, which are located immediately beyond _end.  The variable
@@ -132,6 +170,7 @@ 1:
  */
 page_pde_offset = (__PAGE_OFFSET >> 20);
 
+default_entry:
movl $(pg0 - __PAGE_OFFSET), %edi
movl $(swapper_pg_dir - __PAGE_OFFSET), %edx
movl $0x007, %eax   /* 0x007 = PRESENT+RW+USER */

-- 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[PATCH RFC 3/7] allow linux/elf.h to be included in assembler

2007-06-06 Thread Jeremy Fitzhardinge
Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>

---
 include/linux/elf.h |   24 +++-
 1 file changed, 19 insertions(+), 5 deletions(-)

===
--- a/include/linux/elf.h
+++ b/include/linux/elf.h
@@ -1,9 +1,10 @@
 #ifndef _LINUX_ELF_H
 #define _LINUX_ELF_H
 
+#include 
+#ifndef __ASSEMBLY__
 #include 
 #include 
-#include 
 #include 
 
 struct file;
@@ -31,6 +32,7 @@ typedef __u32 Elf64_Word;
 typedef __u32  Elf64_Word;
 typedef __u64  Elf64_Xword;
 typedef __s64  Elf64_Sxword;
+#endif /* __ASSEMBLY__ */
 
 /* These constants are for the segment types stored in the image headers */
 #define PT_NULL0
@@ -123,6 +125,7 @@ typedef __s64   Elf64_Sxword;
 #define ELF64_ST_BIND(x)   ELF_ST_BIND(x)
 #define ELF64_ST_TYPE(x)   ELF_ST_TYPE(x)
 
+#ifndef __ASSEMBLY__
 typedef struct dynamic{
   Elf32_Sword d_tag;
   union{
@@ -138,6 +141,7 @@ typedef struct {
 Elf64_Addr d_ptr;
   } d_un;
 } Elf64_Dyn;
+#endif /* __ASSEMBLY__ */
 
 /* The following are used with relocations */
 #define ELF32_R_SYM(x) ((x) >> 8)
@@ -146,6 +150,7 @@ typedef struct {
 #define ELF64_R_SYM(i) ((i) >> 32)
 #define ELF64_R_TYPE(i)((i) & 0x)
 
+#ifndef __ASSEMBLY__
 typedef struct elf32_rel {
   Elf32_Addr   r_offset;
   Elf32_Word   r_info;
@@ -185,11 +190,12 @@ typedef struct elf64_sym {
   Elf64_Addr st_value; /* Value of the symbol */
   Elf64_Xword st_size; /* Associated symbol size */
 } Elf64_Sym;
-
+#endif /* __ASSEMBLY__ */
 
 #define EI_NIDENT  16
 
-typedef struct elf32_hdr{
+#ifndef __ASSEMBLY__
+typedef struct elf32_hdr {
   unsigned chare_ident[EI_NIDENT];
   Elf32_Half   e_type;
   Elf32_Half   e_machine;
@@ -222,6 +228,7 @@ typedef struct elf64_hdr {
   Elf64_Half e_shnum;
   Elf64_Half e_shstrndx;
 } Elf64_Ehdr;
+#endif /* __ASSEMBLY__ */
 
 /* These constants define the permissions on sections in the program
header, p_flags. */
@@ -229,7 +236,8 @@ typedef struct elf64_hdr {
 #define PF_W   0x2
 #define PF_X   0x1
 
-typedef struct elf32_phdr{
+#ifndef __ASSEMBLY__
+typedef struct elf32_phdr {
   Elf32_Word   p_type;
   Elf32_Offp_offset;
   Elf32_Addr   p_vaddr;
@@ -250,6 +258,7 @@ typedef struct elf64_phdr {
   Elf64_Xword p_memsz; /* Segment size in memory */
   Elf64_Xword p_align; /* Segment alignment, file & memory */
 } Elf64_Phdr;
+#endif /* __ASSEMBLY__ */
 
 /* sh_type */
 #define SHT_NULL   0
@@ -284,7 +293,8 @@ typedef struct elf64_phdr {
 #define SHN_ABS0xfff1
 #define SHN_COMMON 0xfff2
 #define SHN_HIRESERVE  0x
- 
+
+#ifndef __ASSEMBLY__
 typedef struct {
   Elf32_Word   sh_name;
   Elf32_Word   sh_type;
@@ -310,6 +320,7 @@ typedef struct elf64_shdr {
   Elf64_Xword sh_addralign;/* Section alignment */
   Elf64_Xword sh_entsize;  /* Entry size if section holds table */
 } Elf64_Shdr;
+#endif /* __ASSEMBLY__ */
 
 #defineEI_MAG0 0   /* e_ident[] indexes */
 #defineEI_MAG1 1
@@ -343,6 +354,7 @@ typedef struct elf64_shdr {
 
 #define ELFOSABI_NONE  0
 #define ELFOSABI_LINUX 3
+#define ELFOSABI_STANDALONE255
 
 #ifndef ELF_OSABI
 #define ELF_OSABI ELFOSABI_NONE
@@ -357,6 +369,7 @@ typedef struct elf64_shdr {
 #define NT_PRXFPREG 0x46e62b7f  /* copied from 
gdb5.1/include/elf/common.h */
 
 
+#ifndef __ASSEMBLY__
 /* Note header in a PT_NOTE section */
 typedef struct elf32_note {
   Elf32_Word   n_namesz;   /* Name size */
@@ -396,5 +409,6 @@ static inline void arch_write_notes(stru
 #define ELF_CORE_EXTRA_NOTES_SIZE arch_notes_size()
 #define ELF_CORE_WRITE_EXTRA_NOTES arch_write_notes(file)
 #endif /* ARCH_HAVE_EXTRA_ELF_NOTES */
+#endif /* __ASSEMBLY__ */
 
 #endif /* _LINUX_ELF_H */

-- 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[PATCH RFC 4/7] define ELF notes for adding to a boot image

2007-06-06 Thread Jeremy Fitzhardinge
Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Cc: Vivek Goyal <[EMAIL PROTECTED]>

---
 include/linux/elf_boot.h |   15 +++
 1 file changed, 15 insertions(+)

===
--- /dev/null
+++ b/include/linux/elf_boot.h
@@ -0,0 +1,15 @@
+#ifndef ELF_BOOT_H
+#define ELF_BOOT_H
+
+/* Elf notes to help bootloaders identify what program they are booting.
+ */
+
+/* Standardized Elf image notes for booting... The name for all of these is 
ELFBoot */
+#define ELF_NOTE_BOOT  ELFBoot
+
+#define EIN_PROGRAM_NAME   1 /* The program in this ELF file */
+#define EIN_PROGRAM_VERSION2 /* The version of the program in this ELF 
file */
+#define EIN_PROGRAM_CHECKSUM   3 /* ip style checksum of the memory image. */
+#define EIN_ARGUMENT_STYLE 4 /* String identifying argument passing style 
*/
+
+#endif /* ELF_BOOT_H */

-- 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[PATCH RFC 6/7] i386: make the bzImage payload an ELF file

2007-06-06 Thread Jeremy Fitzhardinge
This patch makes the payload of the bzImage file an ELF file.  In
other words, the bzImage is structured as follows:
 - boot sector
 - 16bit setup code
 - ELF header
  - decompressor
  - compressed kernel

A bootloader may find the start of the ELF file by looking at the
setup_size entry in the boot params, and using that to find the offset
of the ELF header.  The ELF Phdrs contain all the mapped memory
required to decompress and start booting the kernel.

One slightly complex part of this is that the bzImage boot_params need
to know about the internal structure of the ELF file, at least to the
extent of being able to point the core32_start entry at the ELF file's
entrypoint, so that loaders which use this field will still work.

Similarly, the ELF header needs to know how big the kernel vmlinux's
bss segment is, in order to make sure is is mapped properly.

To handle these two cases, we generate abstracted versions of the
object files which only contain the symbols we care about (generated
with objcopy --strip-all --keep-symbol=X), and then include those
symbol tables with ld -R.

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Cc: "Eric W. Biederman" <[EMAIL PROTECTED]>
Cc: H. Peter Anvin <[EMAIL PROTECTED]>
Cc: Vivek Goyal <[EMAIL PROTECTED]>
Cc: Rusty Russell <[EMAIL PROTECTED]>

---
 arch/i386/boot/Makefile   |   11 --
 arch/i386/boot/compressed/Makefile|   29 +--
 arch/i386/boot/compressed/elfhdr.S|   60 +
 arch/i386/boot/compressed/head.S  |9 ++--
 arch/i386/boot/compressed/notes.S |7 +++
 arch/i386/boot/compressed/vmlinux.lds |   24 ++---
 arch/i386/boot/header.S   |7 ---
 arch/i386/boot/setup.ld   |5 ++
 arch/i386/kernel/head.S   |1 
 arch/i386/kernel/vmlinux.lds.S|1 
 10 files changed, 131 insertions(+), 23 deletions(-)

===
--- a/arch/i386/boot/Makefile
+++ b/arch/i386/boot/Makefile
@@ -72,14 +72,19 @@ AFLAGS  := $(CFLAGS) -D__ASSEMBLY__
 
 SETUP_OBJS = $(addprefix $(obj)/,$(setup-y))
 
-LDFLAGS_setup.elf  := -T
-$(obj)/setup.elf: $(src)/setup.ld $(SETUP_OBJS) FORCE
+$(obj)/zImage $(obj)/bzImage:  \
+   LDFLAGS :=  \
+   -R $(obj)/compressed/blob-syms  \
+   --defsym IMAGE_OFFSET=$(IMAGE_OFFSET) -T
+
+$(obj)/setup.elf: $(src)/setup.ld $(SETUP_OBJS)\
+   $(obj)/compressed/blob-syms FORCE
$(call if_changed,ld)
 
 $(obj)/payload.o:  EXTRA_AFLAGS := -Wa,-I$(obj)
 $(obj)/payload.o: $(src)/payload.S $(obj)/blob.bin
 
-$(obj)/compressed/blob: FORCE
+$(obj)/compressed/blob $(obj)/compressed/blob-syms: FORCE
$(Q)$(MAKE) $(build)=$(obj)/compressed IMAGE_OFFSET=$(IMAGE_OFFSET) $@
 
 # Set this if you want to pass append arguments to the zdisk/fdimage/isoimage 
kernel
===
--- a/arch/i386/boot/compressed/Makefile
+++ b/arch/i386/boot/compressed/Makefile
@@ -4,21 +4,42 @@
 # create a compressed vmlinux image from the original vmlinux
 #
 
-targets:= blob vmlinux.bin vmlinux.bin.gz head.o misc.o 
piggy.o \
+targets:= blob vmlinux.bin vmlinux.bin.gz \
+   elfhdr.o head.o misc.o notes.o piggy.o \
vmlinux.bin.all vmlinux.relocs
 
-LDFLAGS_blob   := -T
 hostprogs-y:= relocs
 
 CFLAGS  := -m32 -D__KERNEL__ $(LINUX_INCLUDE) -O2 \
   -fno-strict-aliasing -fPIC \
   $(call cc-option,-ffreestanding) \
   $(call cc-option,-fno-stack-protector)
-LDFLAGS := -m elf_i386
+LDFLAGS := -R $(obj)/vmlinux-syms --defsym IMAGE_OFFSET=$(IMAGE_OFFSET) -T
 
-$(obj)/blob: $(src)/vmlinux.lds $(obj)/head.o $(obj)/misc.o $(obj)/piggy.o 
FORCE
+OBJS=$(addprefix $(obj)/,elfhdr.o head.o misc.o notes.o piggy.o)
+
+$(obj)/blob: $(src)/vmlinux.lds $(obj)/vmlinux-syms $(OBJS) FORCE
$(call if_changed,ld)
@:
+
+# Generate a stripped-down object including only the symbols needed
+# so that we can get them with ld -R. Direct stderr to /dev/null to
+# shut useless warning up.
+quiet_cmd_symextract = SYMEXT $@
+  cmd_symextract = objcopy -S \
+   $(addprefix -j,$(EXTRACTSECTS)) \
+   $(addprefix -K,$(EXTRACTSYMS)) \
+   $< $@ 2>/dev/null
+
+$(obj)/blob-syms: EXTRACTSYMS := blob_entry blob_payload
+$(obj)/blob-syms: EXTRACTSECTS := .text.head .data.compressed
+$(obj)/blob-syms: $(obj)/blob FORCE
+   $(call if_changed,symextract)
+
+$(obj)/vmlinux-syms: EXTRACTSYMS := __reserved_end
+$(obj)/vmlinux-syms: EXTRACTSECTS := .bss
+$(obj)/vmlinux-syms: vmlinux FORCE
+   $(call if_changed,symextract)
 
 $(obj)/vmlinux.bin: vmlinux FORCE
$(call if_changed,objcopy)
===
--

Re: [PATCH RFC 6/7] i386: make the bzImage payload an ELF file

2007-06-06 Thread H. Peter Anvin
Jeremy Fitzhardinge wrote:
> This patch makes the payload of the bzImage file an ELF file.  In
> other words, the bzImage is structured as follows:
>  - boot sector
>  - 16bit setup code
>  - ELF header
>   - decompressor
>   - compressed kernel
> 
> A bootloader may find the start of the ELF file by looking at the
> setup_size entry in the boot params, and using that to find the offset
> of the ELF header.  The ELF Phdrs contain all the mapped memory
> required to decompress and start booting the kernel.
> 
> One slightly complex part of this is that the bzImage boot_params need
> to know about the internal structure of the ELF file, at least to the
> extent of being able to point the core32_start entry at the ELF file's
> entrypoint, so that loaders which use this field will still work.
> 
> Similarly, the ELF header needs to know how big the kernel vmlinux's
> bss segment is, in order to make sure is is mapped properly.
> 
> To handle these two cases, we generate abstracted versions of the
> object files which only contain the symbols we care about (generated
> with objcopy --strip-all --keep-symbol=X), and then include those
> symbol tables with ld -R.

I still believe that we should provide, in effect, vmlinux as a
(compressed) ELF file rather than provide the intermediate stage.  It
would reduce the complexity of testing (all information provided about a
stage have to be both guaranteed to even make sense in the future as
well as be tested to conform to such information) as well as cover a
larger number of environments -- any environment where injecting data
into memory is cheaper than execution is quite unhappy about the current
system.  Such environments include heterogeneous embedded systems (think
a slow CPU on a PCI card where the host CPU has direct access to the
memory on the card) as well as simulators/emulators.

For environments where so is appropriate it would even be possible to
run the setup, invoke the code32_setup hook to do the decompression (and
relocation, if appropriate) in host space.

This makes vmlinux (normally stripped) recoverable from the bzImage file
and so anything that is currently booting vmlinux would be serviced by
this scheme.

-hpa
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC 6/7] i386: make the bzImage payload an ELF file

2007-06-06 Thread Jeremy Fitzhardinge
H. Peter Anvin wrote:
> I still believe that we should provide, in effect, vmlinux as a
> (compressed) ELF file rather than provide the intermediate stage.  It
> would reduce the complexity of testing (all information provided about a
> stage have to be both guaranteed to even make sense in the future as
> well as be tested to conform to such information

I'm not sure I follow you.  Sure, you're right that the Phdr info
contained within the bzImage needs to be tested for correctness.  This
wouldn't normally happen when booting native, but when booting under the
most constrained environment - Xen - it will be tested (and I intend
making the Xen loader as strict as possible).  Of course, it won't help
if the Phdrs are overmap too much, but I don't think that matters too
much, so long as the mappings are not excessively large.

I'm not sure what you mean about "make sense in the future".  If you're
booting the kernel in a new paravirtualized environment, you've
presumably modified the kernel to understand that environment, and
perhaps had to update the boot image format a bit to deal with its
requirements.  I agree that updating the bzImage format may require
retesting in all the other environments, but I think that's probably
true for your scheme as well.

After all, you're assuming that the vmlinux itself provides all
necessary information to be loaded in any environment, which is not
necessarily true (it may need extra ELF notes, for example).  But if
there are any major structural changes needed in the vmlinux, then that
will be equally problematic for both directly using vmlinux and using
ELF-in-bzImage.  So I don't think your argument convincingly sways in
any particular direction.

> ) as well as cover a
> larger number of environments -- any environment where injecting data
> into memory is cheaper than execution is quite unhappy about the current
> system.  Such environments include heterogeneous embedded systems (think
> a slow CPU on a PCI card where the host CPU has direct access to the
> memory on the card) as well as simulators/emulators.
>   

Well, nothing in this scheme precludes the ELF file from being a plain
uncompressed kernel image.  If that's what these environments want, its
easy to provide with a small update to the Makefiles.

> For environments where so is appropriate it would even be possible to
> run the setup, invoke the code32_setup hook to do the decompression (and
> relocation, if appropriate) in host space.
>   

Well, that's what we currently have, and we can't break backwards
compatibility.

> This makes vmlinux (normally stripped) recoverable from the bzImage file
> and so anything that is currently booting vmlinux would be serviced by
> this scheme.
>   

I'm not sure I fully understand the mechanism you're proposing.  You
have the 16-bit setup code, the 32-bit decompressor, and an ELF.gz. Once
the decompressor has extracted the actual ELF file, are you proposing
that it properly parse the ELF file and follow its instuctions to put
the segments in the appropriate places, or are you assuming that the
decompressor can just skip that part and plonk the ELF file where it wants?

In other words, do you see the Phdrs as being descriptive or prescriptive?

J
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC 6/7] i386: make the bzImage payload an ELF file

2007-06-06 Thread H. Peter Anvin
Jeremy Fitzhardinge wrote:
> 
> I'm not sure I fully understand the mechanism you're proposing.  You
> have the 16-bit setup code, the 32-bit decompressor, and an ELF.gz. Once
> the decompressor has extracted the actual ELF file, are you proposing
> that it properly parse the ELF file and follow its instuctions to put
> the segments in the appropriate places, or are you assuming that the
> decompressor can just skip that part and plonk the ELF file where it wants?
> 
> In other words, do you see the Phdrs as being descriptive or prescriptive?
> 

I was thinking prescriptive, having the decompressor read the output
stream and interpret it as ELF.  I guess a descriptive approach could be
made to work, too (I haven't really thought about that avenue of
approach), but the prescriptive model seems more powerful, at least to me.

-hpa
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC 6/7] i386: make the bzImage payload an ELF file

2007-06-06 Thread Jeremy Fitzhardinge
H. Peter Anvin wrote:
> I was thinking prescriptive, having the decompressor read the output
> stream and interpret it as ELF.  I guess a descriptive approach could be
> made to work, too (I haven't really thought about that avenue of
> approach), but the prescriptive model seems more powerful, at least to me.

Certainly, but much harder to implement.  The ELF parser needs to be
prepared to move itself around to get out of the way of the ELF file. 
It's a fairly large change from how it works now.

I was thinking of making the ELF file entirely descriptive, since its
just a set of ELF headers inserted into the existing bzImage structure,
and it still relies on the bzImage being build properly in the first place.

J
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC 6/7] i386: make the bzImage payload an ELF file

2007-06-06 Thread H. Peter Anvin
Jeremy Fitzhardinge wrote:
> 
> Certainly, but much harder to implement.  The ELF parser needs to be
> prepared to move itself around to get out of the way of the ELF file. 
> It's a fairly large change from how it works now.
> 

It doesn't if we simply declare that a certain chunk of memory is
available to it, for the case where it runs in the native configuration.
Since it doesn't have to support *any* ELF file, just the kernel one,
that's an option.

On the other hand, I guess with the decompressor/ELF parser being PIC,
one would simply look for the highest used address, and relocate itself
above that point.  It's not really all that different from what the
decompressor does today, except that it knows the address a priori.

> I was thinking of making the ELF file entirely descriptive, since its
> just a set of ELF headers inserted into the existing bzImage structure,
> and it still relies on the bzImage being build properly in the first place.

Again, it's an option.  The downside is that you don't get the automatic
test coverage of having it be exercised as often as possible.

-hpa

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC 6/7] i386: make the bzImage payload an ELF file

2007-06-06 Thread Jeremy Fitzhardinge
H. Peter Anvin wrote:
> It doesn't if we simply declare that a certain chunk of memory is
> available to it, for the case where it runs in the native configuration.
> Since it doesn't have to support *any* ELF file, just the kernel one,
> that's an option.
>   

I suppose.  But given that its always built at the same time as - and
linked to - the kernel itself, it can have private knowledge about the
kernel.

> On the other hand, I guess with the decompressor/ELF parser being PIC,
> one would simply look for the highest used address, and relocate itself
> above that point.  It's not really all that different from what the
> decompressor does today, except that it knows the address a priori.
>   

Yes, it would have to decompress the ELF file into a temp buffer, and
then rearrange itself and the decompressed ELF file to make space for
the ELF file's final location.  Seems a bit more complex because it has
to be done in the middle of execution rather that at start of day.  But
perhaps that doesn't matter very much.

>> I was thinking of making the ELF file entirely descriptive, since its
>> just a set of ELF headers inserted into the existing bzImage structure,
>> and it still relies on the bzImage being build properly in the first place.
>> 
>
> Again, it's an option.  The downside is that you don't get the automatic
> test coverage of having it be exercised as often as possible.

I don't follow your argument at all.

I'm proposing the kernel take the same code path regardless of how its
booted, with the only two variations:

   1. boot all the way up from 16-bit mode, or
   2. start directly in 32-bit mode

which is essentially the current situation (setup vs code32_start).  All
I'm adding is a bit more metadata for the domain builder to work with. 
The code will get exercised on every boot in every environment, and the
metadata will be tested by whichever environment cares about it.

You're proposing that we add a third booting variation, where the
bootloader takes on the responsibility for decompressing and loading the
kernel's ELF image.  In addition, you're proposing changing the existing
32-bit portion of the boot to perform the same job as the third method,
but in a way which is not reusable by a paravirtual domain builder. 
This means that the boot path is unique for each boot environment, and
so will overall get less coverage.

Given that one axis of the test matrix - "number of subarchtectures" -
is the same in both cases, and the other axis - "number of ways of
booting" - is larger in your proposal, it seems to me that your's has
the higher testing burden.

Anyway, I added an extra pointer in the boot_params so that you can
implement it that way if you really want (no real reason you can have
ELF within ELF within bzImage, but it starts to look a bit
engineering-by-compromise at that point).  It isn't, however, the
approach I want to take with Xen.

J
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC 6/7] i386: make the bzImage payload an ELF file

2007-06-06 Thread Rob Landley
On Wednesday 06 June 2007 7:41 pm, H. Peter Anvin wrote:
> This makes vmlinux (normally stripped) recoverable from the bzImage file
> and so anything that is currently booting vmlinux would be serviced by
> this scheme.

Would this make it sane to strip the initramfs image out of vmlinux with 
objdump and replace it with another one, or are there offsets resolved during 
the build that stop that for vmlinux?

Rob
-- 
The Google cluster became self-aware at 2:14am EDT August 29, 2007...
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC 6/7] i386: make the bzImage payload an ELF file

2007-06-06 Thread H. Peter Anvin
Rob Landley wrote:
> On Wednesday 06 June 2007 7:41 pm, H. Peter Anvin wrote:
>> This makes vmlinux (normally stripped) recoverable from the bzImage file
>> and so anything that is currently booting vmlinux would be serviced by
>> this scheme.
> 
> Would this make it sane to strip the initramfs image out of vmlinux with 
> objdump and replace it with another one, or are there offsets resolved during 
> the build that stop that for vmlinux?
> 

There probably are offsets resolved during the build.  However, that
wouldn't be all that hard to fix.  Still, one can argue whether or not
it is sane under any definition to do this kind of unpacking-repacking
of ELF files.

-hpa

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization