On 01/22/15 03:32, Jan Beulich wrote:
>>>> On 21.01.15 at 18:52, <dsl...@verizon.com> wrote:
>> On 01/16/15 05:09, Jan Beulich wrote:
>>>>>> On 03.10.14 at 00:40, <dsl...@verizon.com> wrote:
>>>> This is a new domain_create() flag, DOMCRF_vmware_port.  It is
>>>> passed to domctl as XEN_DOMCTL_CDF_vmware_port.
>>> Can you explain why a HVM param isn't suitable here?
>>>
>> The issue is that you need this flag during construct_vmcb() and
>> construct_vmcs().  While Intel has vmx_update_exception_bitmap()
>> AMD does not.  So when HVM param's are setup and/or changed there
>> currently is no way to adjust AMD's exception bitmap.
>>
>> So this is the simpler way.
> But the less desirable one from a design/consistency perspective.
> Unless other maintainers disagree, I'd like to see this changed.

Ok, but will wait some time to see if "Unless other maintainers disagree"

>>>> This is both a more complete support then in currently provided by
>>>> QEMU and/or KVM and less.  The missing part requires QEMU changes
>>>> and has been left out until the QEMU patches are accepted upstream.
>>> I vaguely recall the question having been asked before, but I can't
>>> find it to the answer to it: If qemu has support for this, why can't
>>> you build on that rather than adding everything in the hypervisor?
>>>
>> The v10 version of this patch set (which is waiting for an adjusted
>> QEMU (the released 2.2.0 is one) does use QEMU for more VMware port
>> support.  The issues are:
> Was there a newer version of these posted than the v8 I looked at?
> If so, I must have overlooked the posting (as otherwise I would of
> course have commented on the newer version).
>

The newer version was being worked on, but had not been posted (and had no
changes to this patch).  Since it was never posted, I will just continue
getting v9
(instead of v10) into shape to post.


>> 1) QEMU needs access to parts of CPU registers to handle VMware port.
>> 2) You need to allow ring 3 access to this 1 I/O port.
>> 3) There is more state in xen that would need to also be sent to
>>    QEMU if all support is moved to QEMU.
> Understood.
>
>>>> @@ -2111,6 +2112,31 @@ svm_vmexit_do_vmsave(struct vmcb_struct *vmcb,
>>>>       return;
>>>>   }
>>>>
>>>> +static void svm_vmexit_gp_intercept(struct cpu_user_regs *regs,
>>>> +                                    struct vcpu *v)
>>>> +{
>>>> +    struct vmcb_struct *vmcb = v->arch.hvm_svm.vmcb;
>>>> +    /*
>>>> +     * Just use 15 for the instruction length; vmport_gp_check will
>>>> +     * adjust it.  This is because
>>>> +     * __get_instruction_length_from_list() has issues, and may
>>>> +     * require a double read of the instruction bytes.  At some
>>>> +     * point a new routine could be added that is based on the code
>>>> +     * in vmport_gp_check with extensions to make it more general.
>>>> +     * Since that routine is the only user of this code this can be
>>>> +     * done later.
>>>> +     */
>>>> +    unsigned long inst_len = 15;
>>> Surely this can be unsigned int?
>> The code is smaller this way.  In vmx_vmexit_gp_intercept():
>>
>>     unsigned long inst_len;
>> ...
>>     __vmread(VM_EXIT_INSTRUCTION_LEN, &inst_len);
>> ...
>>     rc = vmport_gp_check(regs, v, &inst_len, inst_addr,
>> ...
>>
>> So changing the argument to vmport_gp_check() to "unsigned int" would
>> add code there.
> So be it then. Generic code shouldn't use odd types just because of
> vendor specific code needs it, unless this makes things _a lot_ more
> complicated.
>

Ok.  Since It looks like I will not be using get_instruction_length() I will
change this to "unsigned int" (or should I use "unsigned short" or
"unsigned byte"?).

>>>> +    int rc = X86EMUL_OKAY;
>>>> +
>>>> +    if ( regs->_eax == BDOOR_MAGIC )
>>> With this, is handling other than 32-bit in/out really meaningful/
>>> correct?
>>>
>> Yes. Harder to use, but since VMware allows it, I allow it also.
> But then a comment explaining the non-architectural (from an
> instruction set perspective) behavior is the minimum that's
> needed for future readers (and reviewers) to understand this.

Ok will add.


>>>> +        case BDOOR_CMD_GETHWVERSION:
>>>> +            /* vmware_hw */
>>>> +            regs->_eax = 0;get_instruction_length
>>>> +            if ( is_hvm_vcpu(curr) )
>>> Since you can't get here for PV, I can't see what you need this
>>> conditional for.
>>>
>> Since I was not 100% sure, I was being safe.  Would converting
>> this to be a "debug=y" check be ok?
> ASSERT() would indeed be the right vehicle.
>

Will do.

>>>> +            {
>>>> +                struct hvm_domain *hd = &d->arch.hvm_domain;
>>>> +
>>>> +                regs->_eax = hd->params[HVM_PARAM_VMWARE_HW];
>>>> +            }
>>>> +            if ( !regs->_eax )
>>>> +                regs->_eax = 4;  /* Act like version 4 */
>>> Why version 4?
>> That is the 1st version that VMware was more consistent in the handling
>> of the "VMware hardware version".  Any value between 1 and 6 would be
>> ok.  This should only happen in strange configs.
> Please make the comment say so then.
>

Will do.

>>>> +            /* hostUsecs */
>>>> +            regs->_ebx =value / 1000000ULL;
>>>> +            /* maxTimeLag */
>>>> +            regs->_ecx = 1000000;
>>>> +            break;
>>> Perhaps this should share code with BDOOR_CMD_GETTIME; I have
>>> to admit though that I can't make any sense of why the latter one
>>> has a FULL suffix when it returns _less_ information.
>>>
>> Sharing of code is not simple.
>> Since I did not pick the names, VMware did.
>>
>> Bug found.  The full returns data in si & dx.
>> will fix. And also makes sharing more complex then not.
> Of course if the current code is incomplete, sharing makes less sense
> once completed.
>
>>>> +        unsigned char bytes[MAX_INST_LEN];
>>>> +        unsigned int fetch_len;
>>>> +        int frc;
>>>> +
>>>> +        /* in or out are limited to 32bits */
>>>> +        if ( byte_cnt > 4 )
>>>> +            byte_cnt = 4;
>>>> +
>>>> +        /*
>>>> +         * Fetch up to the next page break; we'll fetch from the
>>>> +         * next page later if we have to.
>>>> +         */
>>>> +        fetch_len = min_t(unsigned int, *inst_len,
>>>> +                          PAGE_SIZE - (inst_addr  & ~PAGE_MASK));
>>>> +        frc = hvm_fetch_from_guest_virt_nofault(bytes, inst_addr, 
>>>> fetch_len,
>>>> +                                                PFEC_page_present);
>>>> +        if ( frc != HVMCOPY_okay )
>>>> +        {
>>>> +            gdprintk(XENLOG_WARNING,
>>>> +                     "Bad instruction fetch at %#lx (frc=%d il=%lu 
>>>> fl=%u)\n",
>>>> +                     (unsigned long) inst_addr, frc, *inst_len, 
>>>> fetch_len);
>>> Pointless cast. But the value of log messages like this one is
>>> questionable anyway.
>>>
>> Will drop cast.  I am not sure it is possible to get here. The best I
>> have come up with is to change the GDT entry for CS to fault, then do
>> this instruction.  Not sure it would fault, and clearly is an attempt
>> to break in.
>>
>> I do know that if Xen is running under VMware (Why anyone would do
>> this?) this is possible.
>>
>> With all this, should I just drop this message (or make it debug=y
>> only)?
> Yes - dropping would be preferred by me, but I'd accept a debug=y
> only one too.

Ok, will drop.

>>>> +
>>>> +        /* Only adjust byte_cnt 1 time */
>>>> +        if ( bytes[0] == 0x66 )     /* operand size prefix */
>>>> +        {
>>>> +            if ( byte_cnt == 4 )
>>>> +                byte_cnt = 2;
>>>> +            else
>>>> +                byte_cnt = 4;
>>>> +        }
>>> Iirc REX.W set following 0x66 cancels the effect of the latter. Another
>>> thing x86emul would be taking care of for you if you used it.
>> I did not know this.  Most of my testing was done without any check
>> for prefix(s).  I.E. (Open) VMware Tools only uses the inl.  I do
>> not know of anybody using 16bit segments and VMware tools.
> But this isn't the perspective to take when adding code to the
> hypervisor - you should always consider what a (perhaps
> malicious) guest could do.

Ok, but my read of this statement does not help decide which way
to go.  I see several options:

1) Only allow #GP to work for 0xed (inl (%dx),%eax).
    Pros: No attack surface for malicious guest.
             No need for get_instruction_length().
             No need for Intel to confirm the necessary hardware
                behaviour as being architectural.
    Cons: There may exist user apps. that work on VMware and
               not on xen (16bit segments, realmode, vm86, etc).

2) Only allow #GP to work for all 4 I/O instructions without prefix.
    Pros: No attack surface for malicious guest.
             No need for get_instruction_length().
             No need for Intel to confirm the necessary hardware
                behaviour as being architectural.
    Cons: There may exist user apps. that work on VMware and
                 not on xen (16bit segments, realmode, vm86, etc).

3) Only allow zero or one 0x66 prefix and 0xed (inl (%dx),%eax).
    Pros: No attack surface for malicious guest.
             No need for get_instruction_length().
             No need for Intel to confirm the necessary hardware
                behaviour as being architectural.
    Cons: There may exist user apps. that work on VMware and
                 not on xen (using too many prefixes, using other
                 opcodes).

4) Only allow zero or one 0x66 prefix and all 4 I/O instructions.
    Pros: No attack surface for malicious guest.
             No need for get_instruction_length().
             No need for Intel to confirm the necessary hardware
                behaviour as being architectural.
    Cons: There may exist user apps. that work on VMware and
                 not on xen (using too many prefixes).

5) Only allow zero to 14 0x66 prefix and 0xed (inl (%dx),%eax).
    Pros: No attack surface for malicious guest.
    Cons: There may exist user apps. that work on VMware and
                not on xen (using unneeded prefixes, using other
                opcodes).
    5a:    Would be cleaner with get_instruction_length() on Intel,
                but would need for Intel to confirm the necessary hardware
                behaviour as being architectural.
    5b:     Always pass in MAX_INST_LEN.
   

6) Only allow zero to 14 0x66 prefix and all 4 I/O instructions.
    Pros: No attack surface for malicious guest.
    Cons: There may exist user apps. that work on VMware and
               not on xen (using unneeded  prefixes).
    6a:    Would be cleaner with get_instruction_length() on Intel,
                but would need for Intel to confirm the necessary hardware
                behaviour as being architectural.
    6b:     Always pass in MAX_INST_LEN.

7) Add complete prefix handling, and all 4 I/O instructions
    Pros: Limited attack surface for malicious guest (the handling
              of all prefixes greatly increases the complexity of the
              code).
    Cons: Lots of added code.
    7a:    Would be cleaner with get_instruction_length() on Intel,
                but would need for Intel to confirm the necessary hardware
                behaviour as being architectural.
    7b:     Always pass in MAX_INST_LEN.

8) Use hvm_emulate_one().
    Pros: shares code, reduces new code.
    Cons: Adds a lot of attack surface for malicious guest.


I had picked #6, you asked for #8, but I read your answer as do not
do #8.

I would be happy to go with any of the 8 ways (or a way I did not list
above),
just need to know which one to focus on.


>>>> +static void vmx_vmexit_gp_intercept(struct cpu_user_regs *regs,
>>>> +                                    struct vcpu *v)
>>>> +{
>>>> +    unsigned long exit_qualification;
>>>> +    unsigned long inst_len;
>>>> +    unsigned long inst_addr = vmx_rip2pointer(regs, v);
>>>> +    unsigned long ecode;
>>>> +    int rc;
>>>> +#ifndef NDEBUG
>>>> +    unsigned long orig_inst_len;
>>>> +    unsigned long vector;
>>>> +
>>>> +    __vmread(VM_EXIT_INTR_INFO, &vector);
>>>> +    BUG_ON(!(vector & INTR_INFO_VALID_MASK));
>>>> +    BUG_ON(!(vector & INTR_INFO_DELIVER_CODE_MASK));
>>>> +#endif
>>> If you use ASSERT() instead of BUG_ON(), I think you can avoid most
>>> of this preprocessor conditional.
>>>
>> I do not see how.  vector only exists in "debug=y".  So yes if using
>> ASSERT() I could move the #endif up 2 lines, but that does not
>> look better to me.
> I don't follow - ASSERT() is intentionally coded in a way such that
> variables used only by it don't cause compiler warnings. And the
> optimizer ought to be able to eliminate the then unnecessary
> __vmread().
>

I am more use to explicit conditional code and to not depend on the
compilers
to correctly optimize the code.  Will change.

Since the most likely case is that I will stop using
get_instruction_length()
(__vmread(VM_EXIT_INSTRUCTION_LEN,...)).  This drops the need for
orig_inst_len
also.

>>>> +    __vmread(EXIT_QUALIFICATION, &exit_qualification);
>>>> +    __vmread(VM_EXIT_INSTRUCTION_LEN, &inst_len);
>>> get_instruction_length(). But is it architecturally defined that
>>> #GP intercept vmexits actually set this field?
>>>
>> I could not find a clear statement.
> That's the point of my comment.
>
>>  My reading of (directly out of
>> "Intel® 64 and IA-32 Architectures
>> Software Developer’s Manual
>> Volume 3 (3A, 3B & 3C):
>> [...]
>> to me says that yes, this field is set on a #GP exit on an IN.  But the
>> #GP case is not called out by name.
> And it is not any of the cases mentioned.
>
>> My read is that a #GP fault is a "VM Exits Unconditionally" based on the
>> setting of the exception bit mask.
> Right, but it's not exactly an instruction based exit. Unless Intel
> confirms that your extending of the manual says is correct, I'd
> rather recommend against relying on unspecified behavior. If
> any CPU model ends up behaving differently, this might be
> rather hard to diagnose I'm afraid.
>
>> So not using get_instruction_length() does avoid a possible BUG_ON()
>> if I am wrong.
>>
>> So there are 3 options here:
>> 1) Add an ASSERT() like the BUG_ON() in get_instruction_length()
>> 2) Switch to using get_instruction_length()
>> 3) Switch to using MAX_INST_LEN.
>>
>> Let me know which way to go.
> As said above - use get_instruction_length() if Intel confirms the
> necessary hardware behavior as being architectural. If they
> don't, 3) looks like the only viable option.


So what is the procedure to getting "Intel confirms the necessary hardware
behaviour as being architectural"?



>>>> @@ -2182,6 +2183,8 @@ int nvmx_n2_vmexit_handler(struct cpu_user_regs 
>>>> *regs,
>>>>               if ( v->fpu_dirtied )
>>>>                   nvcpu->nv_vmexit_pending = 1;
>>>>           }
>>>> +        else if ( vector == TRAP_gp_fault )
>>>> +            nvcpu->nv_vmexit_pending = 1;
>>> Doesn't that mean an unconditional vmexit even if the L1 hypervisor
>>> didn't ask for such?
>> I might.  I have not done any testing here for the nested VMX case.
>> I could just drop this for now and deside what to do for this code later.
> If dropping the code is safe without also forbidding the combination
> of nested and VMware emulation.

Will have to do a lot more testing to know.  At the time I started the
coding
it was still considered experimental.  Looks like for 4.6 I will need it
to be
fully unit tested.

    -Don Slutz


> Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Reply via email to