On 01/22/15 03:32, Jan Beulich wrote: >>>> On 21.01.15 at 18:52, <dsl...@verizon.com> wrote: >> On 01/16/15 05:09, Jan Beulich wrote: >>>>>> On 03.10.14 at 00:40, <dsl...@verizon.com> wrote: >>>> This is a new domain_create() flag, DOMCRF_vmware_port. It is >>>> passed to domctl as XEN_DOMCTL_CDF_vmware_port. >>> Can you explain why a HVM param isn't suitable here? >>> >> The issue is that you need this flag during construct_vmcb() and >> construct_vmcs(). While Intel has vmx_update_exception_bitmap() >> AMD does not. So when HVM param's are setup and/or changed there >> currently is no way to adjust AMD's exception bitmap. >> >> So this is the simpler way. > But the less desirable one from a design/consistency perspective. > Unless other maintainers disagree, I'd like to see this changed.
Ok, but will wait some time to see if "Unless other maintainers disagree" >>>> This is both a more complete support then in currently provided by >>>> QEMU and/or KVM and less. The missing part requires QEMU changes >>>> and has been left out until the QEMU patches are accepted upstream. >>> I vaguely recall the question having been asked before, but I can't >>> find it to the answer to it: If qemu has support for this, why can't >>> you build on that rather than adding everything in the hypervisor? >>> >> The v10 version of this patch set (which is waiting for an adjusted >> QEMU (the released 2.2.0 is one) does use QEMU for more VMware port >> support. The issues are: > Was there a newer version of these posted than the v8 I looked at? > If so, I must have overlooked the posting (as otherwise I would of > course have commented on the newer version). > The newer version was being worked on, but had not been posted (and had no changes to this patch). Since it was never posted, I will just continue getting v9 (instead of v10) into shape to post. >> 1) QEMU needs access to parts of CPU registers to handle VMware port. >> 2) You need to allow ring 3 access to this 1 I/O port. >> 3) There is more state in xen that would need to also be sent to >> QEMU if all support is moved to QEMU. > Understood. > >>>> @@ -2111,6 +2112,31 @@ svm_vmexit_do_vmsave(struct vmcb_struct *vmcb, >>>> return; >>>> } >>>> >>>> +static void svm_vmexit_gp_intercept(struct cpu_user_regs *regs, >>>> + struct vcpu *v) >>>> +{ >>>> + struct vmcb_struct *vmcb = v->arch.hvm_svm.vmcb; >>>> + /* >>>> + * Just use 15 for the instruction length; vmport_gp_check will >>>> + * adjust it. This is because >>>> + * __get_instruction_length_from_list() has issues, and may >>>> + * require a double read of the instruction bytes. At some >>>> + * point a new routine could be added that is based on the code >>>> + * in vmport_gp_check with extensions to make it more general. >>>> + * Since that routine is the only user of this code this can be >>>> + * done later. >>>> + */ >>>> + unsigned long inst_len = 15; >>> Surely this can be unsigned int? >> The code is smaller this way. In vmx_vmexit_gp_intercept(): >> >> unsigned long inst_len; >> ... >> __vmread(VM_EXIT_INSTRUCTION_LEN, &inst_len); >> ... >> rc = vmport_gp_check(regs, v, &inst_len, inst_addr, >> ... >> >> So changing the argument to vmport_gp_check() to "unsigned int" would >> add code there. > So be it then. Generic code shouldn't use odd types just because of > vendor specific code needs it, unless this makes things _a lot_ more > complicated. > Ok. Since It looks like I will not be using get_instruction_length() I will change this to "unsigned int" (or should I use "unsigned short" or "unsigned byte"?). >>>> + int rc = X86EMUL_OKAY; >>>> + >>>> + if ( regs->_eax == BDOOR_MAGIC ) >>> With this, is handling other than 32-bit in/out really meaningful/ >>> correct? >>> >> Yes. Harder to use, but since VMware allows it, I allow it also. > But then a comment explaining the non-architectural (from an > instruction set perspective) behavior is the minimum that's > needed for future readers (and reviewers) to understand this. Ok will add. >>>> + case BDOOR_CMD_GETHWVERSION: >>>> + /* vmware_hw */ >>>> + regs->_eax = 0;get_instruction_length >>>> + if ( is_hvm_vcpu(curr) ) >>> Since you can't get here for PV, I can't see what you need this >>> conditional for. >>> >> Since I was not 100% sure, I was being safe. Would converting >> this to be a "debug=y" check be ok? > ASSERT() would indeed be the right vehicle. > Will do. >>>> + { >>>> + struct hvm_domain *hd = &d->arch.hvm_domain; >>>> + >>>> + regs->_eax = hd->params[HVM_PARAM_VMWARE_HW]; >>>> + } >>>> + if ( !regs->_eax ) >>>> + regs->_eax = 4; /* Act like version 4 */ >>> Why version 4? >> That is the 1st version that VMware was more consistent in the handling >> of the "VMware hardware version". Any value between 1 and 6 would be >> ok. This should only happen in strange configs. > Please make the comment say so then. > Will do. >>>> + /* hostUsecs */ >>>> + regs->_ebx =value / 1000000ULL; >>>> + /* maxTimeLag */ >>>> + regs->_ecx = 1000000; >>>> + break; >>> Perhaps this should share code with BDOOR_CMD_GETTIME; I have >>> to admit though that I can't make any sense of why the latter one >>> has a FULL suffix when it returns _less_ information. >>> >> Sharing of code is not simple. >> Since I did not pick the names, VMware did. >> >> Bug found. The full returns data in si & dx. >> will fix. And also makes sharing more complex then not. > Of course if the current code is incomplete, sharing makes less sense > once completed. > >>>> + unsigned char bytes[MAX_INST_LEN]; >>>> + unsigned int fetch_len; >>>> + int frc; >>>> + >>>> + /* in or out are limited to 32bits */ >>>> + if ( byte_cnt > 4 ) >>>> + byte_cnt = 4; >>>> + >>>> + /* >>>> + * Fetch up to the next page break; we'll fetch from the >>>> + * next page later if we have to. >>>> + */ >>>> + fetch_len = min_t(unsigned int, *inst_len, >>>> + PAGE_SIZE - (inst_addr & ~PAGE_MASK)); >>>> + frc = hvm_fetch_from_guest_virt_nofault(bytes, inst_addr, >>>> fetch_len, >>>> + PFEC_page_present); >>>> + if ( frc != HVMCOPY_okay ) >>>> + { >>>> + gdprintk(XENLOG_WARNING, >>>> + "Bad instruction fetch at %#lx (frc=%d il=%lu >>>> fl=%u)\n", >>>> + (unsigned long) inst_addr, frc, *inst_len, >>>> fetch_len); >>> Pointless cast. But the value of log messages like this one is >>> questionable anyway. >>> >> Will drop cast. I am not sure it is possible to get here. The best I >> have come up with is to change the GDT entry for CS to fault, then do >> this instruction. Not sure it would fault, and clearly is an attempt >> to break in. >> >> I do know that if Xen is running under VMware (Why anyone would do >> this?) this is possible. >> >> With all this, should I just drop this message (or make it debug=y >> only)? > Yes - dropping would be preferred by me, but I'd accept a debug=y > only one too. Ok, will drop. >>>> + >>>> + /* Only adjust byte_cnt 1 time */ >>>> + if ( bytes[0] == 0x66 ) /* operand size prefix */ >>>> + { >>>> + if ( byte_cnt == 4 ) >>>> + byte_cnt = 2; >>>> + else >>>> + byte_cnt = 4; >>>> + } >>> Iirc REX.W set following 0x66 cancels the effect of the latter. Another >>> thing x86emul would be taking care of for you if you used it. >> I did not know this. Most of my testing was done without any check >> for prefix(s). I.E. (Open) VMware Tools only uses the inl. I do >> not know of anybody using 16bit segments and VMware tools. > But this isn't the perspective to take when adding code to the > hypervisor - you should always consider what a (perhaps > malicious) guest could do. Ok, but my read of this statement does not help decide which way to go. I see several options: 1) Only allow #GP to work for 0xed (inl (%dx),%eax). Pros: No attack surface for malicious guest. No need for get_instruction_length(). No need for Intel to confirm the necessary hardware behaviour as being architectural. Cons: There may exist user apps. that work on VMware and not on xen (16bit segments, realmode, vm86, etc). 2) Only allow #GP to work for all 4 I/O instructions without prefix. Pros: No attack surface for malicious guest. No need for get_instruction_length(). No need for Intel to confirm the necessary hardware behaviour as being architectural. Cons: There may exist user apps. that work on VMware and not on xen (16bit segments, realmode, vm86, etc). 3) Only allow zero or one 0x66 prefix and 0xed (inl (%dx),%eax). Pros: No attack surface for malicious guest. No need for get_instruction_length(). No need for Intel to confirm the necessary hardware behaviour as being architectural. Cons: There may exist user apps. that work on VMware and not on xen (using too many prefixes, using other opcodes). 4) Only allow zero or one 0x66 prefix and all 4 I/O instructions. Pros: No attack surface for malicious guest. No need for get_instruction_length(). No need for Intel to confirm the necessary hardware behaviour as being architectural. Cons: There may exist user apps. that work on VMware and not on xen (using too many prefixes). 5) Only allow zero to 14 0x66 prefix and 0xed (inl (%dx),%eax). Pros: No attack surface for malicious guest. Cons: There may exist user apps. that work on VMware and not on xen (using unneeded prefixes, using other opcodes). 5a: Would be cleaner with get_instruction_length() on Intel, but would need for Intel to confirm the necessary hardware behaviour as being architectural. 5b: Always pass in MAX_INST_LEN. 6) Only allow zero to 14 0x66 prefix and all 4 I/O instructions. Pros: No attack surface for malicious guest. Cons: There may exist user apps. that work on VMware and not on xen (using unneeded prefixes). 6a: Would be cleaner with get_instruction_length() on Intel, but would need for Intel to confirm the necessary hardware behaviour as being architectural. 6b: Always pass in MAX_INST_LEN. 7) Add complete prefix handling, and all 4 I/O instructions Pros: Limited attack surface for malicious guest (the handling of all prefixes greatly increases the complexity of the code). Cons: Lots of added code. 7a: Would be cleaner with get_instruction_length() on Intel, but would need for Intel to confirm the necessary hardware behaviour as being architectural. 7b: Always pass in MAX_INST_LEN. 8) Use hvm_emulate_one(). Pros: shares code, reduces new code. Cons: Adds a lot of attack surface for malicious guest. I had picked #6, you asked for #8, but I read your answer as do not do #8. I would be happy to go with any of the 8 ways (or a way I did not list above), just need to know which one to focus on. >>>> +static void vmx_vmexit_gp_intercept(struct cpu_user_regs *regs, >>>> + struct vcpu *v) >>>> +{ >>>> + unsigned long exit_qualification; >>>> + unsigned long inst_len; >>>> + unsigned long inst_addr = vmx_rip2pointer(regs, v); >>>> + unsigned long ecode; >>>> + int rc; >>>> +#ifndef NDEBUG >>>> + unsigned long orig_inst_len; >>>> + unsigned long vector; >>>> + >>>> + __vmread(VM_EXIT_INTR_INFO, &vector); >>>> + BUG_ON(!(vector & INTR_INFO_VALID_MASK)); >>>> + BUG_ON(!(vector & INTR_INFO_DELIVER_CODE_MASK)); >>>> +#endif >>> If you use ASSERT() instead of BUG_ON(), I think you can avoid most >>> of this preprocessor conditional. >>> >> I do not see how. vector only exists in "debug=y". So yes if using >> ASSERT() I could move the #endif up 2 lines, but that does not >> look better to me. > I don't follow - ASSERT() is intentionally coded in a way such that > variables used only by it don't cause compiler warnings. And the > optimizer ought to be able to eliminate the then unnecessary > __vmread(). > I am more use to explicit conditional code and to not depend on the compilers to correctly optimize the code. Will change. Since the most likely case is that I will stop using get_instruction_length() (__vmread(VM_EXIT_INSTRUCTION_LEN,...)). This drops the need for orig_inst_len also. >>>> + __vmread(EXIT_QUALIFICATION, &exit_qualification); >>>> + __vmread(VM_EXIT_INSTRUCTION_LEN, &inst_len); >>> get_instruction_length(). But is it architecturally defined that >>> #GP intercept vmexits actually set this field? >>> >> I could not find a clear statement. > That's the point of my comment. > >> My reading of (directly out of >> "Intel® 64 and IA-32 Architectures >> Software Developer’s Manual >> Volume 3 (3A, 3B & 3C): >> [...] >> to me says that yes, this field is set on a #GP exit on an IN. But the >> #GP case is not called out by name. > And it is not any of the cases mentioned. > >> My read is that a #GP fault is a "VM Exits Unconditionally" based on the >> setting of the exception bit mask. > Right, but it's not exactly an instruction based exit. Unless Intel > confirms that your extending of the manual says is correct, I'd > rather recommend against relying on unspecified behavior. If > any CPU model ends up behaving differently, this might be > rather hard to diagnose I'm afraid. > >> So not using get_instruction_length() does avoid a possible BUG_ON() >> if I am wrong. >> >> So there are 3 options here: >> 1) Add an ASSERT() like the BUG_ON() in get_instruction_length() >> 2) Switch to using get_instruction_length() >> 3) Switch to using MAX_INST_LEN. >> >> Let me know which way to go. > As said above - use get_instruction_length() if Intel confirms the > necessary hardware behavior as being architectural. If they > don't, 3) looks like the only viable option. So what is the procedure to getting "Intel confirms the necessary hardware behaviour as being architectural"? >>>> @@ -2182,6 +2183,8 @@ int nvmx_n2_vmexit_handler(struct cpu_user_regs >>>> *regs, >>>> if ( v->fpu_dirtied ) >>>> nvcpu->nv_vmexit_pending = 1; >>>> } >>>> + else if ( vector == TRAP_gp_fault ) >>>> + nvcpu->nv_vmexit_pending = 1; >>> Doesn't that mean an unconditional vmexit even if the L1 hypervisor >>> didn't ask for such? >> I might. I have not done any testing here for the nested VMX case. >> I could just drop this for now and deside what to do for this code later. > If dropping the code is safe without also forbidding the combination > of nested and VMware emulation. Will have to do a lot more testing to know. At the time I started the coding it was still considered experimental. Looks like for 4.6 I will need it to be fully unit tested. -Don Slutz > Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel