Re: [kvm-devel] [PATCH/RFC 2/2] s390 virtualization interface.

2007-05-02 Thread Christoph Hellwig
On Tue, May 01, 2007 at 11:12:11PM +0200, Arnd Bergmann wrote:
> Not sure about that point. If you need to do atomic operations on the
> first 32 bits, you shouldn't need to invent your own abstractions for
> those, and it's highly unlikely that the implementation of atomic_t changes.

I disagree.  Using an atomic_t in a hardware structure is against all
the abstractions we've built.  It's much better to have separate macros
to atomically modify a word in this hardware strcuture, even if they end
up exactly the same as the atomic_ macros - at least this way we clearly
document their intent.


-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH/RFC 2/2] s390 virtualization interface.

2007-05-01 Thread Avi Kivity
Arnd Bergmann wrote:
> On Sunday 29 April 2007, Heiko Carstens wrote:
>
>   
>>> Is this data structure extensible? If it is, you probably need
>>> some sort of versioning information to make sure that user space
>>> doesn't rely on fields that the kernel doesn't know about.
>>>   
>> I don't think we can put in some versioning information here. If
>> the kernel decides to increase the version then old userspace
>> code would break?
>> We rather need some mechanism so userpace can ask the kernel
>> "do you support feature x?" and dependent on the answer some
>> fields are used or unused.
>> 
>
> You could do it the way that ext2 handles compatible and incompatible
> features in the on-disk layout:
>
> Assign a number of bits in the read-only part of the mapping to flags
> that the user application can test. A bit in the compatible range mean
> that a feature is available to the user application if it wants to
> use it. A bit in the incompatible range means that the user space needs
> to understand how to use a feature in order to run correctly.
>   

The current ioctl() interface has a feature test call, basically you
send down a feature number get see if it's supported or not.  Fairly
similar to feature bit except it isn't limited in the number of bits.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.


-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH/RFC 2/2] s390 virtualization interface.

2007-05-01 Thread Arnd Bergmann
On Sunday 29 April 2007, Heiko Carstens wrote:

> > Is this data structure extensible? If it is, you probably need
> > some sort of versioning information to make sure that user space
> > doesn't rely on fields that the kernel doesn't know about.
> 
> I don't think we can put in some versioning information here. If
> the kernel decides to increase the version then old userspace
> code would break?
> We rather need some mechanism so userpace can ask the kernel
> "do you support feature x?" and dependent on the answer some
> fields are used or unused.

You could do it the way that ext2 handles compatible and incompatible
features in the on-disk layout:

Assign a number of bits in the read-only part of the mapping to flags
that the user application can test. A bit in the compatible range mean
that a feature is available to the user application if it wants to
use it. A bit in the incompatible range means that the user space needs
to understand how to use a feature in order to run correctly.

> > > +struct sca_entry {
> > > +   atomic_t scn;
> > > +   __u64   reserved;
> > > +   __u64   sda;
> > > +   __u64   reserved2[2];
> > > +}__attribute__((packed));
> > 
> > Is this a hardware data structure? If not, you shouldn't pack it
> > because the first word is only 32 bits and consequently all following
> > ones are unaligned.
> 
> It's a hardware structure. Guess the unaligned u64s were generated when
> this structure was extended to support 64bit.
> But of course we shouldn't put an atomic_t in a hardware structure, since
> it's implementation might change.

Not sure about that point. If you need to do atomic operations on the
first 32 bits, you shouldn't need to invent your own abstractions for
those, and it's highly unlikely that the implementation of atomic_t changes.

Arnd <><

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH/RFC 2/2] s390 virtualization interface.

2007-04-29 Thread Heiko Carstens
On Fri, Apr 27, 2007 at 06:53:40PM +0200, Arnd Bergmann wrote:
> On Friday 27 April 2007, Carsten Otte wrote:
> > Add interface which allows a process to start a virtual machine.
> 
> Looks pretty good to me already.
> 
> It seems a lot closer to lguest than to kvm at the moment concerning
> the kernel interfaces (guest real address == host process, state
> is attached to the thread_info, ...).

Guess I should take a look at lguest then :)

> > +static DEFINE_MUTEX(s390host_init_mutex);
> > +
> > +static void s390host_get_data(struct s390host_data *data)
> > +{
> > +   atomic_inc(&data->count);
> > +}
> > +
> > +void s390host_put_data(struct s390host_data *data)
> > +{
> > +   int cpu;
> > +
> > +   if (atomic_dec_return(&data->count))
> > +   return;
> 
> These should probably use a struct kref instead of an atomic_t
> to count the references.

That makes sense.

> 
> > +static unsigned long
> > +s390host_create_io_area(unsigned long addr, unsigned long flags,
> > +   unsigned long io_addr, struct s390host_data *data)
> > +{
[...]
> > +   down_write(&mm->mmap_sem);
> > +   ret = insert_vm_struct(mm, vma);
> > +   if (ret) {
> > +   kmem_cache_free(vm_area_cachep, vma);
> > +   goto out;
> > +   }
> > +   s390host_get_data(data);
> > +   mm->total_vm += 2;
> > +   vm_insert_page(vma, addr, virt_to_page(io_addr));
> > +
> > +   ret = split_vma(mm, vma, addr + PAGE_SIZE, 0);
> > +   if (ret)
> > +   goto out;
> 
> This open-coded mmap looks a bit scary. I can't see anything wrong in it,
> but it may conflict with a user application that tries to do other strange
> things with memory maps.

Well, there is at least already one bug in it. If the call to split_vma fails
there is no cleanup of the vm_insert_page that has already happened.

> Maybe you can either make this a real call to mmap(), or have it automatically
> mapped using the vdso mechanism.

So, yes we need to find something better, since this is the code that I
don't like at all.

> > +   __u16 cpu;
> 
> __u16 is a type for user space interfaces. In the kernel, use u16.

Yes.

> > +   if (current_thread_info()->sie_cpu != -1)
> > +   return -EINVAL;
> 
> -EBUSY?

Yes.

> > +   if (copy_from_user(&cpu, &sie_template->icpua, sizeof(u16)))
> > +   return -EFAULT;
> 
> get_user()?

Yes.

> > +   sca_block = data->sca_block;
> > +   if (sca_block->mcn & (1UL << (S390HOST_MAX_CPUS - 1 - cpu))) {
> 
> __test_bit()?

Yes, but I think we should probably add some function to access the mcn
bits. Each access looks a bit complicated.
 
> > +   ret = -EINVAL;
> 
> -EBUSY?

Yes.

> > +int sys_s390host_remove_cpu(void)
> > +{
> > +   struct sca_block *sca_block;
> > +   int cpu;
> > +
> > +   cpu = current_thread_info()->sie_cpu;
> > +   if (cpu == -1)
> > +   return -EINVAL;
> > +
> > +   mutex_lock(&s390host_init_mutex);
> > +   sca_block = current_thread_info()->s390host_data->sca_block;
> > +   sca_block->mcn &= ~(1UL << (S390HOST_MAX_CPUS - 1 - cpu));
> > +   current_thread_info()->sie_cpu = -1;
> > +   mutex_unlock(&s390host_init_mutex);
> > +   return 0;
> > +}
> 
> Shouldn't this free the resources that were allocated by add_cpu?

Yes, it was supposed to do that. But since it was only a prototype I was
too lazy to code it up and so I decided to free things on exit.

> If not, this syscall seems pretty useless and I'd simply not
> do it at all -- just wait until exit() to clean up.

This reveals that we are not clearing the mcn bit on exit without
pior call to this syscall. BUG().

> > +   if (signal_pending(current)) {
> > +   ret = -EINTR;
> > +   goto out;
> > +   }
> 
> Can't you handle interrupted system calls? I think it would be nicer
> to return -ERESTARTSYS here so you don't have to go back to your
> user space after a signal.

That should work and it is better.

> > +struct sie_kernel {
> > +   struct sie_blocksie_block;
> > +   s390_fp_regshost_fpregs;
> > +   int host_acrs[NUM_ACRS];
> > +} __attribute__((packed,aligned(4096)));
> 
> Why packed?

Not needed.

> > +#define SIE_UPDATE_PSW (1UL << 0)
> > +#define SIE_FLUSH_TLB  (1UL << 1)
> > +#define SIE_ISKE   (1UL << 2)
> > +#define SIE_SSKE   (1UL << 3)
> > +#define SIE_BLOCK_UPDATE   (1UL << 4)
> > +#define SIE_VSMXM_LOCAL_UPDATE (1UL << 5)
> > +#define SIE_VSMXM_DIST_UPDATE   (1UL << 6)
> 
> I find it much more readable to define these like
> 
> #define SIE_UPDATE_PSW1
> #define SIE_FLUSH_TLB 2
> #define SIE_ISKE  4
> 
> especially when trying to interpret values in memory.

Hmm... don't know. Both make sense, but I have no preference on this.
 
> > +struct sie_user {
> > +   struct sie_blocksie_block;
> > +   psw_t   psw;
> > +   unsigned long   gprs[NUM_GPRS];
> > +   s390_fp_regsguest_fpregs;
> > +   int guest_acrs[NUM_ACRS];
> > 

Re: [kvm-devel] [PATCH/RFC 2/2] s390 virtualization interface.

2007-04-27 Thread Arnd Bergmann
On Friday 27 April 2007, Carsten Otte wrote:
> Add interface which allows a process to start a virtual machine.
 
Looks pretty good to me already.

It seems a lot closer to lguest than to kvm at the moment concerning
the kernel interfaces (guest real address == host process, state
is attached to the thread_info, ...).

Now for the nitpicking:

> +static DEFINE_MUTEX(s390host_init_mutex);
> +
> +static void s390host_get_data(struct s390host_data *data)
> +{
> + atomic_inc(&data->count);
> +}
> +
> +void s390host_put_data(struct s390host_data *data)
> +{
> + int cpu;
> +
> + if (atomic_dec_return(&data->count))
> + return;
> +
> + for (cpu = 0; cpu < S390HOST_MAX_CPUS; cpu++)
> + if (data->sie_io[cpu])
> + free_page((unsigned long)data->sie_io[cpu]);
> +
> + if (data->sca_block)
> + free_page((unsigned long)data->sca_block);
> +
> + kfree(data);
> +}

These should probably use a struct kref instead of an atomic_t
to count the references.

> +static unsigned long
> +s390host_create_io_area(unsigned long addr, unsigned long flags,
> + unsigned long io_addr, struct s390host_data *data)
> +{
> + struct mm_struct *mm = current->mm;
> + struct vm_area_struct *vma;
> + unsigned long ret;
> +
> + flags &= MAP_FIXED;
> + addr = get_unmapped_area(NULL, addr, 2 * PAGE_SIZE, 0, flags);
> +
> + if (addr & ~PAGE_MASK)
> + return addr;
> +
> + vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
> +
> + if (!vma)
> + return -ENOMEM;
> +
> + vma->vm_mm = mm;
> + vma->vm_start = addr;
> + vma->vm_end = addr + 2 * PAGE_SIZE;
> + vma->vm_flags = VM_READ | VM_MAYREAD | VM_IO | VM_RESERVED;
> + vma->vm_flags |= VM_SHARED | VM_MAYSHARE | VM_DONTCOPY;
> +
> +#if 1/* FIXME: write access until sys_s390host_sie interface is 
> final */
> + vma->vm_flags |= VM_WRITE | VM_MAYWRITE;
> +#endif
> +
> + vma->vm_page_prot = protection_map[vma->vm_flags & 0xf];
> + vma->vm_private_data = data;
> + vma->vm_ops = &s390host_vmops;
> +
> + down_write(&mm->mmap_sem);
> + ret = insert_vm_struct(mm, vma);
> + if (ret) {
> + kmem_cache_free(vm_area_cachep, vma);
> + goto out;
> + }
> + s390host_get_data(data);
> + mm->total_vm += 2;
> + vm_insert_page(vma, addr, virt_to_page(io_addr));
> +
> + ret = split_vma(mm, vma, addr + PAGE_SIZE, 0);
> + if (ret)
> + goto out;
> + s390host_get_data(data);
> +
> + vma = find_vma(mm, addr + PAGE_SIZE);
> + vma->vm_flags |= VM_WRITE | VM_MAYWRITE;
> + vma->vm_page_prot = protection_map[vma->vm_flags & 0xf];
> + vm_insert_page(vma, addr + PAGE_SIZE,
> +virt_to_page(io_addr + PAGE_SIZE));
> + ret = addr;
> +out:
> + up_write(&mm->mmap_sem);
> + return ret;
> +}

This open-coded mmap looks a bit scary. I can't see anything wrong in it,
but it may conflict with a user application that tries to do other strange
things with memory maps.

Maybe you can either make this a real call to mmap(), or have it automatically
mapped using the vdso mechanism.

> +long sys_s390host_add_cpu(unsigned long addr, unsigned long flags,
> +   struct sie_block __user *sie_template)
> +{
> + struct sie_block *sie_block;
> + struct sie_io *sie_io;
> + struct sca_block *sca_block;
> + struct s390host_data *data = NULL;
> + unsigned long ret;
> + __u16 cpu;

__u16 is a type for user space interfaces. In the kernel, use u16.

> + if (current_thread_info()->sie_cpu != -1)
> + return -EINVAL;

-EBUSY?

> + if (copy_from_user(&cpu, &sie_template->icpua, sizeof(u16)))
> + return -EFAULT;

get_user()?

> + if (cpu >= S390HOST_MAX_CPUS)
> + return -EINVAL;
> +
> + mutex_lock(&s390host_init_mutex);
> +
> + data = get_s390host_context();
> + if (!data) {
> + ret = -ENOMEM;
> + goto out_err;
> + }
> +
> + sca_block = data->sca_block;
> + if (sca_block->mcn & (1UL << (S390HOST_MAX_CPUS - 1 - cpu))) {

__test_bit()?

> + ret = -EINVAL;

-EBUSY?

> +int sys_s390host_remove_cpu(void)
> +{
> + struct sca_block *sca_block;
> + int cpu;
> +
> + cpu = current_thread_info()->sie_cpu;
> + if (cpu == -1)
> + return -EINVAL;
> +
> + mutex_lock(&s390host_init_mutex);
> + sca_block = current_thread_info()->s390host_data->sca_block;
> + sca_block->mcn &= ~(1UL << (S390HOST_MAX_CPUS - 1 - cpu));
> + current_thread_info()->sie_cpu = -1;
> + mutex_unlock(&s390host_init_mutex);
> + return 0;
> +}

Shouldn't this free the resources that were allocated by add_cpu?
If not, this syscall seems pretty useless and I'd simply not
do it at all -- just wait until exit() to clean up.

> + sie_io = current_thread_info()->s390host_data->sie_io[cpu];
> +
> + i

Re: [kvm-devel] [PATCH/RFC 2/2] s390 virtualization interface.

2007-04-27 Thread Carsten Otte

Anthony Liguori wrote:
> Can you provide a high-level description of how this virtualization 
> mechanism works? 
Very well. First of all, userspace allocates regular memory via 
malloc() or mmap()s a file. The user address space will equals the 
virtual machines address space, user space can define an address 
offset where the virtual machine starts and can define a maximum size 
of the virtual machine. Both is set in each of the SIE control blocks.
Userspace uses open/read/close to read a kernel image into its proper 
location.

Next, userspace launches a set of pthreads. Each pthread corresponds 
with a virtual CPU, and each pthread calls the sys_s390host_add_cpu 
system call.
Now the first CPU enters sys_sie to start interpretive execution via 
sys_s3900host_sie. The CPU will run the virtual cpu context for a 
while and exit again with an intercept code that indicates the reason 
for the exit. Now userspace deals with that, and enters 
sys_s390host_sie again. Later on, all other CPUs get started via an 
interprocessor signal SIGP. Userspace will also receive this as SIE 
intercept. Now all pthreads representing the different virtual CPUs 
will enter the virtual machine context simultaneously.
Interprocessor signals (need_resched and such) are handled by the 
hardware as long as the sender and receiver are both in virtual 
context. Otherwise, userspace will get control again.

In the end, when the machine is halted (or panics), all CPUs will 
return with an intercept code indicating that execution has ended for 
this CPU. Userspace will now call sys_s390host_remove_cpu to free 
things up, then userspace will exit().

> How does MMU virtualization work?
s390 does have hardware support for memory management for virutalized 
environments. The first patch of this series introduces an extension 
to the page table entry, that tracks dirty/reference bits for host and 
virtual machine seperately. For the guest, our CPU does transparently 
merge the real dirty and reference bits stored in a storage key and in 
the page table entry extension.  The virutal machine can set up its 
own page tables inside the box and use dynamic address translation. 
This does not require intervention by the host, except that it needs 
to do its part of dirty and reference tracking in the extended page 
table entry explicitly.

cheers,
Carsten

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH/RFC 2/2] s390 virtualization interface.

2007-04-27 Thread Anthony Liguori
Can you provide a high-level description of how this virtualization 
mechanism works?  How does MMU virtualization work?

Carsten Otte wrote:
> +
> +#ifndef _ASM_S390_SIE64_H
> +#define _ASM_S390_SIE64_H
> +
> +struct sie_block {
> + atomic_t cpuflags;  /* 0x */
> + __u32   prefix; /* 0x0004 */
> + __u32   :32;/* 0x0008 */
> + __u32   :32;/* 0x000c */
> + __u64   :64;/* 0x0010 */
> + __u64   :64;/* 0x0018 */
> + __u64   :64;/* 0x0020 */
> + __u64   cputm;  /* 0x0028 */
> + __u64   ckc;/* 0x0030 */
> + __u64   epoch;  /* 0x0038 */
> + __u8svcnn   :1, /* 0x0040 */
> + svc1c   :1,
> + svc2c   :1,
> + svc3c   :1,
> + :4;
>   

I think bitfields can have weird padding issues.

Regards,

Anthony Liguori

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel