from:"David S. Ahern"

Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)

2008-05-18 Thread David S. Ahern

[resend to new list].


David S. Ahern wrote:
> I was just digging through the sysstat history files, and I was not
> imagining it: I did have an excellent overnight run on 5/13-5/14 with
> your patch and the standard RHEL3U8 smp kernel in the guest. I have no
> idea why I cannot get anywhere close to that again. I have updated quite
> a few variables since then (such as going from 2.6.25-rc8 to 2.6.25.3
> kernel in the host), but backing them out (i.e., resetting the test to
> my recollection of all the details of 5/14) has not helped. baffling and
> frustrating.
> 
> more in-line below.
> 
> 
> Avi Kivity wrote:
>> David S. Ahern wrote:
>>> Avi Kivity wrote:
>>>  
>>>> Okay, I committed the patch without the flood count == 5.
>>>>
>>>> 
>>> I've continued testing the RHEL3 guests with the flood count at 3, and I
>>> am right back to where I started. With the patch and the flood count at
>>> 3, I had 2 runs totaling around 24 hours that looked really good. Now, I
>>> am back to square one. I guess the short of it is that I am not sure if
>>> the patch resolves this issue or not.
>>>
>>>   
>> What about with the flood count at 5?  Does it reliably improve
>> performance?
>>
> 
> [dsa] No. I saw the same problem with the flood count at 5. The
> attachment in the last email shows kvm_stat data during a kscand event.
> The data was collected with the patch you posted. With the flood count
> at 3 the mmu cache/flood counters are in the 18,000/sec and pte updates
> at ~50,000/sec and writes at 70,000/sec. With the flood count at 5
> mmu_cache/flood drops to 0 and pte updates and writes both hit
> 180,000+/second. In both cases these last for 30 seconds or more. I only
> included data for the onset as it's pretty flat during the kscand activity.
> 
>>> Also, in a prior e-mail I mentioned guest time advancing rapidly. I've
>>> noticed that with the -no-kvm-pit option the guest time is much better
>>> and typically stays within 3 seconds or so of the host, even through the
>>> high kscand activity which is one instance of when I've noticed time
>>> jumps with the kernel pit. Yes, this result has been repeatable through
>>> 6 or so runs. :-)
>>>   
>> Strange.  The in-kernel PIT was supposed to improve accuracy.
>>
> 
> [dsa] I started a run with the RHEL4 guest 8 hours ago and it is showing
> the same kind of success. With the in-kernel PIT, time in the guest
> advanced ~120 seconds over real time after just 2 days of up time. With
> the userspace PIT, time in the guest is behind real time by only 1
> second after 8 hours of uptime. Note that I am running the RHEL4.6
> kernel recompiled with HZ at 250 instead of the usual 1000.
> 
> david
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)

2008-05-19 Thread David S. Ahern

Does the fact that the hugemem kernel works just fine have any bearing
on your options? Or rather, is there something unique about the way
kscand works in the hugemem kernel that its performance is ok?

I mentioned last month (so without your first patch) that running the
hugemem kernel showed a remarkable improvement in performance compared
to the standard smp kernel. Over the weekend I ran a test with your
first patch and with the flood detector at 3 (I have not run a case with
the detector at 5) and performance with the hugemem was even better in
the sense that 1-minute averages of guest system time show no noticeable
spikes.

In an earlier post I showed a diff in the config files for the standard
SMP and hugemem kernels. See:
http://article.gmane.org/gmane.comp.emulators.kvm.devel/16944/

david



Avi Kivity wrote:
> David S. Ahern wrote:
>>> [dsa] No. I saw the same problem with the flood count at 5. The
>>> attachment in the last email shows kvm_stat data during a kscand event.
>>> The data was collected with the patch you posted. With the flood count
>>> at 3 the mmu cache/flood counters are in the 18,000/sec and pte updates
>>> at ~50,000/sec and writes at 70,000/sec. With the flood count at 5
>>> mmu_cache/flood drops to 0 and pte updates and writes both hit
>>> 180,000+/second. In both cases these last for 30 seconds or more. I only
>>> included data for the onset as it's pretty flat during the kscand
>>> activity.
>>> 
> 
> It makes sense.  We removed a flooding false positive, and introduced a
> false negative.
> 
> The guest access sequence is:
> - point kmap pte at page table
> - use the new pte to access the page table
> 
> Prior to the patch, the mmu didn't see the 'use' part, so it concluded
> the kmap pte would be better off unshadowed.  This shows up as a high
> flood count.
> 
> After the patch, this no longer happens, so the sequence can repreat for
> long periods.  However the pte that is the result of the 'use' part is
> never accessed, so it should be detected as flooded!  But our flood
> detection mechanism looks at one page at a time (per vcpu), while there
> are two pages involved here.
> 
> There are (at least) three options available:
> - detect and special-case this scenario
> - change the flood detector to be per page table instead of per vcpu
> - change the flood detector to look at a list of recently used page
> tables instead of the last page table
> 
> I'm having a hard time trying to pick between the second and third options.
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)

2008-05-22 Thread David S. Ahern

The short answer is that I am still see large system time hiccups in the
guests due to kscand in the guest scanning its active lists. I do see
better response for a KVM_MAX_PTE_HISTORY of 3 than with 4. (For
completeness I also tried a history of 2, but it performed worse than 3
which is no surprise given the meaning of it.)

I have been able to scratch out a simplistic program that stimulates
kscand activity similar to what is going on in my real guest (see
attached). The program requests a memory allocation, initializes it (to
get it backed) and then in a loop sweeps through the memory in chunks
similar to a program using parts of its memory here and there but
eventually accessing all of it.

Start the RHEL3/CentOS 3 guest with *2GB* of RAM (or more). The key is
using a fair amount of highmem. Start a couple of instances of the
attached. For example, I've been using these 2:

memuser 768M 120 5 300
memuser 384M 300 10 600

Together these instances take up a 1GB of RAM and once initialized
consume very little CPU. On kvm they make kscand and kswapd go nuts
every 5-15 minutes. For comparison, I do not see the same behavior for
an identical setup running on esx 3.5.

david

Avi Kivity wrote:
> Avi Kivity wrote:
>>
>> There are (at least) three options available:
>> - detect and special-case this scenario
>> - change the flood detector to be per page table instead of per vcpu
>> - change the flood detector to look at a list of recently used page
>> tables instead of the last page table
>>
>> I'm having a hard time trying to pick between the second and third
>> options.
>>
> 
> The answer turns out to be "yes", so here's a patch that adds a pte
> access history table for each shadowed guest page-table.  Let me know if
> it helps.  Benchmarking a variety of workloads on all guests supported
> by kvm is left as an exercise for the reader, but I suspect the patch
> will either improve things all around, or can be modified to do so.
> 
/* simple program to malloc memory, inialize it, and
 * then repetitively use it to keep it active.
 */

#include 
#include 

#include 
#include 
#include 
#include 
#include 

/* goal is to sweep memory every T1 sec by accessing a
 * percentage at a time and sleeping T2 sec in between accesses.
 * Once all the memory has been accessed, sleep for T3 sec
 * before starting the cycle over.
 */
#define T1  180
#define T2  5
#define T3  300

const char *timestamp(void);

void usage(const char *prog) {
	fprintf(stderr, "\nusage: %s memlen{M|K}) [t1 t2 t3]\n", prog);
}

int main(int argc, char *argv[])
{
	int len;
	char *endp;
	int factor, endp_len;
	int start, incr;
	int t1 = T1, t2 = T2, t3 = T3;
	char *mem;
	char c = 0;

	if (argc < 2) {
		usage(basename(argv[0]));
		return 1;
	}

	/*
	 * determine memory to request
	 */
	len = (int) strtol(argv[1], &endp, 0);
	factor = 1;
	endp_len = strlen(endp);
	if ((endp_len == 1) && ((*endp == 'M') || (*endp == 'm')))
		factor = 1024 * 1024;
	else if ((endp_len == 1) && ((*endp == 'K') || (*endp == 'k')))
		factor = 1024;
	else if (endp_len) {
		fprintf(stderr, "invalid memory len.\n");
		return 1;
	}
	len *= factor;

	if (len == 0) {
		fprintf(stdout, "memory len is 0.\n");
		return 1;
	}

	/*
	 * convert times if given
	 */
	if (argc > 2) {
		if (argc < 5) {
			usage(basename(argv[0]));
			return 1;
		}

		t1 = atoi(argv[2]);
		t2 = atoi(argv[3]);
		t3 = atoi(argv[4]);
	}

	/*
	 *  amount of memory to sweep at one time
	 */
	if (t1 && t2)
		incr = len / t1 * t2;
	else
		incr = len;

	mem = (char *) malloc(len);
	if (mem == NULL) {
		fprintf(stderr, "malloc failed\n");
		return 1;
	}
	printf("memory allocated. initializing to 0\n");
	memset(mem, 0, len);

	start = 0;
	printf("%s starting memory update.\n", timestamp());
	while (1) {
		c++;
		if (c == 0x7f) c = 0;
		memset(mem + start, c, incr);
		start += incr;

		if ((start >= len) || ((start + incr) >= len)) {
			printf("%s scan complete. sleeping %d\n", 
  timestamp(), t3);
			start = 0;
			sleep(t3);
			printf("%s starting memory update.\n", timestamp());
		} else if (t2)
			sleep(t2);
	}

	return 0;
}

const char *timestamp(void)
{
static char date[64];
struct timeval now;
struct tm ltime;

memset(date, 0, sizeof(date));

if (gettimeofday(&now, NULL) == 0)
{
if (localtime_r(&now.tv_sec,

Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)

2008-05-28 Thread David S. Ahern

Weird. Could it be something about the hosts?

I have been running these tests on a DL320G5 with a Xeon 3050 CPU, 2.13
GHz. Host OS is Fedora 8 with the 2.6.25.3 kernel.

I'll rebuild kvm-69 with your latest patch and try the test programs again.

david


Avi Kivity wrote:
> David S. Ahern wrote:
>> The short answer is that I am still see large system time hiccups in the
>> guests due to kscand in the guest scanning its active lists. I do see
>> better response for a KVM_MAX_PTE_HISTORY of 3 than with 4. (For
>> completeness I also tried a history of 2, but it performed worse than 3
>> which is no surprise given the meaning of it.)
>>
>>
>> I have been able to scratch out a simplistic program that stimulates
>> kscand activity similar to what is going on in my real guest (see
>> attached). The program requests a memory allocation, initializes it (to
>> get it backed) and then in a loop sweeps through the memory in chunks
>> similar to a program using parts of its memory here and there but
>> eventually accessing all of it.
>>
>> Start the RHEL3/CentOS 3 guest with *2GB* of RAM (or more). The key is
>> using a fair amount of highmem. Start a couple of instances of the
>> attached. For example, I've been using these 2:
>>
>> memuser 768M 120 5 300
>> memuser 384M 300 10 600
>>
>> Together these instances take up a 1GB of RAM and once initialized
>> consume very little CPU. On kvm they make kscand and kswapd go nuts
>> every 5-15 minutes. For comparison, I do not see the same behavior for
>> an identical setup running on esx 3.5.
>>   
> 
> I haven't been able to reproduce this:
> 
>> [EMAIL PROTECTED] root]# ps -elf | grep -E 'memuser|kscand'
>> 1 S root 7 1  1  75   0- 0 schedu 10:07 ?   
>> 00:00:26 [kscand]
>> 0 S root  1464 1  1  75   0- 196986 schedu 10:20 pts/0  
>> 00:00:21 ./memuser 768M 120 5 300
>> 0 S root  1465 1  0  75   0- 98683 schedu 10:20 pts/0   
>> 00:00:10 ./memuser 384M 300 10 600
>> 0 S root  2148  1293  0  75   0-   922 pipe_w 10:48 pts/0   
>> 00:00:00 grep -E memuser|kscand
> 
> The workload has been running for about half an hour, and kswapd cpu
> usage doesn't seem significant.  This is a 2GB guest running with my
> patch ported to kvm.git HEAD.  Guest is has 2G of memory.
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)

2008-05-28 Thread David S. Ahern

I've been instrumenting the guest kernel as well. It's the scanning of
the active lists that triggers a lot of calls to paging64_prefetch_page,
and, as you guys know, correlates with the number of direct pages in the
list. Earlier in this thread I traced the kvm cycles to
paging64_prefetch_page(). See

http://www.mail-archive.com/[EMAIL PROTECTED]/msg16332.html

In the guest I started capturing scans (kscand() loop) that took longer
than a jiffie. Here's an example for 1 trip through the active lists,
both anonymous and cache:

active_anon_scan: HighMem, age 4, count[age] 41863 -> 30194, direct
36234, dj 225

active_anon_scan: HighMem, age 3, count[age] 1772 -> 1450, direct 1249, dj 3

active_anon_scan: HighMem, age 0, count[age] 104078 -> 101685, direct
84829, dj 848

active_cache_scan: HighMem, age 12, count[age] 3397 -> 2640, direct 889,
dj 19

active_cache_scan: HighMem, age 8, count[age] 6105 -> 5884, direct 988,
dj 24

active_cache_scan: HighMem, age 4, count[age] 18923 -> 18400, direct
11141, dj 37

active_cache_scan: HighMem, age 0, count[age] 14283 -> 14283, direct 69,
dj 1

An explanation of the line (using the first one): it's a scan of the
anonymous list, age bucket of 4. Before the scan loop the bucket had
41863 pages and after the loop the bucket had 30194. Of the pages in the
bucket 36234 were direct pages(ie., PageDirect(page) was non-zero) and
for this bucket 225 jiffies passed while running scan_active_list().

On the host side the total times (sum of the dj's/100) in the output
above directly match with kvm_stat output, spikes in pte_writes/updates.

Tracing the RHEL3 code I believe linux-2.4.21-rmap.patch is the patch
that brought in the code that is run during the active list scans for
direct pgaes. In and of itself each trip through the while loop in
scan_active_list does not take a lot of time, but when run say 84,829
times (see age 0 above) the cumulative time is high, 8.48 seconds per
the example above.

I'll pull down the git branch and give it a spin.

david

Avi Kivity wrote:
> Andrea Arcangeli wrote:
>> On Wed, May 28, 2008 at 08:13:44AM -0600, David S. Ahern wrote:
>>  
>>> Weird. Could it be something about the hosts?
>>> 
>>
>> Note that the VM itself will never make use of kmap. The VM is "data"
>> agonistic. The VM has never any idea with the data contained by the
>> pages. kmap/kmap_atomic/kunmap_atomic are only need to access _data_.
>>
>>   
> 
> What about CONFIG_HIGHPTE?
> 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)

2008-05-28 Thread David S. Ahern

This is the code in the RHEL3.8 kernel:

static int scan_active_list(struct zone_struct * zone, int age,
struct list_head * list, int count)
{
struct list_head *page_lru , *next;
struct page * page;
int over_rsslimit;

count = count * kscand_work_percent / 100;
/* Take the lock while messing with the list... */
lru_lock(zone);
while (count-- > 0 && !list_empty(list)) {
page = list_entry(list->prev, struct page, lru);
pte_chain_lock(page);
if (page_referenced(page, &over_rsslimit)
&& !over_rsslimit
&& check_mapping_inuse(page))
age_page_up_nolock(page, age);
else {
list_del(&page->lru);
list_add(&page->lru, list);
}
pte_chain_unlock(page);
}
lru_unlock(zone);
return 0;
}

My previous email shows examples of the number of pages in the list and
the scanning that happens.

david


Avi Kivity wrote:
> Andrea Arcangeli wrote:
>>
>> So I never found a relation to the symptom reported of VM kernel
>> threads going weird, with KVM optimal handling of kmap ptes.
>>   
> 
> 
> The problem is this code:
> 
> static int scan_active_list(struct zone_struct * zone, int age,
>struct list_head * list)
> {
>struct list_head *page_lru , *next;
>struct page * page;
>int over_rsslimit;
> 
>/* Take the lock while messing with the list... */
>lru_lock(zone);
>list_for_each_safe(page_lru, next, list) {
>page = list_entry(page_lru, struct page, lru);
>pte_chain_lock(page);
>if (page_referenced(page, &over_rsslimit) && !over_rsslimit)
>age_page_up_nolock(page, age);
>pte_chain_unlock(page);
>}
>lru_unlock(zone);
>return 0;
> }
> 
> If the pages in the list are in the same order as in the ptes (which is
> very likely), then we have the following access pattern
> 
> - set up kmap to point at pte
> - test_and_clear_bit(pte)
> - kunmap
> 
> From kvm's point of view this looks like
> 
> - several accesses to set up the kmap
>  - if these accesses trigger flooding, we will have to tear down the
> shadow for this page, only to set it up again soon
> - an access to the pte (emulted)
>  - if this access _doesn't_ trigger flooding, we will have 512 unneeded
> emulations.  The pte is worthless anyway since the accessed bit is clear
> (so we can't set up a shadow pte for it)
>- this bug was fixed
> - an access to tear down the kmap
> 
> [btw, am I reading this right? the entire list is scanned each time?
> 
> if you have 1G of active HIGHMEM, that's a quarter of a million pages,
> which would take at least a second no matter what we do.  VMware can
> probably special-case kmaps, but we can't]
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)

2008-05-28 Thread David S. Ahern

Yes, I've tried changing kscand_work_percent (values of 50 and 30).
Basically it makes kscand wake more often (ie.,MIN_AGING_INTERVAL
declines in proportion) put do less work each trip through the lists.

I have not seen a noticeable change in guest behavior.

david


Andrea Arcangeli wrote:
> On Wed, May 28, 2008 at 09:43:09AM -0600, David S. Ahern wrote:
>> This is the code in the RHEL3.8 kernel:
>>
>> static int scan_active_list(struct zone_struct * zone, int age,
>>  struct list_head * list, int count)
>> {
>>  struct list_head *page_lru , *next;
>>  struct page * page;
>>  int over_rsslimit;
>>
>>  count = count * kscand_work_percent / 100;
>>  /* Take the lock while messing with the list... */
>>  lru_lock(zone);
>>  while (count-- > 0 && !list_empty(list)) {
>>  page = list_entry(list->prev, struct page, lru);
>>  pte_chain_lock(page);
>>  if (page_referenced(page, &over_rsslimit)
>>  && !over_rsslimit
>>  && check_mapping_inuse(page))
>>  age_page_up_nolock(page, age);
>>  else {
>>  list_del(&page->lru);
>>  list_add(&page->lru, list);
>>  }
>>  pte_chain_unlock(page);
>>  }
>>  lru_unlock(zone);
>>  return 0;
>> }
>>
>> My previous email shows examples of the number of pages in the list and
>> the scanning that happens.
> 
> This code looks better than the one below, as a limit was introduced
> and the whole list isn't scanned anymore, if you decrease
> kscand_work_percent (I assume it's a sysctl even if it's missing the
> sysctl_ prefix) to say 1, you can limit damages. Did you try it?
> 
>> Avi Kivity wrote:
>>> Andrea Arcangeli wrote:
>>>> So I never found a relation to the symptom reported of VM kernel
>>>> threads going weird, with KVM optimal handling of kmap ptes.
>>>>   
>>>
>>> The problem is this code:
>>>
>>> static int scan_active_list(struct zone_struct * zone, int age,
>>>struct list_head * list)
>>> {
>>>struct list_head *page_lru , *next;
>>>struct page * page;
>>>int over_rsslimit;
>>>
>>>/* Take the lock while messing with the list... */
>>>lru_lock(zone);
>>>list_for_each_safe(page_lru, next, list) {
>>>page = list_entry(page_lru, struct page, lru);
>>>pte_chain_lock(page);
>>>if (page_referenced(page, &over_rsslimit) && !over_rsslimit)
>>>age_page_up_nolock(page, age);
>>>pte_chain_unlock(page);
>>>}
>>>lru_unlock(zone);
>>>return 0;
>>> }
>>> If the pages in the list are in the same order as in the ptes (which is
>>> very likely), then we have the following access pattern
> 
> Yes it is likely.
> 
>>> - set up kmap to point at pte
>>> - test_and_clear_bit(pte)
>>> - kunmap
>>>
>>> From kvm's point of view this looks like
>>>
>>> - several accesses to set up the kmap
> 
> Hmm, the kmap establishment takes a single guest operation in the
> fixmap area. That's a single write to the pte, to write a pte_t 8/4
> byte large region (PAE/non-PAE). The same pte_t is then cleared and
> flushed out of the tlb with a cpu-local invlpg during kunmap_atomic.
> 
> I count 1 write here so far.
> 
>>>  - if these accesses trigger flooding, we will have to tear down the
>>> shadow for this page, only to set it up again soon
> 
> So the shadow mapping the fixmap area would be tear down by the
> flooding.
> 
> Or is the shadow corresponding to the real user pte pointed by the
> fixmap, that is unshadowed by the flooding, or both/all?
> 
>>> - an access to the pte (emulted)
> 
> Here I count the second write and this isn't done on the fixmap area
> like the first write above, but this is a write to the real user pte,
> pointed by the fixmap. So if this is emulated it means the shadow of
> the user pte pointing to the real data page is still active.
> 
>>>  - if this access _doesn't_ trigger flooding, we will have 512 unneeded
>>> emulations.  The pte is worthless anyway since the accessed bit is clear
>>> (so we can't set up

Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)

2008-05-28 Thread David S. Ahern


I have a clone of the kvm repository, but evidently not running the
right magic to see the changes in the per-page-pte-tracking branch.  I
ran the following:

git clone git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm.git
git branch per-page-pte-tracking

[EMAIL PROTECTED] kvm]$ git branch
  master
* per-page-pte-tracking

But arch/x86/kvm/mmu.c does not show the changes for the
per-page-pte-history.patch.

What I am not doing correctly here?

david



Avi Kivity wrote:
> David S. Ahern wrote:
>> Weird. Could it be something about the hosts?
>>
>> I have been running these tests on a DL320G5 with a Xeon 3050 CPU, 2.13
>> GHz. Host OS is Fedora 8 with the 2.6.25.3 kernel.
>>
>> I'll rebuild kvm-69 with your latest patch and try the test programs
>> again.
>>   
> 
> I've pushed it into kvm.git, branch name per-page-pte-tracking.
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)

2008-05-29 Thread David S. Ahern

This is 2.4/RHEL3, so HZ=100. 848 jiffies = 8.48 seconds -- and that's
just the one age bucket and this is just one example pulled randomly
(well after boot). During that time kscand does get scheduled out, but
ultimately guest time is at 100% during the scans.

david


Avi Kivity wrote:
> David S. Ahern wrote:
>> I've been instrumenting the guest kernel as well. It's the scanning of
>> the active lists that triggers a lot of calls to paging64_prefetch_page,
>> and, as you guys know, correlates with the number of direct pages in the
>> list. Earlier in this thread I traced the kvm cycles to
>> paging64_prefetch_page(). See
>>
>> http://www.mail-archive.com/[EMAIL PROTECTED]/msg16332.html
>>
>> In the guest I started capturing scans (kscand() loop) that took longer
>> than a jiffie. Here's an example for 1 trip through the active lists,
>> both anonymous and cache:
>>
>> active_anon_scan: HighMem, age 4, count[age] 41863 -> 30194, direct
>> 36234, dj 225
>>
>>   
> 
> HZ=512, so half a second.
> 
> 41K pages in 0.5s -> 80K pages/sec.  Considering we have _at_least_ two
> emulations per page, this is almost reasonable.
> 
>> active_anon_scan: HighMem, age 3, count[age] 1772 -> 1450, direct
>> 1249, dj 3
>>
>> active_anon_scan: HighMem, age 0, count[age] 104078 -> 101685, direct
>> 84829, dj 848
>>   
> 
> Here we scanned 100K pages in ~2 seconds.  50K pages/sec, not too good.
> 
>> I'll pull down the git branch and give it a spin.
>>   
> 
> I've rebased it again to include the prefetch_page optimization.
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)

2008-05-29 Thread David S. Ahern


Andrea Arcangeli wrote:
> On Thu, May 29, 2008 at 01:01:06PM +0300, Avi Kivity wrote:
>> No, two:
>>
>> static inline void set_pte(pte_t *ptep, pte_t pte)
>> {
>>ptep->pte_high = pte.pte_high;
>>smp_wmb();
>>ptep->pte_low = pte.pte_low;
>> }
> 
> Right, that can be 2 or 1 depending on PAE non-PAE, other 2.4
> enterprise distro with pte-highmem ships non-PAE kernels by default.

RHEL3U8 has CONFIG_X86_PAE set.



> - an access to tear down the kmap
>   
>>> Yep, pte_clear on the fixmap pte_t followed by an invlpg (if that
>>> matters).
>>>   
>> Looking at the code, that only happens if CONFIG_HIGHMEM_DEBUG is set.
> 
> 2.4 yes. 2.6 is will do similar to CONFIG_HIGHMEM_DEBUG.
> 
> 2.4 without HIGHMEM_DEBUG sets the pte and invlpg in kmap_atomic and
> does nothing in kunmap_atomic.
> 
> 2.6 sets the pte in kmap_atomic, and clears it+invlpg in kunmap_atomic.

CONFIG_DEBUG_HIGHMEM is set.



>> One possible optimization is that if we see the first part of the kmap 
>> instantiation, we emulate a few more instructions before returning to the 
>> guest.  Xen does this IIRC.
> 
> Surely this would avoid 1 wrprotect fault per kmap_atomic, but I'm not
> sure if 32bit PAE is that important to do this. Most 32bit enterprise
> kernels I worked aren't compiled with PAE, only one called bigsmp is.

RHEL3 has a hugemem kernel which basically just enables the 4G/4G split.
My guest with the hugemem kernel runs much better than the standard smp
kernel.


If you care to download it the RHEL3U8 kernel source is posted here:
ftp://ftp.redhat.com/pub/redhat/linux/updates/enterprise/3AS/en/os/SRPMS/kernel-2.4.21-47.EL.src.rpm

Red Hat does heavily patch kernels, so they will be dramatically
different than the kernel.org kernel with the same number.

david
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)

2008-05-29 Thread David S. Ahern



Avi Kivity wrote:
> David S. Ahern wrote:
>> The short answer is that I am still see large system time hiccups in the
>> guests due to kscand in the guest scanning its active lists. I do see
>> better response for a KVM_MAX_PTE_HISTORY of 3 than with 4. (For
>> completeness I also tried a history of 2, but it performed worse than 3
>> which is no surprise given the meaning of it.)
>>
>>
>> I have been able to scratch out a simplistic program that stimulates
>> kscand activity similar to what is going on in my real guest (see
>> attached). The program requests a memory allocation, initializes it (to
>> get it backed) and then in a loop sweeps through the memory in chunks
>> similar to a program using parts of its memory here and there but
>> eventually accessing all of it.
>>
>> Start the RHEL3/CentOS 3 guest with *2GB* of RAM (or more). The key is
>> using a fair amount of highmem. Start a couple of instances of the
>> attached. For example, I've been using these 2:
>>
>> memuser 768M 120 5 300
>> memuser 384M 300 10 600
>>
>> Together these instances take up a 1GB of RAM and once initialized
>> consume very little CPU. On kvm they make kscand and kswapd go nuts
>> every 5-15 minutes. For comparison, I do not see the same behavior for
>> an identical setup running on esx 3.5.
>>   
> 
> I haven't been able to reproduce this:
> 
>> [EMAIL PROTECTED] root]# ps -elf | grep -E 'memuser|kscand'
>> 1 S root 7 1  1  75   0- 0 schedu 10:07 ?   
>> 00:00:26 [kscand]
>> 0 S root  1464 1  1  75   0- 196986 schedu 10:20 pts/0  
>> 00:00:21 ./memuser 768M 120 5 300
>> 0 S root  1465 1  0  75   0- 98683 schedu 10:20 pts/0   
>> 00:00:10 ./memuser 384M 300 10 600
>> 0 S root  2148  1293  0  75   0-   922 pipe_w 10:48 pts/0   
>> 00:00:00 grep -E memuser|kscand
> 
> The workload has been running for about half an hour, and kswapd cpu
> usage doesn't seem significant.  This is a 2GB guest running with my
> patch ported to kvm.git HEAD.  Guest is has 2G of memory.
> 

I'm running on the per-page-pte-tracking branch, and I am still seeing it. 

I doubt you want to sit and watch the screen for an hour, so install sysstat if 
not already, change the sample rate to 1 minute (/etc/cron.d/sysstat), let the 
server run for a few hours and then run 'sar -u'. You'll see something like 
this:

10:12:11 AM   LINUX RESTART

10:13:03 AM   CPU %user %nice   %system   %iowait %idle
10:14:01 AM   all  0.08  0.00  2.08  0.35 97.49
10:15:03 AM   all  0.05  0.00  0.79  0.04 99.12
10:15:59 AM   all  0.15  0.00  1.52  0.06 98.27
10:17:01 AM   all  0.04  0.00  0.69  0.04 99.23
10:17:59 AM   all  0.01  0.00  0.39  0.00 99.60
10:18:59 AM   all  0.00  0.00  0.12  0.02 99.87
10:20:02 AM   all  0.18  0.00 14.62  0.09 85.10
10:21:01 AM   all  0.71  0.00 26.35  0.01 72.94
10:22:02 AM   all  0.67  0.00 10.61  0.00 88.72
10:22:59 AM   all  0.14  0.00  1.80  0.00 98.06
10:24:03 AM   all  0.13  0.00  0.50  0.00 99.37
10:24:59 AM   all  0.09  0.00 11.46  0.00 88.45
10:26:03 AM   all  0.16  0.00  0.69  0.03 99.12
10:26:59 AM   all  0.14  0.00 10.01  0.02 89.83
10:28:03 AM   all  0.57  0.00  2.20  0.03 97.20
Average:  all  0.21  0.00  5.55  0.05 94.20


every one of those jumps in %system time directly correlates to kscand 
activity. Without the memuser programs running the guest %system time is <1%. 
The point of this silly memuser program is just to use high memory -- let it 
age, then make it active again, sit idle, repeat. If you run kvm_stat with -l 
in the host you'll see the jump in pte writes/updates. An intern here added a 
timestamp to the kvm_stat output for me which helps to directly correlate 
guest/host data.


I also ran my real guest on the branch. Performance at boot through the first 
15 minutes was much better, but I'm still seeing recurring hits every 5 minutes 
when kscand kicks in. Here's the data from the guest for the first one which 
happened after 15 minutes of uptime:

active_anon_scan: HighMem, age 11, count[age] 24886 -> 5796, direct 24845, dj 59

active_anon_scan: HighMem, age 7, count[age] 47772 -> 21289, direct 40868, dj 
103

active_anon_scan: HighMem, age 3, count[age] 91007 -> 329, direct 45805, dj 1212

The kvm_stat data for this time period

Re: qemu-kvm 0.12.4 hanging forever

2010-07-02 Thread David S. Ahern



On 07/02/10 11:28, Zach Carter wrote:
> On Thursday 01 July 2010 16:02:26 Brian Jackson wrote:

>>> I'm sure I could use the qemu-kvm that ships from CentOS with the
>>> corresponding kernel module, however that lacks certain essential
>>> features, including support for scsi disk drive emulation.
>>
>> There's a reason Redhat disables scsi support in their kvm... it's not
>> really  suggested to use it.
> 
> What specific reason is that?  The kvm.spec %changelog references RedHat 
> bugzilla 512837, however I am not authorized to access it.
> 
> scsi is a hard requirement for us, even if we have to stay on the old kernel 
> module.  
> 
> Any additional insight would be much appreciated.

I've used SCSI for RHEL3 and RHEL4 guests for years. Performance is
significantly better than IDE. I have yet to have a guest crash because
of it.

David


> 
> thanks,
> 
> -Zach
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v7 RESEND 4/4] Inter-VM shared memory PCI device

2010-07-08 Thread David S. Ahern


On 07/08/10 15:08, Cam Macdonell wrote:
> Resent (again): Some lines were over 80 characters and debugging is now off.
> 
> Support an inter-vm shared memory device that maps a shared-memory object as a
> PCI device in the guest.  This patch also supports interrupts between guest by
> communicating over a unix domain socket.  This patch applies to the qemu-kvm
> repository.
> 
> -device ivshmem,size=[,shm=]
> 
> Interrupts are supported between multiple VMs by using a shared memory server
> by using a chardev socket.
> 
> -device ivshmem,size=[,shm=]
>[,chardev=][,msi=on][,irqfd=on][,vectors=n][,role=peer|master]
> -chardev socket,path=,id=
> 
> The shared memory server, sample programs and init scripts are in a git repo 
> here:
> 
> www.gitorious.org/nahanni
> 

This is an oft requested feature that Cam's been working on for a while
now. I've tested the plan host-VM shared memory aspect and it works
quite nicely. Can this get committed soon?

David


> Signed-off-by: Cam Macdonell 
> ---
>  Makefile.target |3 +
>  hw/ivshmem.c|  842 
> +++
>  qemu-char.c |6 +
>  qemu-char.h |3 +
>  qemu-doc.texi   |   43 +++
>  5 files changed, 897 insertions(+), 0 deletions(-)
>  create mode 100644 hw/ivshmem.c
> 
> diff --git a/Makefile.target b/Makefile.target
> index a0e9747..1e99ec8 100644
> --- a/Makefile.target
> +++ b/Makefile.target
> @@ -203,6 +203,9 @@ obj-$(CONFIG_USB_OHCI) += usb-ohci.o
>  obj-y += rtl8139.o
>  obj-y += e1000.o
>  
> +# Inter-VM PCI shared memory
> +obj-y += ivshmem.o
> +
>  # Hardware support
>  obj-i386-y += vga.o
>  obj-i386-y += mc146818rtc.o i8259.o pc.o
> diff --git a/hw/ivshmem.c b/hw/ivshmem.c
> new file mode 100644
> index 000..763b9c2
> --- /dev/null
> +++ b/hw/ivshmem.c
> @@ -0,0 +1,842 @@
> +/*
> + * Inter-VM Shared Memory PCI device.
> + *
> + * Author:
> + *  Cam Macdonell 
> + *
> + * Based On: cirrus_vga.c
> + *  Copyright (c) 2004 Fabrice Bellard
> + *  Copyright (c) 2004 Makoto Suzuki (suzu)
> + *
> + *  and rtl8139.c
> + *  Copyright (c) 2006 Igor Kovalenko
> + *
> + * This code is licensed under the GNU GPL v2.
> + */
> +#include "hw.h"
> +#include "pc.h"
> +#include "pci.h"
> +#include "msix.h"
> +#include "kvm.h"
> +
> +#include 
> +#include 
> +
> +#define IVSHMEM_IRQFD   0
> +#define IVSHMEM_MSI 1
> +
> +#define IVSHMEM_PEER0
> +#define IVSHMEM_MASTER  1
> +
> +#define IVSHMEM_REG_BAR_SIZE 0x100
> +
> +//#define DEBUG_IVSHMEM
> +#ifdef DEBUG_IVSHMEM
> +#define IVSHMEM_DPRINTF(fmt, ...)\
> +do {printf("IVSHMEM: " fmt, ## __VA_ARGS__); } while (0)
> +#else
> +#define IVSHMEM_DPRINTF(fmt, ...)
> +#endif
> +
> +typedef struct Peer {
> +int nb_eventfds;
> +int *eventfds;
> +} Peer;
> +
> +typedef struct EventfdEntry {
> +PCIDevice *pdev;
> +int vector;
> +} EventfdEntry;
> +
> +typedef struct IVShmemState {
> +PCIDevice dev;
> +uint32_t intrmask;
> +uint32_t intrstatus;
> +uint32_t doorbell;
> +
> +CharDriverState **eventfd_chr;
> +CharDriverState *server_chr;
> +int ivshmem_mmio_io_addr;
> +
> +pcibus_t mmio_addr;
> +pcibus_t shm_pci_addr;
> +uint64_t ivshmem_offset;
> +uint64_t ivshmem_size; /* size of shared memory region */
> +int shm_fd; /* shared memory file descriptor */
> +
> +Peer *peers;
> +int nb_peers; /* how many guests we have space for */
> +int max_peer; /* maximum numbered peer */
> +
> +int vm_id;
> +uint32_t vectors;
> +uint32_t features;
> +EventfdEntry *eventfd_table;
> +
> +char * shmobj;
> +char * sizearg;
> +char * role;
> +int role_val;   /* scalar to avoid multiple string comparisons */
> +} IVShmemState;
> +
> +/* registers for the Inter-VM shared memory device */
> +enum ivshmem_registers {
> +INTRMASK = 0,
> +INTRSTATUS = 4,
> +IVPOSITION = 8,
> +DOORBELL = 12,
> +};
> +
> +static inline uint32_t ivshmem_has_feature(IVShmemState *ivs,
> +unsigned int feature) {
> +return (ivs->features & (1 << feature));
> +}
> +
> +static inline bool is_power_of_two(uint64_t x) {
> +return (x & (x - 1)) == 0;
> +}
> +
> +static void ivshmem_map(PCIDevice *pci_dev, int region_num,
> +pcibus_t addr, pcibus_t size, int type)
> +{
> +IVShmemState *s = DO_UPCAST(IVShmemState, dev, pci_dev);
> +
> +s->shm_pci_addr = addr;
> +
> +if (s->ivshmem_offset > 0) {
> +cpu_register_physical_memory(s->shm_pci_addr, s->ivshmem_size,
> +
> s->ivshmem_offset);
> +}
> +
> +IVSHMEM_DPRINTF("guest pci addr = %" FMT_PCIBUS ", guest h/w addr = %"
> +PRIu64 ", size = %" FMT_PCIBUS "\n", addr, s->ivshmem_offset, size);
> +
> +}
> +
> +/* accessing registers - based on rtl8139 */
> +static void ivshmem_update_irq(IVShmemState *s, int

Re: [PATCH 09/18] Robust TSC compensation

2010-07-13 Thread David S. Ahern



On 07/13/10 15:15, Zachary Amsden wrote:

>> What prevents a vcpu from seeing its TSC go backwards, in case the first
>> write in the 5 second window is smaller than the victim vcpu's last
>> visible TSC value ?
>>
> 
> Nothing, unfortunately.  However, the TSC would already have to be out
> of sync in order for the problem to occur.  It can never happen in
> normal circumstances on a stable hardware TSC except in one case;
> migration.  During the CPU state transfer phase of migration, however,

What about across processor sockets? Aren't CPUs brought up at different
points such that their TSCs start at different times?

David

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KVM call minutes for July 20

2010-07-20 Thread David S. Ahern

On 07/20/10 08:45, Chris Wright wrote:
> 0.13
> - rc RSN (hopefully this week, top priority for anthony)

Can Cam's inter-vm shared memory device get committed for 0.13? It's
been stagnant on the list for a while now waiting for inclusion (or NAK
comments).

David
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

enabling X86_FEATURE_ARCH_PERFMON in guest

2010-07-30 Thread David S. Ahern


How do I get X86_FEATURE_ARCH_PERFMON enabled for a guest?

I've tried "-cpu host,+perfmon" and "-cpu host,+arch_perfmon", but both
get rejected with an error: CPU feature perfmon not found


Host processor:

model name  : Intel(R) Xeon(R) CPU   E5540  @ 2.53GHz
stepping: 5
...

flags   : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall
nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology
nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16
xtpr pdcm dca sse4_1 sse4_2 popcnt lahf_lm ida tpr_shadow vnmi
flexpriority ept vpid


Guest side:
model name  : Intel(R) Xeon(R) CPU   E5540  @ 2.53GHz
stepping: 5
...
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush mmx fxsr sse sse2 ss syscall nx lm constant_tsc pni
ssse3 cx16 sse4_1 sse4_2 popcnt lahf_lm


David
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Anyone seeing huge slowdown launching qemu with Linux 2.6.35?

2010-08-04 Thread David S. Ahern

On 08/03/10 12:43, Avi Kivity wrote:
> libguestfs does not depend on an x86 architectural feature. 
> qemu-system-x86_64 emulates a PC, and PCs don't have -kernel.  We should
> discourage people from depending on this interface for production use.

That is a feature of qemu - and an important one to me as well. Why
should it be discouraged? You end up at the same place -- a running
kernel and in-ram filesystem; why require going through a bootloader
just because the hardware case needs it?

David
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Anyone seeing huge slowdown launching qemu with Linux 2.6.35?

2010-08-04 Thread David S. Ahern

On 08/04/10 11:34, Avi Kivity wrote:

>> And it's awesome for fast prototyping. Of course, once that fast
>> becomes dog slow, it's not useful anymore.
> 
> For the Nth time, it's only slow with 100MB initrds.

100MB is really not that large for an initrd.

Consider the deployment of stateless nodes - something that
virtualization allows the rapid deployment of. 1 kernel, 1 initrd with
the various binaries to be run. Create nodes as needed by launching a
shell command - be it for more capacity, isolation, etc. Why require an
iso or disk wrapper for a binary blob that is all to be run out of
memory? The -append argument allows boot parameters to be specified at
launch. That is a very powerful and simple design option.

David
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Credit-Based CPU Scheduling & Modifying VM Disk Size

2010-08-09 Thread David S. Ahern



On 08/09/10 15:51, Daniel P. Berrange wrote:
>> - Modify VM's Disk Size
> 
> qemu-img can copy disks, resizing as it does it. For raw disks
> just dd extra sace onto the end of it (with a suitable seek= param
> to avoid killing your existing data :-)
> 
> Daniel

New versions of qemu-img have a 'resize' command to change the disk size.

David
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RHEL 4.5 guest virtual network performace

2010-08-16 Thread David S. Ahern



On 08/16/10 17:09, Alex Rixhardson wrote:
> Thank you a lot for the tip - you were right. The 5.5 guest is using
> virtio, but 4.5 is not. So, this is the reason.
> 
> Adding  to the config file unfortunately
> doesn't help - the network card is not recognized by the guest. Do I
> need to install something extra on the guest RHEL 4.5?

RHEL4.8 is the first RHEL4 version to support virtio devices.

David


> 
> Regards,
> Alex
> 
> On Tue, Aug 17, 2010 at 12:05 AM, Alex Rixhardson
>  wrote:
>> virtio...I think :-).
>>
>> How could I confirm that?
>>
>> Regards,
>> Alex
>>
>> On Mon, Aug 16, 2010 at 11:56 PM, Dor Laor  wrote:
>>> On 08/17/2010 12:51 AM, Alex Rixhardson wrote:

 I tried with 'notsc divider=10' (since it's 64 bit guest), but the
 results are the still same :-(. The guest is idle at the time of
 testing. It has 2 CPU and 1024 MB RAM available.
>>>
>>> Hmm, are you using e1000 or virtio for the 4.5 guest?
>>> e1000 should be slow since it's less suitable for virtualization (3
>>> mmio/packet)
>>>
>>>

 On Mon, Aug 16, 2010 at 11:35 PM, Dor Laor  wrote:
>
> On 08/17/2010 12:22 AM, Alex Rixhardson wrote:
>>
>> Thanks for the suggestion.
>>
>> I tried with the netperf. I ran netserver on host and netperf on RHEL
>> 5.5 and RHEL 4.5 guests. This are the results of 60 seconds long
>> tests:
>>
>> RHEL 4.5 guest:
>> Throughput (10^6bits/sec) = 145.80
>
> At least it bought you another 5Mb/s over iperf ...
>
> It might be time related, 5.5 has kvmclock but rhel4 does not.
> If it's 64 bit guest add this to the 4.5 guest cmdline  'notsc
> divider=10'.
> If it's 32 use 'clock=pmtmr divider=10'.
> The divider is probably new and is in rhel4.8 only, it's ok w/o it too.
>
> What's the host load for the 4.5 guest?
>
>>
>> RHEL 5.5 guest:
>> Throughput (10^6bits/sec) = 3760.24
>>
>> The results are really bad on RHEL 4.5 guest. What could be wrong?
>>
>> Regards,
>> Alex
>>
>> On Mon, Aug 16, 2010 at 9:49 PM, Dor Laorwrote:
>>>
>>> On 08/16/2010 10:00 PM, Alex Rixhardson wrote:

 Hi guys,

 I have the following configuration:

 1. host is RHEL 5.5, 64bit with KVM (version that comes out of the box
 with RHEL 5.5)
 2. two guests:
 2a: RHEL 5.5, 32bit,
 2b: RHEL 4.5, 64bit

 If I run iperf between host RHEL 5.5 and guest RHEL 5.5 inside the
 virtual network subnet I get great results (>  4Gbit/sec). But if
 I
 run
 iperf between guest RHEL 4.5 and either of the two RHELs 5.5 I get bad
 network performance (around 140Mbit/sec).
>>>
>>> Please try netperf, iperf known to be buggy and might consume cpu w/o
>>> real
>>> justification
>>>

 The configuration was made thru virtual-manager utility, nothing
 special. I just added virtual network device to both guests.

 Could you guys give me some tips on what should I check?

 Regards,
 Alex
 --
 To unsubscribe from this list: send the line "unsubscribe kvm" in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
 --
 To unsubscribe from this list: send the line "unsubscribe kvm" in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KVM timekeeping and TSC virtualization

2010-08-20 Thread David S. Ahern



On 08/20/10 02:07, Zachary Amsden wrote:
> This patch set implements full TSC virtualization, with both
> trapping and passthrough modes, and intelligent mode switching.
> As a result, TSC will never go backwards, we are stable against
> guest re-calibration attempts, VM reset, and migration.  For guests
> which require it, the TSC khz can even be preserved on migration
> to a new host.
> 
> The TSC will never be trapped on UP systems unless the host TSC
> actually runs faster than the guest; other conditions, including
> bad hardware and changing speeds are accomodated by using catchup
> mode to keep the guest passthrough TSC in line with the host clock.

What's the overhead of trapping TSC reads for Nehalem-type processors?

gettimeofday() in guests is the biggest performance problem with KVM for
me, especially for older OSes like RHEL4 which is a supported OS for
another 2 years. Even with RHEL5, 32-bit, I had to force kvmclock off to
get the VM to run reliably:

http://article.gmane.org/gmane.comp.emulators.kvm.devel/51017/match=kvmclock+rhel5.5

David


> 
> What is still needed on top of this is a way to force TSC
> trapping, or disable it entirely, for benchmarking purposes.
> I refrained from adding that last bit because it wasn't clear
> whether the best thing to do is a global 'force TSC trapping' /
> 'force TSC passthrough' / 'intelligent choice', or if this control
> should be on a per-VM level, via an ioctl(), module parameter,
> or sysfs.
> 
> John and Thomas I have cc'd on this because it may be relevant to
> their interests and I always appreciate feedback, especially on
> a change set as large and complex as this.
> 
> Enjoy.  This time, there are no howler monkeys.  I've included
> all the feedback I got from previous rounds of this and more.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KVM timekeeping and TSC virtualization

2010-08-21 Thread David S. Ahern



On 08/20/10 17:24, Zachary Amsden wrote:
> On 08/20/2010 03:26 AM, David S. Ahern wrote:
>>
>> On 08/20/10 02:07, Zachary Amsden wrote:
>>   
>>> This patch set implements full TSC virtualization, with both
>>> trapping and passthrough modes, and intelligent mode switching.
>>> As a result, TSC will never go backwards, we are stable against
>>> guest re-calibration attempts, VM reset, and migration.  For guests
>>> which require it, the TSC khz can even be preserved on migration
>>> to a new host.
>>>
>>> The TSC will never be trapped on UP systems unless the host TSC
>>> actually runs faster than the guest; other conditions, including
>>> bad hardware and changing speeds are accomodated by using catchup
>>> mode to keep the guest passthrough TSC in line with the host clock.
>>>  
>> What's the overhead of trapping TSC reads for Nehalem-type processors?
>>
>> gettimeofday() in guests is the biggest performance problem with KVM for
>> me, especially for older OSes like RHEL4 which is a supported OS for
>> another 2 years. Even with RHEL5, 32-bit, I had to force kvmclock off to
>> get the VM to run reliably:
>>
>> http://article.gmane.org/gmane.comp.emulators.kvm.devel/51017/match=kvmclock+rhel5.5
>>
>>
> 
> Correctness is the biggest timekeeping problem with KVM for me.  The
> fact that you had to force kvmclock off is evidence of that.  Slightly
> slower applications are fine.  Broken ones are not acceptable.

I have been concerned with speed and correctness for a while:

http://www.mail-archive.com/kvm@vger.kernel.org/msg02955.html
http://www.mail-archive.com/kvm@vger.kernel.org/msg07231.html

> 
> TSC will not be trapped with kvmclock, and the bug you hit with RHEL5
> kvmclock has since been fixed.  As you can see, it is not a simple and
> straightforward issue to get all the issues sorted out.

kvmclock is for guests running RHEL5.5+some update and or some guest
running a very recent linux kernel. There's a lot of products running on
OS'es older than that.

> 
> Also, TSC will not be trapped with UP VMs, only SMP.  If you seriously
> believe RHEL4 will perform better as an SMP guest than several instances
> of coordinated UP guests, you would worry about this issue.  I don't. 
> The amount of upstream scalability and performance work done since that
> timeframe is enormous, to the point that it's entirely plausible that
> KVM governed UP RHEL4 guests as a cluster are faster than a RHEL4 SMP host.

Products built on RHEL3, RHEL4 or earlier RHEL5 were developed in the
past, and performance expectations set for that version based on SMP -
be it bare metal or virtual. You can't expect a product to be redesigned
to run on KVM.

> 
> So the answer is - it depends.  Hardware is always getting faster, and
> trap / exit cost is going down.   Right now, it is anywhere from a few
> hundred to multiple thousands of cycles, depending on your hardware.  I
> don't have an exact benchmark number I can quote, although in a couple
> of hours, I probably will.  I'll guess 3,000 cycles.
> 
> I agree, gettimeofday is a huge issue, for poorly written applications. 

I understand it is not a simple problem, and "poorly written
applications" is a bit of reach don't you think? There are a number of
workloads that depend on time stamps; that does not make them poorly
designed.

> Not that this means we won't speed it up, in fact, I have already done
> quite a bit of work on ways to reduce the exit cost.  Let's, however,
> get things correct before trying to make them aggressively fast.
> 
> Zach

I have also looked at time keeping and performance of getimeofday on a
certain proprietary hypervisor. KVM lags severely here and workloads
dependent on timestamps are dramatically impacted. Evaluations and
decisions are made today based on current designs - both KVM and
product. Severe performance deltas raise a lot of flags.

David
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KVM timekeeping and TSC virtualization

2010-08-23 Thread David S. Ahern

On 08/23/10 19:44, Zachary Amsden wrote:
>> I have also looked at time keeping and performance of getimeofday on a
>> certain proprietary hypervisor. KVM lags severely here and workloads
>> dependent on timestamps are dramatically impacted. Evaluations and
>> decisions are made today based on current designs - both KVM and
>> product. Severe performance deltas raise a lot of flags.
>>
> 
> This is laughably incorrect.

Uh, right.

> 
> Gettimeofday is faster on KVM than anything else using TSC based clock
> because it passes the TSC through directly.   VMware traps the TSC and
> is actually slower.

Yes, it does trap the TSC to ensure it is increasing. My question
regarding trapping on KVM was about to what to expect in terms of
overhead. Furthermore, if you add trapping on KVM are TSC reads still
faster on KVM?

> 
> Can you please define your "severe performance delta" and tell us your
> benchmark methodology?  I'd like to help you figure out how it is flawed.

I sent you the link in the last response. Here it is again:
http://www.mail-archive.com/kvm@vger.kernel.org/msg07231.html

TSC - fast, but has horrible time drifts

PIT - horribly slow

ACPI PM - horribly slow

HPET - did not exist in Nov. 2008, and since has not been reliable in my
tests with RHEL4 and RHEL5

kvmclock - does not exist for RHEL4 and not usable on RHEL5 until the
update of 5.5 with the fix (I have not retried RHEL5 with the latest
maintenance kernel to verify it is stable in my use cases).

Take the program from the link above. Run it in a RHEL4 & RHEL5 guest
running on VMware for all the clock sources. Somewhere I have the data
for these comparisons -- KVM, VMware and bare metal. Same hardware, same
OS. The PIT and acpi-PM clock sources are faster on VMware than bare metal.

My point is that kvmclock is Red Hat's answer for the future -- RHEL6,
RHEL5.Y (whenever it proves reliable). What about the present?  What
about products based on other distributions newer than RHEL5 but
pre-kvmclock?

There are a lot of moving windows of what to use as a clock source, not
just per major number (RHEL4, RHEL5) but minor number (e.g., TSC
stability on RHEL4 -- e.g.,
https://bugzilla.redhat.com/show_bug.cgi?id=491154) and further
maintenance releases (kvmclock requiring RHEL5.5+). That is not very
friendly to a product making a transition to virtualization - and with
the same code base running bare metal or in a VM.

David

> 
> Zach
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KVM timekeeping and TSC virtualization

2010-08-24 Thread David S. Ahern



On 08/23/10 23:47, Zachary Amsden wrote:
> I've heard the rumor that TSC is orders of magnitude faster under VMware
> than under KVM from three people now, and I thought you were part of
> that camp.
> 
> Needless to say, they are either laughably incorrect, or possess some
> great secret knowledge of how to make things under virtualization go
> faster than bare metal.
> 
> I also have a magical talking unicorn, which, btw, is invisible. 
> Extraordinary claims require extraordinary proof (the proof of my
> unicorn is too complex to fit in the margin of this e-mail, however, I
> assure you he is real).

I have put in a lot of time over the past 3 years to understand how the
'magic' of virtualization works; please don't lump me into camps until I
raise my hand as being part of one.


>> My point is that kvmclock is Red Hat's answer for the future -- RHEL6,
>> RHEL5.Y (whenever it proves reliable). What about the present?  What
>> about products based on other distributions newer than RHEL5 but
>> pre-kvmclock?
>>
> 
> It should be obvious from this patchset... PIT or TSC.
> 
> KVM did not have an in-kernel PIT implementation circa 2008, so this
> data is quite old.  It's much faster now and will continue to get faster
> as exit cost goes down and the emulation gets further optimized.

It was in-kernel pit in early 2008 (kernel git entry):

commit 7837699fa6d7adf81f26ab73a5f6897ea1ab9d6a
Author: Sheng Yang 
Date:   Mon Jan 28 05:10:22 2008 +0800

KVM: In kernel PIT model


> 
> Plus, now we have an error-free TSC.
> 
>> There are a lot of moving windows of what to use as a clock source, not
>> just per major number (RHEL4, RHEL5) but minor number (e.g., TSC
>> stability on RHEL4 -- e.g.,
>> https://bugzilla.redhat.com/show_bug.cgi?id=491154) and further
>> maintenance releases (kvmclock requiring RHEL5.5+). That is not very
>> friendly to a product making a transition to virtualization - and with
>> the same code base running bare metal or in a VM.
>>
> 
> If you have old software running on broken hardware you do not get
> hardware performance and error-free time virtualization.  With any
> vendor.  Period.

Sucks to be old *and* broken. But old with fancy new wheels, er hardware
-- like commodity x86 servers running Nehalem-based processors -- is a
different story.

> 
> With this patchset, KVM now has a much stronger guarantee: If you have
> old guest software running on broken hardware, using SMP virtual
> machines, you do not get hardware performance and error-free time
> virtualization.However, if you have new guest software, non-broken
> hardware, or can simply run UP guests instead of SMP, you can have
> hardware performance, and it is now error free.  Alternatively, you can
> sacrifice some accuracy and have hardware performance, even for SMP
> guests, if you can tolerate some minor cross-CPU TSC variation.  No
> other vendor I know of can make that guarantee.
> 
> Zach

If the processor has a stable TSC why trap it? I realize you are trying
to cover a gauntlet of hardware and guests, so maybe a nerd knob is
needed to disable.

David
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Write to read-only msr MSR_IA32_PERF_STATUS is harmless, ignore it!

2010-08-31 Thread David S. Ahern



On 08/31/10 11:04, Jes Sorensen wrote:
>> Just grep for the msr name in a guest kernel source that's known to
>> trigger the message.
> 
> Been there, done that! This happens with an F13 kernel during reboot.
> Ran the search on the expanded 2.6.32.8-149 tree and found no reference
> to anything trying to write it, except for KVM backing up the flag, but
> that shouldn't happen in the guest.

I've been seeing the messages with a 32-bit Fedora 10 guest running a
2.6.27 variant. The wrmsr messages are generated by the time the grub
menu appears. ie., pre-OS.

David

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: qemu-kvm and initrd

2010-09-14 Thread David S. Ahern



On 09/14/10 00:35, Nirmal Guhan wrote:
> Hi,
> 
> Getting an error while booting my guest with -initrd option as in :
> 
> qemu-kvm -net nic,macaddr=$macaddress -net tap,script=/etc/qemu-ifup
> -m 512 -hda /root/kvm/x86/vdisk.img -kernel /root/mvroot/bzImage
> -initrd /root/kvm/mv/ramdisk.img -append "root=/dev/ram0"
> 
> No filesystem could mount root, tried : ext3 ext2 ext4 vfat msds iso9660
> Kernel panic
> 
> #file ramdisk.img
> #ramdisk.img: Linux rev 1.0 ext2 filesystem data (mounted or unclean)

What's the size of ramdisk.img?

David


>
> I tried with both above initrd and gzipped initrd but same error.
> 
> If I try to mount the same file and do a -append  "ip=dhcp
> root=/dev/nfs rw nfsroot=:/root/kvm/mv/mnt" instead of -initrd
> option, it works  fine. So am guessing this is initrd related.
> 
> Any help would be much appreciated.
> 
> Thanks,
> Nirmal
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: high load with usb device

2010-09-14 Thread David S. Ahern

What's your clock source on the host?
cat /sys/devices/system/clocksource/clocksource0/current_clocksource

With the usbtablet device the host clock source is read 2-3 times for
frequently which for acpi_pm and hpet jack up the CPU.

David
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: high load with usb device

2010-09-14 Thread David S. Ahern

On 09/14/10 10:00, Michael Tokarev wrote:
> 14.09.2010 19:51, David S. Ahern пишет:
>> cat /sys/devices/system/clocksource/clocksource0/current_clocksource
> 
> It's tsc (AthlonII CPU).  Also available are hpet and acpi_pm.
> Switching to hpet or acpi_pm does not have visible effect, at
> least not while the guest is running.

acpi_pm takes more cycles to read. On my laptop switching from hpet to
acpi_pm caused the winxp VM to jump up in CPU usage. For both time
sources 'perf top -p ' shows timer reads as the top
function for qemu-kvm (e.g., read_hpet).

On a Nehalem box the clock source is TSC. 'perf top -p ' for a
winxp VM does not show clock reads at all.

David

> 
> Thanks!
> 
> /mjt
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: high load with usb device

2010-09-14 Thread David S. Ahern



On 09/14/10 10:29, Michael Tokarev wrote:

> For comparison, here's the same strace stats without -usbdevice:
> 
> % time seconds  usecs/call callserrors syscall
> -- --- --- - - 
>  97.700.080237  22  3584   select
>   1.090.000895   0  6018  3584 read
>   0.330.000271   0  9670   clock_gettime
>   0.310.000254   0  6086   gettimeofday
>   0.260.000210   0  2432   rt_sigaction
>   0.170.000137   0  3653   timer_gettime
>   0.150.000122   0  2778   timer_settime
>   0.000.00   0 1   ioctl
>   0.000.00   0 1   rt_sigpending
>   0.000.00   0 1 1 rt_sigtimedwait
> -- --- --- - - 
> 100.000.082126 34224  3585 total
> 
> Yes, it is still doing lots of unnecessary stuff, but the load
> is <1%.

Without a USB device attached the controller is turned off. See the call
to qemu_del_timer() in uhci_frame_timer(). As soon as you add the tablet
device the polling starts (see qemu_mod_timer in  uhci_ioport_writew)
and the cpu load starts.

David


> 
> (This is without host_alarm_handler() in qemu_notify_event())
> 
> /mjt
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: qemu-kvm and initrd

2010-09-14 Thread David S. Ahern



On 09/14/10 13:38, Nirmal Guhan wrote:
> On Tue, Sep 14, 2010 at 8:38 AM, David S. Ahern  wrote:
>>
>>
>> On 09/14/10 00:35, Nirmal Guhan wrote:
>>> Hi,
>>>
>>> Getting an error while booting my guest with -initrd option as in :
>>>
>>> qemu-kvm -net nic,macaddr=$macaddress -net tap,script=/etc/qemu-ifup
>>> -m 512 -hda /root/kvm/x86/vdisk.img -kernel /root/mvroot/bzImage
>>> -initrd /root/kvm/mv/ramdisk.img -append "root=/dev/ram0"
>>>
>>> No filesystem could mount root, tried : ext3 ext2 ext4 vfat msds iso9660
>>> Kernel panic
>>>
>>> #file ramdisk.img
>>> #ramdisk.img: Linux rev 1.0 ext2 filesystem data (mounted or unclean)
> 
> I tried with both initrd and initramfs. Sizes are 42314699 and
> 71271136 respectively. Sizes do seem larger but I created them from
> the nfsroot created as part of the build (the nfsroot works
> apparently).

See if you can drop the image size as a test. I had to do that recently
to get the kernel/initrd/append option to work. As I recall I was
getting the same error message until I dropped the initrd size.

David

> 
>>
>> What's the size of ramdisk.img?
>>
>> David
>>
>>
>>>
>>> I tried with both above initrd and gzipped initrd but same error.
>>>
>>> If I try to mount the same file and do a -append  "ip=dhcp
>>> root=/dev/nfs rw nfsroot=:/root/kvm/mv/mnt" instead of -initrd
>>> option, it works  fine. So am guessing this is initrd related.
>>>
>>> Any help would be much appreciated.
>>>
>>> Thanks,
>>> Nirmal
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: .img on nfs, relative on ram, consuming mass ram

2010-09-16 Thread David S. Ahern



On 09/16/10 03:09, Andre Przywara wrote:
> Which is only natural, as tmpfs is promising to never swap. So it will

pages in tmpfs can swap. That's difference between ramfs and tmpfs. From
Documentation/filesystems/tmpfs.txt:

"tmpfs puts everything into the kernel internal caches and grows and
shrinks to accommodate the files it contains and is able to swap
unneeded pages out to swap space."

David
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC 0/4] KVM in-kernel PM Timer implementation

2010-12-14 Thread David S. Ahern



On 12/14/10 08:29, Anthony Liguori wrote:

>> I recently used to investigate the performance benefit. In a Linux
>> guest, I was running a program that calls gettimeofday() 'n' times
>> in a loop (the PM Timer register is read during each call). With
>> in-kernel PM Timer, I observed a significant reduction of program
>> execution time.
>>
> 
> I've played with this in the past.  Can you post real numbers,
> preferably, with a real work load?

2 years ago I posted relative comparisons of the time sources for older
RHEL guests:
http://www.mail-archive.com/kvm@vger.kernel.org/msg07231.html

What's the relative speed of the in-kernel pmtimer compared to the PIT?

David
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC 0/4] KVM in-kernel PM Timer implementation

2010-12-14 Thread David S. Ahern

On 12/14/10 12:49, Anthony Liguori wrote:
> But that doesn't tell you what the impact is in real world workloads. 
> Before we start pushing all device emulation into the kernel, we need to
> quantify how often gettimeofday() is really called in real workloads.

The workload that inspired that example program at its current max load
calls gtod upwards of 1000 times per second. The overhead of
gettimeofday was the biggest factor when comparing performance to bare
metal and esx. That's why I wrote the test program --- boils a complex
product/program to a single system call.

David

> 
> Regards,
> 
> Anthony Liguori
> 
>> What's the relative speed of the in-kernel pmtimer compared to the PIT?
>>
>> David
>>
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC 0/4] KVM in-kernel PM Timer implementation

2010-12-14 Thread David S. Ahern



On 12/14/10 14:46, Anthony Liguori wrote:
> On 12/14/2010 01:54 PM, David S. Ahern wrote:
>>
>> On 12/14/10 12:49, Anthony Liguori wrote:
>>   
>>> But that doesn't tell you what the impact is in real world workloads.
>>> Before we start pushing all device emulation into the kernel, we need to
>>> quantify how often gettimeofday() is really called in real workloads.
>>>  
>> The workload that inspired that example program at its current max load
>> calls gtod upwards of 1000 times per second. The overhead of
>> gettimeofday was the biggest factor when comparing performance to bare
>> metal and esx. That's why I wrote the test program --- boils a complex
>> product/program to a single system call.
>>
> 
> So the absolute performance impact was on the order of what?

At the time I did the investigations (18-24 months ago) KVM was on the
order of 15-20% worse for a RHEL4 based workload and the overhead
appeared to be due to the PIT or PM timer as the clock source. Switching
the clock to the TSC brought the performance on par with bare metal, but
that route has other issues.

> 
> The difference in CPU time of a light weight vs. heavy weight exit
> should be something like 2-3us.  That would mean 2-3ms of CPU time at a
> rate of 1000 per second.

The PIT causes 3 VMEXITs for each gettimeofday (get_offset_pit in RHEL4):

/* timer count may underflow right here */
outb_p(0x00, PIT_MODE); /* latch the count ASAP */
...
count = inb_p(PIT_CH0); /* read the latched count */
...
count |= inb_p(PIT_CH0) << 8;
...


David


> 
> That should be pretty much in the noise.
> 
> There are possibly second order effects that might make a large impact
> such as contention with the qemu_mutex.  It's worth doing
> experimentation to see if a non-mutex acquiring fast path in userspace
> also resulted in a significant performance boost.
> 
> Regards,
> 
> Anthony Liguori
> 
>> David
>>
>>   
>>> Regards,
>>>
>>> Anthony Liguori
>>>
>>> 
>>>> What's the relative speed of the in-kernel pmtimer compared to the PIT?
>>>>
>>>> David
>>>>
>>>>
>>>  
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: iPhone with KVM?

2010-10-12 Thread David S. Ahern



On 10/12/10 11:32, Jan Kiszka wrote:
> Am 12.10.2010 10:32, Kenni Lund wrote:
>> 2010/10/12 Jun Koi :
>>> hi,
>>>
>>> i have a guest Windows on KVM, with iTunes installed on that. then i
>>> let my guest to have direct access to my iPhone connecting to my
>>> physical USB port, using "usb_add" command.
>>> but i got a serious problem: "usb_add" doesnt seem to work, as my
>>> guest Windows never sees my iPhone.
>>
>> The USB emulaiton in qemu/qemu-kvm doesn't support USB 2.0 at the
>> moment, only USB 1.1.
>>
>>> so it seems that giving guest Windows the direct access to USB port is
>>> not enough. any idea why this happens? any solution?
>>
>> Wait for USB 2.0 support to arrive or try to do PCI Passthrough of a
>> USB card/controller. I have yet to come across a USB card/controller
>> that actually works with PCI passthrough, though :/
> 
> You could also give the ehci branch a try:
> 
> git://git.kiszka.org/qemu.git ehci
> 
> I merge current qemu master earlier today, but it's not tested yet.
> 
> Jan
> 

I tried a few months ago and it did not work with the iphone.

David

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

using ftrace with kvm

2010-04-22 Thread David S. Ahern

I have a VM that is spinning (both vcpus at 100%). As I recall kvm_stat
has been deprecated in favor or ftrace. Is there a wiki page or document
that gives suggestions on this?

David
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RHEL5.5, 32-bit VM repeatedly locks up due to kvmclock

2010-04-23 Thread David S. Ahern

After a few days of debugging I think kvmclock is the source of lockups
for a RHEL5.5-based VM. The VM works fine on one host, but repeatedly
locks up on another.

Server 1 - VM locks up repeatedly
-- DL580 G5
-- 4 quad-core X7350 processors at 2.93GHz
-- 48GB RAM

Server 2 - VM works just fine
-- DL380 G6
-- 2 quad-core E5540 processors at 2.53GHz
-- 24GB RAM

Both host servers are running Fedora Core 12, 2.6.32.11-99.fc12.x86_64
kernel. I have tried various versions of qemu-kvm -- the version in
FC-12 and the version for FC-12 in virt-preview. In both cases the
qemu-kvm command line is identical.

VM
- RHEL5.5, PAE kernel (also tried standard 32-bit)
- 2 vcpus
- 3GB RAM
- virtio network and disk

When the VM locks up both vcpu threads are spinning at 100%. Changing
the clocksource to jiffies appears to have addressed the problem.

David
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Mount and unmount CD

2010-04-23 Thread David S. Ahern

I saw this with RHEL5.3. I ended up hacking qemu to re_open the CD every
so often. See attached.

David


On 04/23/2010 09:10 AM, Matt Burkhardt wrote:
> I'm having a problem with a virtual machine running under RHEL 5.4
> 64-bit.  I take out the CD / insert a new and the main machine sees the
> new cd and makes it available.  However, the virtual machines still see
> the old CD.  I've tried mounting the new CD, but it just keeps mounting
> what it "thinks" is in there - the old one.
> 
> Any ideas?
> 
> 
> Matt Burkhardt
> Impari Systems, Inc.
> 
> m...@imparisystems.com
> http://www.imparisystems.com 
> http://www.linkedin.com/in/mlburkhardt 
> http://www.twitter.com/matthewboh
> 502 Fairview Avenue
> Frederick, MD  21701
> work (301) 682-7901
> cell   (301) 802-3235
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--- qemu/block-raw-posix.c.orig 2010-01-06 22:27:56.0 -0700
+++ qemu/block-raw-posix.c  2010-01-06 22:29:51.0 -0700
@@ -193,20 +193,40 @@
 static int raw_pread_aligned(BlockDriverState *bs, int64_t offset,
  uint8_t *buf, int count)
 {
 BDRVRawState *s = bs->opaque;
 int ret;
 
 ret = fd_open(bs);
 if (ret < 0)
 return ret;
 
+/* media changes are only detected at the host layer when
+ * somethin reopens the cdrom device. Without an event 
+ * notice, we need a heuristic. Try the following which mimics
+ * what is done for floppy drives. Here we reopen the cdrom
+ * after 3 seconds of elapsed time - this should be short
+ * enough to cover a user inserting a new disk and then accessing
+ * it via the CLI/GUI.
+ */
+if (bs->type == BDRV_TYPE_CDROM) {
+static int64_t last = 0;
+int64_t now = qemu_get_clock(rt_clock);
+if ((now - last) > 3000)
+ret = cdrom_reopen(bs);
+else
+   ret = 0;
+last = now;
+if (ret < 0)
+   return ret;
+}
+
 if (offset >= 0 && lseek(s->fd, offset, SEEK_SET) == (off_t)-1) {
 ++(s->lseek_err_cnt);
 if(s->lseek_err_cnt <= 10) {
 DEBUG_BLOCK_PRINT("raw_pread(%d:%s, %" PRId64 ", %p, %d) [%" PRId64
   "] lseek failed : %d = %s\n",
   s->fd, bs->filename, offset, buf, count,
   bs->total_sectors, errno, strerror(errno));
 }
 return -1;
 }
--- qemu/hw/ide.c.orig  2010-01-06 22:28:02.0 -0700
+++ qemu/hw/ide.c   2010-01-06 22:30:45.0 -0700
@@ -1456,20 +1456,28 @@
 s->cd_sector_size = sector_size;
 
 /* XXX: check if BUSY_STAT should be set */
 s->status = READY_STAT | SEEK_STAT | DRQ_STAT | BUSY_STAT;
 ide_dma_start(s, ide_atapi_cmd_read_dma_cb);
 }
 
 static void ide_atapi_cmd_read(IDEState *s, int lba, int nb_sectors,
int sector_size)
 {
+if (s->is_cdrom) {
+static int64_t last = 0;
+int64_t now = qemu_get_clock(rt_clock);
+if ((now - last) > 3000)
+(void) cdrom_reopen(s->bs);
+last = now;
+}
+
 #ifdef DEBUG_IDE_ATAPI
 printf("read %s: LBA=%d nb_sectors=%d\n", s->atapi_dma ? "dma" : "pio",
lba, nb_sectors);
 #endif
 if (s->atapi_dma) {
 ide_atapi_cmd_read_dma(s, lba, nb_sectors, sector_size);
 } else {
 ide_atapi_cmd_read_pio(s, lba, nb_sectors, sector_size);
 }
 }

Re: Mount and unmount CD

2010-04-23 Thread David S. Ahern

oops. the previous patch rides on top of this one.

David


On 04/23/2010 12:18 PM, David S. Ahern wrote:
> I saw this with RHEL5.3. I ended up hacking qemu to re_open the CD every
> so often. See attached.
> 
> David
> 
> 
> On 04/23/2010 09:10 AM, Matt Burkhardt wrote:
>> I'm having a problem with a virtual machine running under RHEL 5.4
>> 64-bit.  I take out the CD / insert a new and the main machine sees the
>> new cd and makes it available.  However, the virtual machines still see
>> the old CD.  I've tried mounting the new CD, but it just keeps mounting
>> what it "thinks" is in there - the old one.
>>
>> Any ideas?
>>
>>
>> Matt Burkhardt
>> Impari Systems, Inc.
>>
>> m...@imparisystems.com
>> http://www.imparisystems.com 
>> http://www.linkedin.com/in/mlburkhardt 
>> http://www.twitter.com/matthewboh
>> 502 Fairview Avenue
>> Frederick, MD  21701
>> work (301) 682-7901
>> cell   (301) 802-3235
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
--- qemu/block-raw-posix.c.orig 2010-01-06 21:46:31.0 -0700
+++ qemu/block-raw-posix.c  2010-01-06 21:54:22.0 -0700
@@ -107,20 +107,24 @@
 int fd_got_error;
 int fd_media_changed;
 #endif
 uint8_t* aligned_buf;
 } BDRVRawState;
 
 static int posix_aio_init(void);
 
 static int fd_open(BlockDriverState *bs);
 
+#if defined(__linux__)
+int cdrom_reopen(BlockDriverState *bs);
+#endif
+
 static int raw_open(BlockDriverState *bs, const char *filename, int flags)
 {
 BDRVRawState *s = bs->opaque;
 int fd, open_flags, ret;
 
 posix_aio_init();
 
 s->lseek_err_cnt = 0;
 
 open_flags = O_BINARY;
@@ -212,29 +216,32 @@
 if (ret == count)
 goto label__raw_read__success;
 
 DEBUG_BLOCK_PRINT("raw_pread(%d:%s, %" PRId64 ", %p, %d) [%" PRId64
   "] read failed %d : %d = %s\n",
   s->fd, bs->filename, offset, buf, count,
   bs->total_sectors, ret, errno, strerror(errno));
 
 /* Try harder for CDrom. */
 if (bs->type == BDRV_TYPE_CDROM) {
-lseek(s->fd, offset, SEEK_SET);
-ret = read(s->fd, buf, count);
-if (ret == count)
-goto label__raw_read__success;
-lseek(s->fd, offset, SEEK_SET);
-ret = read(s->fd, buf, count);
-if (ret == count)
+int i;
+for (i = 0; i < 2; ++i) {
+#if defined(__linux__)
+ret = cdrom_reopen(bs);
+if (ret < 0)
 goto label__raw_read__success;
-
+#endif
+lseek(s->fd, offset, SEEK_SET);
+ret = read(s->fd, buf, count);
+if (ret == count)
+goto label__raw_read__success;
+}
 DEBUG_BLOCK_PRINT("raw_pread(%d:%s, %" PRId64 ", %p, %d) [%" PRId64
   "] retry read failed %d : %d = %s\n",
   s->fd, bs->filename, offset, buf, count,
   bs->total_sectors, ret, errno, strerror(errno));
 }
 
 label__raw_read__success:
 
 return ret;
 }
@@ -1025,20 +1032,27 @@
 printf("Floppy opened\n");
 #endif
 }
 if (!last_media_present)
 s->fd_media_changed = 1;
 s->fd_open_time = qemu_get_clock(rt_clock);
 s->fd_got_error = 0;
 return 0;
 }
 
+int cdrom_reopen(BlockDriverState *bs)
+{
+/* mimics a 'change' monitor command - without the eject */
+bdrv_close(bs);
+return bdrv_open2(bs, bs->filename, 0, bs->drv);
+}
+
 static int raw_is_inserted(BlockDriverState *bs)
 {
 BDRVRawState *s = bs->opaque;
 int ret;
 
 switch(s->type) {
 case FTYPE_CD:
 ret = ioctl(s->fd, CDROM_DRIVE_STATUS, CDSL_CURRENT);
 if (ret == CDS_DISC_OK)
 return 1;
--- qemu/hw/ide.c.orig  2010-01-06 21:54:33.0 -0700
+++ qemu/hw/ide.c   2010-01-06 21:56:16.0 -0700
@@ -29,20 +29,24 @@
 #include "pcmcia.h"
 #include "block.h"
 #include "block_int.h"
 #include "qemu-timer.h"
 #include "sysemu.h"
 #include "ppc_mac.h"
 #include "sh.h"
 #include 
 #include 
 
+#if defined(__linux__)
+int cdrom_reopen(BlockDriverState *bs);
+#endif
+
 /* debug IDE devices */
 //#define DEBUG_IDE
 //#define DEBUG_IDE_ATAPI
 //#define DEBUG_AIO
 #define USE_DMA_CDROM
 
 /* Bits of HD_STATUS */
 #define ERR_STAT   0x01
 #define INDEX_STAT 0x02
 #define ECC_STAT   0x04/* Corrected error */
@@ -1363,20 +1

Re: RHEL5.5, 32-bit VM repeatedly locks up due to kvmclock

2010-04-23 Thread David S. Ahern



On 04/23/2010 03:39 PM, Zachary Amsden wrote:
> On 04/23/2010 10:39 AM, Brian Jackson wrote:
>> On Friday 23 April 2010 12:08:22 David S. Ahern wrote:
>>   
>>> After a few days of debugging I think kvmclock is the source of lockups
>>> for a RHEL5.5-based VM. The VM works fine on one host, but repeatedly
>>> locks up on another.
>>>
>>> Server 1 - VM locks up repeatedly
>>> -- DL580 G5
>>> -- 4 quad-core X7350 processors at 2.93GHz
>>> -- 48GB RAM
>>>
>>> Server 2 - VM works just fine
>>> -- DL380 G6
>>> -- 2 quad-core E5540 processors at 2.53GHz
>>> -- 24GB RAM
>>>
>>> Both host servers are running Fedora Core 12, 2.6.32.11-99.fc12.x86_64
>>> kernel. I have tried various versions of qemu-kvm -- the version in
>>> FC-12 and the version for FC-12 in virt-preview. In both cases the
>>> qemu-kvm command line is identical.
>>>
>>> VM
>>> - RHEL5.5, PAE kernel (also tried standard 32-bit)
>>> - 2 vcpus
>>> - 3GB RAM
>>> - virtio network and disk
>>>
>>> When the VM locks up both vcpu threads are spinning at 100%. Changing
>>> the clocksource to jiffies appears to have addressed the problem.
>>>  
>>
>> Does changing the guest to -smp 1 help?
>>
>>
> 
> Based on our current understanding of the problem, it should help, but
> it may not prevent the problem entirely.
> 
> There are three issues with kvmclock due to sampling:
> 
> 1) smp clock alignment may be slightly off due to timing conditions
> 2) kvmclock is resampled at each switch of vcpu to another pcpu
> 3) kvmclock granularity exceeds that of kernel timespec, which means
> sampling errors may show even on UP
> 
> Recommend using a different clocksource (tsc is great if you have stable
> TSC and don't migrate across different-speed machines) until we have all
> the fixes in place.

That's my plan for now. As I recall jiffies was the default in early
RHEL5 versions. Not sure what that means hardware wise.

The biggest problem for me is that RHEL5.5 defaults to kvmclock; I'll
find some workaround for it.

David

> 
> Zach
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RHEL5.5, 32-bit VM repeatedly locks up due to kvmclock

2010-04-23 Thread David S. Ahern



On 04/23/2010 04:21 PM, BRUNO CESAR RIBAS wrote:
> 
> Could you try hpet? I had similar problem with multicore and multiCPU (per
> mother board) [even with constant_tsc].
> 
> Since I changed the guest to hpet i had no more problems.

It's stable in the sense of no lockups yet, but is a much slower time
source from a gettimeofday perspective compared to tsc and jiffies
(based on speed jiffies appears to be tsc-based).

David

> 
>>
>> David
>>
>>>
>>> Zach
>> --
>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Mount and unmount CD Bug reporting

2010-04-25 Thread David S. Ahern

It's been a while since I dug into this, as I recall the media change
needs to be detected to flush cached data. I believe mounting and
unmounting the DVD in the host will work as well, as well as dropping
the cache on the host. I needed an event mechanism rather than polling
for a once in a blue moon change hence the patch.

It is not seen with newer OS versions (e.g., Fedora 11, 12) because of
the media polling (hal service I think).

David


On 04/25/2010 09:32 AM, Matt Burkhardt wrote:
> Thanks - but I don't feel comfortable hacking this - I wanted to try and
> get this reported as a bug, but the directions for submitting a bug are
> too restrictive for me to do this.  For example, it says that you have
> to install the latest version of kvm, run and compile it with different
> flags.  If I had a development box, I would be happy to do this, but
> without one, I can't come close to the requirements for submitting a
> bug.  Could someone else do this?  Since two of us are having the same
> issue with two different versions of RHEL, it seems like it would point
> to a bug....
> 
> On Fri, 2010-04-23 at 12:52 -0600, David S. Ahern wrote:
>> oops. the previous patch rides on top of this one.
>>
>> David
>>
>>
>> On 04/23/2010 12:18 PM, David S. Ahern wrote:
>> > I saw this with RHEL5.3. I ended up hacking qemu to re_open the CD every
>> > so often. See attached.
>> > 
>> > David
>> > 
>> > 
>> > On 04/23/2010 09:10 AM, Matt Burkhardt wrote:
>> >> I'm having a problem with a virtual machine running under RHEL 5.4
>> >> 64-bit.  I take out the CD / insert a new and the main machine sees the
>> >> new cd and makes it available.  However, the virtual machines still see
>> >> the old CD.  I've tried mounting the new CD, but it just keeps mounting
>> >> what it "thinks" is in there - the old one.
>> >>
>> >> Any ideas?
>> >>
>> >>
>> >> Matt Burkhardt
>> >> Impari Systems, Inc.
>> >>
>> >> m...@imparisystems.com <mailto:m...@imparisystems.com>
>> >> http://www.imparisystems.com 
>> >> http://www.linkedin.com/in/mlburkhardt 
>> >> http://www.twitter.com/matthewboh
>> >> 502 Fairview Avenue
>> >> Frederick, MD  21701
>> >> work (301) 682-7901
>> >> cell   (301) 802-3235
>> >>
>> >>
>> >>
>> >> --
>> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> >> the body of a message to majord...@vger.kernel.org 
>> >> <mailto:majord...@vger.kernel.org>
>> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >>
> 
> 
> Matt Burkhardt
> Impari Systems, Inc.
> 
> m...@imparisystems.com <mailto:m...@imparisystems.com>
> http://www.imparisystems.com
> http://www.linkedin.com/in/mlburkhardt
> http://www.twitter.com/matthewboh
> 502 Fairview Avenue
> Frederick, MD  21701
> work (301) 682-7901
> cell   (301) 802-3235
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KVM and the OOM-Killer

2010-05-13 Thread David S. Ahern


>> Not sure I like the idea of running a 64bit user space kernel on top
>> of a 32bit host, prefer to re-install.
>>
>> Can I just replace my kernel with a 64bit one, or do I have to
>> re-install the host O/S ?
> 
> You can run 32-bit userspace with a 64-bit kernel, or reinstall,
> whichever you prefer.
> 
> I once upgraded a 32-bit Fedora install to 64-bit, but that takes some
> tinkering.
> 

You can just install a 64-bit kernel. For rpm based systems you have to
"unpack" the rpm using rpm2cpio. The modules.dep file cannot be updated
-- need to generate that elsewhere -- and mkinitrd needs to be modified
to not try to strip modules (s,strip,true,).

That's all I had to do to plop a 64-bit kernel onto a 32-bit install.

David
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] Moving the kvm ioapic, pic, and pit back to userspace

2010-06-07 Thread David S. Ahern



On 06/07/10 09:26, Avi Kivity wrote:

> The original motivation for moving the PIC and IOAPIC into the kernel
> was performance, especially for assigned devices.  Both devices are high
> interaction since they deal with interrupts; practically after every
> interrupt there is either a PIC ioport write, or an APIC bus message,
> both signalling an EOI operation.  Moving the PIT into the kernel
> allowed us to catch up with missed timer interrupt injections, and
> speeded up guests which read the PIT counters (e.g. tickless guests).
> 
> However, modern guests running on modern qemu use MSI extensively; both
> virtio and assigned devices now have MSI support; and the planned VFIO
> only supports kernel delivery via MSI anyway; line based interrupts will
> need to be mediated by userspace.

The "modern" guest comment is a bit concerning. 2.4 kernels (e.g.,
RHEL3) use the PIT for timekeeping and will still be around for a while.
RHEL4 and RHEL5 will be around for a long time to come. Not sure how
those fit within the "modern" label, though I see my RHEL4 guest is
using the pit as a timesource.

David
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] Moving the kvm ioapic, pic, and pit back to userspace

2010-06-07 Thread David S. Ahern



On 06/07/10 12:46, Avi Kivity wrote:
> On 06/07/2010 07:31 PM, David S. Ahern wrote:
>>
>> On 06/07/10 09:26, Avi Kivity wrote:
>>
>>   
>>> The original motivation for moving the PIC and IOAPIC into the kernel
>>> was performance, especially for assigned devices.  Both devices are high
>>> interaction since they deal with interrupts; practically after every
>>> interrupt there is either a PIC ioport write, or an APIC bus message,
>>> both signalling an EOI operation.  Moving the PIT into the kernel
>>> allowed us to catch up with missed timer interrupt injections, and
>>> speeded up guests which read the PIT counters (e.g. tickless guests).
>>>
>>> However, modern guests running on modern qemu use MSI extensively; both
>>> virtio and assigned devices now have MSI support; and the planned VFIO
>>> only supports kernel delivery via MSI anyway; line based interrupts will
>>> need to be mediated by userspace.
>>>  
>> The "modern" guest comment is a bit concerning. 2.4 kernels (e.g.,
>> RHEL3) use the PIT for timekeeping and will still be around for a while.
>> RHEL4 and RHEL5 will be around for a long time to come. Not sure how
>> those fit within the "modern" label, though I see my RHEL4 guest is
>> using the pit as a timesource.
>>
> 
> First of all, the existing code will remain for a long while (several
> years).  We still have to support existing userspace.
> 
> But, that's not a satisfactory answer.  I don't want users to choose
> which device model to use according to their guest.  As far as I'm
> concerned all guests are triple-boot with the guest rebooting to a
> different OS every half hour.
> 
> So it's important to know how often your RHEL3/4 guest queries the PIT
> (not just receives interrupts, actually reads the counter) under a
> realistic load.  If you have such a number (in reads/sec) that would be
> a good input to this discussion.
> 

Aps that invoke gettimeofday a lot. As I recall RHEL3 uses the TSC
between timer interrupts, but RHEL4 samples counters on each
gettimeofday call:

http://www.mail-archive.com/kvm@vger.kernel.org/msg07231.html

Because of that performance of applications that timestamp log entries
(like a certain product I work on) takes a hit on KVM unless the TSC is
the clock source.

David
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm networking part last

2010-06-28 Thread David S. Ahern



On 06/28/10 16:26, SuNeEl wrote:
> I have been trying desperately to achieve virtual networking with kvm, but =
> some how I failed each time.. rather lot of unclear tutorial using differen=
> t methods achieving common goals made me confuse like bridging, vmnet, tun,=
> etc etc routing ,iptable forward everything in a one pipe ... but before I=
> give up i just thrown this question before you guys
> 
> 
> Host-guest1--guest2guest3
> 192.168.1.1   192.168.1.3192.168.1.4 192.168.1.5
> eth0
> 
> I wanted to use host eth0 to ssh on all guest + dont want to lose connectiv=
> ity to host as well.
> 
> tell me if this is a dream in one shoot so I give up looking more positi=
> veness in virtual networking


I use both direct connect and host-only networking setups. In both cases
qemu is configured to use tap devices (-net tap).

VM's directly connected to LAN:

 .--.   .--.   .--.
 | VM 1 |   | VM 2 |  ...  | VM N |
 '--'   '--'   '--'
 |  |  |
 .--.   .--.   .--.
 | tap  |   | tap  |  ...  | tap  |
 '--'   '--'   '--'
 |  |  |
 '--
  |
  .---.
  |  br0  |
  '---'
  |
  .---.
  | eth0  |
  '---'
  |  LAN
 <-->


Host-side configuration:

/etc/sysconfig/network-scripts/ifcfg-eth0:
DEVICE=eth0
ONBOOT=yes
BRIDGE=mainbr0

/etc/sysconfig/network-scripts/ifcfg-mainbr0
DEVICE=mainbr0
ONBOOT=yes
BOOTPROTO=dhcp

In this case the VMs show up on the LAN just like any other node.


I also have the option to connect VM's to a host-only network:

 .--.   .--.   .--.
 | VM 1 |   | VM 2 |  ...  | VM N |
 '--'   '--'   '--'
 |  |  |
 .--.   .--.   .--.
 | tap  |   | tap  |  ...  | tap  |
 '--'   '--'   '--'
 |  |  |
 '--
  |
  .---..--.
  |  br1  |<---| iptables |
  '---''--'
 |
 |
 v
 .---.
 | eth0  |
 '---'
LAN  |
 <-->

For br1, I chose to manually create it at boot time using an rc-script:

brctl addbr hostbr1
ifconfig hostbr1  netmask  up

VM access to off-box resources is handled through iptables:
iptables -t nat -A PREROUTING -i hostbr1 -j ACCEPT

Direct access to a VM is handled by port redirection:
iptables -t nat -A PREROUTING -p tcp --dport  \
-j DNAT --to-destination :22

iptables -t nat -A PREROUTING -p tcp --dport 2022 \
-j DNAT --to-destination 169.254.1.2:22

e., ssh -p 2022 u...@host is redirected to port 22 for the VM with the
IP 169.254.1.2.

Which networking setup (or both in some cases) I use for specific VM
depends on the purpose of the VM.

David
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)

2008-06-02 Thread David S. Ahern


Avi Kivity wrote:
> David S. Ahern wrote:
>>> I haven't been able to reproduce this:
>>>
>>>
>>>> [EMAIL PROTECTED] root]# ps -elf | grep -E 'memuser|kscand'
>>>> 1 S root 7 1  1  75   0- 0 schedu 10:07 ?  
>>>> 00:00:26 [kscand]
>>>> 0 S root  1464 1  1  75   0- 196986 schedu 10:20 pts/0 
>>>> 00:00:21 ./memuser 768M 120 5 300
>>>> 0 S root  1465 1  0  75   0- 98683 schedu 10:20 pts/0  
>>>> 00:00:10 ./memuser 384M 300 10 600
>>>> 0 S root  2148  1293  0  75   0-   922 pipe_w 10:48 pts/0  
>>>> 00:00:00 grep -E memuser|kscand
>>>>   
>>> The workload has been running for about half an hour, and kswapd cpu
>>> usage doesn't seem significant.  This is a 2GB guest running with my
>>> patch ported to kvm.git HEAD.  Guest is has 2G of memory.
>>>
>>> 
>>
>> I'm running on the per-page-pte-tracking branch, and I am still seeing
>> it.
>> I doubt you want to sit and watch the screen for an hour, so install
>> sysstat if not already, change the sample rate to 1 minute
>> (/etc/cron.d/sysstat), let the server run for a few hours and then run
>> 'sar -u'. You'll see something like this:
>>
>> 10:12:11 AM   LINUX RESTART
>>
>> 10:13:03 AM   CPU %user %nice   %system   %iowait %idle
>> 10:14:01 AM   all  0.08  0.00  2.08  0.35 97.49
>> 10:15:03 AM   all  0.05  0.00  0.79  0.04 99.12
>> 10:15:59 AM   all  0.15  0.00  1.52  0.06 98.27
>> 10:17:01 AM   all  0.04  0.00  0.69  0.04 99.23
>> 10:17:59 AM   all  0.01  0.00  0.39  0.00 99.60
>> 10:18:59 AM   all  0.00  0.00  0.12  0.02 99.87
>> 10:20:02 AM   all  0.18  0.00 14.62  0.09 85.10
>> 10:21:01 AM   all  0.71  0.00 26.35  0.01 72.94
>> 10:22:02 AM   all  0.67  0.00 10.61  0.00 88.72
>> 10:22:59 AM   all  0.14  0.00  1.80  0.00 98.06
>> 10:24:03 AM   all  0.13  0.00  0.50  0.00 99.37
>> 10:24:59 AM   all  0.09  0.00 11.46  0.00 88.45
>> 10:26:03 AM   all  0.16  0.00  0.69  0.03 99.12
>> 10:26:59 AM   all  0.14  0.00 10.01  0.02 89.83
>> 10:28:03 AM   all  0.57  0.00  2.20  0.03 97.20
>> Average:  all  0.21  0.00  5.55  0.05 94.20
>>
>>
>> every one of those jumps in %system time directly correlates to kscand
>> activity. Without the memuser programs running the guest %system time
>> is <1%. The point of this silly memuser program is just to use high
>> memory -- let it age, then make it active again, sit idle, repeat. If
>> you run kvm_stat with -l in the host you'll see the jump in pte
>> writes/updates. An intern here added a timestamp to the kvm_stat
>> output for me which helps to directly correlate guest/host data.
>>
>>
>> I also ran my real guest on the branch. Performance at boot through
>> the first 15 minutes was much better, but I'm still seeing recurring
>> hits every 5 minutes when kscand kicks in. Here's the data from the
>> guest for the first one which happened after 15 minutes of uptime:
>>
>> active_anon_scan: HighMem, age 11, count[age] 24886 -> 5796, direct
>> 24845, dj 59
>>
>> active_anon_scan: HighMem, age 7, count[age] 47772 -> 21289, direct
>> 40868, dj 103
>>
>> active_anon_scan: HighMem, age 3, count[age] 91007 -> 329, direct
>> 45805, dj 1212
>>
>>   
> 
> We touched 90,000 ptes in 12 seconds.  That's 8,000 ptes per second. 
> Yet we see 180,000 page faults per second in the trace.
> 
> Oh!  Only 45K pages were direct, so the other 45K were shared, with
> perhaps many ptes.  We shoud count ptes, not pages.
> 
> Can you modify page_referenced() to count the numbers of ptes mapped (1
> for direct pages, nr_chains for indirect pages) and print the total
> deltas in active_anon_scan?
> 

Here you go. I've shortened the line lengths to get them to squeeze into
80 columns:

anon_scan, all HighMem zone, 187,910 active pages at loop start:
  count[12] 21462 -> 230,   direct 20469, chains 3479,   dj 58
  count[11] 1338  -> 1162,  direct 227,   chains 26144,  dj 59
  count[8] 29397  -> 5410,  direct 26115, chains 27617,  dj 117
  count[4] 35804  -> 25556, d

Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)

2008-06-05 Thread David S. Ahern



Avi Kivity wrote:
> David S. Ahern wrote:
>>> Oh!  Only 45K pages were direct, so the other 45K were shared, with
>>> perhaps many ptes.  We shoud count ptes, not pages.
>>>
>>> Can you modify page_referenced() to count the numbers of ptes mapped (1
>>> for direct pages, nr_chains for indirect pages) and print the total
>>> deltas in active_anon_scan?
>>>
>>> 
>>
>> Here you go. I've shortened the line lengths to get them to squeeze into
>> 80 columns:
>>
>> anon_scan, all HighMem zone, 187,910 active pages at loop start:
>>   count[12] 21462 -> 230,   direct 20469, chains 3479,   dj 58
>>   count[11] 1338  -> 1162,  direct 227,   chains 26144,  dj 59
>>   count[8] 29397  -> 5410,  direct 26115, chains 27617,  dj 117
>>   count[4] 35804  -> 25556, direct 31508, chains 82929,  dj 256
>>   count[3] 2738   -> 2207,  direct 2680,  chains 58, dj 7
>>   count[0] 92580  -> 89509, direct 75024, chains 262834, dj 726
>> (age number is the index in [])
>>
>>   
> 
> Where do all those ptes come from?  that's 180K pages (most of highmem),
> but with 550K ptes.
> 
> The memuser workload doesn't use fork(), so there shouldn't be any
> indirect ptes.
> 
> We might try to unshadow the fixmap page; that means we don't have to do
> 4 fixmap pte accesses per pte scanned.
> 
> The kernel uses two methods for clearing the accessed bit:
> 
> For direct pages:
> 
>if (pte_young(*pte) && ptep_test_and_clear_young(pte))
>referenced++;
> 
> (two accesses)
> 
> For indirect pages:
> 
>if (ptep_test_and_clear_young(pte))
>referenced++;
> 
> (one access)
> 
> Which have to be emulated if we don't shadow the fixmap.  With the data
> above, that translates to 700K emulations with your numbers above, vs
> 2200K emulations, a 3X improvement.  I'm not sure it will be sufficient
> given that we're reducing a 10-second kscand scan into a 3-second scan.
> 

A 3-second scan is much better and incomparison to where kvm was when I
started this e-mail thread (as high as 30-seconds for a scan) it's a
10-fold improvement.

I gave a shot at implementing your suggestion, but evidently I am still
not understanding the shadow implementation. Can you suggest a patch to
try this out?

david
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)

2008-06-18 Thread David S. Ahern

Avi:

We did not get a chance to do this at the Forum. I'd be interested in
whatever options you have for reducing the scan time further (e.g., try
to get scan time down to 1-2 seconds).

thanks,

david


Avi Kivity wrote:
> David S. Ahern wrote:
>> I gave a shot at implementing your suggestion, but evidently I am still
>> not understanding the shadow implementation. Can you suggest a patch to
>> try this out?
>>   
> 
> We can have a hacking session in kvm forum.  Bring a guest on your laptop.
> 
> It isn't going to be easy to both fix the problem and also not introduce
> a regression somewhere else.
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)

2008-06-23 Thread David S. Ahern

Avi Kivity wrote:
> David S. Ahern wrote:
>> Avi:
>>
>> We did not get a chance to do this at the Forum. I'd be interested in
>> whatever options you have for reducing the scan time further (e.g., try
>> to get scan time down to 1-2 seconds).
>>
>>   
> 
> I'm unlikely to get time to do this properly for at least a week, as
> this will be quite difficult and I'm already horribly backlogged. 
> However there's an alternative option, modifying the source and getting
> it upstreamed, as I think RHEL 3 is still maintained.
> 
> The attached patch (untested) should give a 3X boost for kmap_atomics,
> by folding the two accesses to set the pte into one, and by dropping the
> access that clears the pte.  Unfortunately it breaks the ABI, since
> external modules will inline the original kmap_atomic() which expects
> the pte to be cleared.
> 
> This can be worked around by allocating new fixmap slots for kmap_atomic
> with the new behavior, and keeping the old slots with the old behavior,
> but we should first see if the maintainers are open to performance
> optimizations targeting kvm.
> 
RHEL3 is in Maintenance mode (for an explanation see
http://www.redhat.com/security/updates/errata/) which means performance
enhancement patches will not make it in.

Also, I'm going to be out of the office for a couple of weeks in July,
so I will need to put this aside until mid-August or so. I'll reevaluate
options then.

david
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Stable kvm version ?

2008-07-08 Thread David S. Ahern

There's a bug opened for the network lockups -- see
http://sourceforge.net/tracker/index.php?func=detail&aid=1802082&group_id=180599&atid=893831

Based on my testing I've found that the e1000 has the lowest overhead
(e.g., lowest irq and softirq times in the guest). I have not seen any
lockups with the network using the e1000 nic, and a couple of months ago
I was able to run a reasonably intensive network load continuously for
several days.

However, the duration tests I've run were with a modified BIOS. Months
ago when I was digging into the network lockups I was comparing
interrupt allocations to a DL320G3 running a RHEL3/4 load natively. I
noticed no interrupts were shared on bare hardware, while in my RHEL3/4
based kvm guests I was seeing interrupt sharing. So, I patched the bios
(see attached) to get a different usage.

I have not had time to do the due diligence to see if the stability was
due to kvm updates or my bios change. If you have the time I'd be
interested in knowing how the bios change works for you -- if you still
see lockups.

david


Freddie Cash wrote:
> On Fri, Jul 4, 2008 at 6:57 PM, David Mair <[EMAIL PROTECTED]> wrote:
>> Slohm Gadaburi wrote:
>>> I found out I can't use Ubuntu's kvm package because it doesn't
>>> support vm snapshots.
>>>
>>> I am going to use a vanilla kvm and was wondering which version do you
>>> recommend me to use
>>> (my biggest concern is stability) ?
>> I have no stability problems with a mix of Windows and Linux guests using
>> kvm-70 on a x86_64 kernel 2.6.22.18. I've had one Linux guest up all of the
>> past week while testing something. YMMV.
> 
> I have no stability issues with kvm-69 on 64-bit Debian Lenny with
> kernel 2.6.24, using the kvm-amd module from the kernel package, when
> using the rtl8139 NIC.
> 
> I can lock up any of my VMs when using the e1000 NIC and doing massive
> data transfers (rsync, scp, wget), in Debian (Etch/Lenny), Windows XP
> (SP2/SP3), or FreeBSD (6.3/7.0) guests.  And also when using the
> virtio NIC or block drivers in Debian Lenny guests.  Haven't tracked
> down what causes the problem, or how to reliably cause it to happen
> (sometimes right away, sometimes it's fine for a week), which is why I
> haven't posted any bug reports on it as yet.
> 
> For now, all my VMs are using emulated NICs and block devices.
--- bios/rombios32.c.orig	2008-06-17 07:36:35.0 -0600
+++ bios/rombios32.c	2008-06-17 07:37:02.0 -0600
@@ -619,21 +619,21 @@
 
 typedef struct PCIDevice {
 int bus;
 int devfn;
 } PCIDevice;
 
 static uint32_t pci_bios_io_addr;
 static uint32_t pci_bios_mem_addr;
 static uint32_t pci_bios_bigmem_addr;
 /* host irqs corresponding to PCI irqs A-D */
-static uint8_t pci_irqs[4] = { 10, 10, 11, 11 };
+static uint8_t pci_irqs[4] = { 10, 11, 7, 3 };
 static PCIDevice i440_pcidev;
 
 static void pci_config_writel(PCIDevice *d, uint32_t addr, uint32_t val)
 {
 outl(0xcf8, 0x8000 | (d->bus << 16) | (d->devfn << 8) | (addr & 0xfc));
 outl(0xcfc, val);
 }
 
 static void pci_config_writew(PCIDevice *d, uint32_t addr, uint32_t val)
 {

Re: kvm guest loops_per_jiffy miscalibration under host load

2008-07-11 Thread David S. Ahern

What's the status with this for full virt guests?

I am still seeing systematic time drifts in RHEL 3 and RHEL4 guests
which I've been digging into it the past few days. In the course of it I
have been launching guests with boosted priority (both nice -20 and
realtime priority (RR 1)) on a nearly 100% idle host.

One host is a PowerEdge 2950 running RHEL5.2 with kvm-70. With the
realtime priority boot I have routinely seen bogomips in the guest which
do not make sense. e.g.,

ksyms.2:bogomips: 4639.94
ksyms.2:bogomips: 4653.05
ksyms.2:bogomips: 4653.05
ksyms.2:bogomips: 24.52

and

ksyms.3:bogomips: 4639.94
ksyms.3:bogomips: 4653.05
ksyms.3:bogomips: 16.33
ksyms.3:bogomips: 12.87

Also, if I launch qemu with the "-no-kvm-pit -tdf" option the panic
guests panics with the message Marcelo posted at the start of the thread:

Calibrating delay loop... 4653.05 BogoMIPS

CPU: L2 cache: 2048K

Intel machine check reporting enabled on CPU#2.

CPU2: Intel QEMU Virtual CPU version 0.9.1 stepping 03

Booting processor 3/3 eip 2000

Initializing CPU#3

masked ExtINT on CPU#3

ESR value before enabling vector: 

ESR value after enabling vector: 

Calibrating delay loop... 19.60 BogoMIPS

CPU: L2 cache: 2048K

Intel machine check reporting enabled on CPU#3.

CPU3: Intel QEMU Virtual CPU version 0.9.1 stepping 03

Total of 4 processors activated (14031.20 BogoMIPS).

ENABLING IO-APIC IRQs

Setting 4 in the phys_id_present_map
...changing IO-APIC physical APIC ID to 4 ... ok.
..TIMER: vector=0x31 pin1=0 pin2=-1
..MP-BIOS bug: 8254 timer not connected to IO-APIC
...trying to set up timer (IRQ0) through the 8259A ...  failed.
...trying to set up timer as Virtual Wire IRQ... failed.
...trying to set up timer as ExtINT IRQ... failed :(.
Kernel panic: IO-APIC + timer doesn't work! pester [EMAIL PROTECTED]

I'm just looking for stable guest times. I'm not planning to keep the
boosted guest priority, just using it to ensure the guest is not
interrupted as I try to understand why the guest systematically drifts.

david

Glauber Costa wrote:
> Glauber Costa wrote:
>> On Mon, Jul 7, 2008 at 4:21 PM, Anthony Liguori <[EMAIL PROTECTED]>
>> wrote:
>>> Marcelo Tosatti wrote:
 On Mon, Jul 07, 2008 at 03:27:16PM -0300, Glauber Costa wrote:

>> I agree.  A paravirt solution solves the problem.
>>
> Please, look at the patch I've attached.
>
> It does  __delay with host help. This may have the nice effect of not
> busy waiting for long-enough delays, and may well.
>
> It is _completely_ PoC, just to show the idea. It's ugly, broken,
> obviously have to go through pv-ops, etc.
>
> Also, I intend to add a lpj field in the kvm clock memory area. We
> could do just this later, do both, etc.
>
> If we agree this is a viable solution, I'll start working on a patch
>
 This stops interrupts from being processed during the delay. And also
 there are cases like this recently introduced break:

/* Allow RT tasks to run */
preempt_enable();
rep_nop();
preempt_disable();

 I think it would be better to just pass the lpj value via paravirt and
 let the guest busy-loop as usual.

>>> I agree.  VMI and Xen already pass a cpu_khz paravirt value.  Something
>>> similar would probably do the trick.
>>
>> yeah, there is a pv-op for this, so I won't have to mess with the
>> clock interface. I'll draft a patch for it, and sent it.
>>
>>> It may be worthwhile having udelay() or spinlocks call into KVM if
>>> they've
>>> been spinning long enough but I think that's a separate discussion.
>>
>> I think it is, but I'd have to back it up with numbers. measurements
>> are on the way.
>>> Regards,
>>>
>>> Anthony Liguori
>>>
>>>
>>
>>
>>
> How about this? RFC only for now
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm guest loops_per_jiffy miscalibration under host load

2008-07-12 Thread David S. Ahern

Marcelo Tosatti wrote:
> Hi David,
> 
> On Fri, Jul 11, 2008 at 03:18:54PM -0600, David S. Ahern wrote:
>> What's the status with this for full virt guests?
> 
> The consensus seems to be that fullvirt guests need assistance from the
> management app (libvirt) to have boosted priority during their boot
> stage, so loops_per_jiffy calibration can be performed safely. As Daniel
> pointed out this is tricky because you can't know for sure how long the
> boot up will take, if for example PXE is used.

I boosted the priority of the guest to investigate an idea that maybe
some startup calibration in the guest was off slightly leading to
systematic drifting. I was on vacation last week and I am still catching
up with traffic on this list. I just happened to see your first message
with the panic which aligned with one of my tests.

> 
> Glauber is working on some paravirt patches to remedy the situation.
> 
> But loops_per_jiffy is not directly related to clock drifts, so this
> is a separate problem.
> 
>> I am still seeing systematic time drifts in RHEL 3 and RHEL4 guests
>> which I've been digging into it the past few days. 
> 
> All time drift issues we were aware of are fixed in kvm-70. Can you
> please provide more details on how you see the time drifting with
> RHEL3/4 guests? It slowly but continually drifts or there are large
> drifts at once? Are they using TSC or ACPIPM as clocksource?

The attached file shows one example of the drift I am seeing. It's for a
4-way RHEL3 guest started with 'nice -20'. After startup each vcpu was
pinned to a physical cpu using taskset. The only activity on the host is
this one single guest; the guest is relatively idle -- about 4% activity
(~1% user, ~3% system time). Host is synchronized to an ntp server; the
guest is not. The guest is started with the -localtime parameter.  From
the file you can see the guest gains about 1-2 seconds every 5 minutes.

Since it's a RHEL3 guest I believe the PIT is the only choice (how to
confirm?), though it does read the TSC (ie., use_tsc is 1).

> 
> Also, most issues we've seen could only be replicated with dyntick
> guests.
> 
> I'll try to reproduce it locally.
> 
>> In the course of it I have been launching guests with boosted priority
>> (both nice -20 and realtime priority (RR 1)) on a nearly 100% idle
>> host.
> 
> Can you also see wacked bogomips without boosting the guest priority?

The wacked bogomips only shows up when started with real-time priority.
With the 'nice -20' it's sane and close to what the host shows.

As another data point I restarted the RHEL3 guest using the -no-kvm-pit
and -tdf options (nice -20 priority boost). After 22 hours of uptime,
the guest is 29 seconds *behind* the host. Using the in-kernel pit the
guest time is always fast compared to the host.

I've seen similar drifting in RHEL4 guests, but I have not spent as much
time investigating it yet. On ESX adding clock=pit to the boot
parameters for RHEL4 guests helps immensely.

david
host-dt   host timeguest timeguest-host-diff
300   1215748151   1215748262111
301   1215748452   1215748563111
300   1215748752   1215748865113
300   1215749052   1215749165113
300   1215749352   1215749466114
301   1215749653   1215749768115
300   1215749953   1215750069116
300   1215750253   1215750369116
300   1215750553   1215750671118
301   1215750854   1215750972118
300   1215751154   1215751273119
300   1215751454   1215751575121
300   1215751754   1215751875121
301   1215752055   1215752176121
300   1215752355   1215752477122
300   1215752655   1215752780125
300   1215752955   1215753083128
301   1215753256   1215753385129
300   1215753556   1215753686130
300   1215753856   1215753988132
300   1215754156   1215754289133
301   1215754457   1215754592135
300   1215754757   1215754894137
300   1215755057   1215755198141
300   1215755357   1215755499142
301   1215755658   1215755799141
300   1215755958   1215756101143
300   1215756258   1215756402144
300   1215756558   1215756702144
301   1215756859   1215757005146
300   1215757159   1215757307148
300   1215757459   1215757609150
301   1215757760   1215757910150
300   1215758060   1215758211151
300   1215758360   1215758515155
300   1215758660   1215758816156
301   1215758961   1215759118157
300   1215759261   1215759418157
300   1215759561   1215759720159
300   1215759861   1215760022161
301   1215760162   1215760323161
300   1215760462   1215760625163
300   1215760762   1215760927165

Re: kvm guest loops_per_jiffy miscalibration under host load

2008-07-22 Thread David S. Ahern

I've been running a series of tests on RHEL3, RHEL4, and RHEL5. The
short of it is that all of them keep time quite well with 1 vcpu. In the
case of RHEL3 and RHEL4 time is stable for *both* the uniprocessor and
smp kernels, again with only 1 vcpu (there's no up/smp distinction in
the kernels for RHEL5).

As soon as the number of vcpus is >1, time drifts systematically with
the guest *leading* the host. I see this on unloaded guests and hosts
(ie., cpu usage on the host ~<5%). The drift is averaging around
0.5%-0.6% (i.e., 5 seconds gained in the guest per 1000 seconds of real
wall time).

This very reproducible. All I am doing is installing stock RHEL3.8, 4.4
and 5.2, i386 versions, starting them and watching the drift with no
time servers. In all of these recent cases the results are for in-kernel
pit.

more in-line below.


Marcelo Tosatti wrote:
> On Sat, Jul 12, 2008 at 01:28:13PM -0600, David S. Ahern wrote:
>>> All time drift issues we were aware of are fixed in kvm-70. Can you
>>> please provide more details on how you see the time drifting with
>>> RHEL3/4 guests? It slowly but continually drifts or there are large
>>> drifts at once? Are they using TSC or ACPIPM as clocksource?
>> The attached file shows one example of the drift I am seeing. It's for a
>> 4-way RHEL3 guest started with 'nice -20'. After startup each vcpu was
>> pinned to a physical cpu using taskset. The only activity on the host is
>> this one single guest; the guest is relatively idle -- about 4% activity
>> (~1% user, ~3% system time). Host is synchronized to an ntp server; the
>> guest is not. The guest is started with the -localtime parameter.  From
>> the file you can see the guest gains about 1-2 seconds every 5 minutes.
>>
>> Since it's a RHEL3 guest I believe the PIT is the only choice (how to
>> confirm?), though it does read the TSC (ie., use_tsc is 1).
> 
> Since its an SMP guest I believe its using PIT to generate periodic
> timers and ACPI pmtimer as a clock source.

Since my last post, I've been reading up on timekeeping and going
through the kernel code -- focusing on RHEL3 at the moment. AFAICT the
PIT is used for timekeeping, and the local APIC timer interrupts are
used as well (supposedly just for per-cpu system accounting, though I
have not gone through all of the code yet). I do not see references in
dmesg data regarding pmtimer; I thought RHEL3 was not ACPI aware.

> 
>>> Also, most issues we've seen could only be replicated with dyntick
>>> guests.
>>>
>>> I'll try to reproduce it locally.
>>>
>>>> In the course of it I have been launching guests with boosted priority
>>>> (both nice -20 and realtime priority (RR 1)) on a nearly 100% idle
>>>> host.
>>> Can you also see wacked bogomips without boosting the guest priority?
>> The wacked bogomips only shows up when started with real-time priority.
>> With the 'nice -20' it's sane and close to what the host shows.
>>
>> As another data point I restarted the RHEL3 guest using the -no-kvm-pit
>> and -tdf options (nice -20 priority boost). After 22 hours of uptime,
>> the guest is 29 seconds *behind* the host. Using the in-kernel pit the
>> guest time is always fast compared to the host.
>>
>> I've seen similar drifting in RHEL4 guests, but I have not spent as much
>> time investigating it yet. On ESX adding clock=pit to the boot
>> parameters for RHEL4 guests helps immensely.
> 
> The problem with clock=pmtmr and clock=tsc on older 2.6 kernels is lost
> tick and irq latency adjustments, as mentioned in the VMWare paper
> (http://www.vmware.com/pdf/vmware_timekeeping.pdf). They try to detect
> this and compensate by advancing the clock. But the delay between the
> host time fire, injection of guest irq and actual count read (either
> tsc or pmtimer) fool these adjustments. clock=pit has no such lost tick
> detection, so is susceptible to lost ticks under load (in theory).

I have read that document quite a few times; clock=pit is required on
esx for rhel4 guests to be sane.

> 
> The fact that qemu emulation is less suspectible to guest clock running
> faster than it should is because the emulated PIT timer is rearmed
> relative to alarm processing (next_expiration = current_time + count).
> But that also means it is suspectible to host load, ie. the frequency is
> virtual.
> 
> The in-kernel PIT rearms relative to host clock, so the frequency is
> more reliable (next_expiration = prev_expiration + count).
> 
> So for RHEL4, clock=pit along with the following patch seems stable for
> me, no drift either direction, even under guest/host load. Can you give
> it a try wi

Re: [PATCH 2/2] Remove -tdf

2008-07-22 Thread David S. Ahern



Anthony Liguori wrote:
> Dor Laor wrote:
>> Anthony Liguori wrote:
>>> The last time I posted the KVM patch series to qemu-devel, the -tdf
>>> patch met with
>>> some opposition.  Since today we implement timer catch-up in the
>>> in-kernel PIT and
>>> the in-kernel PIT is used by default, it doesn't seem all that
>>> valuable to have
>>> timer catch-up in userspace too.
>>>
>>> Removing it will reduce our divergence from QEMU.
>>>
>>>   
>> IMHO the in kernel PIT should go away, there is no reason to keep it
>> except that userspace PIT drifts.
> 
> I agree fully :-)  But there's certainly no reason to keep -tdf and the
> in-kernel PIT.  Since we're using the in-kernel PIT right now, I'd like
> to get rid of -tdf.
> 
>> Currently both in-kernel PIT and even the in kernel irqchips are not
>> 100% bullet proof.
>> Of course this code is a hack, Gleb Natapov has send better fix for
>> PIT/RTC to qemu list.
>> Can you look into them:
>> http://www.mail-archive.com/kvm@vger.kernel.org/msg01181.html
> 
> Paul Brook's initial feedback is still valid.  It causes quite a lot of
> churn and may not jive well with a virtual time base.  An advantage to
> the current -tdf patch is that it's more contained.  I don't think
> either approach is going to get past Paul in it's current form.
> 
> I'd still like to see some harder evidence of the benefits of tdf.  For
> a specific guest, with a specific configuration, how much better is the
> drift with this series.  The answer shouldn't be "movie's play better" :-)
>

I for one see better timekeeping with RHEL3 guests -- especially when
they get busy (e.g., kscand doing its thing). You see the "time drift"
message every time it kicks in (the message does need to be
throttled/not displayed by the way; it can overwhelm a stderr capture as
the guest runs for months).

> Also, it's important that this is reproducible in upstream QEMU and not
> just in KVM.  If we can make a compelling case for the importance of
> this, we can possibly work out a compromise.

I don't have an opinion on this particular implementation, only that
something is needed to keep the guest from losing time. Right now with
the kernel-pit my RHEL guests are always *ahead* of the host; with the
qemu pit the guest is behind the host (which makes more sense if ticks
are lost). Either way I'd like for the guest to not drift "noticeably"
and when it does that ntpd is adequate to keep it in sync (I've noticed
oddities with it too).

david


> 
> Regards,
> 
> Anthony Liguori
> 
> -- 
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm guest loops_per_jiffy miscalibration under host load

2008-07-22 Thread David S. Ahern



David S. Ahern wrote:
>> The in-kernel PIT rearms relative to host clock, so the frequency is
>> more reliable (next_expiration = prev_expiration + count).
>>
>> So for RHEL4, clock=pit along with the following patch seems stable for
>> me, no drift either direction, even under guest/host load. Can you give
>> it a try with RHEL3 ? I'll be doing that shortly.
> 
> I'll give it a shot and let you know.

After 6:46 of uptime, my RHEL4 guest is only 7 seconds ahead of the
host. The RHEL3 guest is 17 seconds ahead. Both are dramatic
improvements with the patch.

david

>>
>> --
>>
>> Set the count load time to when the count is actually "loaded", not when
>> IRQ is injected.
>>
>> diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
>> index c0f7872..b39b141 100644
>> --- a/arch/x86/kvm/i8254.c
>> +++ b/arch/x86/kvm/i8254.c
>> @@ -207,6 +207,7 @@ static int __pit_timer_fn(struct kvm_kpit_state *ps)
>>  
>>  pt->timer.expires = ktime_add_ns(pt->timer.expires, pt->period);
>>  pt->scheduled = ktime_to_ns(pt->timer.expires);
>> +ps->channels[0].count_load_time = pt->timer.expires;
>>  
>>  return (pt->period == 0 ? 0 : 1);
>>  }
>> @@ -622,7 +623,6 @@ void kvm_pit_timer_intr_post(struct kvm_vcpu *vcpu, int 
>> vec)
>>arch->vioapic->redirtbl[0].fields.mask != 1))) {
>>  ps->inject_pending = 1;
>>  atomic_dec(&ps->pit_timer.pending);
>> -ps->channels[0].count_load_time = ktime_get();
>>  }
>>  }
>>  }
>>


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 0/3] fix PIT injection

2008-07-27 Thread David S. Ahern

Hi Marcelo:

With kvm-72 + this patch set, timekeeping in RHEL3, RHEL4 and RHEL5
guests with 2 vcpus is much better. Approaching 5 hours of uptime and
all 3 guests are within 2 seconds of the host (part of the delta
measurement based). I'll let all 3 run overnight and then turn on ntp
tomorrow.

Thanks for working on this,

david

Marcelo Tosatti wrote:
> The in-kernel PIT emulation can either inject too many or too few
> interrupts.
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

PIT/ntp/timekeeping [was Re: kvm guest loops_per_jiffy miscalibration under host load]

2008-07-29 Thread David S. Ahern

Marcelo Tosatti wrote:
> On Tue, Jul 22, 2008 at 01:56:12PM -0600, David S. Ahern wrote:
>> I've been running a series of tests on RHEL3, RHEL4, and RHEL5. The
>> short of it is that all of them keep time quite well with 1 vcpu. In the
>> case of RHEL3 and RHEL4 time is stable for *both* the uniprocessor and
>> smp kernels, again with only 1 vcpu (there's no up/smp distinction in
>> the kernels for RHEL5).
>>
>> As soon as the number of vcpus is >1, time drifts systematically with
>> the guest *leading* the host. I see this on unloaded guests and hosts
>> (ie., cpu usage on the host ~<5%). The drift is averaging around
>> 0.5%-0.6% (i.e., 5 seconds gained in the guest per 1000 seconds of real
>> wall time).
>>
>> This very reproducible. All I am doing is installing stock RHEL3.8, 4.4
>> and 5.2, i386 versions, starting them and watching the drift with no
>> time servers. In all of these recent cases the results are for in-kernel
>> pit.
> 
> David,
> 
> You mentioned earlier problems with ntpd syncing the guest time? Can you
> provide more details?
> 

It would lose sync often, and 'ntpq -c pe' would show a '*' indicative
of a sync when in fact time in the guest was off by 5-10 seconds. It may
very well be a side effect of the drift due to repeated timer injection
of timer interrupts / lost interrupts.

With your PIT injection patches:

1. For a stock RHEL4.4 guest, ntpd synchronized quickly and saw no need
to adjust time after the initial startup tweak of 1.004620 sec by
ntpdate. After 40 hours it has maintained time very well with no
adjustments. Of course the guest is relatively idle -- it is only
keeping time.

2. For a stock RHEL3.8 guest, I cannot get ntpd to do anything. This
guest is running on the same host as the RHEL4 guest and using the same
time server. This guest has been around for a few weeks and has been
subjected to very tests -- like running with the no-kernel-pit and -tdf
options. In light of 3. below I'll re-create this guest and see if the
problem goes away.

3. For a RHEL3.8 guest running a Cisco product, ntpd was able to
synchonize just fine. We are running ntpd with different arguments;
however using the same syntax on the stock rhel3 guest did not help.

As for as time updates, over 21+ hours of uptime there have been 20 time
resets -- adjustments ranging from -1.01 seconds to +0.75 seconds. This
is a remarkable improvement. Before this PIT patch set I was seeing time
resets of 3-5 seconds every 15 minutes. This is a 2 vcpu guest running a
modest load (disk + network) that pushes cpu usage of ~25%. Point being
that the guest is keeping time reasonably well while do something
useful. :-)

I am planning to install 4 vcpu guests for both RHEL3 and RHEL4 today
and again with modest loads to see how it holds up.

> I find it _necessary_ to use the RR scheduling policy for any Linux
> guest running at static 1000Hz (no dynticks), otherwise timer interrupts
> will invariably be missed. And reinjection plus lost tick adjustment is
> always problematic (will drift either way, depending which version of
> Linux). With the standard batch scheduling policy _idle_ guests can wait
> to run upto 6/7 ms in my testing (thus 6/7 lost timer events). Which
> also means latency can be horrible.
> 

Noted. I'd prefer not to start priority escalations, but if it's needed

What about for the RHEL4.7 kernel running at 250 HZ? I understand it
with 4.7 you can pass a command line divider to run the clock at a
slower rate. In the past I've recompiled RHEL4 kernels to run at 250 HZ
which was a trade-off between too fast (overhead of timer interrupts)
and too slow (need for better scheduling latency).

david
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: PIT/ntp/timekeeping [was Re: kvm guest loops_per_jiffy miscalibration under host load]

2008-07-29 Thread David S. Ahern

Forgot to add something in my last response: Another time-based oddity
I'm seeing in multi-processor guests is the microseconds value changing
as the process is moved between the vcpus.

The attached code exemplifies what I mean. In a RHEL3 VM with 2 vcpus,
start the program with an argument of 99 (to get a wakeup every ~1
sec). Once started lock it to a vcpu. You'll nice consistent output like:

1217351975.261974
1217351976.262292
1217351977.262608
1217351978.262929
1217351979.263243
1217351980.263563
1217351981.263940

Then switch the affinity to the other vcpu. The microseconds value jumps:

1217351982.796132
1217351983.797411
1217351984.797719
1217351985.798041
1217351986.798368
1217351987.798788
1217351988.799025

Toggling the affinity or letting the process roam between the 2
processors causes the microseconds to jump. These means that data logged
using the microseconds value will show time jumps back and forth.

As I understand it the root cause is the TSC-based updates to what is
returned by gettimeofday so the fact that they toggle means the 2 vcpus
see different tsc counts. Is there anyway to make the counts coherent as
processes roam vcpus?

david


David S. Ahern wrote:
> 
> Marcelo Tosatti wrote:
>> On Tue, Jul 22, 2008 at 01:56:12PM -0600, David S. Ahern wrote:
>>> I've been running a series of tests on RHEL3, RHEL4, and RHEL5. The
>>> short of it is that all of them keep time quite well with 1 vcpu. In the
>>> case of RHEL3 and RHEL4 time is stable for *both* the uniprocessor and
>>> smp kernels, again with only 1 vcpu (there's no up/smp distinction in
>>> the kernels for RHEL5).
>>>
>>> As soon as the number of vcpus is >1, time drifts systematically with
>>> the guest *leading* the host. I see this on unloaded guests and hosts
>>> (ie., cpu usage on the host ~<5%). The drift is averaging around
>>> 0.5%-0.6% (i.e., 5 seconds gained in the guest per 1000 seconds of real
>>> wall time).
>>>
>>> This very reproducible. All I am doing is installing stock RHEL3.8, 4.4
>>> and 5.2, i386 versions, starting them and watching the drift with no
>>> time servers. In all of these recent cases the results are for in-kernel
>>> pit.
>> David,
>>
>> You mentioned earlier problems with ntpd syncing the guest time? Can you
>> provide more details?
>>
> 
> It would lose sync often, and 'ntpq -c pe' would show a '*' indicative
> of a sync when in fact time in the guest was off by 5-10 seconds. It may
> very well be a side effect of the drift due to repeated timer injection
> of timer interrupts / lost interrupts.
> 
> With your PIT injection patches:
> 
> 1. For a stock RHEL4.4 guest, ntpd synchronized quickly and saw no need
> to adjust time after the initial startup tweak of 1.004620 sec by
> ntpdate. After 40 hours it has maintained time very well with no
> adjustments. Of course the guest is relatively idle -- it is only
> keeping time.
> 
> 2. For a stock RHEL3.8 guest, I cannot get ntpd to do anything. This
> guest is running on the same host as the RHEL4 guest and using the same
> time server. This guest has been around for a few weeks and has been
> subjected to very tests -- like running with the no-kernel-pit and -tdf
> options. In light of 3. below I'll re-create this guest and see if the
> problem goes away.
> 
> 3. For a RHEL3.8 guest running a Cisco product, ntpd was able to
> synchonize just fine. We are running ntpd with different arguments;
> however using the same syntax on the stock rhel3 guest did not help.
> 
> As for as time updates, over 21+ hours of uptime there have been 20 time
> resets -- adjustments ranging from -1.01 seconds to +0.75 seconds. This
> is a remarkable improvement. Before this PIT patch set I was seeing time
> resets of 3-5 seconds every 15 minutes. This is a 2 vcpu guest running a
> modest load (disk + network) that pushes cpu usage of ~25%. Point being
> that the guest is keeping time reasonably well while do something
> useful. :-)
> 
> I am planning to install 4 vcpu guests for both RHEL3 and RHEL4 today
> and again with modest loads to see how it holds up.
> 
>> I find it _necessary_ to use the RR scheduling policy for any Linux
>> guest running at static 1000Hz (no dynticks), otherwise timer interrupts
>> will invariably be missed. And reinjection plus lost tick adjustment is
>> always problematic (will drift either way, depending which version of
>> Linux). With the standard batch scheduling policy _idle_ guests can wait
>> to run upto 6/7 ms in my testing (thus 6/7 lost timer events). Which
>> also means latency can be horrible.
>>
> 
> Noted. I'd

Re: [patch 3/3] KVM: PIT: fix injection logic and count

2008-08-12 Thread David S. Ahern

Hi Marcelo:

I am seeing erroneous accounting data in RHEL3 guests which I believe I
have traced to this patch. The easiest way to see this is to run 'mpstat
1': intr/s is in the 50's (e.g., for a nearly idle guest with negligible
disk/network). This is wrong. At a minimum it should be 100 -- 100 timer
interrupts per second.

Once this caught my eye, I took a look at /proc/stat. If you take
samples 1 second apart, the difference of the sums for the 'cpu' line
should be HZ * ncpus and for each individual cpu entry (e.g., cpu0,
cpu1, etc) the result should be HZ.  In code:

function cpu_stat {
awk -v cpu="$1" '{
if ($1 == cpu)
{
sum=0
for (i=1; i<= NF; ++i)
sum += $i
print sum
}
}' /proc/stat
}

cpu=${1:-"cpu"}
d1=$(cpu_stat $cpu)
echo "have first sample. sleeping"
usleep 99
d2=$(cpu_stat $cpu)

echo "delta: $(($d2 - $d1))"

I am seeing a result of 2*HZ. So for a 4 vcpu guest the delta for the
cpu line is > 800, and each cpu# line is >200.

You see the effect with the SMP kernel regardless of the number of vcpus
(1 or more) -- it's always twice what it should be and the interrupts
are always half what they should be. The accounting is fine with the
uniprocessor kernel. This suggests the problem is that lapic timer
interrupts are coming in twice as fast (or more) than they should.

Interestingly, I only see this for RHEL5.2 as the host OS; the
accounting is fine for Fedora 9 as the host OS. In both cases it's
kvm-72 with just this patch set.

Any ideas on where the problem could be?

david


Marcelo Tosatti wrote:
> The PIT injection logic is problematic under the following cases:
> 
> 1) If there is a higher priority vector to be delivered by the time
> kvm_pit_timer_intr_post is invoked ps->inject_pending won't be set. 
> This opens the possibility for missing many PIT event injections (say if
> guest executes hlt at this point).
> 
> 2) ps->inject_pending is racy with more than two vcpus. Since there's no 
> locking
> around read/dec of pt->pending, two vcpu's can inject two interrupts for a 
> single
> pt->pending count.
> 
> Fix 1 by using an irq ack notifier: only reinject when the previous irq 
> has been acked. Fix 2 with appropriate locking around manipulation of 
> pending count and irq_ack by the injection / ack paths.
> 
> Also, count_load_time should be set at the time the count is reloaded,
> not when the interrupt is injected (BTW, LAPIC uses the same apparently
> broken scheme, could someone explain what was the reasoning behind
> that? kvm_apic_timer_intr_post).
> 
> Signed-off-by: Marcelo Tosatti <[EMAIL PROTECTED]>
> 
> Index: kvm/arch/x86/kvm/i8254.c
> ===
> --- kvm.orig/arch/x86/kvm/i8254.c
> +++ kvm/arch/x86/kvm/i8254.c
> @@ -207,6 +207,8 @@ static int __pit_timer_fn(struct kvm_kpi
>  
>   pt->timer.expires = ktime_add_ns(pt->timer.expires, pt->period);
>   pt->scheduled = ktime_to_ns(pt->timer.expires);
> + if (pt->period)
> + ps->channels[0].count_load_time = pt->timer.expires;
>  
>   return (pt->period == 0 ? 0 : 1);
>  }
> @@ -215,12 +217,22 @@ int pit_has_pending_timer(struct kvm_vcp
>  {
>   struct kvm_pit *pit = vcpu->kvm->arch.vpit;
>  
> - if (pit && vcpu->vcpu_id == 0 && pit->pit_state.inject_pending)
> + if (pit && vcpu->vcpu_id == 0 && pit->pit_state.irq_ack)
>   return atomic_read(&pit->pit_state.pit_timer.pending);
> -
>   return 0;
>  }
>  
> +void kvm_pit_ack_irq(struct kvm_irq_ack_notifier *kian)
> +{
> + struct kvm_kpit_state *ps = container_of(kian, struct kvm_kpit_state,
> +  irq_ack_notifier);
> + spin_lock(&ps->inject_lock);
> + if (atomic_dec_return(&ps->pit_timer.pending) < 0)
> + WARN_ON(1);
> + ps->irq_ack = 1;
> + spin_unlock(&ps->inject_lock);
> +}
> +
>  static enum hrtimer_restart pit_timer_fn(struct hrtimer *data)
>  {
>   struct kvm_kpit_state *ps;
> @@ -255,8 +267,9 @@ static void destroy_pit_timer(struct kvm
>   hrtimer_cancel(&pt->timer);
>  }
>  
> -static void create_pit_timer(struct kvm_kpit_timer *pt, u32 val, int 
> is_period)
> +static void create_pit_timer(struct kvm_kpit_state *ps, u32 val, int 
> is_period)
>  {
> + struct kvm_kpit_timer *pt = &ps->pit_timer;
>   s64 interval;
>  
>   interval = muldiv64(val, NSEC_PER_SEC, KVM_PIT_FREQ);
> @@ -268,6 +281,7 @@ static void create_pit_timer(struct kvm_
>   pt->period = (is_period == 0) ? 0 : interval;
>   pt->timer.function = pit_timer_fn;
>   atomic_set(&pt->pending, 0);
> + ps->irq_ack = 1;
>  
>   hrtimer_start(&pt->timer, ktime_add_ns(ktime_get(), interval),
> HRTIMER_MODE_ABS);
> @@ -302,11 +316,11 @@ static void pit_load_count(struct kvm *k
>   case 1:
>  /* FIXME: enhance mode 4 precision */
>   case 4:
> - create_pit_timer(

RHEL3 guests and kscand

2008-08-12 Thread David S. Ahern

Hi Avi:

Any chance you'll be able to spend time soon on the RHEL3/kscand problem?

david
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

microsecond time shifts as vcpu changes

2008-08-20 Thread David S. Ahern

Hi Marcelo:

Some time ago I posted a message about time shifts. Since then I changed
my sample code to basically roam the vcpus in sequence, setting affinity
and dumping time. This version better illustrates what I mean by time
shifts.

The following is from within a RHEL3 guest. The numbers are
tv_sec.tv_usec as returned from gettimeofday() with no sleeps are rests
between affinity/gettimeofday calls:

# ./showtime 4 1

cpu 0: 1219292400.275439
cpu 1: 1219292400.667516
cpu 2: 1219292400.381351
cpu 3: 1219292401.942548

cpu 0: 1219292401.288373
cpu 1: 1219292401.678309
cpu 2: 1219292401.392143
cpu 3: 1219292402.953051

cpu 0: 1219292402.296987
cpu 1: 1219292402.686787
cpu 2: 1219292402.400609
cpu 3: 1219292403.961588

cpu 0: 1219292403.305919
cpu 1: 1219292403.695727
cpu 2: 1219292403.409547
cpu 3: 1219292404.970496

cpu 0: 1219292404.315970
cpu 1: 1219292404.705853
cpu 2: 1219292404.419674
cpu 3: 1219292405.980646

...

There are a couple of things concerning here -- the fact that time can
go backward as you shift vcpus, and also the fact that the process has
affinity set to vcpu3, sleeps for 1 second, sets affinity to vcpu0 and
the time delta between vcpu3 and the next vcpu1 is not 1 second. I'm
guessing both are an artifact of the same problem.

This particular host is a DL380 G5 running Fedora 9
(2.6.25.11-97.fc9.x86_64) and kvm-73 unmodified. I have a RHEL4 guest
running on a PowerEdge 2950 and do not see the shifting:

# ./showtime 4 1

cpu 0: 1219292953.320635
cpu 1: 1219292953.321332
cpu 2: 1219292953.321542
cpu 3: 1219292953.334045

cpu 0: 1219292954.336818
cpu 1: 1219292954.336896
cpu 2: 1219292954.337142
cpu 3: 1219292954.337303

cpu 0: 1219292955.339645
cpu 1: 1219292955.345344
cpu 2: 1219292955.345557
cpu 3: 1219292955.345625

cpu 0: 1219292956.348426
cpu 1: 1219292956.348507
cpu 2: 1219292956.348700
cpu 3: 1219292956.348762


Unfortunately I do not have a RHEL4 guest on the DL380 right now. I'll
copy one over when I get the time.


david


#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

int set_affinity(unsigned int cpu)
{
	int rc = 0;
	cpu_set_t set;
	CPU_ZERO(&set);
	CPU_SET(cpu, &set);

#ifdef MAKE_RHEL3
	if (sched_setaffinity(0, &set) != 0)
#else
	if (sched_setaffinity(0, sizeof(set), &set) != 0)
#endif
	{
		rc = 1;
		fprintf(stderr,
		"failed to set CPU mask to %x: %s\n",
		cpu, strerror(errno));
	}

	return rc;
}


int main(int argc, char *argv[])
{
	unsigned int tsleep = 0;
	unsigned int i;
	int max_cpu;
	struct timeval tv;

	if (argc < 2) {
		printf("usage: %s ncpus [sleeptime]\n", basename(argv[0]));
		return 1;
	}

	max_cpu = atoi(argv[1]);
	if (max_cpu == 0)
		return 2;

	if (argc > 2)
		tsleep = atoi(argv[2]);

	while(1) {

		printf("\n");
		for (i = 0; i < max_cpu; ++i) {

			if (set_affinity(i) != 0)
break;

			if (gettimeofday(&tv, NULL) != 0)
printf("gettimeofday failed\n");
			else
printf("cpu %d: %ld.%06ld\n", i, tv.tv_sec, tv.tv_usec);
		}

		if (!tsleep)
			break;

		sleep(tsleep);
	}

	return 0;
}

Re: time command in vm

2008-09-02 Thread David S. Ahern



Terry wrote:
> Hi All,
> 
> When we use time command in vm, we can get 'elapsed time', 'user time'
> and 'system time'.  How to explain these three times in detail?  For
> example, when we have a shadow page fault, we exit from guest to host
> for handling the fault.  So, this handling time should be considered in??
> 

I believe the time spent within kvm handling faults and such for the
guest shows up as system time to the guest.

david


> Thanks,
> Terry
> 
> -- 
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RFC: VMX: initialize TSC offset relative to vm creation time

2008-09-10 Thread David S. Ahern

Hi Marcelo:

Dramatic improvement. The following is an example with kvm-75 and this
patch.  Without cpu affinity from a kvm perspective (vcpu-to-pcpu):

cpu 0: 1221107886.020298
cpu 1: 1221107886.020290 *
cpu 2: 1221107886.020555
cpu 3: 1221107886.020549 *

cpu 0: 1221107887.030244
cpu 1: 1221107887.030236 *
cpu 2: 1221107887.030498
cpu 3: 1221107887.030493 *

cpu 0: 1221107888.040248
cpu 1: 1221107888.040262
cpu 2: 1221107888.040314
cpu 3: 1221107888.040470

cpu 0: 1221107889.050305
cpu 1: 1221107889.050300 *
cpu 2: 1221107889.050354
cpu 3: 1221107889.050394

cpu 0: 1221107890.060384
cpu 1: 1221107890.060489
cpu 2: 1221107890.060753
cpu 3: 1221107890.060918

cpu 0: 1221107891.083559
cpu 1: 1221107891.083558 *
cpu 2: 1221107891.083614
cpu 3: 1221107891.083613 *

cpu 0: 1221107892.091705
cpu 1: 1221107892.091699 *
cpu 2: 1221107892.092998
cpu 3: 1221107892.093011

Setting vcpu-pcpu affinity well after guest startup, tracking is a bit
better (fewer time travels).

I do not believe there's a way to set affinity as kvm/qemu threads are
spawned (short of modifying qemu).

As before, RHEL3 guest. DL380G5 host.

david


Marcelo Tosatti wrote:
> VMX initializes the TSC offset for each vcpu at different times, and
> also reinitializes it for vcpus other than 0 on APIC SIPI message.
> 
> This bug causes the TSC's to appear unsynchronized in the guest, even if
> the host is good.
> 
> Older Linux kernels don't handle the situation very well, so
> gettimeofday is likely to go backwards in time:
> 
> http://www.mail-archive.com/kvm@vger.kernel.org/msg02955.html
> http://sourceforge.net/tracker/index.php?func=detail&aid=2025534&group_id=180599&atid=893831
> 
> Fix it by initializating the offset of each vcpu relative to vm creation
> time, and moving it from vmx_vcpu_reset to vmx_vcpu_setup, out of the
> APIC MP init path.
> 
> 
> Signed-off-by: Marcelo Tosatti <[EMAIL PROTECTED]>
> 
> 
> Index: kvm.tip/arch/x86/kvm/vmx.c
> ===
> --- kvm.tip.orig/arch/x86/kvm/vmx.c
> +++ kvm.tip/arch/x86/kvm/vmx.c
> @@ -850,11 +850,8 @@ static u64 guest_read_tsc(void)
>   * writes 'guest_tsc' into guest's timestamp counter "register"
>   * guest_tsc = host_tsc + tsc_offset ==> tsc_offset = guest_tsc - host_tsc
>   */
> -static void guest_write_tsc(u64 guest_tsc)
> +static void guest_write_tsc(u64 guest_tsc, u64 host_tsc)
>  {
> - u64 host_tsc;
> -
> - rdtscll(host_tsc);
>   vmcs_write64(TSC_OFFSET, guest_tsc - host_tsc);
>  }
>  
> @@ -918,6 +915,7 @@ static int vmx_set_msr(struct kvm_vcpu *
>  {
>   struct vcpu_vmx *vmx = to_vmx(vcpu);
>   struct kvm_msr_entry *msr;
> + u64 host_tsc;
>   int ret = 0;
>  
>   switch (msr_index) {
> @@ -943,7 +941,8 @@ static int vmx_set_msr(struct kvm_vcpu *
>   vmcs_writel(GUEST_SYSENTER_ESP, data);
>   break;
>   case MSR_IA32_TIME_STAMP_COUNTER:
> - guest_write_tsc(data);
> + rdtscll(host_tsc);
> + guest_write_tsc(data, host_tsc);
>   break;
>   case MSR_P6_PERFCTR0:
>   case MSR_P6_PERFCTR1:
> @@ -2202,6 +2201,7 @@ static int vmx_vcpu_setup(struct vcpu_vm
>   vmcs_writel(CR0_GUEST_HOST_MASK, ~0UL);
>   vmcs_writel(CR4_GUEST_HOST_MASK, KVM_GUEST_CR4_MASK);
>  
> + guest_write_tsc(0, vmx->vcpu.kvm->arch.vm_init_tsc);
>  
>   return 0;
>  }
> @@ -2292,8 +2292,6 @@ static int vmx_vcpu_reset(struct kvm_vcp
>   vmcs_write32(GUEST_INTERRUPTIBILITY_INFO, 0);
>   vmcs_write32(GUEST_PENDING_DBG_EXCEPTIONS, 0);
>  
> - guest_write_tsc(0);
> -
>   /* Special registers */
>   vmcs_write64(GUEST_IA32_DEBUGCTL, 0);
>  
> Index: kvm.tip/arch/x86/kvm/x86.c
> ===
> --- kvm.tip.orig/arch/x86/kvm/x86.c
> +++ kvm.tip/arch/x86/kvm/x86.c
> @@ -4250,6 +4250,8 @@ struct  kvm *kvm_arch_create_vm(void)
>   INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
>   INIT_LIST_HEAD(&kvm->arch.assigned_dev_head);
>  
> + rdtscll(kvm->arch.vm_init_tsc);
> +
>   return kvm;
>  }
>  
> Index: kvm.tip/include/asm-x86/kvm_host.h
> ===
> --- kvm.tip.orig/include/asm-x86/kvm_host.h
> +++ kvm.tip/include/asm-x86/kvm_host.h
> @@ -377,6 +377,7 @@ struct kvm_arch{
>  
>   struct page *ept_identity_pagetable;
>   bool ept_identity_pagetable_done;
> + u64 vm_init_tsc;
>  };
>  
>  struct kvm_vm_stat {
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 00/13] RFC: out of sync shadow

2008-09-11 Thread David S. Ahern

Hi Marcelo:

This patchset causes my RHEL3 guest to hang during boot at one of the
early sym53c8xx messages:

sym53c8xx: at PCI bus 0, device 5, functions 0

Using ide instead of scsi the guest proceeds farther, but inevitably
hangs as well. I've tried dropping the amount of ram to 1024 and varied
the number of vcpus as well (including 1 vcpu).

When it hangs kvm on the host is spinning on one of the cpus, and
kvm/qemu appears to be 1 thread short. For the kvm process I expect to
see 2 + Nvcpus threads (ps -C kvm -L). With this patchset I see 2 +
Nvcpus - 1. (e.g., I usually run with 4 vcpus, so there should be 6
threads. I see only 5).

I'm using kvm-git tip from a couple of days ago + this patch set. kvm
userspace comes from kvm-75. Resetting to kvm-git and the guest starts
up just fine.

david

Marcelo Tosatti wrote:
> Keep shadow pages temporarily out of sync, allowing more efficient guest
> PTE updates in comparison to trap-emulate + unprotect heuristics. Stolen
> from Xen :)
> 
> This version only allows leaf pagetables to go out of sync, for
> simplicity, but can be enhanced.
> 
> VMX "bypass_guest_pf" feature on prefetch_page breaks it (since new
> PTE writes need no TLB flush, I assume). Not sure if its worthwhile to
> convert notrap_nonpresent -> trap_nonpresent on unshadow or just go 
> for unconditional nonpaging_prefetch_page.
> 
> * Kernel builds on 4-way 64-bit guest improve 10% (+ 3.7% for
> get_user_pages_fast). 
> 
> * lmbench's "lat_proc fork" microbenchmark latency is 40% lower (a
> shadow worst scenario test).
> 
> * The RHEL3 highpte kscand hangs go from 5+ seconds to < 1 second.
> 
> * Windows 2003 Server, 32-bit PAE, DDK build (build -cPzM 3):
> 
> Windows 2003 Checked 64 Bit Build Environment, 256M RAM
> 1-vcpu:
> vanilla + gup_fast: oos
> 0:04:37.375 0:03:28.047 (- 25%)
> 
> 2-vcpus:
> vanilla + gup_fast  oos
> 0:02:32.000 0:01:56.031 (- 23%)
> 
> 
> Windows 2003 Checked Build Environment, 1GB RAM
> 2-vcpus:
> vanilla + fast_gup oos
> 0:02:26.0780:01:50.110  (- 24%)
> 
> 4-vcpus:
> vanilla + fast_gup oos
> 0:01:59.2660:01:29.625  (- 25%)
> 
> And I think other optimizations are possible now, for example the guest
> can be responsible for remote TLB flushing on kvm_mmu_pte_write().
> 
> Please review.
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 00/13] RFC: out of sync shadow

2008-09-12 Thread David S. Ahern

Marcelo Tosatti wrote:
> On Thu, Sep 11, 2008 at 10:05:12PM -0600, David S. Ahern wrote:
> 
> David,
> 
> Please reload the kvm-intel module with "bypass_guest_pf=0" option.
> 
> 

DOH. You mentioned that in your description, and I forgot to disable it.
 Works fine now.

Guest behavior is amazing. After 30 min of uptime, kscand shows only
17secs of cpu usage. Before, kscand was hitting over a minute after
about 10 minutes of uptime.

david
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 00/13] RFC: out of sync shadow

2008-09-12 Thread David S. Ahern



Marcelo Tosatti wrote:
> On Fri, Sep 12, 2008 at 09:12:03AM -0600, David S. Ahern wrote:
>>
>> Marcelo Tosatti wrote:
>>> On Thu, Sep 11, 2008 at 10:05:12PM -0600, David S. Ahern wrote:
>>>
>>> David,
>>>
>>> Please reload the kvm-intel module with "bypass_guest_pf=0" option.
>>>
>>>
>> DOH. You mentioned that in your description, and I forgot to disable it.
>>  Works fine now.
>>
>> Guest behavior is amazing. After 30 min of uptime, kscand shows only
>> 17secs of cpu usage. Before, kscand was hitting over a minute after
>> about 10 minutes of uptime.
> 
> Great. Note that its not fully optimized for the RHEL3 highpte kscand
> case, which calls invlpg for present pagetable entries.
> 
> The current patchset simply invalidates the shadow on invlpg, meaning
> that the next access will cause a pagefault exit.
> 
> It can instead read the guest pte and prefault, saving one exit per
> test-and-clear-accessed operation.
> 

Meaning performance will get even better? Sweet.

david
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] 8139cp problems - steps to reproduce

2008-09-15 Thread David S. Ahern

Last February I dug into where it was getting stuck. See:

http://article.gmane.org/gmane.comp.emulators.kvm.devel/13537/match=pci%5fset%5firq

and follow up posts.

For the past 5-6 months I've been using the e1000 nic in rhel3 and rhel4
guests without a problem -- and without the need for guest hacks like
noapic.

david



xming wrote:
> hi,
> 
> I am running kvm-74 and it's getting worse for me (compared to 73, or
> 70,71 w/o issues), I tried virtio, rtl8139 and e1000,
> the network will stall. With 8139 it happens very quick (few MB via
> nfs) and I noticed that I can bring up the network
> by setting the stalled NIC (in the guest) by setting it to promisc and
> -promisc repeatably.
> 
> I can now perfectly reproduced the stall and un-stall.
> 
> When it's stalled I noticed on the host that the nic is not totally
> gone, arp broadcast still gets out.
> 
> Any ideas?
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KVM: PIC: enhance IPI avoidance

2008-09-23 Thread David S. Ahern

This patched worked very nicely for me -- about an 8% performance
improvement for my workload.

david


Marcelo Tosatti wrote:
> KVM: PIC: enhance IPI avoidance
> 
> The PIC code makes little effort to avoid kvm_vcpu_kick(), resulting in
> unnecessary guest exits in some conditions.
> 
> For example, if the timer interrupt is routed through the IOAPIC, IRR
> for IRQ 0 will get set but not cleared, since the APIC is handling the
> acks.
> 
> This means that everytime an interrupt < 16 is triggered, the priority
> logic will find IRQ0 pending and send an IPI to vcpu0 (in case IRQ0 is
> not masked, which is Linux's case).
> 
> Introduce a new variable isr_ack to represent the IRQ's for which the
> guest has been signalled / cleared the ISR. Use it to avoid more than
> one IPI per trigger-ack cycle, in addition to the avoidance when ISR is
> set in get_priority().
> 
> Signed-off-by: Marcelo Tosatti <[EMAIL PROTECTED]>
> 
> 
> Index: kvm/arch/x86/kvm/i8259.c
> ===
> --- kvm.orig/arch/x86/kvm/i8259.c
> +++ kvm/arch/x86/kvm/i8259.c
> @@ -33,6 +33,7 @@
>  static void pic_clear_isr(struct kvm_kpic_state *s, int irq)
>  {
>   s->isr &= ~(1 << irq);
> + s->isr_ack |= (1 << irq);
>  }
>  
>  /*
> @@ -213,6 +214,7 @@ void kvm_pic_reset(struct kvm_kpic_state
>   s->irr = 0;
>   s->imr = 0;
>   s->isr = 0;
> + s->isr_ack = 0xff;
>   s->priority_add = 0;
>   s->irq_base = 0;
>   s->read_reg_select = 0;
> @@ -444,10 +446,14 @@ static void pic_irq_request(void *opaque
>  {
>   struct kvm *kvm = opaque;
>   struct kvm_vcpu *vcpu = kvm->vcpus[0];
> + struct kvm_pic *s = pic_irqchip(kvm);
> + int irq = pic_get_irq(&s->pics[0]);
>  
> - pic_irqchip(kvm)->output = level;
> - if (vcpu)
> + s->output = level;
> + if (vcpu && level && (s->pics[0].isr_ack & (1 << irq))) {
> + s->pics[0].isr_ack &= ~(1 << irq);
>   kvm_vcpu_kick(vcpu);
> + }
>  }
>  
>  struct kvm_pic *kvm_create_pic(struct kvm *kvm)
> Index: kvm/arch/x86/kvm/irq.h
> ===
> --- kvm.orig/arch/x86/kvm/irq.h
> +++ kvm/arch/x86/kvm/irq.h
> @@ -42,6 +42,7 @@ struct kvm_kpic_state {
>   u8 irr; /* interrupt request register */
>   u8 imr; /* interrupt mask register */
>   u8 isr; /* interrupt service register */
> + u8 isr_ack; /* interrupt ack detection */
>   u8 priority_add;/* highest irq priority */
>   u8 irq_base;
>   u8 read_reg_select;
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RFC: VMX: initialize TSC offset relative to vm creation time

2008-10-13 Thread David S. Ahern

Marcelo:

Do you have a similar patch/idea for AMD?

Same program as before. Sets affinity to run on vcpu 0, call
gettimeofday(). Repeat for vcpu 1. ... Repeat for vcpu max. sleep(1).
Repeat sequence.

So in the following example output the process calls sleep with affinity
set to vcpu3, and on wake sets it to vcpu0 and then calls gettimeofday.
The result is a backward jump in time going from vcpu3 to vcpu0 and then
a forward jump from vcpu0 to vcpu1:

cpu 0: 1223902798.704804 *
cpu 1: 1223902799.824095
cpu 2: 1223902799.824139
cpu 3: 1223902799.824198

(sleep 1)

cpu 0: 1223902799.714804 *
cpu 1: 1223902800.834148
cpu 2: 1223902800.834190
cpu 3: 1223902800.834231

(sleep 1)

cpu 0: 1223902800.724863 *
cpu 1: 1223902801.844156
cpu 2: 1223902801.844234
cpu 3: 1223902801.844278

...

david

Marcelo Tosatti wrote:
> VMX initializes the TSC offset for each vcpu at different times, and
> also reinitializes it for vcpus other than 0 on APIC SIPI message.
> 
> This bug causes the TSC's to appear unsynchronized in the guest, even if
> the host is good.
> 
> Older Linux kernels don't handle the situation very well, so
> gettimeofday is likely to go backwards in time:
> 
> http://www.mail-archive.com/kvm@vger.kernel.org/msg02955.html
> http://sourceforge.net/tracker/index.php?func=detail&aid=2025534&group_id=180599&atid=893831
> 
> Fix it by initializating the offset of each vcpu relative to vm creation
> time, and moving it from vmx_vcpu_reset to vmx_vcpu_setup, out of the
> APIC MP init path.
> 
> 
> Signed-off-by: Marcelo Tosatti <[EMAIL PROTECTED]>
> 
> 
> Index: kvm.tip/arch/x86/kvm/vmx.c
> ===
> --- kvm.tip.orig/arch/x86/kvm/vmx.c
> +++ kvm.tip/arch/x86/kvm/vmx.c
> @@ -850,11 +850,8 @@ static u64 guest_read_tsc(void)
>   * writes 'guest_tsc' into guest's timestamp counter "register"
>   * guest_tsc = host_tsc + tsc_offset ==> tsc_offset = guest_tsc - host_tsc
>   */
> -static void guest_write_tsc(u64 guest_tsc)
> +static void guest_write_tsc(u64 guest_tsc, u64 host_tsc)
>  {
> - u64 host_tsc;
> -
> - rdtscll(host_tsc);
>   vmcs_write64(TSC_OFFSET, guest_tsc - host_tsc);
>  }
>  
> @@ -918,6 +915,7 @@ static int vmx_set_msr(struct kvm_vcpu *
>  {
>   struct vcpu_vmx *vmx = to_vmx(vcpu);
>   struct kvm_msr_entry *msr;
> + u64 host_tsc;
>   int ret = 0;
>  
>   switch (msr_index) {
> @@ -943,7 +941,8 @@ static int vmx_set_msr(struct kvm_vcpu *
>   vmcs_writel(GUEST_SYSENTER_ESP, data);
>   break;
>   case MSR_IA32_TIME_STAMP_COUNTER:
> - guest_write_tsc(data);
> + rdtscll(host_tsc);
> + guest_write_tsc(data, host_tsc);
>   break;
>   case MSR_P6_PERFCTR0:
>   case MSR_P6_PERFCTR1:
> @@ -2202,6 +2201,7 @@ static int vmx_vcpu_setup(struct vcpu_vm
>   vmcs_writel(CR0_GUEST_HOST_MASK, ~0UL);
>   vmcs_writel(CR4_GUEST_HOST_MASK, KVM_GUEST_CR4_MASK);
>  
> + guest_write_tsc(0, vmx->vcpu.kvm->arch.vm_init_tsc);
>  
>   return 0;
>  }
> @@ -2292,8 +2292,6 @@ static int vmx_vcpu_reset(struct kvm_vcp
>   vmcs_write32(GUEST_INTERRUPTIBILITY_INFO, 0);
>   vmcs_write32(GUEST_PENDING_DBG_EXCEPTIONS, 0);
>  
> - guest_write_tsc(0);
> -
>   /* Special registers */
>   vmcs_write64(GUEST_IA32_DEBUGCTL, 0);
>  
> Index: kvm.tip/arch/x86/kvm/x86.c
> ===
> --- kvm.tip.orig/arch/x86/kvm/x86.c
> +++ kvm.tip/arch/x86/kvm/x86.c
> @@ -4250,6 +4250,8 @@ struct  kvm *kvm_arch_create_vm(void)
>   INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
>   INIT_LIST_HEAD(&kvm->arch.assigned_dev_head);
>  
> + rdtscll(kvm->arch.vm_init_tsc);
> +
>   return kvm;
>  }
>  
> Index: kvm.tip/include/asm-x86/kvm_host.h
> ===
> --- kvm.tip.orig/include/asm-x86/kvm_host.h
> +++ kvm.tip/include/asm-x86/kvm_host.h
> @@ -377,6 +377,7 @@ struct kvm_arch{
>  
>   struct page *ept_identity_pagetable;
>   bool ept_identity_pagetable_done;
> + u64 vm_init_tsc;
>  };
>  
>  struct kvm_vm_stat {
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] audio streaming from usb devices

2010-02-03 Thread David S. Ahern


I have streaming audio devices working within qemu-kvm. This is a port
of the changes to qemu.

Streaming audio generates a series of isochronous requests that are
repetitive and time sensitive. The URBs need to be submitted in
consecutive USB frames and responses need to be handled in a timely manner.

Summary of the changes for isochronous requests:

1. The initial 'valid' value is increased to 32. It needs to be higher
than its current value of 10 since the host adds a 10 frame delay to the
scheduling of the first request; if valid is set to 10 the first
isochronous request times out and qemu cancels it. 32 was chosen as a
nice round number, and it is used in the path where a TD-async pairing
already exists.

2. The token field in the TD is *not* unique for isochronous requests,
so it is not a good choice for finding a matching async request. The
buffer (where to write the guest data) is unique, so use that value instead.

3. TD's for isochronous request need to be completed in the async
completion handler so that data is pushed to the guest as soon as it is
available. The uhci code currently attempts to process complete
isochronous TDs the next time the UHCI frame with the request is
processed. The results in lost data since the async requests will have
long since timed out based on the valid parameter. Increasing the valid
value is not acceptable as it introduces a 1+ second delay in the data
getting pushed to the guest.

4. The frame timer needs to be run on 1 msec intervals. Currently, the
expire time for the processing the next frame is computed after the
processing of each frame. This regularly causes the scheduling of frames
to shift in time. When this happens the periodic scheduling of the
requests is broken and the subsequent request is seen as a new request
by the host resulting in a 10 msec delay (first isochronous request is
scheduled for 10 frames from when the URB is submitted).


[ For what's worth a small change is needed to the guest driver to have
more outstanding URBs (at least 4 URBs with 5 packets per URB).]

Signed-off-by: David Ahern 


diff --git a/hw/usb-uhci.c b/hw/usb-uhci.c
index fdbb4d1..19b4ce6 100644
--- a/hw/usb-uhci.c
+++ b/hw/usb-uhci.c
@@ -112,6 +112,7 @@ typedef struct UHCIAsync {
 uint32_t  td;
 uint32_t  token;
 int8_tvalid;
+uint8_t   isoc;
 uint8_t   done;
 uint8_t   buffer[2048];
 } UHCIAsync;
@@ -131,6 +132,7 @@ typedef struct UHCIState {
 uint32_t fl_base_addr; /* frame list base address */
 uint8_t sof_timing;
 uint8_t status2; /* bit 0 and 1 are used to generate UHCI_STS_USBINT */
+int64_t expire_time;
 QEMUTimer *frame_timer;
 UHCIPort ports[NB_PORTS];
 
@@ -164,6 +166,7 @@ static UHCIAsync *uhci_async_alloc(UHCIState *s)
 async->td= 0;
 async->token = 0;
 async->done  = 0;
+async->isoc  = 0;
 async->next  = NULL;
 
 return async;
@@ -762,13 +765,25 @@ static int uhci_handle_td(UHCIState *s, uint32_t addr, 
UHCI_TD *td, uint32_t *in
 {
 UHCIAsync *async;
 int len = 0, max_len;
-uint8_t pid;
+uint8_t pid, isoc;
+uint32_t token;
 
 /* Is active ? */
 if (!(td->ctrl & TD_CTRL_ACTIVE))
 return 1;
 
-async = uhci_async_find_td(s, addr, td->token);
+/* token field is not unique for isochronous requests,
+ * so use the destination buffer 
+ */
+if (td->ctrl & TD_CTRL_IOS) {
+token = td->buffer;
+isoc = 1;
+} else {
+token = td->token;
+isoc = 0;
+}
+
+async = uhci_async_find_td(s, addr, token);
 if (async) {
 /* Already submitted */
 async->valid = 32;
@@ -785,9 +800,13 @@ static int uhci_handle_td(UHCIState *s, uint32_t addr, 
UHCI_TD *td, uint32_t *in
 if (!async)
 return 1;
 
-async->valid = 10;
+/* valid needs to be large enough to handle 10 frame delay
+ * for initial isochronous requests
+ */
+async->valid = 32;
 async->td= addr;
-async->token = td->token;
+async->token = token;
+async->isoc  = isoc;
 
 max_len = ((td->token >> 21) + 1) & 0x7ff;
 pid = td->token & 0xff;
@@ -841,9 +860,31 @@ static void uhci_async_complete(USBPacket *packet, void 
*opaque)
 
 dprintf("uhci: async complete. td 0x%x token 0x%x\n", async->td, 
async->token);
 
-async->done = 1;
+if (async->isoc) {
+UHCI_TD td;
+uint32_t link = async->td;
+uint32_t int_mask = 0, val;
+int len;
+ 
+cpu_physical_memory_read(link & ~0xf, (uint8_t *) &td, sizeof(td));
+le32_to_cpus(&td.link);
+le32_to_cpus(&td.ctrl);
+le32_to_cpus(&td.token);
+le32_to_cpus(&td.buffer);
+
+uhci_async_unlink(s, async);
+len = uhci_complete_td(s, &td, async, &int_mask);
+s->pending_int_mask |= int_mask;
 
-uhci_process_frame(s);
+/* update the status bits of the TD */
+val = cpu_to_le32(td.ctrl);
+cpu_physical_memory_write((link &

[PATCH] segfault due to buffer overrun in usb-serial

2010-02-03 Thread David S. Ahern

This fixes a segfault due to buffer overrun in the usb-serial device.
The memcpy was incrementing the start location by recv_used yet, the
computation of first_size (how much to write at the end of the buffer
before wrapping to the front) was not accounting for it. This causes the
next element after the receive buffer (recv_ptr) to get overwritten with
random data.

Signed-off-by: David Ahern 

diff --git a/hw/usb-serial.c b/hw/usb-serial.c
index 37293ea..c3f3401 100644
--- a/hw/usb-serial.c
+++ b/hw/usb-serial.c
@@ -497,12 +497,28 @@ static int usb_serial_can_read(void *opaque)
 static void usb_serial_read(void *opaque, const uint8_t *buf, int size)
 {
 USBSerialState *s = opaque;
-int first_size = RECV_BUF - s->recv_ptr;
-if (first_size > size)
-first_size = size;
-memcpy(s->recv_buf + s->recv_ptr + s->recv_used, buf, first_size);
-if (size > first_size)
-memcpy(s->recv_buf, buf + first_size, size - first_size);
+int first_size, start;
+
+/* room in the buffer? */
+if (size > (RECV_BUF - s->recv_used))
+size = RECV_BUF - s->recv_used;
+
+start = s->recv_ptr + s->recv_used;
+if (start < RECV_BUF) {
+/* copy data to end of buffer */
+first_size = RECV_BUF - start;
+if (first_size > size)
+first_size = size;
+
+memcpy(s->recv_buf + start, buf, first_size);
+
+/* wrap around to front if needed */
+if (size > first_size)
+memcpy(s->recv_buf, buf + first_size, size - first_size);
+} else {
+start -= RECV_BUF;
+memcpy(s->recv_buf + start, buf, size);
+}
 s->recv_used += size;
 }


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] add close callback for tty-based char device

2010-02-03 Thread David S. Ahern

Add a tty close callback. Right now if a guest device that is connected
to a tty-based chardev in the host is removed, the tty is not closed.
With this patch it is closed.

Example use case is connecting an emulated USB serial cable in the guest
to tty0 of the host using the monitor command:

usb_add serial::/dev/tty0

and then removing the device with:

usb_del serial::/dev/tty0

Signed-off-by: David Ahern 

diff --git a/qemu-char.c b/qemu-char.c
index 800ee6c..ecd84ec 100644
--- a/qemu-char.c
+++ b/qemu-char.c
@@ -1173,6 +1173,20 @@ static int tty_serial_ioctl(CharDriverState *chr,
int cmd
 return 0;
 }

+static void qemu_chr_close_tty(CharDriverState *chr)
+{
+FDCharDriver *s = chr->opaque;
+int fd = -1;
+
+if (s)
+fd = s->fd_in;
+
+fd_chr_close(chr);
+
+if (fd >= 0)
+close(fd);
+}
+
 static CharDriverState *qemu_chr_open_tty(QemuOpts *opts)
 {
 const char *filename = qemu_opt_get(opts, "path");
@@ -1187,6 +1201,7 @@ static CharDriverState *qemu_chr_open_tty(QemuOpts
*opts)
 return NULL;
 }
 chr->chr_ioctl = tty_serial_ioctl;
+chr->chr_close = qemu_chr_close_tty;
 return chr;
 }
 #else  /* ! __linux__ && ! __sun__ */

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] segfault due to buffer overrun in usb-serial

2010-02-09 Thread David S. Ahern

I have not seen response to this. If there are no objections please apply.

Thanks,

David Ahern


On 02/03/2010 09:00 AM, David S. Ahern wrote:
> This fixes a segfault due to buffer overrun in the usb-serial device.
> The memcpy was incrementing the start location by recv_used yet, the
> computation of first_size (how much to write at the end of the buffer
> before wrapping to the front) was not accounting for it. This causes the
> next element after the receive buffer (recv_ptr) to get overwritten with
> random data.
> 
> Signed-off-by: David Ahern 
> 
> diff --git a/hw/usb-serial.c b/hw/usb-serial.c
> index 37293ea..c3f3401 100644
> --- a/hw/usb-serial.c
> +++ b/hw/usb-serial.c
> @@ -497,12 +497,28 @@ static int usb_serial_can_read(void *opaque)
>  static void usb_serial_read(void *opaque, const uint8_t *buf, int size)
>  {
>  USBSerialState *s = opaque;
> -int first_size = RECV_BUF - s->recv_ptr;
> -if (first_size > size)
> -first_size = size;
> -memcpy(s->recv_buf + s->recv_ptr + s->recv_used, buf, first_size);
> -if (size > first_size)
> -memcpy(s->recv_buf, buf + first_size, size - first_size);
> +int first_size, start;
> +
> +/* room in the buffer? */
> +if (size > (RECV_BUF - s->recv_used))
> +size = RECV_BUF - s->recv_used;
> +
> +start = s->recv_ptr + s->recv_used;
> +if (start < RECV_BUF) {
> +/* copy data to end of buffer */
> +first_size = RECV_BUF - start;
> +if (first_size > size)
> +first_size = size;
> +
> +memcpy(s->recv_buf + start, buf, first_size);
> +
> +/* wrap around to front if needed */
> +if (size > first_size)
> +memcpy(s->recv_buf, buf + first_size, size - first_size);
> +} else {
> +start -= RECV_BUF;
> +memcpy(s->recv_buf + start, buf, size);
> +}
>  s->recv_used += size;
>  }
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [PATCH] audio streaming from usb devices

2010-02-09 Thread David S. Ahern

I have not seen a response to this. If there are no objections please apply.

Thanks,

David Ahern


On 02/03/2010 08:49 AM, David S. Ahern wrote:
> 
> I have streaming audio devices working within qemu-kvm. This is a port
> of the changes to qemu.
> 
> Streaming audio generates a series of isochronous requests that are
> repetitive and time sensitive. The URBs need to be submitted in
> consecutive USB frames and responses need to be handled in a timely manner.
> 
> Summary of the changes for isochronous requests:
> 
> 1. The initial 'valid' value is increased to 32. It needs to be higher
> than its current value of 10 since the host adds a 10 frame delay to the
> scheduling of the first request; if valid is set to 10 the first
> isochronous request times out and qemu cancels it. 32 was chosen as a
> nice round number, and it is used in the path where a TD-async pairing
> already exists.
> 
> 2. The token field in the TD is *not* unique for isochronous requests,
> so it is not a good choice for finding a matching async request. The
> buffer (where to write the guest data) is unique, so use that value instead.
> 
> 3. TD's for isochronous request need to be completed in the async
> completion handler so that data is pushed to the guest as soon as it is
> available. The uhci code currently attempts to process complete
> isochronous TDs the next time the UHCI frame with the request is
> processed. The results in lost data since the async requests will have
> long since timed out based on the valid parameter. Increasing the valid
> value is not acceptable as it introduces a 1+ second delay in the data
> getting pushed to the guest.
> 
> 4. The frame timer needs to be run on 1 msec intervals. Currently, the
> expire time for the processing the next frame is computed after the
> processing of each frame. This regularly causes the scheduling of frames
> to shift in time. When this happens the periodic scheduling of the
> requests is broken and the subsequent request is seen as a new request
> by the host resulting in a 10 msec delay (first isochronous request is
> scheduled for 10 frames from when the URB is submitted).
> 
> 
> [ For what's worth a small change is needed to the guest driver to have
> more outstanding URBs (at least 4 URBs with 5 packets per URB).]
> 
> Signed-off-by: David Ahern 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] add close callback for tty-based char device

2010-02-09 Thread David S. Ahern

I have not seen response to this. If there are no objections please apply.

Thanks,

David Ahern


On 02/03/2010 09:18 AM, David S. Ahern wrote:
> Add a tty close callback. Right now if a guest device that is connected
> to a tty-based chardev in the host is removed, the tty is not closed.
> With this patch it is closed.
> 
> Example use case is connecting an emulated USB serial cable in the guest
> to tty0 of the host using the monitor command:
> 
> usb_add serial::/dev/tty0
> 
> and then removing the device with:
> 
> usb_del serial::/dev/tty0
> 
> Signed-off-by: David Ahern 
> 
> diff --git a/qemu-char.c b/qemu-char.c
> index 800ee6c..ecd84ec 100644
> --- a/qemu-char.c
> +++ b/qemu-char.c
> @@ -1173,6 +1173,20 @@ static int tty_serial_ioctl(CharDriverState *chr,
> int cmd
>  return 0;
>  }
> 
> +static void qemu_chr_close_tty(CharDriverState *chr)
> +{
> +FDCharDriver *s = chr->opaque;
> +int fd = -1;
> +
> +if (s)
> +fd = s->fd_in;
> +
> +fd_chr_close(chr);
> +
> +if (fd >= 0)
> +close(fd);
> +}
> +
>  static CharDriverState *qemu_chr_open_tty(QemuOpts *opts)
>  {
>  const char *filename = qemu_opt_get(opts, "path");
> @@ -1187,6 +1201,7 @@ static CharDriverState *qemu_chr_open_tty(QemuOpts
> *opts)
>  return NULL;
>  }
>  chr->chr_ioctl = tty_serial_ioctl;
> +chr->chr_close = qemu_chr_close_tty;
>  return chr;
>  }
>  #else  /* ! __linux__ && ! __sun__ */
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

segfault at start with latest qemu-kvm.git

2010-03-03 Thread David S. Ahern


With latest qemu-kvm.git I am getting a segfault at start:

/tmp/qemu-kvm-test/bin/qemu-system-x86_64 -m 1024 -smp 2 \
  -drive file=/images/f12-x86_64.img,if=virtio,cache=none,boot=on

kvm_create_vcpu: Invalid argument
Segmentation fault (core dumped)


git bisect points to:

Bisecting: 0 revisions left to test after this (roughly 0 steps)
[52b03dd70261934688cb00768c4b1e404716a337] qemu-kvm: Move
kvm_set_boot_cpu_id


$ git show
commit 7811d4e8ec057d25db68f900be1f09a142faca49
Author: Marcelo Tosatti 
Date:   Mon Mar 1 21:36:31 2010 -0300


If I manually back out the patch it will boot fine.

-- 
David
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: segfault at start with latest qemu-kvm.git

2010-03-03 Thread David S. Ahern





On 03/03/2010 04:08 PM, Jan Kiszka wrote:
> David S. Ahern wrote:
>> With latest qemu-kvm.git I am getting a segfault at start:
>>
>> /tmp/qemu-kvm-test/bin/qemu-system-x86_64 -m 1024 -smp 2 \
>>   -drive file=/images/f12-x86_64.img,if=virtio,cache=none,boot=on
>>
>> kvm_create_vcpu: Invalid argument
>> Segmentation fault (core dumped)
>>
>>
>> git bisect points to:
>>
>> Bisecting: 0 revisions left to test after this (roughly 0 steps)
>> [52b03dd70261934688cb00768c4b1e404716a337] qemu-kvm: Move
>> kvm_set_boot_cpu_id
>>
>>
>> $ git show
>> commit 7811d4e8ec057d25db68f900be1f09a142faca49
>> Author: Marcelo Tosatti 
>> Date:   Mon Mar 1 21:36:31 2010 -0300
>>
>>
>> If I manually back out the patch it will boot fine.
>>
> 
> Problem persists after removing the build directory and doing a fresh
> configure && make? I'm asking before taking the bug (which would be
> mine, likely) as I recently spent some hours "debugging" a volatile
> build system issue.
> 
> Jan
> 

Before sending the email I pulled a fresh clone in a completely
different directory (/tmp) to determine if it was something I
introduced. I then went back to my usual location, unapplied the patch
and it worked fine.

David
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: segfault at start with latest qemu-kvm.git

2010-03-03 Thread David S. Ahern


On 03/03/2010 04:20 PM, Jan Kiszka wrote:
> David S. Ahern wrote:
>>
>>
>>
>> On 03/03/2010 04:08 PM, Jan Kiszka wrote:
>>> David S. Ahern wrote:
>>>> With latest qemu-kvm.git I am getting a segfault at start:
>>>>
>>>> /tmp/qemu-kvm-test/bin/qemu-system-x86_64 -m 1024 -smp 2 \
>>>>   -drive file=/images/f12-x86_64.img,if=virtio,cache=none,boot=on
>>>>
>>>> kvm_create_vcpu: Invalid argument
>>>> Segmentation fault (core dumped)
>>>>
>>>>
>>>> git bisect points to:
>>>>
>>>> Bisecting: 0 revisions left to test after this (roughly 0 steps)
>>>> [52b03dd70261934688cb00768c4b1e404716a337] qemu-kvm: Move
>>>> kvm_set_boot_cpu_id
>>>>
>>>>
>>>> $ git show
>>>> commit 7811d4e8ec057d25db68f900be1f09a142faca49
>>>> Author: Marcelo Tosatti 
>>>> Date:   Mon Mar 1 21:36:31 2010 -0300
>>>>
>>>>
>>>> If I manually back out the patch it will boot fine.
>>>>
>>> Problem persists after removing the build directory and doing a fresh
>>> configure && make? I'm asking before taking the bug (which would be
>>> mine, likely) as I recently spent some hours "debugging" a volatile
>>> build system issue.
>>>
>>> Jan
>>>
>>
>> Before sending the email I pulled a fresh clone in a completely
>> different directory (/tmp) to determine if it was something I
>> introduced. I then went back to my usual location, unapplied the patch
>> and it worked fine.
> 
> OK, that reason can be excluded. What's your host kernel kvm version?
> 
> (Of course, the issue does not show up here. But virtio currently does
> not boot for me - independent of my patch.)
> 
> Jan
> 

Fedora Core 12,

Linux daahern-lx 2.6.31.12-174.2.22.fc12.x86_64 #1 SMP Fri Feb 19
18:55:03 UTC 2010 x86_64 x86_64 x86_64 GNU/Linux

David
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] qemu-kvm: Fix boot CPU setup for the case it is unsupported

2010-03-04 Thread David S. Ahern





On 03/04/2010 02:00 AM, Jan Kiszka wrote:
> Commit 52b03dd702 incorrectly failed KVM initialization in case the
> kernel did not support KVM_CAP_SET_BOOT_CPU_ID. Fix this, and also
> improve error propagation of kvm_create_context at this chance.
> 
> Signed-off-by: Jan Kiszka 
> ---
> 
> OK, it really was me. :)
> 
>  qemu-kvm-x86.c |9 +++--
>  qemu-kvm.c |4 +++-
>  2 files changed, 10 insertions(+), 3 deletions(-)
> 
> diff --git a/qemu-kvm-x86.c b/qemu-kvm-x86.c
> index 7a5925a..7d42fdc 100644
> --- a/qemu-kvm-x86.c
> +++ b/qemu-kvm-x86.c
> @@ -672,7 +672,7 @@ static const VMStateDescription vmstate_kvmclock= {
>  
>  int kvm_arch_qemu_create_context(void)
>  {
> -int i;
> +int i, r;
>  struct utsname utsname;
>  
>  uname(&utsname);
> @@ -696,7 +696,12 @@ int kvm_arch_qemu_create_context(void)
>  vmstate_register(0, &vmstate_kvmclock, &kvmclock_data);
>  #endif
>  
> -return kvm_set_boot_cpu_id(0);
> +r = kvm_set_boot_cpu_id(0);
> +if (r < 0 && r != -ENOSYS) {
> +return r;
> +}
> +
> +return 0;
>  }
>  
>  static void set_msr_entry(struct kvm_msr_entry *entry, uint32_t index,
> diff --git a/qemu-kvm.c b/qemu-kvm.c
> index 222ca97..e417f21 100644
> --- a/qemu-kvm.c
> +++ b/qemu-kvm.c
> @@ -2091,8 +2091,10 @@ static int kvm_create_context(void)
>  return -1;
>  }
>  r = kvm_arch_qemu_create_context();
> -if (r < 0)
> +if (r < 0) {
>  kvm_finalize(kvm_state);
> +return -1;
> +}
>  if (kvm_pit && !kvm_pit_reinject) {
>  if (kvm_reinject_control(kvm_context, 0)) {
>  fprintf(stderr, "failure to disable in-kernel PIT 
> reinjection\n");
> 

Works for me: FC12 host, FC12 guest.

David
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Streaming Audio from Virtual Machine

2010-03-21 Thread David S. Ahern



On 03/21/2010 01:12 PM, Gus Zernial wrote:
> I'm using Kubuntu 9.10 32-bit on a quad-core Phenom II with 
> Gigabit ethernet. I want to stream audio from MLB.com from a 
> WinXP client thru a Linksys WMB54G wireless music bridge. Note 
> that there are drivers for the WMB54G only for WinXP and Vista.
> 
> If I stream the audio thru a native WinXP box thru the WMB54G,
> all is well and the audio sounds fine. When I try to stream thru a 
> WinXP virtual machine on Kubuntu 9.10, the audio is poor quality
> and subject to gaps and dropping the stream altogether. So far
> I've tried KVM/QEMU and VirtualBox, same result.
> 
> Regards KVM/QEMU, I note AMD-V is activated in the BIOS, and I have a 
> custom 2.6.32.7 kernel, and QEMU 0.11.0. The kvm kvm_amd modules are compiled 
> in and loaded. I've been using bridged networking . I think it's set up 
> correctly but I confess I'm no networking expert. My start command for the 
> WinXP virtual machine is:
> 
> sudo /usr/bin/qemu -m 1024 -boot c 
> -netnic,vlan=0,macaddr=00:d0:13:b0:2d:32,model=rtl8139 -net 
> tap,vlan=0,ifname=tap0,script=/etc/qemu-ifup -localtime -soundhw ac97 -smp 4 
> -fda /dev/fd0 -vga std -usb /home/rbroman/windows.img
> 
> I also tried model=virtio but that didn't help. 
> 
> I suspect this is a virtual machine networking problem but I'm
> not sure. So my questions are:
> 
> -What's the best/fastest networking option and how do I set it up?
> Pointers to step-by-step instructions appreciated.
> 
> -Is it possible I have a problem other than networking? Configuration
> problem with KVM/QEMU? Or could there be a problem with the WMB54G driver 
> when used thru a virtual machine?
> 
> -Is there a better virtual machine solution than KVM/QEMU for what 
> I'm trying to do?

[dsa] I have been able to stream and video in a KVM-hosted winxp VM, and
I have even watched a netflix-based movie. My laptop has a Core-2 duo
cpu, T9550, with 4 GB of RAM. Networking at home is through a wireless-N
router, and I use bridged networking and NAT for VMs.

Host activity definitely has an impact. When streaming I make sure I am
not doing any heavy activity in the host layer, and if I notice jitter
the first thing I do is up the priority of the VM threads using chrt.

David

> 
> Recommendations appreciated - Gus
> 
> 
> 
> 
> 
>   
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Networkconfiguration with KVM

2010-04-05 Thread David S. Ahern



On 04/05/2010 12:04 PM, Dan Johansson wrote:
> Must I specify an IP for the br-eth3 interface?

You do not have to specify an IP address for the bridge.

In my case:

mainbr0   Link encap:Ethernet  HWaddr 
  UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
  RX packets:7328933 errors:0 dropped:0 overruns:0 frame:0
  TX packets:6992 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:0
  RX bytes:481076877 (458.7 MiB)  TX bytes:629184 (614.4 KiB)

tap0  Link encap:Ethernet  HWaddr 
  UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
  RX packets:139390 errors:0 dropped:0 overruns:0 frame:0
  TX packets:7460821 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:500
  RX bytes:13554808 (12.9 MiB)  TX bytes:601113602 (573.2 MiB)


# brctl show
bridge name bridge id   STP enabled interfaces
mainbr0 8000.  no  tap0
eth0

eth0 is the interface connected to the physical LAN. mainbr0 ties the
VM's tap to eth0.

David

> 
> Regards,
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

latest git - main thread spinning

2010-04-11 Thread David S. Ahern

With the latest qemu-kvm.git (fresh pull today, 11-April-2010) the main
qemu thread is spinning.

It looks like the recent sync with qemu.git is the culprit --
specifically, d6f4ade214a9f74dca9495b83a24ff9c113e4f9a from Paolo on
March 10 changed the semantics of main_loop_wait from a timeout value to
a nonblocking argument. kvm_main_loop() still invokes it with the
argument of 1000 which means the timeout for select is set to 0.

David
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Autotest] Autotest: Unattended_install testcase always fail with rhel3.9-32 guest

2010-04-16 Thread David S. Ahern



On 04/14/2010 08:01 AM, Lucas Meneghel Rodrigues wrote:
> On Wed, Apr 14, 2010 at 10:26 AM, Amos Kong  wrote:
>> Hi Lucas,
>>
>> When I execute unattended_install testcases on RHEL-5.5, it always fail when 
>> using rhel3.9-32 guest.
>> I found it blocked after packages installation. Is it related that rhel39-32 
>> guest don't support acpi ?
> 
> I've hit this problem before, it is what I believe to be an anaconda
> bug on that particular RHEL version. I tried a *lot* to work around
> the problem, spent a lot of time with it, but in the end I just gave
> up.
> 
> The problem happens because it's simply not possible to bring the
> network up at post install stage so the install can communicate with
> the host to respond that its installation finished. If anyone can help
> to work around the problem that'd be great...

What commands are you running to configure the network and what command
is stalling? I've done unattended installs with RHEL3.8, 32-bit guests
with networking enabled.

David


> 
>>
>> Regards,
>> Amos
>> ___
>> Autotest mailing list
>> autot...@test.kernel.org
>> http://test.kernel.org/cgi-bin/mailman/listinfo/autotest
>>
> 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Autotest] Autotest: Unattended_install testcase always fail with rhel3.9-32 guest

2010-04-16 Thread David S. Ahern



On 04/16/2010 09:36 AM, Lucas Meneghel Rodrigues wrote:
> On Fri, 2010-04-16 at 08:03 -0600, David S. Ahern wrote:
>>
>> On 04/14/2010 08:01 AM, Lucas Meneghel Rodrigues wrote:
>>> On Wed, Apr 14, 2010 at 10:26 AM, Amos Kong  wrote:
>>>> Hi Lucas,
>>>>
>>>> When I execute unattended_install testcases on RHEL-5.5, it always fail 
>>>> when using rhel3.9-32 guest.
>>>> I found it blocked after packages installation. Is it related that 
>>>> rhel39-32 guest don't support acpi ?
>>>
>>> I've hit this problem before, it is what I believe to be an anaconda
>>> bug on that particular RHEL version. I tried a *lot* to work around
>>> the problem, spent a lot of time with it, but in the end I just gave
>>> up.
>>>
>>> The problem happens because it's simply not possible to bring the
>>> network up at post install stage so the install can communicate with
>>> the host to respond that its installation finished. If anyone can help
>>> to work around the problem that'd be great...
>>
>> What commands are you running to configure the network and what command
>> is stalling? I've done unattended installs with RHEL3.8, 32-bit guests
>> with networking enabled.
> 
> To add some background to the discussion, RHEL3.9 64 bit works just
> fine. The kickstart file that installs pretty much all RH based systems
> tries to configure the network by calling 'dhclient eth0'. In order to
> work, some networking kernel modules need to be loaded.
> 
> While debugging the problem, I discovered that it wasn't possible to
> load some of the iptables kernel modules (I don't remember exactly which
> ones). So I tried many strategies, loading the modules specifying paths,
> etc... nothing worked. It seems like those essential networking modules
> in the install kernel for 32 bit are missing due to some build problem.

Ok, so it's firewall related. RHEL3 uses a BOOT kernel and only a subset
of the kernel modules are included in the modules.cgz. It should contain
all of the drivers for the NICs, so networking alone should be fine. Why
are the iptables rules needed to tell the host that the install has
completed?

David


> 
> Sure, once the install finished the system will boot on a functional
> kernel, but the kernel used by the install system just can't load the
> modules, rendering our unattended install system useless, since the host
> need to be able to verify whether the guest finished the install through
> socket communication.
> 
>> David
>>
>>
>>>
>>>>
>>>> Regards,
>>>> Amos
>>>> ___
>>>> Autotest mailing list
>>>> autot...@test.kernel.org
>>>> http://test.kernel.org/cgi-bin/mailman/listinfo/autotest
>>>>
>>>
>>>
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Autotest] Autotest: Unattended_install testcase always fail with rhel3.9-32 guest

2010-04-17 Thread David S. Ahern

On 04/17/2010 10:09 PM, Amos Kong wrote:
> %post --interpreter /usr/bin/python
> import socket, os
> os.system('dhclient')
> os.system('chkconfig sshd on')
> os.system('iptables -F')
> os.system('echo 0 > /selinux/enforce')
> server = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
> server.bind(('', 12323))
> server.listen(1)
> (client, addr) = server.accept()
> client.send("done")
> client.close()

So, effectively after the install completes use dhclient to configure a
network address, start a server on a known port and when a client
connects send the message "done". I would expect that to work just fine.

What part is not working? Have you used anaconda's root shell (alt-f2)
to confirm each step and if so which one is not setup as expected?

David
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Autotest] Autotest: Unattended_install testcase always fail with rhel3.9-32 guest

2010-04-18 Thread David S. Ahern



On 04/18/2010 12:26 PM, Lucas Meneghel Rodrigues wrote:
> On Sat, 2010-04-17 at 22:55 -0600, David S. Ahern wrote:
>>
>> On 04/17/2010 10:09 PM, Amos Kong wrote:
>>> %post --interpreter /usr/bin/python
>>> import socket, os
>>> os.system('dhclient')
>>> os.system('chkconfig sshd on')
>>> os.system('iptables -F')
>>> os.system('echo 0 > /selinux/enforce')
>>> server = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
>>> server.bind(('', 12323))
>>> server.listen(1)
>>> (client, addr) = server.accept()
>>> client.send("done")
>>> client.close()
>>
>> So, effectively after the install completes use dhclient to configure a
>> network address, start a server on a known port and when a client
>> connects send the message "done". I would expect that to work just fine.
> 
> Me too, it has been working for RHEL 4.X, 5.X 32/64 bit and 3.X 64 bit.
> The problem has been effectively 3.9 32 bit.

I fired up a 3.9 guest with your ks.cfg. The problem is due to the
limited functionality in the RHEL3 BOOT kernel for i386. Specifically,
dhclient is failing at:

setsockopt(6, SOL_SOCKET, SO_ATTACH_FILTER, "\v\0\6\10\240Y\n\10", 8) =
-1 ENOPROTOOPT

So dhclient client is out. But you can still configure and use
networking via ifconfig if static addressing is an option for you. I was
able to use that command to configure eth0 and push an strace output
file for dhclient.

Also, a couple of comments on this use case:
- SELinux is not applicable
- 32 GB of RAM is way beyond what the RHEL3 i386 can detect and use
- 12 vcpus seems high as well.

David


> 
>> What part is not working? Have you used anaconda's root shell (alt-f2)
>> to confirm each step and if so which one is not setup as expected?
> 
> dhclient. It fails saying "module IP_... could not be loaded.
> 
>> David
>> --
>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RFC: VMX: initialize TSC offset relative to vm creation time

2008-10-28 Thread David S. Ahern



Marcelo Tosatti wrote:
> On Sat, Sep 13, 2008 at 07:55:02AM +0300, Avi Kivity wrote:
>> Marcelo Tosatti wrote:
>>> VMX initializes the TSC offset for each vcpu at different times, and
>>> also reinitializes it for vcpus other than 0 on APIC SIPI message.
>>>
>>> This bug causes the TSC's to appear unsynchronized in the guest, even if
>>> the host is good.
>>>
>>> Older Linux kernels don't handle the situation very well, so
>>> gettimeofday is likely to go backwards in time:
>>>
>>> http://www.mail-archive.com/kvm@vger.kernel.org/msg02955.html
>>> http://sourceforge.net/tracker/index.php?func=detail&aid=2025534&group_id=180599&atid=893831
>>>
>>> Fix it by initializating the offset of each vcpu relative to vm creation
>>> time, and moving it from vmx_vcpu_reset to vmx_vcpu_setup, out of the
>>> APIC MP init path.
>>>
>>>
>>>   
>> This is good in principle, but we need to detect if we're on a multiple
>> board host (or a host with unsynced tscs) and do something else in that
>> case.
> 
> I think this is a separate, and difficult, problem. For instance older
> Linux guests that correct the TSC across CPU's are broken at the moment
> in the unsynced TSC case.
> 
> That is, the fact that KVM does not handle unsynced TSC's on the host is
> not an argument against this patch which clearly fixes a bug.
> 
> Take commit 019960ae9933161c2809fa4ee608ba30d9639fd2 for example.
> 

Has anything changed "recently" with the TSC code? Recently here being
the past 2 months since you first crafted the patch. I ask because in
the past few runs based on kvm.git trees (e.g., as recently as a pull on
10/26), this tsc offset patch no longer fixes the problem.

The following one does fix the problem with kvm.git pulled on 10/26/08:

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 64e2439..d5da717 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -860,7 +860,7 @@ static void guest_write_tsc(u64 guest_tsc)
u64 host_tsc;

rdtscll(host_tsc);
-   vmcs_write64(TSC_OFFSET, guest_tsc - host_tsc);
+   vmcs_write64(TSC_OFFSET, 0);
 }

 /*

This is the vmx counterpart (or at least to my understanding) to a
suggestion Ben had for the svm code.

david
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RFC: VMX: initialize TSC offset relative to vm creation time

2008-10-30 Thread David S. Ahern



Marcelo Tosatti wrote:
> On Tue, Oct 28, 2008 at 12:36:14PM -0600, David S. Ahern wrote:
>>> That is, the fact that KVM does not handle unsynced TSC's on the host is
>>> not an argument against this patch which clearly fixes a bug.
>>>
>>> Take commit 019960ae9933161c2809fa4ee608ba30d9639fd2 for example.
>>>
>> Has anything changed "recently" with the TSC code? Recently here being
>> the past 2 months since you first crafted the patch. I ask because in
>> the past few runs based on kvm.git trees (e.g., as recently as a pull on
>> 10/26), this tsc offset patch no longer fixes the problem.
> 
> Hi David,
> 
> Can you share showtime output? Works for me.
> 

Hi Marcelo:

I pulled kvm.git this morning and ran three cases:
1. kvm.git with no patches,
2. kvm.git with your TSC offset patch from September 10th,
3. kvm.git with TSC offset set to 0.

In all cases the host is a DL380G5, Fedora 9 OS, kvm-77 userspace. Guest
is running RHEL3U8. 3 samples for each case:


1. kvm.git, no patches:

cpu 0: 1225374376.351910 *
cpu 1: 1225374376.598833
cpu 2: 1225374378.154530
cpu 3: 1225374377.874563 *

sleeping 1 with affinity set to 0x8

cpu 0: 1225374377.361762 *
cpu 1: 1225374377.608669
cpu 2: 1225374379.164366
cpu 3: 1225374378.884393 *

sleeping 1 with affinity set to 0x8

cpu 0: 1225374378.371607 *
cpu 1: 1225374378.618517
cpu 2: 1225374380.174213
cpu 3: 1225374379.894246 *



2. kvm.git, Marcelo patch

cpu 0: 1225374671.069711
cpu 1: 1225374671.069711
cpu 2: 1225374671.069804
cpu 3: 1225374671.069761 *

sleeping 1 with affinity set to 0x8

cpu 0: 1225374672.079221
cpu 1: 1225374672.079220 *
cpu 2: 1225374672.079309
cpu 3: 1225374672.079267 *

sleeping 1 with affinity set to 0x8

cpu 0: 1225374673.088703
cpu 1: 1225374673.088701 *
cpu 2: 1225374673.088802
cpu 3: 1225374673.088763 *



3. tsc offset 0

cpu 0: 1225374910.953226
cpu 1: 1225374910.953307
cpu 2: 1225374910.953355
cpu 3: 1225374910.953446

sleeping 1 with affinity set to 0x8

cpu 0: 1225374911.962735
cpu 1: 1225374911.962808
cpu 2: 1225374911.962857
cpu 3: 1225374911.962949

sleeping 1 with affinity set to 0x8

cpu 0: 1225374912.972211
cpu 1: 1225374912.972284
cpu 2: 1225374912.972333
cpu 3: 1225374912.972425


I'll repeat the test later on a PowerEdge 2950 with a similar setup, but
it has the same processor as the DL380G5.

david


>> The following one does fix the problem with kvm.git pulled on 10/26/08:
>>
>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> index 64e2439..d5da717 100644
>> --- a/arch/x86/kvm/vmx.c
>> +++ b/arch/x86/kvm/vmx.c
>> @@ -860,7 +860,7 @@ static void guest_write_tsc(u64 guest_tsc)
>> u64 host_tsc;
>>
>> rdtscll(host_tsc);
>> -   vmcs_write64(TSC_OFFSET, guest_tsc - host_tsc);
>> +   vmcs_write64(TSC_OFFSET, 0);
>>  }
>>
>>  /*
>>
>> This is the vmx counterpart (or at least to my understanding) to a
>> suggestion Ben had for the svm code.
>>
>> david
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 5/6] kvm: qemu: virtio-net: handle all tx in I/O thread without timer

2008-11-04 Thread David S. Ahern

Mark McLoughlin wrote:

> Note also that when tuning for a specific workload, which CPU
> the I/O thread is pinned to is important.
> 

Hi Mark:

Can you give an example of when that has a noticeable affect?

For example, if the guest handles network interrupts on vcpu0 and it is
pinned to pcpu0 where should the IO thread be pinned for best performance?

thanks,

david
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 5/6] kvm: qemu: virtio-net: handle all tx in I/O thread without timer

2008-11-06 Thread David S. Ahern



Mark McLoughlin wrote:
> On Tue, 2008-11-04 at 08:23 -0700, David S. Ahern wrote:
>> Mark McLoughlin wrote:
>>
>>> Note also that when tuning for a specific workload, which CPU
>>> the I/O thread is pinned to is important.
>>>
>> Hi Mark:
>>
>> Can you give an example of when that has a noticeable affect?
>>
>> For example, if the guest handles network interrupts on vcpu0 and it is
>> pinned to pcpu0 where should the IO thread be pinned for best performance?
> 
> Basically, the I/O thread is where packets are copied too and from host
> kernel space at the moment.
> 
> If there are other copies of the packets anywhere, you want those to
> copy from a cache.
> 
> With my netperf guest->host benchmark, you actually have four copies
> going on:
> 
>   1) netperf process in guest copying to guest kernel space
> 
>   2) qemu process in the host copying between internal buffers
> 
>   3) qemu process in the host copying to host kernel space
> 
>   4) netserver process in the host copying into its buffers
> 
> My machine has four CPUs, with two 6Mb L2 caches - each cache is shared
> between two of the CPUs, so I set things up as follows:
> 
>   pcpu#3 - netserver, I/O thread, vcpu#0
>   pcup#4 - vcpu#1, virtio_net irq, netperf
> 
> which (hopefully) ensures that we're only doing one copy using RAM and
> the rest are using the L1/L2 caches.

So, in other words you don't necessarily need the guest vcpu that
handles the net irq on the same pcpu as the IO thread, but it should
help to keep them within the same processor caches. Correct?

david

> 
> Cheers,
> Mark.
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KVM performance

2008-11-14 Thread David S. Ahern

See if boosting the priority of the VM (see man chrt), and locking it to
a core (see man taskset) helps. You'll want to do that for the vcpu
thread(s) (in the qmeu monitor, run 'info cpus' command).

david


Randy Broman wrote:
> I am using Intel Core2 Duo E6600, Kubuntu 8.04 with kernel
> 2.6.24-21-generic,
> kvm (as in "QEMU PC emulator version 0.9.1 (kvm-62)") and a WinXP SP3
> guest,
> with bridged networking. My start command is:
> 
> sudo kvm -m 1024 -cdrom /dev/cdrom -boot c -net
> nic,macaddr=00:d0:13:b0:2d:32,
> model=rtl8139 -net tap -soundhw all -localtime /home/rbroman/windows.img
> 
> All this is stable and generally works well, except that internet-based
> video and
> audio performance is poor (choppy, skips) in comparison with performance
> under
> WinXP running native on the same machine (it's a dual-boot). I would
> appreciate
> recommendations to improve video and audio performance, and have the
> following
> specific questions:
> 
> -I've tried both the default Cirrus adapter and the "-std-vga" option.
> Which is better?
> I saw reference to another VMware-based adapter, but I can't figure out
> how to implement
> it - would that be better?
> 
> -I notice we're up to kvm-79 vs my kvm-62. Should I move to the newer
> version? Do I
> have to custom-compile my kernel to do so, and if so what kernel version
> and what
> specific kernel options should I use?
> 
> -Are there other tuning steps I could take?
> 
> Please copy me directly as I'm not on this list. Thankyou
> 
> 
> 
> 
> -- 
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

gettimeofday "slow" in RHEL4 guests

2008-11-24 Thread David S. Ahern


I noticed that gettimeofday in RHEL4.6 guests is taking much longer than
with RHEL3.8 guests. I wrote a simple program (see below) to call
gettimeofday in a loop 1,000,000 times and then used time to measure how
long it took.


For the RHEL3.8 guest:
time -p ./timeofday_bench
real 0.99
user 0.12
sys 0.24

For the RHEL4.6 guest with the default clock source (pmtmr):
time -p ./timeofday_bench
real 15.65
user 0.18
sys 15.46

and RHEL4.6 guest with PIT as the clock source (clock=pit kernel parameter):
time -p ./timeofday_bench
real 13.67
user 0.21
sys 13.45

So, basically gettimeofday() takes about 50 times as long on a RHEL4 guest.

Host is a DL380G5, 2 dual-core Xeon 5140 processors, 4 GB of RAM. It's
running kvm.git tree as of 11/18/08 with kvm-75 userspace. Guest in both
RHEL3 and RHEL4 cases has 4 vcpus, 3.5GB of RAM.

david

--

timeofday_bench.c:

#include 
#include 
#include 

int main(int argc, char *argv[])
{
int rc = 0, n;
struct timeval tv;
int iter = 100;  /* number of times to call gettimeofday */

if (argc > 1)
iter = atoi(argv[1]);

if (iter == 0) {
fprintf(stderr, "invalid number of iterations\n");
return 1;
}

printf("starting ");
for (n = 0; n < iter; ++n) {
if (gettimeofday(&tv, NULL) != 0) {
fprintf(stderr, "\ngettimeofday failed\n");
rc = 1;
break;
}
}

if (!rc)
printf("done\n");

return rc;
}
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: gettimeofday "slow" in RHEL4 guests

2008-11-24 Thread David S. Ahern

Some more data on this overhead.

RHEL3 (which is based on the 2.4.21 kernel) gets microsecond resolutions
by reading the TSC. Reading the TSC from within a guest is very fast on kvm.

RHEL4 (which is basd on the 2.6.9 kernel) allows multiple time sources:
pmtmr (ACPI power management timer which is the default), pit, hpet and TSC.

The pmtmr and pit both do ioport reads to get microsecond resolutions
(see read_pmtmr and get_offset_pit, respectively). For the tsc as the
timer source gettimeofday is *very* lightweight, but time drifts very
badly and ntpd cannot acquire a sync. I believe someone is working on
the HPET for guests and I know from bare metal performance that it is a
much lighter weight time source, but with RHEL4 the HPET breaks the
ability to use the RTC. So, I'm running out of options for reliable and
lightweight time sources.

Any chance the pit or pmtmr options can be optimized a bit?

thanks,

david

PS. yes, I did try the userspace pit and its performance is worse than
the in-kernel PIT.

David S. Ahern wrote:
> I noticed that gettimeofday in RHEL4.6 guests is taking much longer than
> with RHEL3.8 guests. I wrote a simple program (see below) to call
> gettimeofday in a loop 1,000,000 times and then used time to measure how
> long it took.
> 
> 
> For the RHEL3.8 guest:
> time -p ./timeofday_bench
> real 0.99
> user 0.12
> sys 0.24
> 
> For the RHEL4.6 guest with the default clock source (pmtmr):
> time -p ./timeofday_bench
> real 15.65
> user 0.18
> sys 15.46
> 
> and RHEL4.6 guest with PIT as the clock source (clock=pit kernel parameter):
> time -p ./timeofday_bench
> real 13.67
> user 0.21
> sys 13.45
> 
> So, basically gettimeofday() takes about 50 times as long on a RHEL4 guest.
> 
> Host is a DL380G5, 2 dual-core Xeon 5140 processors, 4 GB of RAM. It's
> running kvm.git tree as of 11/18/08 with kvm-75 userspace. Guest in both
> RHEL3 and RHEL4 cases has 4 vcpus, 3.5GB of RAM.
> 
> david
> 
> --
> 
> timeofday_bench.c:
> 
> #include 
> #include 
> #include 
> 
> int main(int argc, char *argv[])
> {
>   int rc = 0, n;
>   struct timeval tv;
>   int iter = 100;  /* number of times to call gettimeofday */
> 
>   if (argc > 1)
>   iter = atoi(argv[1]);
> 
>   if (iter == 0) {
>   fprintf(stderr, "invalid number of iterations\n");
>   return 1;
>   }
> 
>   printf("starting ");
>   for (n = 0; n < iter; ++n) {
>   if (gettimeofday(&tv, NULL) != 0) {
>   fprintf(stderr, "\ngettimeofday failed\n");
>   rc = 1;
>   break;
>   }
>   }
> 
>   if (!rc)
>   printf("done\n");
> 
>   return rc;
> }
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: gettimeofday "slow" in RHEL4 guests

2008-11-25 Thread David S. Ahern



Hollis Blanchard wrote:
> On Mon, 2008-11-24 at 21:41 -0700, David S. Ahern wrote:
>> RHEL3 (which is based on the 2.4.21 kernel) gets microsecond
>> resolutions
>> by reading the TSC. Reading the TSC from within a guest is very fast
>> on kvm.
>>
>> RHEL4 (which is basd on the 2.6.9 kernel) allows multiple time
>> sources:
>> pmtmr (ACPI power management timer which is the default), pit, hpet
>> and TSC.
>>
>> The pmtmr and pit both do ioport reads to get microsecond resolutions
>> (see read_pmtmr and get_offset_pit, respectively). For the tsc as the
>> timer source gettimeofday is *very* lightweight, but time drifts very
>> badly and ntpd cannot acquire a sync.
> 
> Why aren't you seeing severe time drift when using RHEL3 guests with the
> TSC time source?
> 

With RHEL3 it's a PIT time source, and the PIT counter is only read on
interrupts. For gettimeofday requests only the tsc is read; the
algorithm for microsecond resolution uses the pit count and its tsc
timestamp from the last interrupt.

In RHEL4, the PIT counter is read for each gettimeofday request when it
is the timer source. That's the cause of the extra overhead, and
consequently, worse performance.

david
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

qemu spinning on serial port writes

2008-12-22 Thread David S. Ahern


I am trying to redirect a guest's boot output through the host's serial
port.  Shortly after launching qemu, the main thread is spinning on:

write(9, "0", 1)   = -1 EAGAIN (Resource temporarily unavailable)

fd 9 is the serial port, ttyS0.


The backtrace for the thread is:

#0  0x2ac3433f8c0b in write () from /lib64/libpthread.so.0
#1  0x00475df9 in send_all (fd=9, buf=,
len1=1) at qemu-char.c:477
#2  0x0043a102 in serial_xmit (opaque=) at
/root/kvm-81/qemu/hw/serial.c:311
#3  0x0043a591 in serial_ioport_write (opaque=0x14971790,
addr=, val=48)
at /root/kvm-81/qemu/hw/serial.c:366
#4  0x410eeedc in ?? ()
#5  0x00129000 in ?? ()
#6  0x14821fa0 in ?? ()
#7  0x0007 in ?? ()
#8  0x004a54c5 in tlb_set_page_exec (env=0x10ab4,
vaddr=46912496956816, paddr=1, prot=-1, mmu_idx=0, is_softmmu=1)
at /root/kvm-81/qemu/exec.c:388
#9  0x00512f3b in tlb_fill (addr=345446292, is_write=1,
mmu_idx=-1, retaddr=0x0)
at /root/kvm-81/qemu/target-i386/op_helper.c:4690
#10 0x004a6bd2 in __ldb_cmmu (addr=9, mmu_idx=0) at
/root/kvm-81/qemu/softmmu_template.h:135
#11 0x004a879b in cpu_x86_exec (env1=) at
/root/kvm-81/qemu/cpu-exec.c:628
#12 0x0040ba29 in main (argc=12, argv=0x7fff67f7a398) at
/root/kvm-81/qemu/vl.c:3816

send_all() invokes unix_write() which by design is not breaking out on
EAGAIN.

The following command is enough to show the problem:

qemu-system-x86_64 -m 256 -smp 1 -no-kvm \
-drivefile=/dev/cciss/c0d0,if=scsi,cache=off,boot=on \
-vnc :1 -serial /dev/ttyS0


The guest is running RHEL3 with the parameter 'console=ttyS0' added to
grub.conf; the problem appears to be with qemu, so I would expect it to
show with any linux guest. This particular host is running RHEL5.2 with
kvm-81, but I have also seen the problem with Fedora-9 as the host OS.

Yes, the serial port of the server is connected to another system via a
null modem. If I change the serial argument to '-serial udp::4555' and
use  'nc -u -l  localhost 4555  > /dev/ttyS0' I see the guest's boot
output show up on the second system as expected. I'd prefer to be able
to use the serial port connection directly without nc as a proxy.
Suggestions?

david
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 0/3] synchronized TSC between vcpu's on SMP guests

2008-12-22 Thread David S. Ahern

Hi Marcelo:

I just found time to try out this patch set. I applied it to kvm-81, and
it fixes the time shifts I've observed on DL380-G5s.

david


Marcelo Tosatti wrote:
> Most Intel hosts are supposed to have their TSC's synchronized. This
> patchset attempts to fix the sites which overwrite the TSC making them
> appear unsynchronized to the guest.
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Add microcode patch level dummy

2009-01-05 Thread David S. Ahern



Alexander Graf wrote:
> Anthony Liguori wrote:
>> Alexander Graf wrote:
>>> VMware ESX checks if the microcode level is correct when using a
>>> barcelona CPU, in
>>> order to see if it actually can use SVM. Let's tell it we're on the
>>> safe side...
>>>   
>> Sounds like you're able to boot ESX?  Are you able to run a guest yet?
> 
> The moment you wrote the mail I just managed to get ReactOS running in
> ESX. It does not use SVM yet though. I wonder when/if ESX actually does
> use SVM.
> 

Is the guest 64-bit? I am suspecting ESX 3.5 series does not use VT/SVM
for 32-bit guests, but have not been able to positively confirm it.

david


> Alex
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kvm: make --mem-path memory allocation depend on mmu notifiers

2009-01-13 Thread David S. Ahern



Marcelo Tosatti wrote:
> Without mmu notifiers usage of hugepages to back guest memory can cause
> memory corruption.
> 
> Signed-off-by: Marcelo Tosatti 
> 
> 
> diff --git a/qemu/vl.c b/qemu/vl.c
> index d0660ab..49cf066 100644
> --- a/qemu/vl.c
> +++ b/qemu/vl.c
> @@ -4664,6 +4664,11 @@ void *alloc_mem_area(size_t memory, unsigned long 
> *len, const char *path)
>  void *area;
>  int fd;
>  
> +if (!kvm_has_sync_mmu()) {
> +fprintf(stderr, "host lacks mmu notifiers, disabling --mem-path\n");
> +return NULL;
> +}
> +
>  if (asprintf(&filename, "%s/kvm.XX", path) == -1)
>   return NULL;
>  

That means you can't use hugepages with RHEL5 as the host OS. That's not
good for me. I've exclusively used hugepages for the past 6 months or
so, the past 2 months with RHEL5 as the host OS without a problem. How
likely is it to occur (theoretically possible or random selection)?

david


> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 5/5] Fix kdump under KVM.

2009-10-30 Thread David S. Ahern


On 10/30/2009 09:28 AM, Chris Lalancette wrote:
>>
>> For VMX, the guests TSC's are now usually synchronized (which was not
>> the case before, when this patch was written). For AMD, they start out
>> of sync.
>>
>> Migration also causes them to be out of sync. So i'd prefer to postpone
>> the removal of this limitation until there is a guarantee the TSCs 
>> are synchronized across vcpus.
> 
> OK.  I'll try to get on an unsynchronized TSC machine and see if I can 
> reproduce
> there.

Perhaps Marcelo was referring to his patchset from December 2008
targeting TSC synchronization?

http://www.mail-archive.com/kvm@vger.kernel.org/msg08071.html

David
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 >

1 - 100 of 125 matches

Mail list logo