HEADSUP: Call for FreeBSD Status Reports - 3Q/2010

2010-09-30 Thread Daniel Gerzo

Dear all,

I would like to remind you that the next round of status reports
covering the third quarter of 2010 is due on October 15th, 2010. This
initiative is very welcome in our community. Therefore, I would like to
ask you to submit your status reports soon, so that we can compile the 
report on time.


Do not hesitate and write us a few lines - a short  description about 
what you are working on, what are your plans and goals, so we can inform 
our community about your great work! Check out the reports from the past 
to get some inspiration of what your submission should look like.


If you know about a project that should be included in the status
report, please let us know as well, so we can poke the responsible
people to provide us with something useful. Updates to submissions from
the last report are welcome too.

Note that the submissions are accepted from anyone involved with the
FreeBSD community, you do not have to be a FreeBSD committer.
Submissions about anything related to FreeBSD are very welcome!

Please email us the filled-in XML template which can be found at
http://www.freebsd.org/news/status/report-sample.xml to
mont...@freebsd.org, or alternatively use our web based form located at
http://www.freebsd.org/cgi/monthly.cgi.

For more information, please visit http://www.freebsd.org/news/status/.

We are looking forward to see your submissions!

--
S pozdravom / Best regards
  Daniel Gerzo, FreeBSD committer
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: Adding a V=R mapping for amd64?

2010-09-30 Thread Kostik Belousov
On Wed, Sep 29, 2010 at 01:22:56PM -0700, Matthew Fleming wrote:
 On Wed, Sep 29, 2010 at 1:12 PM, Kostik Belousov kostik...@gmail.com wrote:
  On Wed, Sep 29, 2010 at 12:40:57PM -0700, Matthew Fleming wrote:
  I'm hacking around with making a fast reboot that puts a copy of the
  MBR from disk into address 0x7c00 and, after disabling various
  translation bits and stopping other CPUs, branches to it, to skip the
  hardware self test that normally happens on boot.
 
  I haven't gotten to the point of attempting to run the code at 0x7c00
  because I'm first hitting a different error.  Despite my attempts to
  enter a translation into the hardware page table, I get a panic trying
  to write to address 0x7000, where I intended to put the trampoline
  code that turns off translation.
 
  Rebooting...
  Attempt to reset to MBR...
  XXX attempting pmap_kenter()...
  XXX copying bootstrap code...
  panic @ time 1285760103.939, thread 0xff000775d960: Fatal trap 12:
  page fault while in kernel mode
 
  cpuid = 0
  Panic occurred in module kernel loaded at 0x8010:
 
  Stack: --
  kernel:trap_fatal+0xac
  kernel:trap_pfault+0x24c
  kernel:trap+0x42e
  kernel:bcopy+0x16
  kernel:shutdown_reset+0x48
  kernel:boot+0x317
  kernel:reboot+0x60
  kernel:ia32_syscall+0x1cd
  --
  cpuid = 0; apic id = 00
  fault virtual address   = 0x7000
  fault code              = supervisor write data, page not present
  stack pointer           = 0x10:0xff8059e07670
  frame pointer           = 0x10:0xff8059e07780
 
  Here's what I think is the relevant snippets of code.  Note that I
  reserved the vm_page_t for physical page 7 as mbr_page early in boot,
  so I know the memory is free.
 
  void
  pmap_kenter_VR(vm_paddr_t pa)
  {
        pmap_t pmap = kernel_pmap;
        vm_page_t mpte;
        pd_entry_t *pde;
        pt_entry_t *pte;
 
        vm_page_lock_queues();
        PMAP_LOCK(pmap);
        mpte = pmap_allocpte(pmap, pa, M_WAITOK);
 
        pde = pmap_pde(pmap, pa);
        if (pde == NULL || (*pde  PG_V) == 0)
                panic(%s: invalid page directory va=%#lx, __func__, pa);
        if ((*pde  PG_PS) != 0)
                panic(%s: attempted pmap_enter on 2MB page, __func__);
        pte = pmap_pde_to_pte(pde, pa);
        if (pte == NULL)
                panic(%s: no pte va=%#lx, __func__, pa);
 
        if (*pte != 0) {
                /* Remove extra pte reference. */
                mpte-wire_count--;
        }
          pte_store(pte, pa | PG_RW | PG_V | PG_G | pg_nx);
 
        vm_page_unlock_queues();
        PMAP_UNLOCK(pmap);
  }
 
  Then in cpu_reset():
 
        /*
         * Establish a V=R mapping for the MBR page, and copy a
         * reasonable guess at the size of the bootstrap code into the
         * beginning of the page.
         */
        printf(XXX attempting pmap_kenter()...\n);
        pmap_kenter_VR(trunc_page(mbaddr));
        printf(XXX copying bootstrap code...\n);
        to_copy = (uintptr_t)xxx_reset_end - (uintptr_t)xxx_reset_real;
        if (to_copy  mbaddr - trunc_page(mbaddr))
                to_copy = mbaddr - trunc_page(mbaddr);
        bcopy(xxx_reset_real, (void *)trunc_page(mbaddr), to_copy);  /* die 
  here */
        printf(XXX attempting to turn off xlation and re-run MBR...\n);
        xxx_reset_real(mbaddr);
 
 
  My first attempt was a call to
        pmap_kenter(trunc_page(0x7c00), trunc_page(0x7c00));
  which failed trying to dereference the non-existent PDE.
 
  My second attempt called
        pmap_enter(kernel_pmap, trunc_page(0x7c00), VM_PROT_WRITE, mbr_page,
            VM_PROT_ALL, 0);
  That failed with the same crash as the attempt using pmap_kenter_VR().
 
  So... any thoughts as to why, after an apparently successful
  installation of an xlation, I still get a panic as though there were
  no xlation?
 
  Weird formatting of backtrace. Is this some proprietary code ?
 
 Yeah, there's some Isilon local modifications.
 
  Why do you try to create 1-1 mapping at all ? The MBR code should be
  executing in real mode anyway, and the mapping address at the moment of
  bcopy does not matter at all. I think that the use of PHYS_TO_DMAP()
  should give you direct mapping.
 
 I assumed I need trampoline code.  The moment I turn off the xlation
 bits, if the instruction pointer I'm running from is the normal kernel
 addresses, won't I die horribly trying to access 0x8000 or
 so in real mode?
Then, there is more then identity mapping. You need to gradually turn
off protected mode, in particular, load segment registers with 64KB
limited descriptors. After turning off paging, for far jumps to work,
you would need also identity-mapped GDT.

The procedure is documented in Intel IA32 arch manual, see vol 3A 9.9.3
Switching Back to Real-Address Mode.

In fact, there is some magic location in the RAM that causes BIOS to
skip initialization and 

Re: page table fault, which should map kernel virtual address space

2010-09-30 Thread Svatopluk Kraus
On Tue, Sep 21, 2010 at 7:38 PM, Alan Cox alan.l@gmail.com wrote:
 On Mon, Sep 20, 2010 at 9:32 AM, Svatopluk Kraus onw...@gmail.com wrote:
 Beyond 'kernel_map', some submaps of 'kernel_map' (buffer_map,
 pager_map,...) exist as result of 'kmem_suballoc' function call.
 When this submaps are used (for example 'kmem_alloc_nofault'
 function) and its virtual address subspace is at the end of
 used kernel virtual address space at the moment (and above 'NKPT'
 preallocation), then missing page tables are not allocated
 and double fault can happen.


 No, the page tables are allocated.  If you create a submap X of the kernel
 map using kmem_suballoc(), then a vm_map_findspace() is performed by
 vm_map_find() on the kernel map to find space for the submap X.  As you note
 above, the call to vm_map_findspace() on the kernel map will call
 pmap_growkernel() if needed to extend the kernel page table.

 If you create another submap X' of X, then that submap X' can only map
 addresses that fall within the range for X.  So, any necessary page table
 pages were allocated when X was created.

You are right. Mea culpa. I was focused on a solution and made
too quick conclusion. The page table fault hitted in 'pager_map',
which is submap of 'clean_map' and when I debugged the problem
I didn't see a submap stuff as a whole.

 That said, there may actually be a problem with the implementation of the
 superpage_align parameter to kmem_suballoc().  If a submap is created with
 superpage_align equal to TRUE, but the submap's size is not a multiple of
 the superpage size, then vm_map_find() may not allocate a page table page
 for the last megabyte or so of the submap.

 There are only a few places where kmem_suballoc() is called with
 superpage_align set to TRUE.  If you changed them to FALSE, that is an easy
 way to test this hypothesis.

Yes, it helps.

My story is that the problem started up when I updated a project
('coldfire' port)
based on FreeBSD 8.0. to FreeBSD current version. In the current version
the 'clean_map' submap is created with superpage_align set to TRUE.

I have looked at vm_map_find() and debugged the page table fault once again.
IMO, it looks that a do-while loop does not work in the function as intended.
A vm_map_findspace() finds a space and calls pmap_growkernel() if needed.
A pmap_align_superpage() arrange the space but never call pmap_growkernel().
A vm_map_insert() inserts the aligned space into a map without error
and never call pmap_growkernel() and does not invoke loop iteration.

I don't know too much about an idea how a virtual memory model is implemented
and used in other modules. But it seems that it could be more reasonable to
align address space in vm_map_findspace() internally and not to loop externally.

I have tried to add a check in vm_map_insert() that checks the 'end' parameter
against 'kernel_vm_end' variable and returns KERN_NO_SPACE error if needed.
In this case the loop in vm_map_find() works and I have no problem with
the page table fault. But 'kernel_vm_end' variable must be initializated
properly before first use of vm_map_insert(). The 'kernel_vm_end' variable
can be self-initializated in pmap_growkernel() in FreeBSD 8.0 (it is too late),
but it was changed in current version ('i386' port).

Thanks for your answer, but I'm still looking for permanent
and approved solution.

 Regards, Svata
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Examining the VM splay tree effectiveness

2010-09-30 Thread Andre Oppermann
 necessary as they easily can be done
with standard RB tree accessors.  Our standard userspace jemalloc
also uses RB trees for its internal housekeeping.  RB tree details:
 http://en.wikipedia.org/wiki/Red-black_tree

I say hypothesis because I haven't measured the difference to an
RB tree implementation yet.  I've hacked up a crude and somewhat
mechanical patch to convert the vmmap and page VM structures to
use RB trees, the vmmap part is not stable yet.  The page part
seems to work fine though.

This is what I've hacked together so far:
 http://people.freebsd.org/~andre/vmmap_vmpage_stats-20100930.diff
 http://people.freebsd.org/~andre/vmmap_vmpage_rbtree-20100930.diff

The diffs are still in their early stages and do not make use of
any code simplifications becoming possible by using RB trees instead
of the current splay trees.

Comments on the VM issue and splay vs. RB tree hypothesis welcome.

--
Andre
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: Examining the VM splay tree effectiveness

2010-09-30 Thread Matthew Fleming
 on this though.  In addition a few of
 the additional linked lists that currently accompany the vmmap and
 page structures are no longer necessary as they easily can be done
 with standard RB tree accessors.  Our standard userspace jemalloc
 also uses RB trees for its internal housekeeping.  RB tree details:
  http://en.wikipedia.org/wiki/Red-black_tree

 I say hypothesis because I haven't measured the difference to an
 RB tree implementation yet.  I've hacked up a crude and somewhat
 mechanical patch to convert the vmmap and page VM structures to
 use RB trees, the vmmap part is not stable yet.  The page part
 seems to work fine though.

 This is what I've hacked together so far:
  http://people.freebsd.org/~andre/vmmap_vmpage_stats-20100930.diff
  http://people.freebsd.org/~andre/vmmap_vmpage_rbtree-20100930.diff

 The diffs are still in their early stages and do not make use of
 any code simplifications becoming possible by using RB trees instead
 of the current splay trees.

 Comments on the VM issue and splay vs. RB tree hypothesis welcome.

Do you have actual performance numbers from a real workload?

I implemented an AA tree locally (basically a 2,3 tree that uses
coloring to represent the 3 nodes) and the performance was much worse
than the splay tree.  I assume this is due to the overhead of keeping
the tree balanced; I didn't instrument either the splay or the AA
code.  I suspect that the overhead of an RB tree would similarly wash
out the benefits of O(log_2 n) time, but this is theory.  The facts
would be measuring several different workloads an looking at the
system/real times for them between the two solutions.

We've talked internally at $work about using a B+-tree with maybe
branching factor 5-7; whatever makes sense for the size of a cache
line.  This seems likely to be better for performance than an RB-tree
but does require a lot more changes, since separate memory is needed
for the tree's nodes outside the vm_page structure.  There just hasn't
been time to implement it and try it out.

Unfortunately I won't have the time to experiment at $work for a few
months on this problem.

Thanks,
matthew
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


sysctl for querying kmem_map-size

2010-09-30 Thread Andriy Gapon

Here's a patch that adds a sysctl for querying kmem_map-size, which may be 
useful
for system state/resources monitoring:
http://people.freebsd.org/~avg/sysctl-kmem_map_size.diff

I am quite unsure about sizeof(kmem_map-size) == sizeof(int) hack, but I 
couldn't
think of other way to decide whether to use SYSCTL_ADD_UINT or SYSCTL_ADD_ULONG
depending on real type behind vm_size_t.

Comments, suggestions?

-- 
Andriy Gapon
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: Examining the VM splay tree effectiveness

2010-09-30 Thread Andre Oppermann

On 30.09.2010 18:37, Andre Oppermann wrote:

Just for the kick of it I decided to take a closer look at the use of
splay trees (inherited from Mach if I read the history correctly) in
the FreeBSD VM system suspecting an interesting journey.


Correcting myself regarding the history: The splay tree for vmmap was
done about 8 years ago by alc@ to replace a simple linked list and was
a huge improvement.  The change in vmpage from a hash to the same splay
tree as in vmmap was committed by dillon@ about 7.5 years ago with some
involvement of a...@.

--
Andre
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: Examining the VM splay tree effectiveness

2010-09-30 Thread Roman Divacky
 that a splay tree
 is suboptimal for the normal case and quickly horrible even on small
 deviations from the normal case in the vmmap.  For the page table it
 is always horrible.  Not to mention the locking complications due to
 modifying the tree on lookups.
 
 Based on an examination of other binary trees I put forward the
 hypothesis that a Red-Black tree is a much better, if not the optimal,
 fit for the vmmap and page table.  RB trees are balanced binary trees
 and O(log n) in all cases.  The big advantage in this context is that
 lookups are pure reads and do not cause CPU cache invalidations on
 other CPU's and always only require a read lock without the worst
 case properties of the unbalanced splay tree.  The high cache locality
 of the vmmap lookups can be used with the RB tree as well by simply
 adding a pointer to the least recently found node.  To prevent write 
 locking this can be done lazily.  More profiling is required to make
 a non-speculative statement on this though.  In addition a few of
 the additional linked lists that currently accompany the vmmap and
 page structures are no longer necessary as they easily can be done
 with standard RB tree accessors.  Our standard userspace jemalloc
 also uses RB trees for its internal housekeeping.  RB tree details:
  http://en.wikipedia.org/wiki/Red-black_tree
 
 I say hypothesis because I haven't measured the difference to an
 RB tree implementation yet.  I've hacked up a crude and somewhat
 mechanical patch to convert the vmmap and page VM structures to
 use RB trees, the vmmap part is not stable yet.  The page part
 seems to work fine though.
 
 This is what I've hacked together so far:
  http://people.freebsd.org/~andre/vmmap_vmpage_stats-20100930.diff
  http://people.freebsd.org/~andre/vmmap_vmpage_rbtree-20100930.diff
 
 The diffs are still in their early stages and do not make use of
 any code simplifications becoming possible by using RB trees instead
 of the current splay trees.
 
 Comments on the VM issue and splay vs. RB tree hypothesis welcome.
 
 -- 
 Andre
 ___
 freebsd-curr...@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-current
 To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: Examining the VM splay tree effectiveness

2010-09-30 Thread Andre Oppermann
 on this though.  In addition a few of
the additional linked lists that currently accompany the vmmap and
page structures are no longer necessary as they easily can be done
with standard RB tree accessors.  Our standard userspace jemalloc
also uses RB trees for its internal housekeeping.  RB tree details:
  http://en.wikipedia.org/wiki/Red-black_tree

I say hypothesis because I haven't measured the difference to an
RB tree implementation yet.  I've hacked up a crude and somewhat
mechanical patch to convert the vmmap and page VM structures to
use RB trees, the vmmap part is not stable yet.  The page part
seems to work fine though.

This is what I've hacked together so far:
  http://people.freebsd.org/~andre/vmmap_vmpage_stats-20100930.diff
  http://people.freebsd.org/~andre/vmmap_vmpage_rbtree-20100930.diff

The diffs are still in their early stages and do not make use of
any code simplifications becoming possible by using RB trees instead
of the current splay trees.

Comments on the VM issue and splay vs. RB tree hypothesis welcome.


Do you have actual performance numbers from a real workload?


No, not yet.  I obviously would have posted those numbers as well.
What do you consider a representative real workload?


I implemented an AA tree locally (basically a 2,3 tree that uses
coloring to represent the 3 nodes) and the performance was much worse
than the splay tree.  I assume this is due to the overhead of keeping
the tree balanced; I didn't instrument either the splay or the AA
code.  I suspect that the overhead of an RB tree would similarly wash
out the benefits of O(log_2 n) time, but this is theory.  The facts
would be measuring several different workloads an looking at the
system/real times for them between the two solutions.


I think there are two pronounced effects to consider:
 a) the algorithmic complexity
 b) the real world implications of said complexity and operations
on today's CPU's multi-hierarchy caches and associated SMP side
effects.


We've talked internally at $work about using a B+-tree with maybe
branching factor 5-7; whatever makes sense for the size of a cache
line.  This seems likely to be better for performance than an RB-tree
but does require a lot more changes, since separate memory is needed
for the tree's nodes outside the vm_page structure.  There just hasn't
been time to implement it and try it out.



Unfortunately I won't have the time to experiment at $work for a few
months on this problem.


--
Andre
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: Examining the VM splay tree effectiveness

2010-09-30 Thread Alan Cox
On Thu, Sep 30, 2010 at 12:37 PM, Andre Oppermann an...@freebsd.org wrote:

 On 30.09.2010 18:37, Andre Oppermann wrote:

 Just for the kick of it I decided to take a closer look at the use of
 splay trees (inherited from Mach if I read the history correctly) in
 the FreeBSD VM system suspecting an interesting journey.


 Correcting myself regarding the history: The splay tree for vmmap was
 done about 8 years ago by alc@ to replace a simple linked list and was
 a huge improvement.  The change in vmpage from a hash to the same splay
 tree as in vmmap was committed by dillon@ about 7.5 years ago with some
 involvement of a...@.
 ss


Yes, and there is a substantial difference in the degree of locality of
access to these different structures, and thus the effectiveness of a splay
tree.  When I did the last round of changes to the locking on the vm map, I
made some measurements of the splay tree's performance on a JVM running a
moderately large bioinformatics application.  The upshot was that the
average number of map entries visited on an access to the vm map's splay
tree was less than the expected depth of a node in a perfectly balanced
tree.

I teach class shortly.  I'll provide more details later.

Regards,
Alan
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: Examining the VM splay tree effectiveness

2010-09-30 Thread Ivan Voras
On 09/30/10 20:01, Alan Cox wrote:
 On Thu, Sep 30, 2010 at 12:37 PM, Andre Oppermann an...@freebsd.org wrote:
 
 On 30.09.2010 18:37, Andre Oppermann wrote:

 Just for the kick of it I decided to take a closer look at the use of
 splay trees (inherited from Mach if I read the history correctly) in
 the FreeBSD VM system suspecting an interesting journey.


 Correcting myself regarding the history: The splay tree for vmmap was
 done about 8 years ago by alc@ to replace a simple linked list and was
 a huge improvement.  The change in vmpage from a hash to the same splay
 tree as in vmmap was committed by dillon@ about 7.5 years ago with some
 involvement of a...@.
 ss

 Yes, and there is a substantial difference in the degree of locality of
 access to these different structures, and thus the effectiveness of a splay
 tree.  When I did the last round of changes to the locking on the vm map, I
 made some measurements of the splay tree's performance on a JVM running a
 moderately large bioinformatics application.  The upshot was that the
 average number of map entries visited on an access to the vm map's splay
 tree was less than the expected depth of a node in a perfectly balanced
 tree.

Sorry, I'm not sure how to parse that - are you saying that the splaying
helped, making the number of nodes visited during tree search lesser
than the average node depth in a balanced tree?

Even if so, wouldn't the excessive bandwidth lost in the splaying ops
(and worse - write bandwidth) make it unsuitable today?


___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: 8.1-R - Marvell 88SX6081 SATA controller via mvs = lots of errors

2010-09-30 Thread Alexander Motin
Hi.

Karl Pielorz wrote:
 I just switched my 8.1-R/amd64 (dual Opteron) system from ATA over to
 the new mvs driver, and started seeing a whole bunch of errors (which
 appear to have hosed one of my zfs volumes during a scrub) - anyone know
 what the following errors actually mean?
 
 The machine has 2 * 88SX6081's in it:
 
 
 Sep 28 19:58:49 kernel: mvs0: Marvell 88SX6081 SATA controller port
 0x3000-0x30ff mem 0xd010-0xd01f,0xd040-0xd07f irq 24 at
 device 4.0 on pci17
 Sep 28 19:58:49 kernel: mvs0: Gen-II, 8 3Gbps ports, Port Multiplier
 ...
 Sep 28 19:58:49 kernel: mvs1: Marvell 88SX6081 SATA controller port
 0x4000-0x40ff mem 0xd0c0-0xd0cf,0xd080-0xd0bf irq 28 at
 device 4.0 on pci18
 Sep 28 19:58:49 kernel: mvs1: Gen-II, 8 3Gbps ports, Port Multiplier
 supported
 
 
 Under 7.2 they ran fine, with the ATA driver. I use ZFS on this machine
 - and both pools were scrubbed before the upgrade (and backed up
 fortunately!).
 
 
 With the mvs driver, during a scrub of the main volume, I see:
 
 
 Sep 29 08:56:13 kernel: mvsch12: EMPTY CRPB 6 (-14) 1 4000
 Sep 29 08:56:13 kernel: mvsch12: EMPTY CRPB 7 (-14) 0 4000
 Sep 29 08:56:13 kernel: mvsch12: EMPTY CRPB 8 (-14) 2 4000
 
 
 [repeated a lot - interspersed with zfs reporting problems with files,
 on all the devices in the pool]
 
 I then also get a whole bunch of:
 
 
 Sep 29 08:56:56 kernel: mvsch0: Timeout on slot 1
 Sep 29 08:56:56 kernel: mvsch0: iec 0200 sstat 0123 serr
  edma_s 1020 dma_c  dma_s  rs 0006 statu
 s 40
 Sep 29 08:56:56 kernel: mvsch0:  ... waiting for slots 0004
 Sep 29 08:56:56 kernel: mvsch12: Timeout on slot 5
 Sep 29 08:56:56 kernel: mvsch12: iec 0200 sstat 0123 serr
  edma_s 1121 dma_c  dma_s  rs 0028 stat
 us 40
 

EMPTY CRPB error means that controller reported completion for command
slot that driver counted as empty at the moment. Can't say if it is
hardware or driver issue. Timeouts could be related but I am not sure
what is the reason and what is consequence here. It could help if you
send me full log of those messages to create full picture.

-- 
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: Examining the VM splay tree effectiveness

2010-09-30 Thread Alfred Perlstein
 is a much better, if not the optimal,
 fit for the vmmap and page table.  RB trees are balanced binary trees
 and O(log n) in all cases.  The big advantage in this context is that
 lookups are pure reads and do not cause CPU cache invalidations on
 other CPU's and always only require a read lock without the worst
 case properties of the unbalanced splay tree.  The high cache locality
 of the vmmap lookups can be used with the RB tree as well by simply
 adding a pointer to the least recently found node.  To prevent write 
 locking this can be done lazily.  More profiling is required to make
 a non-speculative statement on this though.  In addition a few of
 the additional linked lists that currently accompany the vmmap and
 page structures are no longer necessary as they easily can be done
 with standard RB tree accessors.  Our standard userspace jemalloc
 also uses RB trees for its internal housekeeping.  RB tree details:
  http://en.wikipedia.org/wiki/Red-black_tree
 
 I say hypothesis because I haven't measured the difference to an
 RB tree implementation yet.  I've hacked up a crude and somewhat
 mechanical patch to convert the vmmap and page VM structures to
 use RB trees, the vmmap part is not stable yet.  The page part
 seems to work fine though.
 
 This is what I've hacked together so far:
  http://people.freebsd.org/~andre/vmmap_vmpage_stats-20100930.diff
  http://people.freebsd.org/~andre/vmmap_vmpage_rbtree-20100930.diff
 
 The diffs are still in their early stages and do not make use of
 any code simplifications becoming possible by using RB trees instead
 of the current splay trees.
 
 Comments on the VM issue and splay vs. RB tree hypothesis welcome.
 
 -- 
 Andre
 ___
 freebsd-hackers@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
 To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

-- 
- Alfred Perlstein
.- VMOA #5191, 03 vmax, 92 gs500, 85 ch250, 07 zx10
.- FreeBSD committer
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: sysctl for querying kmem_map-size

2010-09-30 Thread Garrett Cooper
On Thu, Sep 30, 2010 at 10:17 AM, Andriy Gapon a...@icyb.net.ua wrote:

 Here's a patch that adds a sysctl for querying kmem_map-size, which may be 
 useful
 for system state/resources monitoring:
 http://people.freebsd.org/~avg/sysctl-kmem_map_size.diff

 I am quite unsure about sizeof(kmem_map-size) == sizeof(int) hack, but I 
 couldn't
 think of other way to decide whether to use SYSCTL_ADD_UINT or 
 SYSCTL_ADD_ULONG
 depending on real type behind vm_size_t.

Is the base value of the field size_t? If so, then it's ulong on
64-bit archs and uint on 32-bit archs. Maybe it's a good time then to
actually get the sysctl and tunables work that I started on into base.
I have a functioning and tested copy of the tunables work, but I'll
need to do similar for the sysctls as well (des@ and I kind of got out
of sync a few months back).
Thanks,
-Garrett
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: Examining the VM splay tree effectiveness

2010-09-30 Thread Ivan Voras
On 09/30/10 20:38, Alfred Perlstein wrote:
 Andre,
 
 Your observations on the effectiveness of the splay tree
 mirror the concerns I have with it when I read about it.
 
 I have always wondered though if the splay-tree algorithm
 was modified to only perform rotations when a lookup required
 more than N traversals to reach a node.
 
 This would self-balance the tree and maintain cache without 
 the expense of writes for nearly all lookups.
 
 I'm wondering if you have the time/interest in rerunning
 your tests, but modifying the algorithm to only rebalance
 the splay if a lookup requires more than let's say 3, 5, 7
 or 10 traversals.

I see two possible problems with this:

1: the validity of this heuristics, since the splay is not meant to help
the current lookup but future lookups, and if you now require e.g.
5-deep traversal, (barring external information about the structures -
meybe some inner relationship of the nodes can be exploitet) it is I
think about the same probability that the next lookup will hit that
rotated node or the former root node.

2: rotating only on the N'th lookup would have to go like this:

   1. take a read-only lock
   2. make the lookup, count the depth
   3. if depth  N:
  1. relock for write (lock upgrade will not always work)
  2. recheck if the tree is still the same; bail if it isn't
  3. do the splay
   4. unlock

i.e. suspiciously complicated. That is, if you want to take advantage of
read paralelism; if the tree is write-locked all the time it's simpler
but only inefficient.

Of course, real-world measurements trump theory :)

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: sysctl for querying kmem_map-size

2010-09-30 Thread Andriy Gapon
on 30/09/2010 21:52 Garrett Cooper said the following:
 On Thu, Sep 30, 2010 at 10:17 AM, Andriy Gapon a...@icyb.net.ua wrote:

 Here's a patch that adds a sysctl for querying kmem_map-size, which may be 
 useful
 for system state/resources monitoring:
 http://people.freebsd.org/~avg/sysctl-kmem_map_size.diff

 I am quite unsure about sizeof(kmem_map-size) == sizeof(int) hack, but I 
 couldn't
 think of other way to decide whether to use SYSCTL_ADD_UINT or 
 SYSCTL_ADD_ULONG
 depending on real type behind vm_size_t.
 
 Is the base value of the field size_t? If so, then it's ulong on
 64-bit archs and uint on 32-bit archs. Maybe it's a good time then to

No, it's vm_size_t, but it's defined similarly to size_t I guess:
vm_size_t - __vm_size_t - {__uint32_t or __uint64_t depending on arch}

 actually get the sysctl and tunables work that I started on into base.
 I have a functioning and tested copy of the tunables work, but I'll
 need to do similar for the sysctls as well (des@ and I kind of got out
 of sync a few months back).

I believe that this is the first time I hear about this project.

-- 
Andriy Gapon
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: Examining the VM splay tree effectiveness

2010-09-30 Thread Andre Oppermann

On 30.09.2010 20:01, Alan Cox wrote:

On Thu, Sep 30, 2010 at 12:37 PM, Andre Oppermannan...@freebsd.org  wrote:


On 30.09.2010 18:37, Andre Oppermann wrote:


Just for the kick of it I decided to take a closer look at the use of
splay trees (inherited from Mach if I read the history correctly) in
the FreeBSD VM system suspecting an interesting journey.



Correcting myself regarding the history: The splay tree for vmmap was
done about 8 years ago by alc@ to replace a simple linked list and was
a huge improvement.  The change in vmpage from a hash to the same splay
tree as in vmmap was committed by dillon@ about 7.5 years ago with some
involvement of a...@.
ss



Yes, and there is a substantial difference in the degree of locality of
access to these different structures, and thus the effectiveness of a splay
tree.  When I did the last round of changes to the locking on the vm map, I
made some measurements of the splay tree's performance on a JVM running a
moderately large bioinformatics application.  The upshot was that the
average number of map entries visited on an access to the vm map's splay
tree was less than the expected depth of a node in a perfectly balanced
tree.


This is indeed an expected property of the splay tree.  For the vmmap
the root node hit rate, that is the least recently accessed node, is very
high.  Other nodes that are frequently accessed as well should be high
up too.  Nonetheless the churn rate on the nodes in tree rotations is
considerable.

The question now is whether we can get the same positive effect but with
less negative side effects.  Of particular concern is the CPU cache
dirtying and thrashing by the splay rotations.  Especially with today's
and tomorrows many core, many socket, many threads/processes machines.
Another concern is worst case behavior when the root node hit rate is
low.  Currently this is the case in vmpage for certain.  It may also
become a concern in vmmap with the use of memguard (fragmenting the
address space) or an adversary trying to exploit worst case behavior
on a shared host by mapping only every other page.

This concern is validated when accepting the measurement by the 2008 SoC
student Mayur (thanks to rdivacky@ for digging up his page):
 http://wiki.freebsd.org/MayurShardul/VM_Algorithm_Improvement

In the measurements of his alternative radix trie implementation the
lookup time is *230 times less* than for the splay tree (as measured by
with the TSC).  The description of the performed benchmark is extremely
sparse and it is not clear what access pattern he simulated.  Probably
a random lookup pattern which is of course very bad for the splay tree.
Any balanced tree will be better for random lookups.

Whether his measured improvement is due to using a radix trie or if it
would perform equally to a normal balanced binary tree I can't say (yet).


I teach class shortly.  I'll provide more details later.


--
Andre
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: Examining the VM splay tree effectiveness

2010-09-30 Thread Andre Oppermann

On 30.09.2010 20:38, Alfred Perlstein wrote:

Andre,

Your observations on the effectiveness of the splay tree
mirror the concerns I have with it when I read about it.

I have always wondered though if the splay-tree algorithm
was modified to only perform rotations when a lookup required
more than N traversals to reach a node.

This would self-balance the tree and maintain cache without
the expense of writes for nearly all lookups.


This would break the underlying assumption of the splay tree.
A splay tree is not a balanced tree.  With this approach you
can easily get stuck in a sub-optimal situation with many
lookups traversing N-1 nodes and not getting the cache benefit.
When N nodes are traversed for an outlier, and the rotate happens,
you rotate again on the next lookup or you get stuck in another
sub-optimal situation.  In effect you get the worst of an splay
tree while not being able to gain on the amortization effect
when the root node hit rate is high.


I'm wondering if you have the time/interest in rerunning
your tests, but modifying the algorithm to only rebalance
the splay if a lookup requires more than let's say 3, 5, 7
or 10 traversals.


As the assumption of the algorithm is broken I don't think it's
useful to spend time on this.  I'd rather try and add a last-match
pointer to the RB tree to take home the locality effect in vmmap.

--
Andre
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: sysctl for querying kmem_map-size

2010-09-30 Thread Garrett Cooper
On Thu, Sep 30, 2010 at 1:26 PM, Andriy Gapon a...@icyb.net.ua wrote:
 on 30/09/2010 21:52 Garrett Cooper said the following:
 On Thu, Sep 30, 2010 at 10:17 AM, Andriy Gapon a...@icyb.net.ua wrote:

 Here's a patch that adds a sysctl for querying kmem_map-size, which may be 
 useful
 for system state/resources monitoring:
 http://people.freebsd.org/~avg/sysctl-kmem_map_size.diff

 I am quite unsure about sizeof(kmem_map-size) == sizeof(int) hack, but I 
 couldn't
 think of other way to decide whether to use SYSCTL_ADD_UINT or 
 SYSCTL_ADD_ULONG
 depending on real type behind vm_size_t.

     Is the base value of the field size_t? If so, then it's ulong on
 64-bit archs and uint on 32-bit archs. Maybe it's a good time then to

 No, it's vm_size_t, but it's defined similarly to size_t I guess:
 vm_size_t - __vm_size_t - {__uint32_t or __uint64_t depending on arch}

Well... that's basically size_t right? At first inspection it
seems silly to define it as __uint32_t or __uint64_t if it's really
the same thing (in theory) as __size_t . Of course the vmem folks
(like alc@) might have more context as to why things were done this
way.

 actually get the sysctl and tunables work that I started on into base.
 I have a functioning and tested copy of the tunables work, but I'll
 need to do similar for the sysctls as well (des@ and I kind of got out
 of sync a few months back).

 I believe that this is the first time I hear about this project.

It wasn't an official project. It was more or less an email thread
that sort of turned into a challenge, which turned into a project :) :
http://unix.derkeiler.com/Mailing-Lists/FreeBSD/hackers/2010-08/msg00054.html
.
As with all things though, I'm not a committer by any means, so
unless someone works with me to review and commit my code, it will
exist purely in the ether for all eternity. Part of the drawbacks of
being in GSoC purgatory.
Thanks!
-Garrett
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org