Re: How would I make this work in RELENG_7

2007-12-22 Thread Alexander Motin

Hi.

Sam Fourman Jr. wrote:

I have a PDA smart phone that I would like to use as a wireless modem
on my laptop

someone from OpenBSD helped me get it committed to OpenBSD 's Tree

would someone help me with a similar patch for FreeBSD

here is an old post that I made
http://www.nabble.com/Alltel-PPC6700-Wireless-Modem-td12491547.html


I have added my HTC Prophet PDA some time ago unto uipaq driver and it 
seems like working except there is no WM6 support in palm/synce version 
present in ports, so I had to build recent version by myself.


I have made a patch adding your device ID into uipaq driver alike 
OpenBSD one. But I am not sure should this device be supported with 
uipaq or umodem driver. umodem driver looks much more powerful, but I 
have nothing to test it, as my WM6 does not provides USB modem support. 
Could you try it also?


--
Alexander Motin
--- usbdevs.prev2007-12-11 08:41:38.0 +0200
+++ usbdevs 2007-12-23 01:13:13.0 +0200
@@ -1382,6 +1382,7 @@
 product HP2 C500   0x6002  PhotoSmart C500
 
 /* HTC products */
+product HTC MODEM  0x00cf  USB Modem 
 product HTC SMARTPHONE 0x0a51  SmartPhone USB Sync
 
 /* HUAWEI products */
--- uipaq.c.prev2007-10-22 11:28:24.0 +0300
+++ uipaq.c 2007-12-23 01:12:51.0 +0200
@@ -123,6 +123,7 @@
{{ USB_VENDOR_HP, USB_PRODUCT_HP_2215 }, 0 },
{{ USB_VENDOR_HP, USB_PRODUCT_HP_568J }, 0},
{{ USB_VENDOR_HTC, USB_PRODUCT_HTC_SMARTPHONE }, 0},
+   {{ USB_VENDOR_HTC, USB_PRODUCT_HTC_MODEM }, 0},
{{ USB_VENDOR_COMPAQ, USB_PRODUCT_COMPAQ_IPAQPOCKETPC } , 0},
{{ USB_VENDOR_CASIO, USB_PRODUCT_CASIO_BE300 } , 0},
{{ USB_VENDOR_SHARP, USB_PRODUCT_SHARP_WZERO3ES }, 0},
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

kstackusage() patch request for comments

2008-01-27 Thread Alexander Motin

Hi.

I have made a patch
http://www.mavhome.dp.ua/kstackusage.patch
that implements machine dependent function returning current kernel 
thread stack usage statistics and uses it in netgraph subsystem for 
receiving maximum benefit from direct function calls and minimum 
queueing while keeping stack protected. As I have never developed 
machine-dependant things I would like to hear any comments about it.


The main question I have is about source files and headers I should use 
for this specific purposes. Is it correct way to define function in 
machine independent header, but implement it in machdep.c? Or I should 
define it in machine dependent headers?


Also I would be grateful for help with implementations of this function 
for arch different from i386/amd64.


Thanks.

--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Memory allocation performance

2008-01-31 Thread Alexander Motin

Hi.

While profiling netgraph operation on UP HEAD router I have found that 
huge amount of time it spent on memory allocation/deallocation:


0.14  0.05  132119/545292  ip_forward  [12]
0.14  0.05  133127/545292  fxp_add_rfabuf [18]
0.27  0.10  266236/545292  ng_package_data [17]
[9]14.1 0.56  0.21  545292 uma_zalloc_arg [9]
0.17  0.00  545292/1733401 critical_exit  [98]
0.01  0.00  275941/679675  generic_bzero [68]
0.01  0.00  133127/133127  mb_ctor_pack [103]

0.15  0.06  133100/545266  mb_free_ext [22]
0.15  0.06  133121/545266  m_freem [15]
0.29  0.11  266236/545266  ng_free_item [16]
[8]15.2 0.60  0.23  545266 uma_zfree_arg [8]
0.17  0.00  545266/1733401 critical_exit  [98]
0.00  0.04  133100/133100  mb_dtor_pack [57]
0.00  0.00  134121/134121  mb_dtor_mbuf [111]

I have already optimized all possible allocation calls and those that 
left are practically unavoidable. But even after this kgmon tells that 
30% of CPU time consumed by memory management.


So I have some questions:
1) Is it real situation or just profiler mistake?
2) If it is real then why UMA is so slow? I have tried to replace it in 
some places with preallocated TAILQ of required memory blocks protected 
by mutex and according to profiler I have got _much_ better results. 
Will it be a good practice to replace relatively small UMA zones with 
preallocated queue to avoid part of UMA calls?
3) I have seen that UMA does some kind of CPU cache affinity, but does 
it cost so much that it costs 30% CPU time on UP router?


Thanks!

--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Memory allocation performance

2008-01-31 Thread Alexander Motin

Alexander Motin пишет:
While profiling netgraph operation on UP HEAD router I have found that 
huge amount of time it spent on memory allocation/deallocation:


I have forgotten to tell that it was mostly GENERIC kernel just built 
without INVARIANTS, WITNESS and SMP but with 'profile 2'.


--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Memory allocation performance

2008-01-31 Thread Alexander Motin

Kris Kennaway пишет:

Alexander Motin wrote:

Alexander Motin пишет:
While profiling netgraph operation on UP HEAD router I have found 
that huge amount of time it spent on memory allocation/deallocation:


I have forgotten to tell that it was mostly GENERIC kernel just built 
without INVARIANTS, WITNESS and SMP but with 'profile 2'.


What is 'profile 2'?


I have thought it is high resolution profiling support. Isn't it?

--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Memory allocation performance

2008-01-31 Thread Alexander Motin

Julian Elischer пишет:

Alexander Motin wrote:

Hi.

While profiling netgraph operation on UP HEAD router I have found that 
huge amount of time it spent on memory allocation/deallocation:


0.14  0.05  132119/545292  ip_forward  [12]
0.14  0.05  133127/545292  fxp_add_rfabuf [18]
0.27  0.10  266236/545292  ng_package_data [17]
[9]14.1 0.56  0.21  545292 uma_zalloc_arg [9]
0.17  0.00  545292/1733401 critical_exit  [98]
0.01  0.00  275941/679675  generic_bzero [68]
0.01  0.00  133127/133127  mb_ctor_pack [103]

0.15  0.06  133100/545266  mb_free_ext [22]
0.15  0.06  133121/545266  m_freem [15]
0.29  0.11  266236/545266  ng_free_item [16]
[8]15.2 0.60  0.23  545266 uma_zfree_arg [8]
0.17  0.00  545266/1733401 critical_exit  [98]
0.00  0.04  133100/133100  mb_dtor_pack [57]
0.00  0.00  134121/134121  mb_dtor_mbuf [111]

I have already optimized all possible allocation calls and those that 
left are practically unavoidable. But even after this kgmon tells that 
30% of CPU time consumed by memory management.


So I have some questions:
1) Is it real situation or just profiler mistake?
2) If it is real then why UMA is so slow? I have tried to replace it 
in some places with preallocated TAILQ of required memory blocks 
protected by mutex and according to profiler I have got _much_ better 
results. Will it be a good practice to replace relatively small UMA 
zones with preallocated queue to avoid part of UMA calls?
3) I have seen that UMA does some kind of CPU cache affinity, but does 
it cost so much that it costs 30% CPU time on UP router?


given this information, I would add an 'item cache' in ng_base.c
(hmm do I already have one?)


That was actually my second question. As there is only 512 items by 
default and they are small in size I can easily preallocate them all on 
boot. But is it a good way? Why UMA can't do just the same when I have 
created zone with specified element size and maximum number of objects? 
What is the principal difference?


--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Memory allocation performance

2008-02-01 Thread Alexander Motin

Hi.

Robert Watson wrote:
It would be very helpful if you could try doing some analysis with hwpmc 
-- "high resolution profiling" is of increasingly limited utility with 
modern CPUs, where even a high frequency timer won't run very often.  
It's also quite subject to cycle events that align with other timers in 
the system.


I have tried hwpmc but still not completely friendly with it. Whole 
picture is somewhat alike to kgmon's, but it looks very noisy. Is there 
some "know how" about how to use it better?


I have tried it for measuring number of instructions. But I am in doubt 
that instructions is a correct counter for performance measurement as 
different instructions may have very different execution times depending 
on many reasons, like cache misses and current memory traffic. I have 
tried to use tsc to count CPU cycles, but got the error:

# pmcstat -n 1 -S "tsc" -O sample.out
pmcstat: ERROR: Cannot allocate system-mode pmc with specification 
"tsc": Operation not supported

What have I missed?

I am now using Pentium4 Prescott CPU with HTT enabled in BIOS, but 
kernel built without SMP to simplify profiling. What counters can you 
recommend me to use on it for regular time profiling?


Thanks for reply.

--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Memory allocation performance

2008-02-02 Thread Alexander Motin

Robert Watson wrote:
I guess the question is: where are the cycles going?  Are we suffering 
excessive cache misses in managing the slabs?  Are you effectively 
"cycling through" objects rather than using a smaller set that fits 
better in the cache?


In my test setup only several objects from zone usually allocated same 
time, but they allocated two times per every packet.


To check UMA dependency I have made a trivial one-element cache which in 
my test case allows to avoid two for four allocations per packet.

.alloc.
-   item = uma_zalloc(ng_qzone, wait | M_ZERO);
+   mtx_lock_spin(&itemcachemtx);
+   item = itemcache;
+   itemcache = NULL;
+   mtx_unlock_spin(&itemcachemtx);
+   if (item == NULL)
+   item = uma_zalloc(ng_qzone, wait | M_ZERO);
+   else
+   bzero(item, sizeof(*item));
.free.
-   uma_zfree(ng_qzone, item);
+   mtx_lock_spin(&itemcachemtx);
+   if (itemcache == NULL) {
+   itemcache = item;
+   item = NULL;
+   }
+   mtx_unlock_spin(&itemcachemtx);
+   if (item)
+   uma_zfree(ng_qzone, item);
...

To be sure that test system is CPU-bound I have throttled it with sysctl 
to 1044MHz. With this patch my test PPPoE-to-PPPoE router throughput has 
grown from 17 to 21Mbytes/s. Profiling results I have sent promised 
close results.


Is some bit of debugging enabled that shouldn't 
be, perhaps due to a failure of ifdefs?


I have commented out all INVARIANTS and WITNESS options from GENERIC 
kernel config. What else should I check?


--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Memory allocation performance

2008-02-02 Thread Alexander Motin

Joseph Koshy wrote:

You cannot sample with the TSC since the TSC does not interrupt the CPU.
For CPU cycles you would probably want to use "p4-global-power-events";
see pmc(3).


Thanks, I have already found this. There was only problem, that by 
default it counts cycles only when both logical cores are active while 
one of my cores was halted.
Sampling on this, profiler shown results close to usual profiling, but 
looking more random:


 175.97 1.49   1/64  ip_input  [49]
 175.97 1.49   1/64  g_alloc_bio [81]
 175.97 1.49   1/64  ng_package_data [18]
1055.81 8.93   6/64  em_handle_rxtx [4]
2639.5322.32  15/64  em_get_buf [19]
3343.4128.27  19/64  ng_getqblk [17]
3695.3431.25  21/64  ip_forward  [14]
[9]21.6 11262.00   95.23  64 uma_zalloc_arg [9]
  35.4513.03   5/22  critical_exit [75]
  26.86 0.00  22/77  critical_enter [99]
  19.89 0.00  18/19  mb_ctor_mbuf [141]


  31.87 0.24   4/1324ng_ether_rcvdata [13]
  31.87 0.24   4/1324ip_forward  [14]
  95.60 0.73  12/1324ng_iface_rcvdata  
[16]

 103.57 0.79  13/1324m_freem [25]
 876.34 6.71 110/1324mb_free_ext [30]
9408.7572.011181/1324ng_free_item [11]
[10]20.2 10548.00  80.731324 uma_zfree_arg [10]
  26.86 0.00  22/77  critical_enter [99]
  15.0011.59   7/7   mb_dtor_mbuf [134]
  19.00 6.62   4/4   mb_dtor_pack [136]
   1.66 0.00   1/32  m_tag_delete_chain [114]


 21.4   11262.00 11262.00   64 175968.75 177456.76  uma_zalloc_arg [9]
 20.1   21810.00 10548.00 1324  7966.77  8027.74  uma_zfree_arg [10]
  5.6   24773.00  2963.00 1591  1862.35  2640.07  ng_snd_item 
 [15]

  3.5   26599.00  1826.00   33 55333.33 55333.33  ng_address_hook [20]
  2.4   27834.00  1235.00  319  3871.47  3871.47  ng_acquire_read [28]

To make statistics better I need to record sampling data with smaller 
period, but too much data creates additional overhead including disc 
operations and brakes statistics. Is there any way to make it more 
precise? What sampling parameters should I use for better results?


--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Memory allocation performance

2008-02-02 Thread Alexander Motin

Peter Jeremy пишет:

On Sat, Feb 02, 2008 at 11:31:31AM +0200, Alexander Motin wrote:
To check UMA dependency I have made a trivial one-element cache which in my 
test case allows to avoid two for four allocations per packet.


You should be able to implement this lockless using atomic(9).  I haven't
verified it, but the following should work.


I have tried this, but man 9 atomic talks:

The atomic_readandclear() functions are not implemented for the types
``char'', ``short'', ``ptr'', ``8'', and ``16'' and do not have any 
variants with memory barriers at this time.


--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Memory allocation performance

2008-02-02 Thread Alexander Motin

Robert Watson wrote:
Hence my request for drilling down a bit on profiling -- the question 
I'm asking is whether profiling shows things running or taking time that 
shouldn't be.


I have not yet understood why does it happend, but hwpmc shows huge 
amount of "p4-resource-stall"s in UMA functions:

  %   cumulative   self  self total
 time   seconds   secondscalls  ms/call  ms/call  name
 45.22303.00  2303.000  100.00%   uma_zfree_arg [1]
 41.24402.00  2099.000  100.00%   uma_zalloc_arg [2]
  1.44472.0070.000  100.00% 
uma_zone_exhausted_nolock [3]

  0.94520.0048.000  100.00%   ng_snd_item [4]
  0.84562.0042.000  100.00%   __qdivrem [5]
  0.84603.0041.000  100.00%   ether_input [6]
  0.64633.0030.000  100.00%   ng_ppp_prepend [7]

Probably it explains why "p4-global-power-events" shows many hits into them
  %   cumulative   self  self total
 time   seconds   secondscalls  ms/call  ms/call  name
 20.0   37984.00 37984.000  100.00%   uma_zfree_arg [1]
 17.8   71818.00 33834.000  100.00%   uma_zalloc_arg [2]
  4.0   79483.00  7665.000  100.00%   ng_snd_item [3]
  3.0   85256.00  5773.000  100.00%   __mcount [4]
  2.3   89677.00  4421.000  100.00%   bcmp [5]
  2.2   93853.00  4176.000  100.00%   generic_bcopy [6]

, while "p4-instr-retired" does not.
  %   cumulative   self  self total
 time   seconds   secondscalls  ms/call  ms/call  name
 11.15351.00  5351.000  100.00%   ng_apply_item [1]
  7.99178.00  3827.000  100.00% 
legacy_pcib_alloc_msi [2]

  4.1   11182.00  2004.000  100.00%   init386 [3]
  4.0   13108.00  1926.000  100.00%   rn_match [4]
  3.5   14811.00  1703.000  100.00%   uma_zalloc_arg [5]
  2.6   16046.00  1235.000  100.00%   SHA256_Transform [6]
  2.2   17130.00  1084.000  100.00%   ng_add_hook [7]
  2.0   18111.00   981.000  100.00%   ng_rmhook_self [8]
  2.0   19054.00   943.000  100.00%   em_encap [9]

For this moment I have invent two possible explanation. One is that due 
to UMA's cyclic block allocation order it does not fits CPU caches and 
another that it is somehow related to critical_exit(), which possibly 
can cause context switch. Does anybody have better explanation how such 
small and simple in this part function can cause such results?


--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Memory allocation performance

2008-02-02 Thread Alexander Motin

Robert Watson wrote:
Basically, the goal would be 
to make the pcpu cache FIFO as much as possible as that maximizes the 
chances that the newly allocated object already has lines in the cache.  


Why FIFO? I think LIFO (stack) should be better for this goal as the 
last freed object has more chances to be still present in cache.


--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Memory allocation performance

2008-02-03 Thread Alexander Motin

Kris Kennaway wrote:
You can look at the raw output from pmcstat, which is a collection of 
instruction pointers that you can feed to e.g. addr2line to find out 
exactly where in those functions the events are occurring.  This will 
often help to track down the precise causes.


Thanks to the hint, it was interesting hunting, but it shown nothing. It 
hits into very simple lines like:

bucket = cache->uc_freebucket;
cache->uc_allocs++;
if (zone->uz_ctor != NULL) {
cache->uc_frees++;
and so on.
There is no loops, there is no inlines or macroses. Nothing! And the 
only hint about it is a huge number of "p4-resource-stall"s in those 
lines. I have no idea what exactly does it means, why does it happens 
mostly here and how to fight it.


I would probably agreed that it might be some profiler fluctuation, but 
performance benefits I have got from self-made uma calls caching look 
very real. :(


Robert Watson wrote:
> There was, FYI, a report a few years ago that there was a measurable
> improvement from allocating off the free bucket rather than maintaining
> separate alloc and free buckets.  It sounded good at the time but I was
> never able to reproduce the benefits in my test environment.  Now might
> be a good time to try to revalidate that.  Basically, the goal would be
> to make the pcpu cache FIFO as much as possible as that maximizes the
> chances that the newly allocated object already has lines in the cache.
> It's a fairly trivial tweak to the UMA allocation code.

I have tried this, but have not found a difference. May be it gives some 
benefits, but not in this situation. In this situation profiling shows 
delays in allocator itself, so as soon as allocator does not touches 
data objects itself it probably more speaks about management structure's 
memory caching then about objects caching.


I have got one more crazy idea that memory containing zones may have 
some special hardware or configuration features, like "noncaching" or 
something alike. That could explain slowdown in accessing it. But as I 
can't prove it, it just one more crazy theory. :(


--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Memory allocation performance

2008-03-05 Thread Alexander Motin

Bruce Evans wrote:

Try profiling it one another type of CPU, to get different performance
counters but hopefully not very different stalls.  If the other CPU doesn't
stall at all, put another black mark against P4 and delete your copies of
it :-).


I have tried to profile the same system with the same load on different 
hardware:

 - was Pentium4 2.8 at ASUS MB based on i875G chipset,
 - now PentiumD 3.0 at Supermicro PDSMi board based on E7230 chipset.

The results are completely different. The problem has gone:
0.03 0.04  538550/2154375 ip_forward  [11]
0.03 0.04  538562/2154375 em_get_buf [32]
0.07 0.08 1077100/2154375 ng_package_data [26]
[15]1.8 0.14 0.15 2154375 uma_zalloc_arg [15]
0.06 0.00 1077151/3232111 generic_bzero [22]
0.03 0.00  538555/538555  mb_ctor_mbuf [60]
0.03 0.00 2154375/4421407 critical_exit  [63]

0.02 0.01  538554/2154376 m_freem [42]
0.02 0.01  538563/2154376 mb_free_ext [54]
0.04 0.03 1077100/2154376 ng_free_item [48]
[30]0.9 0.08 0.06 2154376 uma_zfree_arg [30]
0.03 0.00 2154376/4421407 critical_exit  [63]
0.00 0.01  538563/538563  mb_dtor_pack [82]
0.01 0.00 2154376/4421971 critical_enter [69]

So probably it was some hardware related problem. First MB has video 
integrated to chipset without any dedicated memory, possibly it affected 
memory performance in some way. On the first system there were such 
messages on boot:

Mar  3 23:01:20 swamp kernel: acpi0: reservation of 0, a (3) failed
Mar  3 23:01:20 swamp kernel: acpi0: reservation of 10, 3fdf (3) 
failed
Mar  3 23:01:20 swamp kernel: agp0: controller> on vgapci0

Mar  3 23:01:20 swamp kernel: agp0: detected 892k stolen memory
Mar  3 23:01:20 swamp kernel: agp0: aperture size is 128M
, can they be related?

--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


soclose() & so->so_upcall() = race?

2008-03-06 Thread Alexander Motin

Hi.

As I can see so_upcall() callback is called with SOCKBUF_MTX unlocked. 
It means that SB_UPCALL flag can be removed during call and socket can 
be closed and deallocated with soclose() while callback is running. Am I 
right or I have missed something? How in that situation socket pointer 
protected from being used after free?


--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Multiple netgraph threads

2008-03-30 Thread Alexander Motin

Hi.

I have implemented a patch (for the HEAD) making netgraph to use several 
own threads for event queues processing instead of using single swinet. 
It should significantly improve netgraph SMP scalability on complicated 
workloads that require queueing by implementation (PPTP/L2TP) or stack 
size limitations. It works perfectly on my UP system, showing results 
close to original or even a bit better. I have no real SMP test server 
to measure real scalability, but test on HTT CPU works fine, utilizing 
both virtual cores at the predictable level.


Reviews and feedbacks are welcome.

URL: http://people.freebsd.org/~mav/netgraph.threads.patch

--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Multiple netgraph threads

2008-03-30 Thread Alexander Motin

Hans Petter Selasky wrote:
Have you thought about nodes that lock the same mutex must be run on the same 
thread else for example one thread will run while another will just waits for 
a mutex ?


Usually different nodes does not share data, keeping them inside 
node/hook private data, so I don't see problem here. It is possible that 
two messages will be queued to the same node at same time, but to fulfil 
serialization requirements each node queue processed only by one thread 
at a time, so there is no place for congestion.


You can achieve this by grouping nodes into a tree, and the node at the top of 
the tree decides on which thread the nodes in the tree should be run.


Netgraph graphs usually not linear and it is from hard to impossible to 
predict the way execution will go and the locks that may be congested. 
Including usually big number of nodes and usually small lock time, 
congestion probability IMHO looks insignificant comparing to additional 
logic overhead. If some node/hook needs big lock time it may be instead 
declared as FORCE_WRITER to enqueue congested messages instead of 
blocking them (such way now implemented all existing ppp 
compression/encryption nodes).


--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Multiple netgraph threads

2008-03-30 Thread Alexander Motin

Robert Watson wrote:
> FYI, you might be interested in some similar work I've been doing
> in the rwatson_netisr branch in Perforce, which:
> Adds per-CPU netisr threads

Thanks. Netgraph from the beginning uses concept of direct function 
calls, when level of parallelism limited by data source. In that point 
multiple netisr threads will give benefits.


My initial leaning would be that we would like to avoid adding too many 
more threads that will do per-packet work, as that leads to excessive 
context switching.


Netgraph uses queueing only as last resort, when direct call is not 
possible due to locking or stack limitations. For example, while working 
with kernel sockets (*upcall)() I have got many issues which make 
impossible to freely use received data without queueing as upcall() 
caller holds some locks leading to unpredicted LORs in socket/TCP/UDP code.
In case of such forced queueing, node becomes an independent data source 
which can be pinned to and processed by whatever specialized thread or 
netisr, when it will be able to do it more effectively.


--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Spin down HDD after disk sync or before power off

2009-03-05 Thread Alexander Motin
Daniel Thiele wrote:
> Oliver Fromme wrote:
> | Octavian Covalschi wrote:
> |  > I'm looking a way to spin down HDD just right before power off. Why?
> |  >
> |  > Because currently when I call "shutdown -p now", HDD is powered off
> at it's
> |  > full speed (7200.4) and as a result
> |  > I hear a noise of stopping/spinning down of HDD, and _this_
> concerns me as
> |  > I'm afraid it can damage HDD.

I am not sure that there is any problem. Last 10 years drives using
electromagnetic head positioning which mechanically parts heads on power
down.

> |  [...]
> |  You can't do anything from
> | userland at this point.  If you want to insert a spin-down
> | for your disks, you will have to modify the kernel.
> 
> That is what I did and am still doing successfully since 2006.
> See
> http://lists.freebsd.org/pipermail/freebsd-acpi/2006-January/002375.html
> for my initial problem description and
> http://lists.freebsd.org/pipermail/freebsd-acpi/2006-February/002566.html
> for the "solution". Note that back then David Tolpin
> (d...@davidashen.net) suggested to use
> " ... & (ATA_SUPPORT_APM|ATA_SUPPORT_STANDBY)"
> instead.
> 
> I don't know if that is the way it should be done, but for me it worked
> across 3 hard disks and two notebooks so far. I am aware that 3 disks
> and 2 notebooks provide very limited test results, but maybe this work
> around solves your problem, too.
> 
> It would still be great, though, if a proper solution for this could be
> permanently implemented into FreeBSD. That is, if the current behaviour
> really is not that healthy to hard drives, as Joerg suggested.

I have thought about doing that on device detach to prepare drive to
mechanical shocks in case of drive physical removing. But to work
properly it requires some changes in ATA core to be made first to
protect against submitting commands to already physically removed drive..

I can agree with doing that on suspend if ACPI does not doing it
automatically.

But on system shutdown having meaning of reboot, I think, commanding
drive IDLE will just lead to additional mechanical and power stress for
drive and PSU when drives will be spin-up in just a few seconds after
spin-down.

-- 
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: hot-attach SATA drive

2009-03-30 Thread Alexander Motin
Andriy Gapon wrote:
> Recently I tried to hot-attach a SATA drive to a running system.
> Controller is ICH9 in AHCI mode. Physically/electronically everything went
> smoothly, the drive spun-up. Then I tried to detach and re-attach all channels
> with no devices on them using atacontrol. I did it 3 times to be sure, but no 
> new
> disk showed up. Then I finally rebooted, the disk showed up OK.
> 
> Question: was hot-attach expected to work? Is there a limitation in hardware 
> or in
> our driver?
> 
> Note: I attached the drive to a regular SATA port, not eSATA.

Which system version do you use? With recent CURRENT I have successfully
tested insert/remove SATA drives with ICH8, ICH8M and JMB363 AHCI
controllers channel attach/detach. Theoretically it is possible to
insert/remove SATA drives even without channel attach/detach. Remove
works fine, but such really hot insertion functionality is not
implemented properly now and so blocked.

-- 
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: hot-attach SATA drive

2009-03-30 Thread Alexander Motin
Andriy Gapon wrote:
> on 30/03/2009 14:14 Alexander Motin said the following:
>> Andriy Gapon wrote:
>>> Recently I tried to hot-attach a SATA drive to a running system.
>>> Controller is ICH9 in AHCI mode. Physically/electronically everything went
>>> smoothly, the drive spun-up. Then I tried to detach and re-attach all 
>>> channels
>>> with no devices on them using atacontrol. I did it 3 times to be sure, but 
>>> no new
>>> disk showed up. Then I finally rebooted, the disk showed up OK.
>>>
>>> Question: was hot-attach expected to work? Is there a limitation in 
>>> hardware or in
>>> our driver?
>>>
>>> Note: I attached the drive to a regular SATA port, not eSATA.
>> Which system version do you use? With recent CURRENT I have successfully
>> tested insert/remove SATA drives with ICH8, ICH8M and JMB363 AHCI
>> controllers channel attach/detach. Theoretically it is possible to
>> insert/remove SATA drives even without channel attach/detach. Remove
>> works fine, but such really hot insertion functionality is not
>> implemented properly now and so blocked.
> 
> It was stable/7, amd64.
> Maybe there is a small subset of the changes in current that I could try in 
> stable/7?

There is significant sources difference due to modularization work done
on CURRENT, so it is not so easy to directly compare sources or backport
something. I haven't actually looked on/tested 7-STABLE much.

-- 
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Video memory as swap under FreeBSD

2007-10-14 Thread Alexander Motin

Dag-Erling Smørgrav пишет:

Arne Schwabe <[EMAIL PROTECTED]> writes:

VIdeo RAM may also not be as stable as your main RAM. I mean nobody if a
bit flips in video ram.


That may have been true fifteen years ago, but not today.


Have the anybody ever seen ECC video RAM? Video RAM usually works on 
higher frequencies then main RAM and IMHO it must affect stability. For 
video RAM some percent of errors is really less important, I have seen 
it myself with my previous video card until it died completely.


--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Summary: Re: Spin down HDD after disk sync or before power off

2010-11-16 Thread Alexander Motin
Alexander Best wrote:
> On Tue Nov 16 10, Bruce Cran wrote:
>> On Fri, 22 Oct 2010 10:03:09 +
>> Alexander Best  wrote:
>>
>>> so how about olivers patch? it will only apply to ata devices so it's
>>> garanteed not to break any other CAM devices (i'm thinking about the
>>> aac controller issue). you could revert your previous shutdown work
>>> and plug olivers patch into CAM. you might want to replace the
>>> combination of flush/standby immediate with sleep.
>> One problem with the code that's been committed is that the shutdown
>> event handler doesn't get run during a suspend operation so an
>> emergency unload still gets done when running "acpiconf -s3".
> 
> unfortunately i don't think a can help you on that one. acpi never worked for
> me! even 'acpiconf -s1' will hopelessly crash my system. :(

It is not necessary to have fully working suspend to work on this.
Bounce mode should be enough. If bounce is also not working for you - it
definitely should be the first thing to fix.

>From the other side ATM I see no good approach to this, as soon as CAM
devices are not present on NewBus to handle suspend events. Unless SIM
drivers will expose those events to CAM in some way.

-- 
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Summary: Re: Spin down HDD after disk sync or before power off

2010-11-16 Thread Alexander Motin
Alexander Best wrote:
> On Wed Nov 17 10, Alexander Motin wrote:
>> Alexander Best wrote:
>>> On Tue Nov 16 10, Bruce Cran wrote:
>>>> On Fri, 22 Oct 2010 10:03:09 +
>>>> Alexander Best  wrote:
>>>>
>>>>> so how about olivers patch? it will only apply to ata devices so it's
>>>>> garanteed not to break any other CAM devices (i'm thinking about the
>>>>> aac controller issue). you could revert your previous shutdown work
>>>>> and plug olivers patch into CAM. you might want to replace the
>>>>> combination of flush/standby immediate with sleep.
>>>> One problem with the code that's been committed is that the shutdown
>>>> event handler doesn't get run during a suspend operation so an
>>>> emergency unload still gets done when running "acpiconf -s3".
>>> unfortunately i don't think a can help you on that one. acpi never worked 
>>> for
>>> me! even 'acpiconf -s1' will hopelessly crash my system. :(
>> It is not necessary to have fully working suspend to work on this.
>> Bounce mode should be enough. If bounce is also not working for you - it
>> definitely should be the first thing to fix.
> 
> bounce mode? sorry i'm lost.

sysctl debug.acpi.suspend_bounce=1

It will make system to wake up back immediately after suspending all
devices, instead of transition to requested S-state.

-- 
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Summary: Re: Spin down HDD after disk sync or before power off

2010-11-19 Thread Alexander Motin
Bruce Cran wrote:
> On Tue, 16 Nov 2010 20:40:00 +
> Bruce Cran  wrote:
> 
>> One problem with the code that's been committed is that the shutdown
>> event handler doesn't get run during a suspend operation so an
>> emergency unload still gets done when running "acpiconf -s3".
> 
> Something else I noticed today: I've just got a new disk that supports
> NCQ and found the kern.cam.ada.ada_send_ordered sysctl that appears to
> enable/disable its use (?).  

ada_send_ordered controls periodical non-queued commands insertion to
avoid possible infinite commands starvation and timeouts as result. NCQ
can't be disabled now.

> But the shutdown handler that spins
> the disk down only gets initialized if ada_send_ordered is enabled. I
> was wondering what the reason for this is?

Interesting question. That code came as-is from "da" driver and I can't
explain it. I have feeling that it's wrong.

-- 
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: statclock(n)

2010-11-19 Thread Alexander Motin
Hi.

Andriy Gapon wrote:
> I wonder if instead of calling statclock() multiple times (after an idle 
> period)
> we couldn't call it just with an appropriate N parameter.
> So some stats like e.g. cp_time[] could do +=N instead of ++.
> Other stats ru_ixrss need to be updated only once.
> Similarly, N could be passed further down to sched_clock() and handled there 
> too.

I think yes. It is reasonable. Initially hardclock() was also called in
a loop. It was just rewritten first because it is called more often
(more times), goes to hardware to sync time, and any way required
changes to work properly.

-- 
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Another small error. Re: Logical vs. bitwise AND in sbin/routed/parms.c

2010-11-22 Thread Alexander Motin

On 23.11.2010 00:07, Artem Belevich wrote:

While it's not directly related to hunting for '&'/'&&' typos, here's
another seemingly wrong place in the code:

--- a/sys/dev/ahci/ahci.c
+++ b/sys/dev/ahci/ahci.c
@@ -852,7 +852,7 @@ ahci_ch_attach(device_t dev)
 ch->caps = ctlr->caps;
 ch->caps2 = ctlr->caps2;
 ch->quirks = ctlr->quirks;
-   ch->numslots = ((ch->caps&  AHCI_CAP_NCS)>>  AHCI_CAP_NCS_SHIFT) + 1,
+   ch->numslots = ((ch->caps&  AHCI_CAP_NCS)>>  AHCI_CAP_NCS_SHIFT) + 1;
 mtx_init(&ch->mtx, "AHCI channel lock", NULL, MTX_DEF);
 resource_int_value(device_get_name(dev),
 device_get_unit(dev), "pm_level",&ch->pm_level);

I did mention it on freebsd-current@ some time back:
http://lists.freebsd.org/pipermail/freebsd-current/2009-November/013645.html


Fixed at r215725. Thanks.

--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Where userland read/write requests, whcih is larger than MAXPHYS, are splitted?

2010-12-10 Thread Alexander Motin
Lev Serebryakov wrote:
>I'm  digging  thought  GEOM/IO  code  and  can not find place, where
>  requests  from  userland to read more than MAXPHYS bytes, is splitted
>  into  several "struct bio"?
> 
>   It seems, that these children request are issued one-by-one, not in
>  parallel,   am  I  right?  Why?  It  breaks  down  parallelism,  when
>  underlying GEOM can process several requests simoltaneously?

AFAIK first time requests from user-land broken to MAXPHYS-size pieces
by physio() before entering GEOM. Requests are indeed serialized here, I
suppose to limit KVA that thread can harvest, but IMHO it could be
reconsidered.

One more split happens (when needed) at geom_disk module to honor disk
driver's maximal I/O size. There is no serialization. Most of ATA/SATA
drivers in 8-STABLE support I/O up to at least min(512K, MAXPHYS) - 128K
by default. Many SCSI drivers still limited by DFLTPHYS - 64K.

-- 
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Where userland read/write requests, whcih is larger than MAXPHYS, are splitted?

2010-12-10 Thread Alexander Motin
Andriy Gapon wrote:
> on 10/12/2010 16:45 Alexander Motin said the following:
>> by default. Many SCSI drivers still limited by DFLTPHYS - 64K.
> 
> Including the cases where MAXBSIZE is abused because it historically has the 
> same
> value.

DFLTPHYS automatically assumed by CAM for all SIMs not reporting their
maximal I/O size. All drivers using MAXBSIZE most likely will fall into
this category, because this functionality was added just at 8.0.

-- 
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: PCI IDE Controller Base Address Register setting

2011-01-03 Thread Alexander Motin

On Saturday, January 01, 2011 2:58:12 pm Darmawan Salihun wrote:

So, I found out that it seems the
allocation of I/O ports for the IDE controller is just fine.
However, the primary IDE channel is shared between
an IDE interface  and a CF card. Moreover, Linux detects
DMA bug, because all drives connected to the interface would be
in PIO mode :-/
If all drives on the primary channel are "forced" to PIO mode, then
shouldn't the "IDE PCI bus master register" (offset 20h per SFF-8038i)
along with the command register (offset 4h), are set to indicate the
controller doesn't support bus mastering?


I don't think that BIOS should change controller capabilities depending 
on attached drives, may be except may be for workarounding some known 
bugs/incompatibilities. Otherwise this will just make hot-plug things 
tricky and unpredictable.


>> Anyway, is it possible for devices on _the same_ channel to use
>> different setting in FreeBSD? For example, the primary slave
>> is using UDMA66 while the primary master is using PIO-4.
>> Or such configuration is considered invalid.

Yes, it is possible. If automatic negotiation doesn't succeed for some 
reason, you may limit initial mode for each specific device using 
hint.ata.X.devY.mode loader tunables, added not so long ago. After boot 
it can also be tuned per-device via atacontrol or camcontrol tools, 
depending on ATA stack used.


>> The AMDLX800-CS5536 board I'm working with has different connectors
>> for the primary master and primary slave. Moreover, the chipset
>> supports different setting in primary master and primary slave.

There are few other controllers not supporting such configurations, but 
it is handled by their specific drivers.


--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: PCI IDE Controller Base Address Register setting

2011-01-05 Thread Alexander Motin
Darmawan Salihun wrote:
> I get the following log message upon booting with "boot -Dv":
> ==
> ata0:  on atapci0
> ata0: reset tp1 mask=03 ostat0=50 ostat1=50
> ata0: stat0=0x80 err=0x00 lsb=0x00 msb=0x00
> ata0: stat0=0x50 err=0x01 lsb=0x00 msb=0x00
> ata0: stat1=0x50 err=0x01 lsb=0x00 msb=0x00
> ata0: reset tp2 stat0=50 stat1=50 devices=0x3
> ...
> ata0: Identifying devices: 0003
> ata0: New devices: 0003
> ...
> ata0-slave: pio=PIO4 wdma=WDMA2 udma=UDMA100 cable=80 wire
> ata0-master: pio=PIO1 wdma=UNSUPPORTED udma=UNSUPPORTED cable=40 wire
> ...
> ad0: FAILURE setting PIO1 on CS5536 chip
> ad0: 488MB  at ata0-master BIOSPIO
> ...
> GEOM: newdisk ad0
> ad0: Adaptec check1 failed
> ad0: LSI(v3) check1 failed
> ad0: FAILURE - READ status=51 
> error=c4 LBA=0
> ...
> ad1: setting PIO4 on CS5536 chip
> ad1: setting UDMA100 on CS5536 chip
> ad1: 38150MB  at ata0-slave UDMA100
> ...
> GEOM: newdisk ad1
> ...
> ad1: FAILURE - READ_DMA status=51 error=84 
> LBA=78132575
> ad1: FAILURE - READ_DMA status=51 error=84 
> LBA=78132591
> ...
> ==
> I have several questions: 
> 1. How FreeBSD sets the PIO mode on the target IDE controller? 
> what could've caused it to fail like the message above?

Looking to your messages I would suggest you are running something like
FreeBSD 8.0. At that time controller-specific method first set mode on
device and then programmed the chip. Most likely this error returned by
device. Some very old devices not supporting more then PIO3 may not
support mode setting command.

Mode setting code was significantly rewritten between 8.0 and 8.1. I
would recommend you to take newer version of FreeBSD for experiments.

> 2. It seems to me that setting the UDMA100 in the 
> AMD CS5536 IDE controller went just fine (in the log above). 
> But, FreeBSD fails when it tries to read something from the drive. 
> Does it mean the UDMA100 "mode" failed to be set correctly 
> in the IDE controller? 

It can be. For UDMA transfer rate is driven by transmitting side (for
reading - by device), but there is always a chance to do something
wrong. :) I don't have CS5536 board, so can't be completely sure how
correct is the code.

> 3. As I'm currently trying to fix the bug in the BIOS for the particular 
> board used to boot FreeBSD, what would you suggest to fix it? 

Try latest FreeBSD -- 8.2 is now in RC state.
Try to disconnect devices one by one.
Try to limit initial mode via loader tunables (note that some of them
were added not so long ago and may be missing on 8.0).

-- 
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: PCI IDE Controller Base Address Register setting

2011-01-05 Thread Alexander Motin

On 05.01.2011 19:25, Darmawan Salihun wrote:

--- On Wed, 1/5/11, Alexander Motin  wrote:


From: Alexander Motin
Subject: Re: PCI IDE Controller Base Address Register setting
To: "Darmawan Salihun"
Cc: "John Baldwin", freebsd-hackers@freebsd.org
Date: Wednesday, January 5, 2011, 9:56 AM
Darmawan Salihun wrote:

I get the following log message upon booting with

"boot -Dv":

==
ata0:  on atapci0
ata0: reset tp1 mask=03 ostat0=50 ostat1=50
ata0: stat0=0x80 err=0x00 lsb=0x00 msb=0x00
ata0: stat0=0x50 err=0x01 lsb=0x00 msb=0x00
ata0: stat1=0x50 err=0x01 lsb=0x00 msb=0x00
ata0: reset tp2 stat0=50 stat1=50 devices=0x3
...
ata0: Identifying devices: 0003
ata0: New devices: 0003
...
ata0-slave: pio=PIO4 wdma=WDMA2 udma=UDMA100 cable=80

wire

ata0-master: pio=PIO1 wdma=UNSUPPORTED

udma=UNSUPPORTED cable=40 wire

...
ad0: FAILURE setting PIO1 on CS5536 chip
ad0: 488MB  at

ata0-master BIOSPIO

...
GEOM: newdisk ad0
ad0: Adaptec check1 failed
ad0: LSI(v3) check1 failed
ad0: FAILURE - READ status=51

error=c4  LBA=0

...
ad1: setting PIO4 on CS5536 chip
ad1: setting UDMA100 on CS5536 chip
ad1: 38150MB  at

ata0-slave UDMA100

...
GEOM: newdisk ad1
...
ad1: FAILURE - READ_DMA

status=51
error=84  LBA=78132575

ad1: FAILURE - READ_DMA

status=51
error=84  LBA=78132591

...
==
I have several questions:
1. How FreeBSD sets the PIO mode on the target IDE

controller?

what could've caused it to fail like the message

above?

Looking to your messages I would suggest you are running
something like
FreeBSD 8.0. At that time controller-specific method first
set mode on
device and then programmed the chip. Most likely this error
returned by
device. Some very old devices not supporting more then PIO3
may not
support mode setting command.

Mode setting code was significantly rewritten between 8.0
and 8.1. I
would recommend you to take newer version of FreeBSD for
experiments.


The device is a CF-card. Do I need to add some sort of CFA-specific
initialization code to the BIOS?


Some CF devices AFAIR may wish power-up command before they will be able 
to access media, but I have never seen such ones, suppose it was 
applicable only to some old microdrives. AFAIR in all other points CF 
specification only extends ATA without additional requirements.



I'm using FreeBSD 8.0 as the test bed for the log message above.
I have FreeBSD 8.1 DVD to do further tests. Will report later.


OK.


2. It seems to me that setting the UDMA100 in the
AMD CS5536 IDE controller went just fine (in the log

above).

But, FreeBSD fails when it tries to read something

from the drive.

Does it mean the UDMA100 "mode" failed to be set

correctly

in the IDE controller?


It can be. For UDMA transfer rate is driven by transmitting
side (for
reading - by device), but there is always a chance to do
something
wrong. :) I don't have CS5536 board, so can't be completely
sure how
correct is the code.


Does it require chipset-specific support code on the OS
(say a device driver) or setting via PCI Bus Master registers
is enough?


There is no standard for setting I/O mode on ATA controllers. Most of 
vendors have own ways for setting it. Most of controllers have some 
additional registers, accessible via PCI configuration space. So for 
most of controllers FreeBSD has specific sub-drivers inside ata(4). If 
no matching sub-driver found - controller handled as "Generic" and mode 
setting is assumed to be done by BIOS, but it is a last resort.



3. As I'm currently trying to fix the bug in the BIOS

for the particular

board used to boot FreeBSD, what would you suggest to

fix it?

Try latest FreeBSD -- 8.2 is now in RC state.
Try to disconnect devices one by one.
Try to limit initial mode via loader tunables (note that
some of them
were added not so long ago and may be missing on 8.0).


A question about the loader tunable: is it enough to pass it through
the "boot" command, similar to the "-Dv" in "boot -Dv"?


You can use `set ...` command at the same loader command line before 
typing `boot ...`. To make it permanent - you can add wanted options to 
/boot/loader.conf file.


--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Need an alternative to DELAY()

2011-04-12 Thread Alexander Motin
Alexander Motin wrote:
> Warner Losh wrote:
>> I don't suppose that your driver could cause the hardware to interrupt after 
>> a little time?  That would be more resource friendly...  Otherwise, 1ms is 
>> long enough that a msleep or tsleep would likely work quite nicely.
> 
> It's not his driver, it's mine. Actually, unlike AHCI, this hardware
> even has interrupt for ready transition (second, biggest of sleeps). But
> it is not used in present situation.
> 
>> On Apr 11, 2011, at 1:43 PM, dieter...@engineer.com wrote:
>>>>> FreeBSD 8.2  amd64  uniprocessor
>>>>>
>>>>> kernel: siisch1: DISCONNECT requested
>>>>> kernel: siisch1: SIIS reset...
>>>>> kernel: siisch1: siis_sata_connect() calling DELAY(1000)
>>>>> last message repeated 59 times
>>>>> kernel: siisch1: SATA connect time=60ms status=0123
>>>>> kernel: siisch1: SIIS reset done: devices=0001
>>>>> kernel: siisch1: DISCONNECT requested
>>>>> kernel: siisch1: SIIS reset...
>>>>> kernel: siisch1: siis_sata_connect() calling DELAY(1000)
>>>>> last message repeated 58 times
>>>>> kernel: siisch1: SATA connect time=59ms status=0123
>>>>> ...
>>>>> kernel: siisch0: siis_wait_ready() calling DELAY(1000)
>>>>> last message repeated 1300 times
>>>>> kernel: siisch0: port is not ready (timeout 1ms) status = 
>>> 001f2000
>>>>> Meanwhile, *everything* comes to a screeching halt.  Device
>>>>> drivers are locked out, and thus incoming data is lost.
>>>>> Losing incoming data is unacceptable.
>>>>>
>>>>> Need an alternative to DELAY() that does not lock out
>>>>> other device drivers.  There must be a way to reset one
>>>>> bit of hardware without locking down the entire machine.
>>> Hans Petter Selasky writes:
>>>> An alternative to DELAY() is the simplest solution. You probably need
>>>> to do some redesign in the SCSI layer to find a better solution.
>>> I keep coming back to the idea that a device driver for one
>>> controller should not have to lock out *all* the hardware.
>>> RS-232 locks out Ethernet.  Disk drivers lock out Ethernet.
>>> And so on.  Why?  Is there some fundamental reason that this
>>> *has* to be?  I thought the conversion from spl() to mutex()
>>> was supposed to fix this?
>>>
>>> I'm making progress on my project converting printf(9) calls
>>> to log(9), and fixing some bugs along the way.  Eventually I'll
>>> have patches to submit.  But this is really a workaround, not
>>> a fix to the underlying problem.
>>>
>>> Redesigning the SCSI layer sounds like a job for someone who took
>>> a lot more CS classes than I did.  /dev/brain returns ENOCLUE.  :-(
> 
> CAM is not completely innocent in this situation indeed. CAM defines
> XPT_RESET_BUS request as synchronous. It is not queued, and called under
> the SIM mutex lock. I don't think lock can be safely dropped in the
> middle there.

Thinking again, I was unfair to CAM: SCSI (SPI) just don't have this
ready status to wait for, so waiting was always done asynchronously
there. I'll try to emulate that.

> Now I think that I could try to move readiness waiting out of the
> siis_reset() to do it asynchronously. I'll think about it.

-- 
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Need an alternative to DELAY()

2011-04-12 Thread Alexander Motin
Warner Losh wrote:
> I don't suppose that your driver could cause the hardware to interrupt after 
> a little time?  That would be more resource friendly...  Otherwise, 1ms is 
> long enough that a msleep or tsleep would likely work quite nicely.

It's not his driver, it's mine. Actually, unlike AHCI, this hardware
even has interrupt for ready transition (second, biggest of sleeps). But
it is not used in present situation.

> On Apr 11, 2011, at 1:43 PM, dieter...@engineer.com wrote:
>>>> FreeBSD 8.2  amd64  uniprocessor
>>>>
>>>> kernel: siisch1: DISCONNECT requested
>>>> kernel: siisch1: SIIS reset...
>>>> kernel: siisch1: siis_sata_connect() calling DELAY(1000)
>>>> last message repeated 59 times
>>>> kernel: siisch1: SATA connect time=60ms status=0123
>>>> kernel: siisch1: SIIS reset done: devices=0001
>>>> kernel: siisch1: DISCONNECT requested
>>>> kernel: siisch1: SIIS reset...
>>>> kernel: siisch1: siis_sata_connect() calling DELAY(1000)
>>>> last message repeated 58 times
>>>> kernel: siisch1: SATA connect time=59ms status=0123
>>>> ...
>>>> kernel: siisch0: siis_wait_ready() calling DELAY(1000)
>>>> last message repeated 1300 times
>>>> kernel: siisch0: port is not ready (timeout 1ms) status = 
>> 001f2000
>>>> Meanwhile, *everything* comes to a screeching halt.  Device
>>>> drivers are locked out, and thus incoming data is lost.
>>>> Losing incoming data is unacceptable.
>>>>
>>>> Need an alternative to DELAY() that does not lock out
>>>> other device drivers.  There must be a way to reset one
>>>> bit of hardware without locking down the entire machine.
>> Hans Petter Selasky writes:
>>> An alternative to DELAY() is the simplest solution. You probably need
>>> to do some redesign in the SCSI layer to find a better solution.
>> I keep coming back to the idea that a device driver for one
>> controller should not have to lock out *all* the hardware.
>> RS-232 locks out Ethernet.  Disk drivers lock out Ethernet.
>> And so on.  Why?  Is there some fundamental reason that this
>> *has* to be?  I thought the conversion from spl() to mutex()
>> was supposed to fix this?
>>
>> I'm making progress on my project converting printf(9) calls
>> to log(9), and fixing some bugs along the way.  Eventually I'll
>> have patches to submit.  But this is really a workaround, not
>> a fix to the underlying problem.
>>
>> Redesigning the SCSI layer sounds like a job for someone who took
>> a lot more CS classes than I did.  /dev/brain returns ENOCLUE.  :-(

CAM is not completely innocent in this situation indeed. CAM defines
XPT_RESET_BUS request as synchronous. It is not queued, and called under
the SIM mutex lock. I don't think lock can be safely dropped in the
middle there.

Now I think that I could try to move readiness waiting out of the
siis_reset() to do it asynchronously. I'll think about it.

-- 
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Need an alternative to DELAY()

2011-04-12 Thread Alexander Motin
Alexander Motin wrote:
> Warner Losh wrote:
>> I don't suppose that your driver could cause the hardware to interrupt after 
>> a little time?  That would be more resource friendly...  Otherwise, 1ms is 
>> long enough that a msleep or tsleep would likely work quite nicely.
> 
> It's not his driver, it's mine. Actually, unlike AHCI, this hardware
> even has interrupt for ready transition (second, biggest of sleeps). But
> it is not used in present situation.
> 
>> On Apr 11, 2011, at 1:43 PM, dieter...@engineer.com wrote:
>>>>> FreeBSD 8.2  amd64  uniprocessor
>>>>>
>>>>> kernel: siisch1: DISCONNECT requested
>>>>> kernel: siisch1: SIIS reset...
>>>>> kernel: siisch1: siis_sata_connect() calling DELAY(1000)
>>>>> last message repeated 59 times
>>>>> kernel: siisch1: SATA connect time=60ms status=0123
>>>>> kernel: siisch1: SIIS reset done: devices=0001
>>>>> kernel: siisch1: DISCONNECT requested
>>>>> kernel: siisch1: SIIS reset...
>>>>> kernel: siisch1: siis_sata_connect() calling DELAY(1000)
>>>>> last message repeated 58 times
>>>>> kernel: siisch1: SATA connect time=59ms status=0123
>>>>> ...
>>>>> kernel: siisch0: siis_wait_ready() calling DELAY(1000)
>>>>> last message repeated 1300 times
>>>>> kernel: siisch0: port is not ready (timeout 1ms) status = 
>>> 001f2000
>>>>> Meanwhile, *everything* comes to a screeching halt.  Device
>>>>> drivers are locked out, and thus incoming data is lost.
>>>>> Losing incoming data is unacceptable.
>>>>>
>>>>> Need an alternative to DELAY() that does not lock out
>>>>> other device drivers.  There must be a way to reset one
>>>>> bit of hardware without locking down the entire machine.
>>> Hans Petter Selasky writes:
>>>> An alternative to DELAY() is the simplest solution. You probably need
>>>> to do some redesign in the SCSI layer to find a better solution.
>>> I keep coming back to the idea that a device driver for one
>>> controller should not have to lock out *all* the hardware.
>>> RS-232 locks out Ethernet.  Disk drivers lock out Ethernet.
>>> And so on.  Why?  Is there some fundamental reason that this
>>> *has* to be?  I thought the conversion from spl() to mutex()
>>> was supposed to fix this?
>>>
>>> I'm making progress on my project converting printf(9) calls
>>> to log(9), and fixing some bugs along the way.  Eventually I'll
>>> have patches to submit.  But this is really a workaround, not
>>> a fix to the underlying problem.
>>>
>>> Redesigning the SCSI layer sounds like a job for someone who took
>>> a lot more CS classes than I did.  /dev/brain returns ENOCLUE.  :-(
> 
> CAM is not completely innocent in this situation indeed. CAM defines
> XPT_RESET_BUS request as synchronous. It is not queued, and called under
> the SIM mutex lock. I don't think lock can be safely dropped in the
> middle there.
> 
> Now I think that I could try to move readiness waiting out of the
> siis_reset() to do it asynchronously. I'll think about it.

I've fixed this problem for ahci(4) in HEAD, there should be no sleeps
longer then 100ms now (typical 1-2ms).

With siis(4) the situation is different. There by default should be no
sleeps longer then 100ms (typical 1-2ms). Longer sleep means that either
controller is not responding, or it can't establish link to device it
sees. I've reduced waiting timeout from 10s to 1s. It should improve
situation a bit, but I would look for the original problem cause. Have
you done something specific to trigger it? Are your drive/cables OK?

-- 
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Look of boot2, on HDD

2011-04-30 Thread Alexander Motin
Garrett Cooper wrote:
> 2011/4/29  :
>> /boot/boot2STAGE 2 bootstrap file
>> Understands the FreeBSD file system enough, to find files on it, and can 
>> provide a simple interface to choose the kernel or loader to run.
>>
>> Once sys is fully booted, HDD is 'ada0'.
>> However, STAGE 2, sees it, as a 'ad4', at boot process, which is same seen, 
>> by booted sys, when I turn off AHCI.
>>
>> So, here is the riddle ...
>> On fully booted sys, how do I query STAGE 2, to tell me, how it'll see, my 
>> 'ada0' HDD?
> 
> This is a very interesting catch:
> 
> /usr/src/sys/boot/pc98/boot2/boot2.c:static const char *const
> dev_nm[NDEV] = {"ad", "da", "fd"};
> /usr/src/sys/boot/i386/boot2/boot2.c:static const char *const
> dev_nm[NDEV] = {"ad", "da", "fd"};
> 
> It probably will be a no-op soon because of some of the
> compatibility changes Alex made, but still a potential point of
> confusion nonetheless.

Pardon my ignorance, but could somebody shed some light for me on this
list of names? Why much more sophisticated loader(8) operates disks as
diak0/1/..., while boot2 tries to mimic something he has no any idea
about, using very limited information from random sources? Does this
names important for anything?

Even with old ATA names didn't match on my laptop: boot2 reports ad0,
while system - ad4. Also we have a lot of drivers who's disk names don't
fit into this set of ad, da and fd.

-- 
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Look of boot2, on HDD

2011-04-30 Thread Alexander Motin

On 30.04.2011 20:39, rank1see...@gmail.com wrote:

Garrett Cooper wrote:

2011/4/29:

/boot/boot2STAGE 2 bootstrap file
Understands the FreeBSD file system enough, to find files on it, and can 
provide a simple interface to choose the kernel or loader to run.

Once sys is fully booted, HDD is 'ada0'.
However, STAGE 2, sees it, as a 'ad4', at boot process, which is same seen, by 
booted sys, when I turn off AHCI.

So, here is the riddle ...
On fully booted sys, how do I query STAGE 2, to tell me, how it'll see, my 
'ada0' HDD?


 This is a very interesting catch:

/usr/src/sys/boot/pc98/boot2/boot2.c:static const char *const
dev_nm[NDEV] = {"ad", "da", "fd"};
/usr/src/sys/boot/i386/boot2/boot2.c:static const char *const
dev_nm[NDEV] = {"ad", "da", "fd"};

 It probably will be a no-op soon because of some of the
compatibility changes Alex made, but still a potential point of
confusion nonetheless.


Pardon my ignorance, but could somebody shed some light for me on this
list of names? Why much more sophisticated loader(8) operates disks as
diak0/1/..., while boot2 tries to mimic something he has no any idea
about, using very limited information from random sources? Does this
names important for anything?

Even with old ATA names didn't match on my laptop: boot2 reports ad0,
while system - ad4. Also we have a lot of drivers who's disk names don't
fit into this set of ad, da and fd.


Well ..., ATM, I say lets NOT touch/edit boot2 nor loader.
Let them continue to see devices, the way they "like" ...


League for the robots rights? :)


NOW, all I would like, is to find a way of ASKING them, how will they see 
"$target" device, at theirs boot step/time.
"Asking" is done, on a fully booted sys and I am interested in asking STAGE 2 
(boot2).


I think it may be impossible. It is up to each controller's BIOS to 
report device or not. And some controllers may just have no/disabled 
BIOS to report anything. Artificial separation between ad and da in 
boot2 also doesn't makes thinks easier.


--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Kernel timers infrastructure

2011-07-25 Thread Alexander Motin

Hi.

On 25.07.2011 17:13, Filippo Sironi wrote:

I'm working on a university project that's based on FreeBSD and I'm currently 
hacking the kernel... but I'm a complete newbie.
My question is: what if I have to call a certain function 10 times per second?
I've seen a bit of code regarding callout_* functions but I can't get through 
them. Is there anyone who can help me?


Have you read callout(9) manual page? That API is right if you need to 
call some function. Also in some cases (if you need to make your kernel 
thread wait for something) you may use sleep(9) API.


--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: cam / ata timeout limited to 2147 due to overflow bug?

2011-08-08 Thread Alexander Motin

On 05.08.2011 11:11, Eygene Ryabinkin wrote:

What I don't understand is why the /2000


It gives (timeout_in_ticks)/2.  The code in ahci_timeout does the following:
{{{
 /* Check if slot was not being executed last time we checked. */
 if (slot->state<  AHCI_SLOT_EXECUTING) {
 /* Check if slot started executing. */
 sstatus = ATA_INL(ch->r_mem, AHCI_P_SACT);
 ccs = (ATA_INL(ch->r_mem, AHCI_P_CMD)&  AHCI_P_CMD_CCS_MASK)
 >>  AHCI_P_CMD_CCS_SHIFT;
 if ((sstatus&  (1<<  slot->slot)) != 0 || ccs == slot->slot ||
 ch->fbs_enabled)
 slot->state = AHCI_SLOT_EXECUTING;

 callout_reset(&slot->timeout,
 (int)slot->ccb->ccb_h.timeout * hz / 2000,
 (timeout_t*)ahci_timeout, slot);
 return;
 }
}}}

So, my theory is that the first half of the timeout time is devoted
to the transition from AHCI_SLOT_RUNNING ->  AHCI_SLOT_EXECUTING and
the second one is the transition from AHCI_SLOT_RUNNING ->  TIMEOUT
to give the whole process the duration of a full timeout.  However,
judging by the code, if the slot won't start executing at the first
invocation of ahci_timeout that was spawned by the callout armed in
ahci_execute_transaction, we can have timeouts more than for the
specified amount of time.  And if the slot will never start its
execution, the callout will spin forever, unless I am missing something
important here.

May be Alexander can shed some light into this?


Your understanding is right. Some command may never trigger timeout if 
some other command execute infinitely. My goal was to find the commands 
that are really executing and may really cause delays. It would not be 
fair if command depend on each other and short command timeout reset 
device while long command tries to do something big. Implemented in 
ahci(4) code supposed to avoid such false timeouts. Unluckily, I've 
found case when that algorithm indeed may fail. Patch fixing that is 
committed and merged down recently.


--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: cam / ata timeout limited to 2147 due to overflow bug?

2011-08-12 Thread Alexander Motin
Eygene Ryabinkin wrote:
> Fri, Aug 05, 2011 at 10:59:43AM +0100, Steven Hartland wrote:
>> I've tried the patch and it a few cut and paste errors, which I've fixed,
> 
> Thanks for spotting that!
> 
>> and confirmed it works as expected, so thanks for that :)
>>
>> There's also a load more drivers with the same issue so I've gone through
>> and fixed all the occurances I can find. Here's the updated patch:-
>> http://blog.multiplay.co.uk/dropzone/freebsd/ccb_timeout.patch
> 
> I had found a couple of missed drivers, fixed overlong lines and fixed
> the missing 10 in the sys/dev/hptrr/hptrr_os_bsd.c.  Also changed ciss
> to have u_int32_t timeouts instead of int ones: this should not harm
> anything, because all passed timeouts are explicit numbers that are
> not larger than 10.  And I had also renamed
> CAM_HDR_TIMEOUT_TO_TICKS to the base CAM_TIMEOUT_TO_TICKS, because it
> seems that every CAM timeout is 32-bit long.  The new patch lives at
>   
> http://codelabs.ru/fbsd/patches/cam/CAM-properly-convert-timeout-to-ticks.diff
> 
> But there are some cases where the argument to the
> CAM_TIMEOUT_TO_TICKS is int and not u_int32_t.  It should be mostly
> harmless for now, since the values do not exceed 2^32, but my current
> feeling about timeouts that are counted in milliseconds that there
> should be an in-kernel type for this stuff.  Seems like 32-bit wide
> unsigned value is good for it: maximal value is around 46 days that
> should be fine for the millisecond-precision timeout.
> 
> Through my grep session for the kernel sources I had seen other
> (t * hz / 1000) constructs, so I feel that the fix should be
> extended to cover these cases as well.
> 
> I am interested in the other's opinions on this.

First of all, not so many people really need millisecond precision for
timeouts, combined with large timeout range. I would prefer to see
seconds in CAM. Same time 64bit mul/div pair for every call may worth
something, especially for low-end 32bit archs. We can't change existing
CAM API, but global mechanical replace through the kernel is not a good
idea IMHO.

Personally, I would not touch argument types in ciss -- it is almost
null change, but may break some patches applicability. While using
uint32_t in CAM structures could be a benefit from compatibility point,
it is not important from this point in function arguments.

-- 
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Clock stalls on Sabertooth 990FX

2011-08-15 Thread Alexander Motin

On 15.08.2011 23:57, Joe Schaefer wrote:

On Mon, Aug 15, 2011 at 4:35 PM, Alexander Motin  wrote:

On 15.08.2011 22:18, Joe Schaefer wrote:


On Mon, Aug 15, 2011 at 9:31 AM, Joe Schaeferwrote:


On Mon, Aug 15, 2011 at 8:32 AM, Andriy Gaponwrote:


on 13/08/2011 20:16 Joe Schaefer said the following:


Brand new machine with a Phenom II X6 1100T and under chronic load
the clock will stop running periodically until the machine eventually
completely
freezes.  Note: during these stalls the kernel is still running, the
machine is still
mostly responsive, it's just that the clock is frozen in time.

I've disabled Turbo mode in the bios and toyed with just about every
other setting but nothing seems to resolve this problem.  Based on the
behavior
of the machine (just making buildworld will eventually kill it, upping
the -j flag
just kills it faster), I'm guessing it has something to do with the
Digi+ VRM features
but again nothing I've tried modifying in the bios seems to help.

I've tried both 8.2-RELEASE and FreeBSD 9 (head).  Running head now
with
a dtrace enabled kernel.

Suggestions?


On head, start with checking what source is used for driving clocks:
sysctl kern.eventtimer


% sysctl kern.eventtimer  [master]
kern.eventtimer.choice: HPET(450) HPET1(450) HPET2(450) LAPIC(400)
i8254(100) RTC(0)
kern.eventtimer.et.LAPIC.flags: 15
kern.eventtimer.et.LAPIC.frequency: 0
kern.eventtimer.et.LAPIC.quality: 400
kern.eventtimer.et.HPET.flags: 3
kern.eventtimer.et.HPET.frequency: 14318180
kern.eventtimer.et.HPET.quality: 450
kern.eventtimer.et.HPET1.flags: 3
kern.eventtimer.et.HPET1.frequency: 14318180
kern.eventtimer.et.HPET1.quality: 450
kern.eventtimer.et.HPET2.flags: 3
kern.eventtimer.et.HPET2.frequency: 14318180
kern.eventtimer.et.HPET2.quality: 450
kern.eventtimer.et.i8254.flags: 1
kern.eventtimer.et.i8254.frequency: 1193182
kern.eventtimer.et.i8254.quality: 100
kern.eventtimer.et.RTC.flags: 17
kern.eventtimer.et.RTC.frequency: 32768
kern.eventtimer.et.RTC.quality: 0
kern.eventtimer.periodic: 0
kern.eventtimer.timer: HPET



Changing this to "i8254" seems to have resolved the stalls.
I'm running buildworld -j12 without issue.  More than willing
to test out a patch or two against head if anyone's still
interested, otherwise I've thrown the change into loader.conf
and will move along quietly.


8.2-RELEASE you've mentioned doesn't have event timers subsystem and HPET
timer driver. That makes me think it is strange at least. Can you try also
LAPIC timer and do alike experiments with kern.timeocunter?


My problems with 8.2-RELEASE may have been network based.  I don't recall
precisely if the clock was stalling there, my guess is no based on
what you wrote.

I'll test LAPIC next ... so far so good.  Just so I'm clear, you'd
like me to tweak
kern.timecounter.hardware as well?  (Currently it's HPET).


Yes. Instead. Ticking clock depends on both timecounter and eventtimer.


Also, please check whether kern.timecounter.tc.X.counter value changes for
the selected timercounter and whether you are receiving timer interrupts in
`vmstat -i`


--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Clock stalls on Sabertooth 990FX

2011-08-15 Thread Alexander Motin

On 15.08.2011 22:18, Joe Schaefer wrote:

On Mon, Aug 15, 2011 at 9:31 AM, Joe Schaefer  wrote:

On Mon, Aug 15, 2011 at 8:32 AM, Andriy Gapon  wrote:

on 13/08/2011 20:16 Joe Schaefer said the following:

Brand new machine with a Phenom II X6 1100T and under chronic load
the clock will stop running periodically until the machine eventually completely
freezes.  Note: during these stalls the kernel is still running, the
machine is still
mostly responsive, it's just that the clock is frozen in time.

I've disabled Turbo mode in the bios and toyed with just about every
other setting but nothing seems to resolve this problem.  Based on the behavior
of the machine (just making buildworld will eventually kill it, upping
the -j flag
just kills it faster), I'm guessing it has something to do with the
Digi+ VRM features
but again nothing I've tried modifying in the bios seems to help.

I've tried both 8.2-RELEASE and FreeBSD 9 (head).  Running head now with
a dtrace enabled kernel.

Suggestions?


On head, start with checking what source is used for driving clocks:
sysctl kern.eventtimer


% sysctl kern.eventtimer  [master]
kern.eventtimer.choice: HPET(450) HPET1(450) HPET2(450) LAPIC(400)
i8254(100) RTC(0)
kern.eventtimer.et.LAPIC.flags: 15
kern.eventtimer.et.LAPIC.frequency: 0
kern.eventtimer.et.LAPIC.quality: 400
kern.eventtimer.et.HPET.flags: 3
kern.eventtimer.et.HPET.frequency: 14318180
kern.eventtimer.et.HPET.quality: 450
kern.eventtimer.et.HPET1.flags: 3
kern.eventtimer.et.HPET1.frequency: 14318180
kern.eventtimer.et.HPET1.quality: 450
kern.eventtimer.et.HPET2.flags: 3
kern.eventtimer.et.HPET2.frequency: 14318180
kern.eventtimer.et.HPET2.quality: 450
kern.eventtimer.et.i8254.flags: 1
kern.eventtimer.et.i8254.frequency: 1193182
kern.eventtimer.et.i8254.quality: 100
kern.eventtimer.et.RTC.flags: 17
kern.eventtimer.et.RTC.frequency: 32768
kern.eventtimer.et.RTC.quality: 0
kern.eventtimer.periodic: 0
kern.eventtimer.timer: HPET


Changing this to "i8254" seems to have resolved the stalls.
I'm running buildworld -j12 without issue.  More than willing
to test out a patch or two against head if anyone's still
interested, otherwise I've thrown the change into loader.conf
and will move along quietly.


8.2-RELEASE you've mentioned doesn't have event timers subsystem and 
HPET timer driver. That makes me think it is strange at least. Can you 
try also LAPIC timer and do alike experiments with kern.timeocunter?


Also, please check whether kern.timecounter.tc.X.counter value changes 
for the selected timercounter and whether you are receiving timer 
interrupts in `vmstat -i`


--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Clock stalls on Sabertooth 990FX

2011-08-15 Thread Alexander Motin

On 16.08.2011 00:13, Joe Schaefer wrote:

On Mon, Aug 15, 2011 at 5:00 PM, Alexander Motin  wrote:

On 15.08.2011 23:57, Joe Schaefer wrote:


On Mon, Aug 15, 2011 at 4:35 PM, Alexander Motinwrote:


On 15.08.2011 22:18, Joe Schaefer wrote:


On Mon, Aug 15, 2011 at 9:31 AM, Joe Schaefer
  wrote:


On Mon, Aug 15, 2011 at 8:32 AM, Andriy Gapon
  wrote:


on 13/08/2011 20:16 Joe Schaefer said the following:


Brand new machine with a Phenom II X6 1100T and under chronic load
the clock will stop running periodically until the machine eventually
completely
freezes.  Note: during these stalls the kernel is still running, the
machine is still
mostly responsive, it's just that the clock is frozen in time.

I've disabled Turbo mode in the bios and toyed with just about every
other setting but nothing seems to resolve this problem.  Based on
the
behavior
of the machine (just making buildworld will eventually kill it,
upping
the -j flag
just kills it faster), I'm guessing it has something to do with the
Digi+ VRM features
but again nothing I've tried modifying in the bios seems to help.

I've tried both 8.2-RELEASE and FreeBSD 9 (head).  Running head now
with
a dtrace enabled kernel.

Suggestions?


On head, start with checking what source is used for driving clocks:
sysctl kern.eventtimer


% sysctl kern.eventtimer  [master]
kern.eventtimer.choice: HPET(450) HPET1(450) HPET2(450) LAPIC(400)
i8254(100) RTC(0)
kern.eventtimer.et.LAPIC.flags: 15
kern.eventtimer.et.LAPIC.frequency: 0
kern.eventtimer.et.LAPIC.quality: 400
kern.eventtimer.et.HPET.flags: 3
kern.eventtimer.et.HPET.frequency: 14318180
kern.eventtimer.et.HPET.quality: 450
kern.eventtimer.et.HPET1.flags: 3
kern.eventtimer.et.HPET1.frequency: 14318180
kern.eventtimer.et.HPET1.quality: 450
kern.eventtimer.et.HPET2.flags: 3
kern.eventtimer.et.HPET2.frequency: 14318180
kern.eventtimer.et.HPET2.quality: 450
kern.eventtimer.et.i8254.flags: 1
kern.eventtimer.et.i8254.frequency: 1193182
kern.eventtimer.et.i8254.quality: 100
kern.eventtimer.et.RTC.flags: 17
kern.eventtimer.et.RTC.frequency: 32768
kern.eventtimer.et.RTC.quality: 0
kern.eventtimer.periodic: 0
kern.eventtimer.timer: HPET



Changing this to "i8254" seems to have resolved the stalls.
I'm running buildworld -j12 without issue.  More than willing
to test out a patch or two against head if anyone's still
interested, otherwise I've thrown the change into loader.conf
and will move along quietly.


8.2-RELEASE you've mentioned doesn't have event timers subsystem and HPET
timer driver. That makes me think it is strange at least. Can you try
also
LAPIC timer and do alike experiments with kern.timeocunter?


My problems with 8.2-RELEASE may have been network based.  I don't recall
precisely if the clock was stalling there, my guess is no based on
what you wrote.

I'll test LAPIC next ... so far so good.  Just so I'm clear, you'd
like me to tweak
kern.timecounter.hardware as well?  (Currently it's HPET).


Yes. Instead. Ticking clock depends on both timecounter and eventtimer.


Haven't found a combination that hangs my machine other than with the
eventtimer at HPET.


I mean trying eventtimer HPET and different timecounters.

If changing timecounter won't help, try please this patch:

--- acpi_hpet.c.prev2010-12-25 11:28:45.0 +0200
+++ acpi_hpet.c 2011-05-11 14:30:59.0 +0300
@@ -190,7 +190,7 @@ restart:
bus_write_4(sc->mem_res, HPET_TIMER_COMPARATOR(t->num),
t->next);
}
-   if (fdiv < 5000) {
+   if (1 || fdiv < 5000) {
bus_read_4(sc->mem_res, HPET_TIMER_COMPARATOR(t->num));
now = bus_read_4(sc->mem_res, HPET_MAIN_COUNTER);

--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Clock stalls on Sabertooth 990FX

2011-08-16 Thread Alexander Motin
Joe Schaefer wrote:
>>>>>> If changing timecounter won't help, try please this patch:
>>>>>>
>>>>>> --- acpi_hpet.c.prev2010-12-25 11:28:45.0 +0200
>>>>>> +++ acpi_hpet.c 2011-05-11 14:30:59.0 +0300
>>>>>> @@ -190,7 +190,7 @@ restart:
>>>>>>bus_write_4(sc->mem_res, HPET_TIMER_COMPARATOR(t->num),
>>>>>>t->next);
>>>>>>}
>>>>>> -   if (fdiv < 5000) {
>>>>>> +   if (1 || fdiv < 5000) {
>>>>>>bus_read_4(sc->mem_res, HPET_TIMER_COMPARATOR(t->num));
>>>>>>now = bus_read_4(sc->mem_res, HPET_MAIN_COUNTER);
>>>>>>
>>>>>> --
>>>>>> Alexander Motin
>>>>> Will do next.
>>>>>
>>>> Patch applied. Running with HPET eventtimer and no stalls during
>>>> make buildworld -j12.
>>>>
>> it maybe help, I used to come across a bug on Linux with regard to HPET,
>> some northbridge chipset (maybe amd's, where
>> HPET sit on) has a problem, that writes to  compatitor regs will not take
>> effect immediately, you need a read to reg to flush it;
> 
> So far the patch performs flawlessly for me.  I'm tempted to reenable turbo
> mode just for kicks (someday, not today- delighted with the uptime!)

I am going to commit following patch:
http://people.freebsd.org/~mav/hpet.paranoid.patch
. It uses same assumptions as Linux. Try it please.

-- 
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: SW_WATCHDOG vs new eventtimer code

2011-09-20 Thread Alexander Motin
Hi.

On 20.09.2011 22:19, Andriy Gapon wrote:
> just want to check with you first if the following makes sense.
> I use SW_WATCHDOG on one of the test machines, which was recently updated to
> from stable/8 to head.  Now it seems to get seemingly random watchdog events.
> My theory is that this is because of the eventtimer logic.
> If during idle period we accumulate enough timer ticks and then run all those
> ticks very rapidly, then the SW_WATCHDOG code may get an impression that it 
> was
> not patted for many real ticks.
> Not sure what would be the best way to make SW_WATCHDOG happier/smarter.

Eventtimer code now set to generate interrupts at least 4 times per
second for each CPU. As soon as SW_WATCHDOG only handles periods more
then one second, I would say it should not be hurt. I would try to add
some debug there to see what's going on (how big the tick busts are).
I'll try it to do it tomorrow.

-- 
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: SW_WATCHDOG vs new eventtimer code

2011-09-21 Thread Alexander Motin
Andriy Gapon wrote:
> on 20/09/2011 23:04 Alexander Motin said the following:
>> On 20.09.2011 22:19, Andriy Gapon wrote:
>>> just want to check with you first if the following makes sense.
>>> I use SW_WATCHDOG on one of the test machines, which was recently updated to
>>> from stable/8 to head.  Now it seems to get seemingly random watchdog 
>>> events.
>>> My theory is that this is because of the eventtimer logic.
>>> If during idle period we accumulate enough timer ticks and then run all 
>>> those
>>> ticks very rapidly, then the SW_WATCHDOG code may get an impression that it 
>>> was
>>> not patted for many real ticks.
>>> Not sure what would be the best way to make SW_WATCHDOG happier/smarter.
>> Eventtimer code now set to generate interrupts at least 4 times per
>> second for each CPU. As soon as SW_WATCHDOG only handles periods more
>> then one second, I would say it should not be hurt. I would try to add
>> some debug there to see what's going on (how big the tick busts are).
>> I'll try it to do it tomorrow.

I've built kernel with SW_WATCHDOG and run watchdogd with most tight
parameters (-s 1 -t 2), but observed no problems so far.

> Just in case, here is a debugging snippet from a panic that I've got:
> #14 0x80660ae5 in handleevents (now=0xff80e3e0b8b0, fake=0) at
> /usr/src/sys/kern/kern_clocksource.c:209
> 209 while (bintime_cmp(now, &state->nextstat, >=)) {
> (kgdb) list
> 204 }
> 205 if (runs && fake < 2) {
> 206 hardclock_anycpu(runs, usermode);
> 207 done = 1;
> 208 }
> 209 while (bintime_cmp(now, &state->nextstat, >=)) {
> 210 if (fake < 2)
> 211 statclock(usermode);
> 212 bintime_add(&state->nextstat, &statperiod);
> 213 done = 1;
> (kgdb) p state->nextstat
> $1 = {sec = 90, frac = 15986939599958264124}
> (kgdb) p *now
> $3 = {sec = 106, frac = 11494276814354478452}
> (kgdb) p statperiod
> $4 = {sec = 0, frac = 145249953336295682}
> 
> (kgdb) fr 13
> #13 0x8042603e in hardclock_anycpu (cnt=15761, usermode=Variable
> "usermode" is not available.
> ) at atomic.h:183
> 183 atomic.h: No such file or directory.
> in atomic.h
> (kgdb) p cnt
> $5 = 15761
> (kgdb) p newticks
> $6 = 15000
> (kgdb) p watchdog_ticks
> $7 = 16000
> 
> Watchdog timeout was set to ~16 seconds.

It looks like your system was out for about 15 seconds or for some
reason system uptime jumped 15 seconds forward. Have you done anything
special at the moment or have you seen anything strange in system
behavior? What timecounter are you using? I see you are using HPET
eventtimer, but on what hardware (is it per-CPU or global)?

Building kernel with KTR_SPARE2 ktrace enabled should help to collect
valuable info about timers behavior before the crash.

-- 
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: sizeof(size_t) and other "semantic" types on 32 bit systems?

2011-09-30 Thread Alexander Motin
On 30.09.2011 23:30, Lev Serebryakov wrote:
>   I was surprised, when I discover that size_t are 32-bit wide on
>  32-bit (i386) system. Which "semantic" type should I use, for
>  example, for storing GEOM size in bytes in system-independent way? I
>  could use "uint64_t", of course, but I don't like this solution, as
>  it very low-level (ok, not so low-level as "unsigned long long", to
>  be honest).

GEOM uses 64bit off_t for media size and many other things.

-- 
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: sizeof(size_t) and other "semantic" types on 32 bit systems?

2011-09-30 Thread Alexander Motin
On 30.09.2011 23:53, Lev Serebryakov wrote:
> Hello, Alexander.
> You wrote 1 октября 2011 г., 0:51:00:
> 
>> GEOM uses 64bit off_t for media size and many other things.
>   off_t is signed! It is ``not accurate enough'' (if you know this
> Russian joke) :)))

Let's say negative values reserved for 2^63 error codes. :)

-- 
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: how are callouts handled in cpu_idle() ?

2011-10-01 Thread Alexander Motin
Adrian Chadd wrote:
> On 1 October 2011 17:25, Alexander Motin  wrote:
> 
>> Use of critical section in cpu_idle() from the beginning was based on
>> number of assumptions about filter interrupt handler's limitations.
> 
> [snip]
> 
>> So, if you really need to use callout() in interrupt filter, we could
>> disable interrupts before calling cpu_idleclock(), as you have told. But
>> that is only a partial solution and you should be ready for the second
>> half of the problems. Depending on your needs I am not sure it will
>> satisfy you.
> 
> I'm not using callouts from a swi in ath(4), at least not yet. I
> haven't yet gone over all the drivers in sys/dev/ to see if any of
> them are actually doing this.
> I was just making an observation.

All I've told above related only to filter interrupt handlers. Swi has
no any problems from this point. Swi is alike to regular threaded
interrupt handler, it has no this kind of limitations. If interrupt
fires after cpu_idleclock() and schedules swi, sched_runnable() will
return non-zero and sleep will be canceled. After that, when it is
finally called, swi have updated time and allowed to use callouts as it
wants to.

-- 
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: how are callouts handled in cpu_idle() ?

2011-10-01 Thread Alexander Motin
Hi.

Adrian Chadd wrote:
> What happens if this occurs:
> 
> * cpu_idle() is entered
> * you enter critical_enter() and call cpu_idleclock()
> * an interrupt occurs, scheduling an immediate callout
> * .. but you've already set the clock register, so it won't be
> serviced until the wait returns.
> 
> Perhaps interrupts have to be disabled before critical_enter() and
> cpu_idletick() to ensure an interrupt-driven callout doesn't get
> delayed?

Use of critical section in cpu_idle() from the beginning was based on
number of assumptions about filter interrupt handler's limitations.
These handlers are not guarantied to get updated system time/ticks and
they are discouraged to use callouts. If callout scheduled from
interrupt filter during system wake up, system has no ticks counter
updated yet and you may get callout scheduled into the past (it will be
run immediately) or at least much earlier (up to 250ms) then requested.
In your case callout indeed may get delayed (up to the same 250ms). All
that is not a problem for regular interrupt threaded interrupts --
interrupt thread execution will be delayed until all stuff get updated.

On some platforms and CPUs it is possible to enter/exit sleep with
interrupts disabled. That would allow to cover all kinds of problems,
but that support is very selective, so it is not used for this purpose
now. We may want to get cpu_idle() rewritten again at some time to
handle that and possibly make code choosing sleep state and measuring
sleep time more MI, like Linux does it, but that is another project.

So, if you really need to use callout() in interrupt filter, we could
disable interrupts before calling cpu_idleclock(), as you have told. But
that is only a partial solution and you should be ready for the second
half of the problems. Depending on your needs I am not sure it will
satisfy you.

-- 
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: how are callouts handled in cpu_idle() ?

2011-10-01 Thread Alexander Motin
Adrian Chadd wrote:
> Right. Hm, what about for i386/amd64 cases we call intr_disable()
> before the critical_enter and idleclock, then re-enable either just
> before the wait/halt call (like it is now) or just after the
> sched_running check (just like it is now.)
> 
> That way a filter handler which schedules a callout gets called correctly?

As I have described, that can fix only part of the problems (only
delayed shots, but not too early shots). It still won't allow to
reliably use callouts from the interrupt filter handlers.

Also I've just recalled one possible issue from that. cpu_idleclock()
calls binuptime() to get present system time to properly reprogram timer
hardware. In most cases binuptime() is quite undemanding call to be
called anywhere, but at least on some kinds of sparc64 hardware it uses
IPI to read time counter from another CPU, because of assumption that
CPUs may not be synchronized. I hate that, but that's fact. Attempt to
wait for IPI with interrupt disabled may cause dead lock between several
CPUs. Respecting that, in event timers subsystem I was trying to avoid
calling binuptime() with disabled interrupts. To be honest, sparc64
doesn't have cpu_idle() function now to all, but if we speaking about
general (perfect) case, we should either continue avoid this, or deny
things like what sparc64 does.

-- 
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: sizeof(size_t) and other "semantic" types on 32 bit systems?

2011-10-01 Thread Alexander Motin
Ben Laurie wrote:
> 2011/9/30 Lev Serebryakov :
>> Hello, Alexander.
>> You wrote 1 октября 2011 г., 0:51:00:
>>
>>> GEOM uses 64bit off_t for media size and many other things.
>>  off_t is signed! It is ``not accurate enough'' (if you know this
>> Russian joke) :)))
> 
> No, but I now want to know the Russian joke.

I think it is about this one, sorry for my rough translation:

Man getting to a doctor:
M: Doctor, I have a problem THERE.
D: Undress please.
M: Can I accurately put shoes there?
D: Please.
Man carefully puts shoes aside to each other perpendicularly to the wall.
M: Can I accurately put my pants on a chair back?
D: Sure.
Man carefully puts pants on a chair.
M: Can I accurately put my underwear here?
D: OK.
Man carefully puts underwear on a couch, removing any pleats.
D: So what's the problem?
M: Can't you see? My testicles are on different height!
D: Don't worry, it is normal. You are not sick.
M: But that's not accurately enough! ("ne akkuratnenko kak-to" in
original transliteration)

http://blog.i.ua/user/2864472/779569/

-- 
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Parallels v4 regression (aka ada(4) oddity) in RELENG_9

2012-01-23 Thread Alexander Motin

On 01/23/12 20:06, Devin Teske wrote:

I have a Parallels virtual machine and it runs FreeBSD 4 through 8 just
swimmingly.

However, in RELENG_9 I notice something different. My once "ad0" is now showing
up as "ada0". However, something even stranger is that devfs is providing both
ad0 family devices AND ada0 family devices.


That is compatibility shims. You can disable them if you want by adding 
to /boot/loader.conf line: kern.cam.ada.legacy_aliases=0


--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


[RFT][patch] Scheduling for HTT and not only

2012-02-05 Thread Alexander Motin

Hi.

I've analyzed scheduler behavior and think found the problem with HTT. 
SCHED_ULE knows about HTT and when doing load balancing once a second, 
it does right things. Unluckily, if some other thread gets in the way, 
process can be easily pushed out to another CPU, where it will stay for 
another second because of CPU affinity, possibly sharing physical core 
with something else without need.


I've made a patch, reworking SCHED_ULE affinity code, to fix that:
http://people.freebsd.org/~mav/sched.htt.patch

This patch does three things:
 - Disables strict affinity optimization when HTT detected to let more 
sophisticated code to take into account load of other logical core(s).
 - Adds affinity support to the sched_lowest() function to prefer 
specified (last used) CPU (and CPU groups it belongs to) in case of 
equal load. Previous code always selected first valid CPU of evens. It 
caused threads migration to lower CPUs without need.
 - If current CPU group has no CPU where the process with its priority 
can run now, sequentially check parent CPU groups before doing global 
search. That should improve affinity for the next cache levels.


I've made several different benchmarks to test it, and so far results 
look promising:
 - On Atom D525 (2 physical cores + HTT) I've tested HTTP receive with 
fetch and FTP transmit with ftpd. On receive I've got 103MB/s on 
interface; on transmit somewhat less -- about 85MB/s. In both cases 
scheduler kept interrupt thread and application on different physical 
cores. Without patch speed fluctuating about 103-80MB/s on receive and 
is about 85MB/s on transmit.
 - On the same Atom I've tested TCP speed with iperf and got mostly the 
same results:
   - receive to Atom with patch -- 755-765Mbit/s, without patch -- 
531-765Mbit/s.

   - transmit from Atom in both cases 679Mbit/s.
Fluctuating receive behavior in both tests I think can be explained by 
some heavy callout handled by the swi4:clock process, called on receive 
(seen in top and schedgraph), but not on transmit. May be it is 
specifics of the Realtek NIC driver.


 - On the same Atom tested number of 512 byte reads from SSD with dd in 
1 and 32 streams. Found no regressions, but no benefits also as with one 
stream there is no congestion and with multiple streams all cores congested.


 - On Core i7-2600K (4 physical cores + HTT) I've run more then 20 
`make buildworld`s with different -j values (1,2,4,6,8,12,16) for both 
original and patched kernel. I've found no performance regressions, 
while for -j4 I've got 10% improvement:

# ministat -w 65 res4A res4B
x res4A
+ res4B
+-+
|+|
|++  xx  x|
|A||__M__A__| |
+-+
NMinMax  Median   AvgStddev
x   31554.861617.43 1571.62 1581.3033 32.389449
+   31420.69 1423.1 1421.36 1421.7167 1.2439587
Difference at 95.0% confidence
-159.587 ± 51.9496
-10.0921% ± 3.28524%
(Student's t, pooled s = 22.9197)
, and for -j6 -- 3.6% improvement:
# ministat -w 65 res6A res6B
x res6A
+ res6B
+-+
|  +  |
|  +   + x x x|
||_M__A___||__AM_||
+-+
NMinMax Median   AvgStddev
x   31381.171402.94 1400.3 1394.8033 11.880372
+   3 1340.41349.341341.23 1343.6567 4.9393758
Difference at 95.0% confidence
-51.1467 ± 20.6211
-3.66694% ± 1.47842%
(Student's t, pooled s = 9.09782)

Who wants to do independent testing to verify my results or do some more 
interesting benchmarks? :)


PS: Sponsored by iXsystems, Inc.

--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: [RFT][patch] Scheduling for HTT and not only

2012-02-06 Thread Alexander Motin

On 02/06/12 18:01, Alexander Best wrote:

On Mon Feb  6 12, Alexander Motin wrote:

I've analyzed scheduler behavior and think found the problem with HTT.
SCHED_ULE knows about HTT and when doing load balancing once a second,
it does right things. Unluckily, if some other thread gets in the way,
process can be easily pushed out to another CPU, where it will stay for
another second because of CPU affinity, possibly sharing physical core
with something else without need.

I've made a patch, reworking SCHED_ULE affinity code, to fix that:
http://people.freebsd.org/~mav/sched.htt.patch

This patch does three things:
  - Disables strict affinity optimization when HTT detected to let more
sophisticated code to take into account load of other logical core(s).
  - Adds affinity support to the sched_lowest() function to prefer
specified (last used) CPU (and CPU groups it belongs to) in case of
equal load. Previous code always selected first valid CPU of evens. It
caused threads migration to lower CPUs without need.
  - If current CPU group has no CPU where the process with its priority
can run now, sequentially check parent CPU groups before doing global
search. That should improve affinity for the next cache levels.



Who wants to do independent testing to verify my results or do some more
interesting benchmarks? :)


i don't have any benchmarks to offer, but i'm seeing a massive increase in
responsiveness with your patch. with an unpatched kernel, opening xterm while
unrar'ing some huge archive could take up to 3 minutes!!! with your patch the
time it takes for xterm to start is never>  10 seconds!!!


Thank you for the report. I can suggest explanation for this. Original 
code does only one pass looking for CPU where the thread can run 
immediately. That pass limited to the first level of CPU topology (for 
HTT systems it is one physical core). If it sees no good candidate, it 
just looks for the CPU with minimal load, ignoring thread priority. I 
suppose that may lead to priority violation, scheduling thread to CPU 
where higher-priority thread is running, where it may wait for a very 
long time, while there is some other CPU with minimal priority thread. 
My patch does more searches, that allows to handle priorities better.


Unluckily on my newer tests of context-switch-intensive workloads (like 
doing 40K MySQL requests per second) I've found about 3% slowdown 
because of these additional searches. I'll finish some more tests and 
try to find some compromise solution.


--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: [RFT][patch] Scheduling for HTT and not only

2012-02-06 Thread Alexander Motin

On 02/06/12 19:37, Tijl Coosemans wrote:

On Monday 06 February 2012 17:29:14 Alexander Motin wrote:

On 02/06/12 18:01, Alexander Best wrote:

On Mon Feb  6 12, Alexander Motin wrote:

I've analyzed scheduler behavior and think found the problem with HTT.
SCHED_ULE knows about HTT and when doing load balancing once a second,
it does right things. Unluckily, if some other thread gets in the way,
process can be easily pushed out to another CPU, where it will stay for
another second because of CPU affinity, possibly sharing physical core
with something else without need.

I've made a patch, reworking SCHED_ULE affinity code, to fix that:
http://people.freebsd.org/~mav/sched.htt.patch

This patch does three things:
   - Disables strict affinity optimization when HTT detected to let more
sophisticated code to take into account load of other logical core(s).
   - Adds affinity support to the sched_lowest() function to prefer
specified (last used) CPU (and CPU groups it belongs to) in case of
equal load. Previous code always selected first valid CPU of evens. It
caused threads migration to lower CPUs without need.
   - If current CPU group has no CPU where the process with its priority
can run now, sequentially check parent CPU groups before doing global
search. That should improve affinity for the next cache levels.

Who wants to do independent testing to verify my results or do some more
interesting benchmarks? :)


i don't have any benchmarks to offer, but i'm seeing a massive increase in
responsiveness with your patch. with an unpatched kernel, opening xterm while
unrar'ing some huge archive could take up to 3 minutes!!! with your patch the
time it takes for xterm to start is never>   10 seconds!!!


Thank you for the report. I can suggest explanation for this. Original
code does only one pass looking for CPU where the thread can run
immediately. That pass limited to the first level of CPU topology (for
HTT systems it is one physical core). If it sees no good candidate, it
just looks for the CPU with minimal load, ignoring thread priority. I
suppose that may lead to priority violation, scheduling thread to CPU
where higher-priority thread is running, where it may wait for a very
long time, while there is some other CPU with minimal priority thread.
My patch does more searches, that allows to handle priorities better.


But why would unrar have a higher priority?


I am not good with ULE priority calculations yet, but I think there 
could be many factors. Both GUI and unrar probably running with the same 
nice level of 0, so initially they are equal. After they started, their 
priorities depend on spent CPU time, calculated "interactivity" and who 
knows what else. So possibly at some moments unrar may get priority 
higher then GUI. Also there can participate several kernel threads, that 
have higher priority by definition.


--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: [RFT][patch] Scheduling for HTT and not only

2012-02-06 Thread Alexander Motin

On 02/06/12 21:08, Florian Smeets wrote:

On 06.02.12 08:59, David Xu wrote:

On 2012/2/6 15:44, Alexander Motin wrote:

On 06.02.2012 09:40, David Xu wrote:

On 2012/2/6 15:04, Alexander Motin wrote:

Hi.

I've analyzed scheduler behavior and think found the problem with HTT.
SCHED_ULE knows about HTT and when doing load balancing once a second,
it does right things. Unluckily, if some other thread gets in the way,
process can be easily pushed out to another CPU, where it will stay
for another second because of CPU affinity, possibly sharing physical
core with something else without need.

I've made a patch, reworking SCHED_ULE affinity code, to fix that:
http://people.freebsd.org/~mav/sched.htt.patch

This patch does three things:
- Disables strict affinity optimization when HTT detected to let more
sophisticated code to take into account load of other logical core(s).

Yes, the HTT should first be skipped, looking up in upper layer to find
a more idling physical core. At least, if system is a dual-core,
4-thread CPU,
and if there are two busy threads, they should be run on different
physical cores.


- Adds affinity support to the sched_lowest() function to prefer
specified (last used) CPU (and CPU groups it belongs to) in case of
equal load. Previous code always selected first valid CPU of evens. It
caused threads migration to lower CPUs without need.


Even some level of imbalance can be borne, until it exceeds a threshold,
this at least does not trash other cpu's cache, pushing a new thread
to another cpu trashes its cache. The cpus and groups can be arranged in
a circle list, so searching a lowest load cpu always starts from right
neighborhood to tail, then circles from head to left neighborhood.


- If current CPU group has no CPU where the process with its priority
can run now, sequentially check parent CPU groups before doing global
search. That should improve affinity for the next cache levels.

I've made several different benchmarks to test it, and so far results
look promising:
- On Atom D525 (2 physical cores + HTT) I've tested HTTP receive with
fetch and FTP transmit with ftpd. On receive I've got 103MB/s on
interface; on transmit somewhat less -- about 85MB/s. In both cases
scheduler kept interrupt thread and application on different physical
cores. Without patch speed fluctuating about 103-80MB/s on receive and
is about 85MB/s on transmit.
- On the same Atom I've tested TCP speed with iperf and got mostly the
same results:
- receive to Atom with patch -- 755-765Mbit/s, without patch --
531-765Mbit/s.
- transmit from Atom in both cases 679Mbit/s.
Fluctuating receive behavior in both tests I think can be explained by
some heavy callout handled by the swi4:clock process, called on
receive (seen in top and schedgraph), but not on transmit. May be it
is specifics of the Realtek NIC driver.

- On the same Atom tested number of 512 byte reads from SSD with dd in
1 and 32 streams. Found no regressions, but no benefits also as with
one stream there is no congestion and with multiple streams all cores
congested.

- On Core i7-2600K (4 physical cores + HTT) I've run more then 20
`make buildworld`s with different -j values (1,2,4,6,8,12,16) for both
original and patched kernel. I've found no performance regressions,
while for -j4 I've got 10% improvement:
# ministat -w 65 res4A res4B
x res4A
+ res4B
+-+
|+ |
|++ x x x|
|A| |__M__A__| |
+-+
N Min Max Median Avg Stddev
x 3 1554.86 1617.43 1571.62 1581.3033 32.389449
+ 3 1420.69 1423.1 1421.36 1421.7167 1.2439587
Difference at 95.0% confidence
-159.587 ± 51.9496
-10.0921% ± 3.28524%
(Student's t, pooled s = 22.9197)
, and for -j6 -- 3.6% improvement:
# ministat -w 65 res6A res6B
x res6A
+ res6B
+-+
| + |
| + + x x x |
||_M__A___| |__AM_||
+-+
N Min Max Median Avg Stddev
x 3 1381.17 1402.94 1400.3 1394.8033 11.880372
+ 3 1340.4 1349.34 1341.23 1343.6567 4.9393758
Difference at 95.0% confidence
-51.1467 ± 20.6211
-3.66694% ± 1.47842%
(Student's t, pooled s = 9.09782)

Who wants to do independent testing to verify my results or do some
more interesting benchmarks? :)

PS: Sponsored by iXsystems, Inc.


The benchmark is incomplete, a complete benchmark should at lease
includes cpu intensive applications.
Testing for release world databases and web servers and other importance
applications is needed.


I plan to do this, but you may help. ;)


Thanks, I need to find time. I have cc'ed hackers@, my first mail seems
forgot to include it. I think designing a SMP scheduler is a dirty work,
many test and refining and still, you may get imperfect result. ;-)



Here are my tests for PostgreSQL (i still use r229659 as the

Re: [RFT][patch] Scheduling for HTT and not only

2012-02-11 Thread Alexander Motin

On 02/11/12 15:35, Andriy Gapon wrote:

on 06/02/2012 09:04 Alexander Motin said the following:

I've analyzed scheduler behavior and think found the problem with HTT. SCHED_ULE
knows about HTT and when doing load balancing once a second, it does right
things. Unluckily, if some other thread gets in the way, process can be easily
pushed out to another CPU, where it will stay for another second because of CPU
affinity, possibly sharing physical core with something else without need.

I've made a patch, reworking SCHED_ULE affinity code, to fix that:
http://people.freebsd.org/~mav/sched.htt.patch

This patch does three things:
  - Disables strict affinity optimization when HTT detected to let more
sophisticated code to take into account load of other logical core(s).
  - Adds affinity support to the sched_lowest() function to prefer specified
(last used) CPU (and CPU groups it belongs to) in case of equal load. Previous
code always selected first valid CPU of evens. It caused threads migration to
lower CPUs without need.
  - If current CPU group has no CPU where the process with its priority can run
now, sequentially check parent CPU groups before doing global search. That
should improve affinity for the next cache levels.


Alexander,

I know that you are working on improving this patch and we have already
discussed some ideas via out-of-band channels.


I've heavily rewritten the patch already. So at least some of the ideas 
are already addressed. :) At this moment I am mostly satisfied with 
results and after final tests today I'll probably publish new version.



Here's some additional ideas.  They are in part inspired by inspecting
OpenSolaris code.

Let's assume that one of the goals of a scheduler is to maximize system
performance / computational throughput[*].  I think that modern SMP-aware
schedulers try to employ the following two SMP-specific techniques to achieve 
that:
- take advantage of thread-to-cache affinity to minimize "cold cache" time
- distribute the threads over logical CPUs to optimize system resource usage by
minimizing[**] sharing of / contention over the resources, which could be
caches, instruction pipelines (for HTT threads), FPUs (for AMD Bulldozer
"cores"), etc.

1.  Affinity.
It seems that on modern CPUs the caches are either inclusive or some smart "as
if inclusive" caches.  As a result, if two cores have a shared cache at any
level, then it should be relatively cheap to move a thread from one core to the
other.  E.g. if logical CPUs P0 and P1 have private L1 and L2 caches and a
shared L3 cache, then on modern processors it should be much cheaper to move a
thread from P0 to P1 than to some processor P2 that doesn't share the L3 cache.


Absolutely true! On smack-mysql indexed select benchmarks I've found 
that on Atom CPU with two cores without L3 it is cheaper to move two 
mysql threads to one physical core (L2 cache) suffering from SMT, then 
bounce data between cores. Same time on Core i7 with shared L3 and also 
SMT results are strictly opposite.



If this assumption is really true, then we can track only an affinity of a
thread with relation to a top level shared cache.  E.g. if migration within an
L3 cache is cheap, then we don't have any reason to constrain a migration scope
to an L2 cache, let alone L1.


In present patch version I've implemented two different thresholds for 
the last level cache and for the rest. That's why I am waiting from you 
patch to properly detect cache topologies. :)



2. Balancing.
I think that the current balancing code is pretty good, but can be augmented
with the following:
  A. The SMP topology in longer term should include other important shared
resources, not only caches.  We already have this in some form via
CG_FLAG_THREAD, which implies instruction pipeline sharing.


At this moment I am using different penalty coefficients for SMT and 
shared caches (for unrelated processes sharing is is not good). No 
problem to add more types there. Separate flag for shared FPU could be 
used to have different penalty coefficients for usual threads and 
FPU-less kernel threads.



  B. Given the affinity assumptions, sched_pickcpu can pick the best CPU only
among CPUs sharing a top level cache if a thread still has an affinity to it or
among all CPUs otherwise.  This should reduce temporary imbalances.


I've done it in more complicated way. I am doing cache affinity with 
weight 2 to all paths with running _now_ threads of the same process and 
with weight 1 to the previous path where thread was running. I believe 
that constant cache trashing between two running threads is much worse 
then single jump from one CPU to another on context some switches. 
Though it could be made configurable.



  C. I think that we should eliminate the bias in the sched_lowest() family of
functions.  I like how your patch started addressing this.  For the cases 

Re: [RFT][patch] Scheduling for HTT and not only

2012-02-13 Thread Alexander Motin

On 02/11/12 16:21, Alexander Motin wrote:

I've heavily rewritten the patch already. So at least some of the ideas
are already addressed. :) At this moment I am mostly satisfied with
results and after final tests today I'll probably publish new version.


It took more time, but finally I think I've put pieces together:
http://people.freebsd.org/~mav/sched.htt23.patch

The patch is more complicated then previous one both logically and 
computationally, but with growing CPU power and complexity I think we 
can possibly spend some more time deciding how to spend time. :)


Patch formalizes several ideas of the previous code about how to select 
CPU for running a thread and adds some new. It's main idea is that I've 
moved from comparing raw integer queue lengths to higher-resolution 
flexible values. That additional 8-bit precision allows same time take 
into account many factors affecting performance. Beside just choosing 
best from equally-loaded CPUs, with new code it may even happen that 
because of SMT, cache affinity, etc, CPU with more threads on it's queue 
will be reported as less loaded and opposite.


New code takes into account such factors:
 - SMT sharing penalty.
 - Cache sharing penalty.
 - Cache affinity (with separate coefficients for last-level and other 
level caches) to the:

  - other running threads of it's process,
  - previous CPU where it was running,
  - current CPU (usually where it was called from).
All of these factors are configurable via sysctls, but I think 
reasonable defaults should fit most.


Also, comparing to previous patch, I've resurrected optimized shortcut 
in CPU selection for the case of SMT. Comparing to original code having 
problems with this, I've added check for other logical cores load that 
should make it safe and still very fast when there are less running 
threads then physical cores.


I've tested in on Core i7 and Atom systems, but more interesting would 
be to test it on multi-socket system with properly detected topology to 
check benefits from affinity.


At this moment the main issue I see is that this patch affects only time 
when thread is starting. If thread runs continuously, it will stay where 
it was, even if due to situation change that is not very effective 
(causes SMT sharing, etc). I haven't looked much on periodic load 
balancer yet, but probably it could also be somehow improved.


What is your opinion, is it too over-engineered, or it is the right way 
to go?


--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: [RFT][patch] Scheduling for HTT and not only

2012-02-13 Thread Alexander Motin

On 02/13/12 22:23, Jeff Roberson wrote:

On Mon, 13 Feb 2012, Alexander Motin wrote:


On 02/11/12 16:21, Alexander Motin wrote:

I've heavily rewritten the patch already. So at least some of the ideas
are already addressed. :) At this moment I am mostly satisfied with
results and after final tests today I'll probably publish new version.


It took more time, but finally I think I've put pieces together:
http://people.freebsd.org/~mav/sched.htt23.patch


I need some time to read and digest this. However, at first glance, a
global pickcpu lock will not be acceptable. Better to make a rarely
imperfect decision than too often cause contention.


On my tests it was opposite. Imperfect decisions under 60K MySQL 
requests per second on 8 cores quite often caused two threads to be 
pushed to one CPU or to one physical core, causing up to 5-10% 
performance penalties. I've tried both with and without lock and at 
least on 8-core machine difference was significant to add this. I 
understand that this is not good, but I have no machine with hundred of 
CPUs to tell how will it work there. For really big systems it could be 
partitioned somehow, but that will also increase load imbalance.



The patch is more complicated then previous one both logically and
computationally, but with growing CPU power and complexity I think we
can possibly spend some more time deciding how to spend time. :)


It is probably worth more cycles but we need to evaluate this much more
complex algorithm carefully to make sure that each of these new features
provides an advantage.


Problem is that doing half of things may not give full picture. How to 
do affinity trying to save some percents, while SMT effect is times 
higher? Same time too many unknown variables in applications behavior 
can easily make all of this pointless.



Patch formalizes several ideas of the previous code about how to
select CPU for running a thread and adds some new. It's main idea is
that I've moved from comparing raw integer queue lengths to
higher-resolution flexible values. That additional 8-bit precision
allows same time take into account many factors affecting performance.
Beside just choosing best from equally-loaded CPUs, with new code it
may even happen that because of SMT, cache affinity, etc, CPU with
more threads on it's queue will be reported as less loaded and opposite.

New code takes into account such factors:
- SMT sharing penalty.
- Cache sharing penalty.
- Cache affinity (with separate coefficients for last-level and other
level caches) to the:


We already used separate affinity values for different cache levels.
Keep in mind that if something else has run on a core the cache affinity
is lost in very short order. Trying too hard to preserve it beyond a few
ms never seems to pan out.


Previously it was only about timeout, that was IMHO pointless, as it is 
impossible to predict when cache will be purged. It could be done in 
microsecond or second later, depending on application behavior.



- other running threads of it's process,


This is not really a great indicator of whether things should be
scheduled together or not. What workload are you targeting here?


When several threads accessing/modifying same shared memory. Like MySQL 
server threads. I've noticed that on Atom CPU wit no L3 it is cheaper to 
move two threads to one physical core to share the cache then handle 
coherency over the memory bus.



- previous CPU where it was running,
- current CPU (usually where it was called from).


These two were also already used. Additionally:

+ * Hide part of the current thread
+ * load, hoping it or the scheduled
+ * one complete soon.
+ * XXX: We need more stats for this.

I had something like this before. Unfortunately interactive tasks are
allowed fairly aggressive bursts of cpu to account for things like xorg
and web browsers. Also, I tried this for ithreads but they can be very
expensive in some workloads so other cpus will idle as you try to
schedule behind an ithread.


As I have noted, this need more precise statistics about thread 
behavior. Present sampled statistics is almost useless there. Existing 
code always prefers to run thread on current CPU if there is no other 
CPU with no load. That logic works very good when 8 MySQL threads and 8 
clients working on 8 CPUs, but a bit not so good in other situations.



All of these factors are configurable via sysctls, but I think
reasonable defaults should fit most.

Also, comparing to previous patch, I've resurrected optimized shortcut
in CPU selection for the case of SMT. Comparing to original code
having problems with this, I've added check for other logical cores
load that should make it safe and still very fast when there are less
running threads then physical cores.

I've tested in on Core i7 and Atom systems, but more interesting would
be to test it on multi-socket system with properly detected topology
to check ben

Re: [RFT][patch] Scheduling for HTT and not only

2012-02-13 Thread Alexander Motin

On 13.02.2012 23:39, Jeff Roberson wrote:

On Mon, 13 Feb 2012, Alexander Motin wrote:

On 02/13/12 22:23, Jeff Roberson wrote:

On Mon, 13 Feb 2012, Alexander Motin wrote:


On 02/11/12 16:21, Alexander Motin wrote:

I've heavily rewritten the patch already. So at least some of the
ideas
are already addressed. :) At this moment I am mostly satisfied with
results and after final tests today I'll probably publish new version.


It took more time, but finally I think I've put pieces together:
http://people.freebsd.org/~mav/sched.htt23.patch


I need some time to read and digest this. However, at first glance, a
global pickcpu lock will not be acceptable. Better to make a rarely
imperfect decision than too often cause contention.


On my tests it was opposite. Imperfect decisions under 60K MySQL
requests per second on 8 cores quite often caused two threads to be
pushed to one CPU or to one physical core, causing up to 5-10%
performance penalties. I've tried both with and without lock and at
least on 8-core machine difference was significant to add this. I
understand that this is not good, but I have no machine with hundred
of CPUs to tell how will it work there. For really big systems it
could be partitioned somehow, but that will also increase load imbalance.


It would be preferable to refetch the load on the target cpu and restart
the selection if it has changed. Even this should have some maximum
bound on the number of times it will spin and possibly be conditionally
enabled. That two cpus are making the same decision indicates that the
race window is occuring and contention will be guaranteed. As you have
tested on only 8 cores that's not a good sign.


Race window there exists by definition. as the code is not locked. The 
fact that we hitting it often may mean just that some other locks (may 
be in application, may be ours) cause that synchronization. Using almost 
equal requests in benchmark also did not increase randomness. What's 
about rechecking load -- that is done as you may see to protect from 
fast paths, as that works. But there is time window between check and 
putting request on the queue. Used lock doesn't fix it completely, but 
significantly reduce changes.



The patch is more complicated then previous one both logically and
computationally, but with growing CPU power and complexity I think we
can possibly spend some more time deciding how to spend time. :)


It is probably worth more cycles but we need to evaluate this much more
complex algorithm carefully to make sure that each of these new features
provides an advantage.


Problem is that doing half of things may not give full picture. How to
do affinity trying to save some percents, while SMT effect is times
higher? Same time too many unknown variables in applications behavior
can easily make all of this pointless.


Patch formalizes several ideas of the previous code about how to
select CPU for running a thread and adds some new. It's main idea is
that I've moved from comparing raw integer queue lengths to
higher-resolution flexible values. That additional 8-bit precision
allows same time take into account many factors affecting performance.
Beside just choosing best from equally-loaded CPUs, with new code it
may even happen that because of SMT, cache affinity, etc, CPU with
more threads on it's queue will be reported as less loaded and
opposite.

New code takes into account such factors:
- SMT sharing penalty.
- Cache sharing penalty.
- Cache affinity (with separate coefficients for last-level and other
level caches) to the:


We already used separate affinity values for different cache levels.
Keep in mind that if something else has run on a core the cache affinity
is lost in very short order. Trying too hard to preserve it beyond a few
ms never seems to pan out.


Previously it was only about timeout, that was IMHO pointless, as it
is impossible to predict when cache will be purged. It could be done
in microsecond or second later, depending on application behavior.


This was not pointless. Eliminate it and see. The point is that after
some time has elapsed the cache is almost certainly useless and we
should select the most appropriate cpu based on load, priority, etc. We
don't have perfect information for any of these algorithms. But as an
approximation it is useful to know whether affinity should even be
considered. An improvement on this would be to look at the amount of
time the core has been idle since the selecting thread last ran rather
than just the current load. Tell me what the point of selecting for
affinity is if so much time has passed that valid cache contents are
almost guaranteed to be lost?


I am not telling/going to keep affinity forever. You may see that I am 
also setting limit on affinity time. What's IMHO pointless is trying to 
set expiration time to 1/2/3ms for L1/2/3 caches. These numbers doesn't 
mean anything real. What I was saying

Re: [RFT][patch] Scheduling for HTT and not only

2012-02-15 Thread Alexander Motin

On 02/14/12 00:38, Alexander Motin wrote:

I see no much point in committing them sequentially, as they are quite
orthogonal. I need to make one decision. I am going on small vacation
next week. It will give time for thoughts to settle. May be I indeed
just clean previous patch a bit and commit it when I get back. I've
spent too much time trying to make these things formal and so far
results are not bad, but also not so brilliant as I would like. May be
it is indeed time to step back and try some more simple solution.


I've decided to stop those cache black magic practices and focus on 
things that really exist in this world -- SMT and CPU load. I've dropped 
most of cache related things from the patch and made the rest of things 
more strict and predictable:

http://people.freebsd.org/~mav/sched.htt34.patch

This patch adds check to skip fast previous CPU selection if it's SMT 
neighbor is in use, not just if no SMT present as in previous patches.


I've took affinity/preference algorithm from the first patch and 
improved it. That makes pickcpu() to prefer previous core or it's 
neighbors in case of equal load. That is very simple to keep it, but 
still should give cache hits.


I've changed the general algorithm of topology tree processing. First I 
am looking for idle core on the same last-level cache as before, with 
affinity to previous core or it's neighbors on higher level caches. 
Original code could put additional thread on already busy core, while 
next socket is completely idle. Now if there is no idle core on this 
cache, then all other CPUs are checked.


CPU groups comparison now done in two steps: first, same as before, 
compared summary load of all cores; but now, if it is equal, I am 
comparing load of the less/most loaded cores. That should allow to 
differentiate whether load 2 really means 1+1 or 2+0. In that case group 
with 2+0 will be taken as more loaded than one with 1+1, making group 
choice more grounded and predictable.


I've added randomization in case if all above factors are equal.

As before I've tested this on Core i7-870 with 4 physical and 8 logical 
cores and Atom D525 with 2 physical and 4 logical cores. On Core i7 I've 
got speedup up to 10-15% in super-smack MySQL and PostgreSQL indexed 
select for 2-8 threads and no penalty in other cases. pbzip2 shows up to 
13% performance increase for 2-5 threads and no penalty in other cases.


Tests on Atom show mostly about the same performance as before in 
database benchmarks: faster for 1 thread, slower for 2-3 and about the 
same for other cases. Single stream network performance improved same as 
for the first patch. That CPU is quite difficult to handle as with mix 
of effective SMT and lack of L3 cache different scheduling approaches 
give different results in different situations.


Specific performance numbers can be found here:
http://people.freebsd.org/~mav/bench.ods
Every point there includes at least 5 samples and except pbzip2 test 
that is quite unstable with previous sources all are statistically valid.


Florian is now running alternative set of benchmarks on dual-socket 
hardware without SMT.


--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: [RFT][patch] Scheduling for HTT and not only

2012-02-15 Thread Alexander Motin

On 02/15/12 21:54, Jeff Roberson wrote:

On Wed, 15 Feb 2012, Alexander Motin wrote:

As before I've tested this on Core i7-870 with 4 physical and 8
logical cores and Atom D525 with 2 physical and 4 logical cores. On
Core i7 I've got speedup up to 10-15% in super-smack MySQL and
PostgreSQL indexed select for 2-8 threads and no penalty in other
cases. pbzip2 shows up to 13% performance increase for 2-5 threads and
no penalty in other cases.


Can you also test buildworld or buildkernel with a -j value twice the
number of cores? This is an interesting case because it gets little
benefit from from affinity and really wants the best balancing possible.
It's also the first thing people will complain about if it slows.


I'll do it, but even now I can say that existing balancing algorithm 
requires improvements to better handle SMT. If I understand correctly, 
present code never takes the last running thread from it's CPU. It is 
fine for non-SMT systems, but with SMT it may cause imbalance and as 
result reduced total performance. While current pickcpu() algorithm 
should be precise enough, its decision can be easily affected by some 
microsecond transient load, such as interrupt threads, etc, and results 
of that decision may be effective for seconds.


--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: [RFT][patch] Scheduling for HTT and not only

2012-02-15 Thread Alexander Motin

On 02/15/12 21:54, Jeff Roberson wrote:

On Wed, 15 Feb 2012, Alexander Motin wrote:


On 02/14/12 00:38, Alexander Motin wrote:

I see no much point in committing them sequentially, as they are quite
orthogonal. I need to make one decision. I am going on small vacation
next week. It will give time for thoughts to settle. May be I indeed
just clean previous patch a bit and commit it when I get back. I've
spent too much time trying to make these things formal and so far
results are not bad, but also not so brilliant as I would like. May be
it is indeed time to step back and try some more simple solution.


I've decided to stop those cache black magic practices and focus on
things that really exist in this world -- SMT and CPU load. I've
dropped most of cache related things from the patch and made the rest
of things more strict and predictable:
http://people.freebsd.org/~mav/sched.htt34.patch


This looks great. I think there is value in considering the other
approach further but I would like to do this part first. It would be
nice to also add priority as a greater influence in the load balancing
as well.


I have feeling that for timeshare/idle threads balance should take into 
account not a priority, but a nice level. Priority is very unstable 
thing that is recalculated each time, while nice it is what really 
describes how much time thread should get in perspective and those 
values should be shuffled equally between CPUs. But I haven't thought 
about specific math yet.


--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: [RFT][patch] Scheduling for HTT and not only

2012-02-16 Thread Alexander Motin

On 02/15/12 21:54, Jeff Roberson wrote:

On Wed, 15 Feb 2012, Alexander Motin wrote:

As before I've tested this on Core i7-870 with 4 physical and 8
logical cores and Atom D525 with 2 physical and 4 logical cores. On
Core i7 I've got speedup up to 10-15% in super-smack MySQL and
PostgreSQL indexed select for 2-8 threads and no penalty in other
cases. pbzip2 shows up to 13% performance increase for 2-5 threads and
no penalty in other cases.


Can you also test buildworld or buildkernel with a -j value twice the
number of cores? This is an interesting case because it gets little
benefit from from affinity and really wants the best balancing possible.
It's also the first thing people will complain about if it slows.


All night long buildworld run on Core i7-2600K (4/8 cores) with 8GB RAM 
and RAID0 of two fast SSDs found no bad surprises:


old new %
1   4242.33 4239.69 -0.0622299538
2   2376.4433   2340.47 -1.5137453521
4   1581.3033   1430.1733   -9.5573063055
6   1394.8033   1348.0533   -3.3517270858
8   1365.8067   1315.87 -3.6562055231
10  1312.8533
12  1350.23 1313.2667   -2.7375558238
16  1346.2267   1306.0733   -2.9826625783
20  1313.31

Each point there averaged of 3 runs.

--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: [RFT][patch] Scheduling for HTT and not only

2012-02-16 Thread Alexander Motin

On 02/16/12 10:48, Alexander Motin wrote:

On 02/15/12 21:54, Jeff Roberson wrote:

On Wed, 15 Feb 2012, Alexander Motin wrote:

As before I've tested this on Core i7-870 with 4 physical and 8
logical cores and Atom D525 with 2 physical and 4 logical cores. On
Core i7 I've got speedup up to 10-15% in super-smack MySQL and
PostgreSQL indexed select for 2-8 threads and no penalty in other
cases. pbzip2 shows up to 13% performance increase for 2-5 threads and
no penalty in other cases.


Can you also test buildworld or buildkernel with a -j value twice the
number of cores? This is an interesting case because it gets little
benefit from from affinity and really wants the best balancing possible.
It's also the first thing people will complain about if it slows.


All night long buildworld run on Core i7-2600K (4/8 cores) with 8GB RAM
and RAID0 of two fast SSDs found no bad surprises:

old new %
1 4242.33 4239.69 -0.0622299538
2 2376.4433 2340.47 -1.5137453521
4 1581.3033 1430.1733 -9.5573063055
6 1394.8033 1348.0533 -3.3517270858
8 1365.8067 1315.87 -3.6562055231
10 1312.8533
12 1350.23 1313.2667 -2.7375558238
16 1346.2267 1306.0733 -2.9826625783
20 1313.31

Each point there averaged of 3 runs.


Ah! Just recalled, this kernel included patch from avg@ that removed 
extra equal topology level from the tree. That could improve balancer 
behavior in this case of single-socket system.


--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: [RFT][patch] Scheduling for HTT and not only

2012-02-17 Thread Alexander Motin

On 02/15/12 21:54, Jeff Roberson wrote:

On Wed, 15 Feb 2012, Alexander Motin wrote:

I've decided to stop those cache black magic practices and focus on
things that really exist in this world -- SMT and CPU load. I've
dropped most of cache related things from the patch and made the rest
of things more strict and predictable:
http://people.freebsd.org/~mav/sched.htt34.patch


This looks great. I think there is value in considering the other
approach further but I would like to do this part first. It would be
nice to also add priority as a greater influence in the load balancing
as well.


I haven't got good idea yet about balancing priorities, but I've 
rewritten balancer itself. As soon as sched_lowest() / sched_highest() 
are more intelligent now, they allowed to remove topology traversing 
from the balancer itself. That should fix double-swapping problem, allow 
to keep some affinity while moving threads and make balancing more fair. 
I did number of tests running 4, 8, 9 and 16 CPU-bound threads on 8 
CPUs. With 4, 8 and 16 threads everything is stationary as it should. 
With 9 threads I see regular and random load move between all 8 CPUs. 
Measurements on 5 minutes run show deviation of only about 5 seconds. It 
is the same deviation as I see caused by only scheduling of 16 threads 
on 8 cores without any balancing needed at all. So I believe this code 
works as it should.


Here is the patch: http://people.freebsd.org/~mav/sched.htt40.patch

I plan this to be a final patch of this series (more to come :)) and if 
there will be no problems or objections, I am going to commit it (except 
some debugging KTRs) in about ten days. So now it's a good time for 
reviews and testing. :)


--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: 8 to 9: Kernel modularization -- did it change?

2012-02-17 Thread Alexander Motin

On 02/17/12 18:20, Alex Goncharov wrote:

,--- You/Freddie (Fri, 17 Feb 2012 08:08:18 -0800) *
|>  So, how do I go about finding and modifying the sound card pin
|>  assignments in FreeBSD 9?  (If I can't do it without temporarily
|>  installing FreeBSD 8, it would be a huge disappointment. :)
|
| Stick the hint into /boot/loader.conf and reboot.

How do I find the correct hint if I can't reload the sound module in
the new kernel environment and explore 'dmesg', '/dev/sndstat' and the
physical headphones with the new hint without a reboot?

Stick something in a file (/boot/device.hints or /boot/loader.conf),
reboot and see if it works... if it doesn't put a different
combination of 'cad', 'nid' and 'seq' and reboot?... And again and
again, till it works?..


Improved HDA driver in HEAD allows to change CODEC configuration via 
sysctls on fly without unloading. I am going to merge it to 9/8-STABLE 
in few weeks. If somebody wants to write nice GUI for it -- welcome! ;)


--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: [RFT][patch] Scheduling for HTT and not only

2012-02-17 Thread Alexander Motin

On 17.02.2012 18:53, Arnaud Lacombe wrote:

On Fri, Feb 17, 2012 at 11:29 AM, Alexander Motin  wrote:

On 02/15/12 21:54, Jeff Roberson wrote:

On Wed, 15 Feb 2012, Alexander Motin wrote:

I've decided to stop those cache black magic practices and focus on
things that really exist in this world -- SMT and CPU load. I've
dropped most of cache related things from the patch and made the rest
of things more strict and predictable:
http://people.freebsd.org/~mav/sched.htt34.patch


This looks great. I think there is value in considering the other
approach further but I would like to do this part first. It would be
nice to also add priority as a greater influence in the load balancing
as well.


I haven't got good idea yet about balancing priorities, but I've rewritten
balancer itself. As soon as sched_lowest() / sched_highest() are more
intelligent now, they allowed to remove topology traversing from the
balancer itself. That should fix double-swapping problem, allow to keep some
affinity while moving threads and make balancing more fair. I did number of
tests running 4, 8, 9 and 16 CPU-bound threads on 8 CPUs. With 4, 8 and 16
threads everything is stationary as it should. With 9 threads I see regular
and random load move between all 8 CPUs. Measurements on 5 minutes run show
deviation of only about 5 seconds. It is the same deviation as I see caused
by only scheduling of 16 threads on 8 cores without any balancing needed at
all. So I believe this code works as it should.

Here is the patch: http://people.freebsd.org/~mav/sched.htt40.patch

I plan this to be a final patch of this series (more to come :)) and if
there will be no problems or objections, I am going to commit it (except
some debugging KTRs) in about ten days. So now it's a good time for reviews
and testing. :)


is there a place where all the patches are available ?


All my scheduler patches are cumulative, so all you need is only the 
last mentioned here sched.htt40.patch.


But in some cases, especially for multi-socket systems, to let it show 
its best, you may want to apply additional patch from avg@ to better 
detect CPU topology:

https://gitorious.org/~avg/freebsd/avgbsd/commit/6bca4a2e4854ea3fc275946a023db65c483cb9dd


I intend to run
some tests on a 1x2x2 (atom D510), 1x4x1 (core-2 quad), and eventually
a 2x8x2 platforms, against r231573. Results should hopefully be
available by the end of the week-end/middle of next week[0].

[0]: the D510 will likely be testing a couple of Linux kernel over the
week-end, and a FreeBSD run takes about 2.5 days to complete.


--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: 8 to 9: Kernel modularization -- did it change?

2012-02-17 Thread Alexander Motin

On 17.02.2012 19:10, Alex Goncharov wrote:

,--- You/Alexander (Fri, 17 Feb 2012 18:36:53 +0200) *
| On 02/17/12 18:20, Alex Goncharov wrote:
|>  How do I find the correct hint if I can't reload the sound module in
|>  the new kernel environment and explore 'dmesg', '/dev/sndstat' and the
|>  physical headphones with the new hint without a reboot?
|>
|>  Stick something in a file (/boot/device.hints or /boot/loader.conf),
|>  reboot and see if it works... if it doesn't put a different
|>  combination of 'cad', 'nid' and 'seq' and reboot?... And again and
|>  again, till it works?..
|
| Improved HDA driver in HEAD allows to change CODEC configuration via
| sysctls on fly without unloading. I am going to merge it to 9/8-STABLE
| in few weeks. If somebody wants to write nice GUI for it -- welcome! ;)

Being mostly a FreeBSD freeloader (or a marginal contributor), I
shouldn't be complaining, and I am not, but permit me to make a
personal biased judgment: losing the ability to do a practically
important thing (a dynamic sound card tuning) which was available in
8, makes 9 a "not ready to be released" OS (GUI isn't relevant here.)

OK, I'll put my upgrades to 9 on hold... Thanks all for clarifying the
situation!

P.S. As an aside and IMHO:

Over the last year, I've been asking myself why I keep bothering with
FreeBSD when several Linux distros do everything painlessly out of box.

Obviously, I like FreeBSD general structure a lot, that's why.
FreeBSD won't miss me going back to Linux, but I may miss FreeBSD, so
I am still sticking on, but I see a lot of dangers to FreeBSD being a
meaningfully used platform, for various reasons (some of them have
been mentioned in several discussions on freebsd-stable.)  Breaking
POLA in a released OS, which I see with this sound card tuning issue,
doesn't add FreeBSD friends...


For many users these days not having sound drivers turned on by default 
is more astonishing. Luckily most systems don't need hints to get 
working sound. So it depends on point of view. And definitely I see no 
problem in this for updating to 9.0, unless you are already biased 
against it.


--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: 8 to 9: Kernel modularization -- did it change?

2012-02-17 Thread Alexander Motin

On 02/17/12 19:43, Alex Goncharov wrote:

A technical question: I have the saved (from 8) copies of
/boot/device.hints for the laptops in question which have lines like:

   hint.hdac.0.cad0.nid22.config="as=1 seq=15 device=Headphones"

there.

What's the best way to use this in 9?  To add to the bottom of
/boot/device.hints?  Or use in /boot/loader.conf somehow, as was
suggested in one of the replies in this thread?  If in loader.conf,
what's the correct syntax?


I am not ready to tell you which way is more correct in theory, because 
in practice all of them are working. I personally prefer to have 
everything set in my /boot/loader.conf. The syntax is equal.


--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: callouts precision

2012-02-18 Thread Alexander Motin

On 18.02.2012 21:05, Andriy Gapon wrote:

Just want to double-check myself.
It seems that currently, thanks to event timers, we mostly should be able to
schedule a hardware timer to fire at almost arbitrary moment with very fine
precision.
OTOH, our callout subsystem still seems to be completely tick oriented in the
sense that all timeouts are specified and kept in ticks.
As a result, it's impossible to use e.g. nanosleep(2) with a precision better
than HZ.

How deeply ticks are ingrained into callout(9)?  Are they used only as a measure
of time?  Or are there any dependencies on them being integers, like for
indexing, etc?
In other words, how hard it would be to replace ticks with e.g. bintime as an
internal representation of time in callout(9) [leaving interfaces alone for the
start]?  Is it easier to retrofit that code or to replace it with something new?


Pending callouts are now stored in large array of unsorted lists, where 
last bits of callout time is the array index. It is quite effective for 
insert/delete operation. It is ineffective for getting next event time 
needed for new event timers, but it is rare operation. Using arbitrary 
time values in that case is very problematic. It would require complete 
internal redesign.


--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: callouts precision

2012-02-18 Thread Alexander Motin

On 18.02.2012 22:40, Andriy Gapon wrote:

on 18/02/2012 21:42 Alexander Motin said the following:

On 18.02.2012 21:05, Andriy Gapon wrote:

Just want to double-check myself.
It seems that currently, thanks to event timers, we mostly should be able to
schedule a hardware timer to fire at almost arbitrary moment with very fine
precision.
OTOH, our callout subsystem still seems to be completely tick oriented in the
sense that all timeouts are specified and kept in ticks.
As a result, it's impossible to use e.g. nanosleep(2) with a precision better
than HZ.

How deeply ticks are ingrained into callout(9)?  Are they used only as a measure
of time?  Or are there any dependencies on them being integers, like for
indexing, etc?
In other words, how hard it would be to replace ticks with e.g. bintime as an
internal representation of time in callout(9) [leaving interfaces alone for the
start]?  Is it easier to retrofit that code or to replace it with something new?


Pending callouts are now stored in large array of unsorted lists, where last
bits of callout time is the array index. It is quite effective for insert/delete
operation. It is ineffective for getting next event time needed for new event
timers, but it is rare operation. Using arbitrary time values in that case is
very problematic. It would require complete internal redesign.



I see.  Thank you for the insight!

One possible hack that I can think of is to use "pseudo-ticks" in the callout
implementation instead of real ticks.  E.g. such a pseudo-tick could be set
equal to 1 microsecond instead of 1/hz (it could be tunable).  Then, of course,
instead of driving the callouts via hardclock/softclock, they would have to be
driven directly from event timers.  And they would have to use current
microseconds uptime instead of ticks, obviously.  This would also require a
revision of types used to store timeout values.  Current 'int' would not be
adequate anymore, it seems.


I don't think it will work. With so high frequency it will make callouts 
distribution over the array almost random. While insert / remove 
operations will still be cheap, search for the next event will be 1000 
times more expensive. Unless you propose increase array size 1000 times, 
it will not be better then just using single unsorted link,



It looks like Timer_Wheel_T from ACE has some useful enhancements in this 
direction.

BTW, it seems that with int ticks and HZ of 1000, ticks would overflow from
INT_MAX to INT_MIN in ~24 days.  I can imagine that some code might get confused
by such an overflow.  But that's a different topic.


Probably you are right. I've seen few dangerous comparisons in ULE code.

--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: [RFT][patch] Scheduling for HTT and not only

2012-03-02 Thread Alexander Motin

Hi.

On 03/03/12 05:24, Adrian Chadd wrote:

mav@, can you please take a look at George's traces and see if there's
anything obviously silly going on?
He's reporting that your ULE work hasn't improved his (very) degenerate case.


As I can see, my patch has nothing to do with the problem. My patch 
improves SMP load balancing, while in this case problem is different. In 
some cases, when not all CPUs are busy, my patch could mask the problem 
by using more CPUs, but not in this case when dnets consumes all 
available CPUs.


I still not feel very comfortable with ULE math, but as I understand, in 
both illustrated cases there is a conflict between clearly CPU-bound 
dnets threads, that consume all available CPU and never do voluntary 
context switches, and more or less interactive other threads. If other 
threads detected to be "interactive" in ULE terms, they should preempt 
dnets threads and everything will be fine. But "batch" (in ULE terms) 
threads never preempt each other, switching context only about 10 times 
per second, as hardcoded in sched_slice variable. Kernel build by 
definition consumes too much CPU time to be marked "interactive". 
exo-helper-1 thread in interact.out could potentially be marked 
"interactive", but possibly once it consumed some CPU to become "batch", 
it is difficult for it to get back, as waiting in a runq is not counted 
as sleep and each time it is getting running, it has some new work to 
do, so it remains "batch". May be if CPU time accounting was more 
precise it would work better (by accounting those short periods when 
threads really sleeps voluntary), but not with present sampled logic 
with 1ms granularity. As result, while dnets threads each time consume 
full 100ms time slices, other threads are starving, getting running only 
10 times per second to voluntary switch out in just a few milliseconds.



On 2 March 2012 16:14, George Mitchell  wrote:

On 03/02/12 18:06, Adrian Chadd wrote:


Hi George,

Have you thought about providing schedgraph traces with your
particular workload?

I'm sure that'll help out the scheduler hackers quite a bit.

THanks,


Adrian



I posted a couple back in December but I haven't created any more
recently:

http://www.m5p.com/~george/ktr-ule-problem.out
http://www.m5p.com/~george/ktr-ule-interact.out

To the best of my knowledge, no one ever examined them.   -- George



--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: [RFT][patch] Scheduling for HTT and not only

2012-03-03 Thread Alexander Motin

On 03/03/12 10:59, Adrian Chadd wrote:

Right. Is this written up in a PR somewhere explaining the problem in
as much depth has you just have?


Have no idea. I am new at this area and haven't looked on PRs yet.


And thanks for this, it's great to see some further explanation of the
current issues the scheduler faces.


By the way I've just reproduced the problem with compilation. On 
dual-core system net/mpd5 compilation in one stream takes 17 seconds. 
But with two low-priority non-interactive CPU-burning threads running it 
takes 127 seconds. I'll try to analyze it more now. I have feeling that 
there could be more factors causing priority violation than I've 
described below.



On 2 March 2012 23:40, Alexander Motin  wrote:

On 03/03/12 05:24, Adrian Chadd wrote:


mav@, can you please take a look at George's traces and see if there's
anything obviously silly going on?
He's reporting that your ULE work hasn't improved his (very) degenerate
case.



As I can see, my patch has nothing to do with the problem. My patch improves
SMP load balancing, while in this case problem is different. In some cases,
when not all CPUs are busy, my patch could mask the problem by using more
CPUs, but not in this case when dnets consumes all available CPUs.

I still not feel very comfortable with ULE math, but as I understand, in
both illustrated cases there is a conflict between clearly CPU-bound dnets
threads, that consume all available CPU and never do voluntary context
switches, and more or less interactive other threads. If other threads
detected to be "interactive" in ULE terms, they should preempt dnets threads
and everything will be fine. But "batch" (in ULE terms) threads never
preempt each other, switching context only about 10 times per second, as
hardcoded in sched_slice variable. Kernel build by definition consumes too
much CPU time to be marked "interactive". exo-helper-1 thread in
interact.out could potentially be marked "interactive", but possibly once it
consumed some CPU to become "batch", it is difficult for it to get back, as
waiting in a runq is not counted as sleep and each time it is getting
running, it has some new work to do, so it remains "batch". May be if CPU
time accounting was more precise it would work better (by accounting those
short periods when threads really sleeps voluntary), but not with present
sampled logic with 1ms granularity. As result, while dnets threads each time
consume full 100ms time slices, other threads are starving, getting running
only 10 times per second to voluntary switch out in just a few milliseconds.



On 2 March 2012 16:14, George Mitchellwrote:


On 03/02/12 18:06, Adrian Chadd wrote:



Hi George,

Have you thought about providing schedgraph traces with your
particular workload?

I'm sure that'll help out the scheduler hackers quite a bit.

THanks,


Adrian



I posted a couple back in December but I haven't created any more
recently:

http://www.m5p.com/~george/ktr-ule-problem.out
http://www.m5p.com/~george/ktr-ule-interact.out

To the best of my knowledge, no one ever examined them.   -- George




--
Alexander Motin

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"



--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: [RFT][patch] Scheduling for HTT and not only

2012-03-03 Thread Alexander Motin

On 03/03/12 11:12, Alexander Motin wrote:

On 03/03/12 10:59, Adrian Chadd wrote:

Right. Is this written up in a PR somewhere explaining the problem in
as much depth has you just have?


Have no idea. I am new at this area and haven't looked on PRs yet.


And thanks for this, it's great to see some further explanation of the
current issues the scheduler faces.


By the way I've just reproduced the problem with compilation. On
dual-core system net/mpd5 compilation in one stream takes 17 seconds.
But with two low-priority non-interactive CPU-burning threads running it
takes 127 seconds. I'll try to analyze it more now. I have feeling that
there could be more factors causing priority violation than I've
described below.


On closer look my test appeared not so clean, but instead much more 
interesting. Because of NFS use, there is not just context switches 
between make, cc and as, that are possibly optimized a bit now, but many 
short sleeps when background process gets running. As result, in some 
moments I see such wonderful traces for cc:


wait on runq for 81ms,
run for 37us,
wait NFS for 202us,
wait on runq for 92ms,
run for 30us,
wait NFS for 245us,
wait on runq for 53ms,
run for 142us,

About 0.05% CPU time use for process that supposed to be CPU-bound. And 
while for small run/sleep times ratio process could be nominated on 
interactivity, with so small absolute sleep times it will need ages to 
compensate 5 seconds of "batch" run history, recorded before.



On 2 March 2012 23:40, Alexander Motin wrote:

On 03/03/12 05:24, Adrian Chadd wrote:


mav@, can you please take a look at George's traces and see if there's
anything obviously silly going on?
He's reporting that your ULE work hasn't improved his (very) degenerate
case.



As I can see, my patch has nothing to do with the problem. My patch
improves
SMP load balancing, while in this case problem is different. In some
cases,
when not all CPUs are busy, my patch could mask the problem by using
more
CPUs, but not in this case when dnets consumes all available CPUs.

I still not feel very comfortable with ULE math, but as I understand, in
both illustrated cases there is a conflict between clearly CPU-bound
dnets
threads, that consume all available CPU and never do voluntary context
switches, and more or less interactive other threads. If other threads
detected to be "interactive" in ULE terms, they should preempt dnets
threads
and everything will be fine. But "batch" (in ULE terms) threads never
preempt each other, switching context only about 10 times per second, as
hardcoded in sched_slice variable. Kernel build by definition
consumes too
much CPU time to be marked "interactive". exo-helper-1 thread in
interact.out could potentially be marked "interactive", but possibly
once it
consumed some CPU to become "batch", it is difficult for it to get
back, as
waiting in a runq is not counted as sleep and each time it is getting
running, it has some new work to do, so it remains "batch". May be if
CPU
time accounting was more precise it would work better (by accounting
those
short periods when threads really sleeps voluntary), but not with
present
sampled logic with 1ms granularity. As result, while dnets threads
each time
consume full 100ms time slices, other threads are starving, getting
running
only 10 times per second to voluntary switch out in just a few
milliseconds.



On 2 March 2012 16:14, George Mitchell wrote:


On 03/02/12 18:06, Adrian Chadd wrote:



Hi George,

Have you thought about providing schedgraph traces with your
particular workload?

I'm sure that'll help out the scheduler hackers quite a bit.

THanks,


Adrian



I posted a couple back in December but I haven't created any more
recently:

http://www.m5p.com/~george/ktr-ule-problem.out
http://www.m5p.com/~george/ktr-ule-interact.out

To the best of my knowledge, no one ever examined them. -- George


--
Alexander Motin



--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: [RFT][patch] Scheduling for HTT and not only

2012-03-03 Thread Alexander Motin

On 03.03.2012 17:26, Ivan Klymenko wrote:

I have FreeBSD 10.0-CURRENT #0 r232253M
Patch in r232454 broken my DRM
My system patched http://people.freebsd.org/~kib/drm/all.13.5.patch
After build kernel with only r232454 patch Xorg log contains:
...
[   504.865] [drm] failed to load kernel module "i915"
[   504.865] (EE) intel(0): [drm] Failed to open DRM device for 
pci::00:02.0: File exists
[   504.865] (EE) intel(0): Failed to become DRM master.
[   504.865] (**) intel(0): Depth 24, (--) framebuffer bpp 32
[   504.865] (==) intel(0): RGB weight 888
[   504.865] (==) intel(0): Default visual is TrueColor
[   504.865] (**) intel(0): Option "DRI" "True"
[   504.865] (**) intel(0): Option "TripleBuffer" "True"
[   504.865] (II) intel(0): Integrated Graphics Chipset: Intel(R) Sandybridge 
Mobile (GT2)
[   504.865] (--) intel(0): Chipset: "Sandybridge Mobile (GT2)"
and black screen...

do not even know why it happened ... :(


I've just rebuilt my Core2Duo laptop with r232454 and have no any 
problem with Xorg and (at least old) Intel video driver. Now writing 
this mail from it. Now started rebuilding of my home server. I am not 
sure how this change can cause such specific effect. Are you sure you 
haven't changed anything else unexpectedly?


--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: [RFT][patch] Scheduling for HTT and not only

2012-03-03 Thread Alexander Motin

On 03.03.2012 18:57, Mario Lobo wrote:

On Saturday 03 March 2012 13:30:50 Alexander Motin wrote:

On 03.03.2012 17:26, Ivan Klymenko wrote:

I have FreeBSD 10.0-CURRENT #0 r232253M
Patch in r232454 broken my DRM
My system patched http://people.freebsd.org/~kib/drm/all.13.5.patch
After build kernel with only r232454 patch Xorg log contains:
...
[   504.865] [drm] failed to load kernel module "i915"
[   504.865] (EE) intel(0): [drm] Failed to open DRM device for
pci::00:02.0: File exists [   504.865] (EE) intel(0): Failed to
become DRM master.
[   504.865] (**) intel(0): Depth 24, (--) framebuffer bpp 32
[   504.865] (==) intel(0): RGB weight 888
[   504.865] (==) intel(0): Default visual is TrueColor
[   504.865] (**) intel(0): Option "DRI" "True"
[   504.865] (**) intel(0): Option "TripleBuffer" "True"
[   504.865] (II) intel(0): Integrated Graphics Chipset: Intel(R)
Sandybridge Mobile (GT2) [   504.865] (--) intel(0): Chipset:
"Sandybridge Mobile (GT2)"
and black screen...

do not even know why it happened ... :(


I've just rebuilt my Core2Duo laptop with r232454 and have no any
problem with Xorg and (at least old) Intel video driver. Now writing
this mail from it. Now started rebuilding of my home server. I am not
sure how this change can cause such specific effect. Are you sure you
haven't changed anything else unexpectedly?


I'd like to test the patch on my 8.2-STABLE desktop.
Phenom II quad / 16 GRam

I have sched.htt40.patch here. Is this the latest?


It is, mostly. Code committed to the HEAD was slightly modified. Here is 
the patch as it is in SVN:

http://svnweb.freebsd.org/base/head/sys/kern/sched_ule.c?r1=229429&r2=232207&view=patch

And today I've fixed one found bug:
http://svnweb.freebsd.org/base/head/sys/kern/sched_ule.c?r1=232207&r2=232454&view=patch


Will it apply cleanly on it?
Any "gotchas"?


Sorry, I have no idea about difference from 8-STABLE. You may try.

--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


[RFC] Thread CPU load estimation for ULE, not sampled by hardclock()

2012-03-09 Thread Alexander Motin

Hi.

At this time ULE scheduler uses hardclock() ticks via sched_tick() to 
estimate threads CPU load. That strictly limits it's best possible 
precision and that is one more reason to call hardclock() on every HZ 
tick, even when there is no callouts scheduled.


I've made a patch to get CPU load estimation in ULE in event-based way. 
It won't immediately increase precision as I still use ticks counter as 
time source, but it should be trivial to use more precise one later (if 
some global, fast and reliable one is available). What it gives now, is 
that sched_tick() function is now empty for ULE, same as it always was 
for 4BSD. With some more changes in other areas it should allow to run 
hardclock() with full HZ rate only on one non-idle CPU, not on each one 
as it is now. One more small step toward tick-less kernel.


Patch can be found here:
http://people.freebsd.org/~mav/sched.notick4.patch

Any comments?


Important theoretical question: what exactly we want to see as CPU load? 
ps(1) defines %cpu as: "decaying average over up to a minute of previous 
(real) time", but I think that definition is not exactly true now.


As I understand, 4BSD calculates pure decaying average without time 
limit -- if not precision limitation, CPU load value would include data 
for all the thread life time with exponentially decreasing weight for 
old values. ULE behaves partially alike, but has strict idle deadline 
after which load considered equal to zero and in some cases it decays 
more linearly, that seems also depends on if/how often CPU load is read.


So which way should be considered right? Should it be clear decaying 
average as 4BSD does, or should it be something closer to "average load 
for last N seconds", following ideas I see in ULE now?


--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: [RFT][patch] Scheduling for HTT and not only

2012-04-05 Thread Alexander Motin

On 05.04.2012 21:12, Arnaud Lacombe wrote:

Hi,

[Sorry for the delay, I got a bit sidetrack'ed...]

2012/2/17 Alexander Motin:

On 17.02.2012 18:53, Arnaud Lacombe wrote:


On Fri, Feb 17, 2012 at 11:29 AM, Alexander Motinwrote:


On 02/15/12 21:54, Jeff Roberson wrote:


On Wed, 15 Feb 2012, Alexander Motin wrote:


I've decided to stop those cache black magic practices and focus on
things that really exist in this world -- SMT and CPU load. I've
dropped most of cache related things from the patch and made the rest
of things more strict and predictable:
http://people.freebsd.org/~mav/sched.htt34.patch



This looks great. I think there is value in considering the other
approach further but I would like to do this part first. It would be
nice to also add priority as a greater influence in the load balancing
as well.



I haven't got good idea yet about balancing priorities, but I've
rewritten
balancer itself. As soon as sched_lowest() / sched_highest() are more
intelligent now, they allowed to remove topology traversing from the
balancer itself. That should fix double-swapping problem, allow to keep
some
affinity while moving threads and make balancing more fair. I did number
of
tests running 4, 8, 9 and 16 CPU-bound threads on 8 CPUs. With 4, 8 and
16
threads everything is stationary as it should. With 9 threads I see
regular
and random load move between all 8 CPUs. Measurements on 5 minutes run
show
deviation of only about 5 seconds. It is the same deviation as I see
caused
by only scheduling of 16 threads on 8 cores without any balancing needed
at
all. So I believe this code works as it should.

Here is the patch: http://people.freebsd.org/~mav/sched.htt40.patch

I plan this to be a final patch of this series (more to come :)) and if
there will be no problems or objections, I am going to commit it (except
some debugging KTRs) in about ten days. So now it's a good time for
reviews
and testing. :)


is there a place where all the patches are available ?



All my scheduler patches are cumulative, so all you need is only the last
mentioned here sched.htt40.patch.


You may want to have a look to the result I collected in the
`runs/freebsd-experiments' branch of:

https://github.com/lacombar/hackbench/

and compare them with vanilla FreeBSD 9.0 and -CURRENT results
available in `runs/freebsd'. On the dual package platform, your patch
is not a definite win.


But in some cases, especially for multi-socket systems, to let it show its
best, you may want to apply additional patch from avg@ to better detect CPU
topology:
https://gitorious.org/~avg/freebsd/avgbsd/commit/6bca4a2e4854ea3fc275946a023db65c483cb9dd


test I conducted specifically for this patch did not showed much improvement...


If I understand right, this test runs thousands of threads sending and 
receiving data over the pipes. It is quite likely that all CPUs will be 
always busy and so load balancing is not really important in this test, 
What looks good is that more complicated new code is not slower then old 
one.


While this test seems very scheduler-intensive, it may depend on many 
other factors, such as syscall performance, context switch, etc. I'll 
try to play more with it.


--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: [RFT][patch] Scheduling for HTT and not only

2012-04-06 Thread Alexander Motin

On 04/06/12 17:13, Attilio Rao wrote:

Il 05 aprile 2012 19:12, Arnaud Lacombe  ha scritto:

Hi,

[Sorry for the delay, I got a bit sidetrack'ed...]

2012/2/17 Alexander Motin:

On 17.02.2012 18:53, Arnaud Lacombe wrote:


On Fri, Feb 17, 2012 at 11:29 AM, Alexander Motinwrote:


On 02/15/12 21:54, Jeff Roberson wrote:


On Wed, 15 Feb 2012, Alexander Motin wrote:


I've decided to stop those cache black magic practices and focus on
things that really exist in this world -- SMT and CPU load. I've
dropped most of cache related things from the patch and made the rest
of things more strict and predictable:
http://people.freebsd.org/~mav/sched.htt34.patch



This looks great. I think there is value in considering the other
approach further but I would like to do this part first. It would be
nice to also add priority as a greater influence in the load balancing
as well.



I haven't got good idea yet about balancing priorities, but I've
rewritten
balancer itself. As soon as sched_lowest() / sched_highest() are more
intelligent now, they allowed to remove topology traversing from the
balancer itself. That should fix double-swapping problem, allow to keep
some
affinity while moving threads and make balancing more fair. I did number
of
tests running 4, 8, 9 and 16 CPU-bound threads on 8 CPUs. With 4, 8 and
16
threads everything is stationary as it should. With 9 threads I see
regular
and random load move between all 8 CPUs. Measurements on 5 minutes run
show
deviation of only about 5 seconds. It is the same deviation as I see
caused
by only scheduling of 16 threads on 8 cores without any balancing needed
at
all. So I believe this code works as it should.

Here is the patch: http://people.freebsd.org/~mav/sched.htt40.patch

I plan this to be a final patch of this series (more to come :)) and if
there will be no problems or objections, I am going to commit it (except
some debugging KTRs) in about ten days. So now it's a good time for
reviews
and testing. :)


is there a place where all the patches are available ?



All my scheduler patches are cumulative, so all you need is only the last
mentioned here sched.htt40.patch.


You may want to have a look to the result I collected in the
`runs/freebsd-experiments' branch of:

https://github.com/lacombar/hackbench/

and compare them with vanilla FreeBSD 9.0 and -CURRENT results
available in `runs/freebsd'. On the dual package platform, your patch
is not a definite win.


But in some cases, especially for multi-socket systems, to let it show its
best, you may want to apply additional patch from avg@ to better detect CPU
topology:
https://gitorious.org/~avg/freebsd/avgbsd/commit/6bca4a2e4854ea3fc275946a023db65c483cb9dd


test I conducted specifically for this patch did not showed much improvement...


Can you please clarify on this point?
The test you did included cases where the topology was detected badly
against cases where the topology was detected correctly as a patched
kernel (and you still didn't see a performance improvement), in terms
of cache line sharing?


At this moment SCHED_ULE does almost nothing in terms of cache line 
sharing affinity (though it probably worth some further experiments). 
What this patch may improve is opposite case -- reduce cache sharing 
pressure for cache-hungry applications. For example, proper cache 
topology detection (such as lack of global L3 cache, but shared L2 per 
pairs of cores on Core2Quad class CPUs) increases pbzip2 performance 
when number of threads is less then number of CPUs (i.e. when there is 
place for optimization).


--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: [RFT][patch] Scheduling for HTT and not only

2012-04-06 Thread Alexander Motin

On 04/06/12 17:30, Attilio Rao wrote:

Il 06 aprile 2012 15:27, Alexander Motin  ha scritto:

On 04/06/12 17:13, Attilio Rao wrote:


Il 05 aprile 2012 19:12, Arnaud Lacombeha scritto:


Hi,

[Sorry for the delay, I got a bit sidetrack'ed...]

2012/2/17 Alexander Motin:


On 17.02.2012 18:53, Arnaud Lacombe wrote:



On Fri, Feb 17, 2012 at 11:29 AM, Alexander Motin
  wrote:



On 02/15/12 21:54, Jeff Roberson wrote:



On Wed, 15 Feb 2012, Alexander Motin wrote:



I've decided to stop those cache black magic practices and focus on
things that really exist in this world -- SMT and CPU load. I've
dropped most of cache related things from the patch and made the
rest
of things more strict and predictable:
http://people.freebsd.org/~mav/sched.htt34.patch




This looks great. I think there is value in considering the other
approach further but I would like to do this part first. It would be
nice to also add priority as a greater influence in the load
balancing
as well.




I haven't got good idea yet about balancing priorities, but I've
rewritten
balancer itself. As soon as sched_lowest() / sched_highest() are more
intelligent now, they allowed to remove topology traversing from the
balancer itself. That should fix double-swapping problem, allow to
keep
some
affinity while moving threads and make balancing more fair. I did
number
of
tests running 4, 8, 9 and 16 CPU-bound threads on 8 CPUs. With 4, 8
and
16
threads everything is stationary as it should. With 9 threads I see
regular
and random load move between all 8 CPUs. Measurements on 5 minutes run
show
deviation of only about 5 seconds. It is the same deviation as I see
caused
by only scheduling of 16 threads on 8 cores without any balancing
needed
at
all. So I believe this code works as it should.

Here is the patch: http://people.freebsd.org/~mav/sched.htt40.patch

I plan this to be a final patch of this series (more to come :)) and
if
there will be no problems or objections, I am going to commit it
(except
some debugging KTRs) in about ten days. So now it's a good time for
reviews
and testing. :)


is there a place where all the patches are available ?




All my scheduler patches are cumulative, so all you need is only the
last
mentioned here sched.htt40.patch.


You may want to have a look to the result I collected in the
`runs/freebsd-experiments' branch of:

https://github.com/lacombar/hackbench/

and compare them with vanilla FreeBSD 9.0 and -CURRENT results
available in `runs/freebsd'. On the dual package platform, your patch
is not a definite win.


But in some cases, especially for multi-socket systems, to let it show
its
best, you may want to apply additional patch from avg@ to better detect
CPU
topology:

https://gitorious.org/~avg/freebsd/avgbsd/commit/6bca4a2e4854ea3fc275946a023db65c483cb9dd


test I conducted specifically for this patch did not showed much
improvement...



Can you please clarify on this point?
The test you did included cases where the topology was detected badly
against cases where the topology was detected correctly as a patched
kernel (and you still didn't see a performance improvement), in terms
of cache line sharing?



At this moment SCHED_ULE does almost nothing in terms of cache line sharing
affinity (though it probably worth some further experiments). What this
patch may improve is opposite case -- reduce cache sharing pressure for
cache-hungry applications. For example, proper cache topology detection
(such as lack of global L3 cache, but shared L2 per pairs of cores on
Core2Quad class CPUs) increases pbzip2 performance when number of threads is
less then number of CPUs (i.e. when there is place for optimization).


My asking is not referred to your patch really.
I just wanted to know if he correctly benchmark a case where the
topology was screwed up and then correctly recognized by avg's patch
in terms of cache level aggregation (it wasn't referred to your patch
btw).


I understand. I've just described test case when properly detected 
topology could give benefit. What the test really does is indeed a good 
question.


--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: [RFT][patch] Scheduling for HTT and not only

2012-04-09 Thread Alexander Motin

On 04/05/12 21:45, Alexander Motin wrote:

On 05.04.2012 21:12, Arnaud Lacombe wrote:

Hi,

[Sorry for the delay, I got a bit sidetrack'ed...]

2012/2/17 Alexander Motin:

On 17.02.2012 18:53, Arnaud Lacombe wrote:


On Fri, Feb 17, 2012 at 11:29 AM, Alexander Motin
wrote:


On 02/15/12 21:54, Jeff Roberson wrote:


On Wed, 15 Feb 2012, Alexander Motin wrote:


I've decided to stop those cache black magic practices and focus on
things that really exist in this world -- SMT and CPU load. I've
dropped most of cache related things from the patch and made the
rest
of things more strict and predictable:
http://people.freebsd.org/~mav/sched.htt34.patch



This looks great. I think there is value in considering the other
approach further but I would like to do this part first. It would be
nice to also add priority as a greater influence in the load
balancing
as well.



I haven't got good idea yet about balancing priorities, but I've
rewritten
balancer itself. As soon as sched_lowest() / sched_highest() are more
intelligent now, they allowed to remove topology traversing from the
balancer itself. That should fix double-swapping problem, allow to
keep
some
affinity while moving threads and make balancing more fair. I did
number
of
tests running 4, 8, 9 and 16 CPU-bound threads on 8 CPUs. With 4, 8
and
16
threads everything is stationary as it should. With 9 threads I see
regular
and random load move between all 8 CPUs. Measurements on 5 minutes run
show
deviation of only about 5 seconds. It is the same deviation as I see
caused
by only scheduling of 16 threads on 8 cores without any balancing
needed
at
all. So I believe this code works as it should.

Here is the patch: http://people.freebsd.org/~mav/sched.htt40.patch

I plan this to be a final patch of this series (more to come :))
and if
there will be no problems or objections, I am going to commit it
(except
some debugging KTRs) in about ten days. So now it's a good time for
reviews
and testing. :)


is there a place where all the patches are available ?



All my scheduler patches are cumulative, so all you need is only the
last
mentioned here sched.htt40.patch.


You may want to have a look to the result I collected in the
`runs/freebsd-experiments' branch of:

https://github.com/lacombar/hackbench/

and compare them with vanilla FreeBSD 9.0 and -CURRENT results
available in `runs/freebsd'. On the dual package platform, your patch
is not a definite win.


But in some cases, especially for multi-socket systems, to let it
show its
best, you may want to apply additional patch from avg@ to better
detect CPU
topology:
https://gitorious.org/~avg/freebsd/avgbsd/commit/6bca4a2e4854ea3fc275946a023db65c483cb9dd



test I conducted specifically for this patch did not showed much
improvement...


If I understand right, this test runs thousands of threads sending and
receiving data over the pipes. It is quite likely that all CPUs will be
always busy and so load balancing is not really important in this test,
What looks good is that more complicated new code is not slower then old
one.

While this test seems very scheduler-intensive, it may depend on many
other factors, such as syscall performance, context switch, etc. I'll
try to play more with it.


My profiling on 8-core Core i7 system shows that code from sched_ule.c 
staying on first places consumes still only 13% of kernel CPU time, 
while doing million of context switches per second. cpu_search(), 
affected by this patch, even less -- only 8%. The rest of time is spread 
between many small other functions. I did some optimizations at r234066 
to reduce cpu_search(0 time to 6%, but looking on how unstable results 
of this test are, hardly any difference there can be really measured by it.


I have strong feeling that while this test may be interesting for 
profiling, it's own results in first place depend not from how fast 
scheduler is, but from the pipes capacity and other alike things. Can 
somebody hint me what except pipe capacity and context switch to 
unblocked receiver prevents sender from sending all data in batch and 
then receiver from receiving them all in batch? If different OSes have 
different policies there, I think results could be incomparable.


--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: [RFT][patch] Scheduling for HTT and not only

2012-04-10 Thread Alexander Motin

On 04/10/12 19:58, Arnaud Lacombe wrote:

2012/4/9 Alexander Motin:

[...]

I have strong feeling that while this test may be interesting for profiling,
it's own results in first place depend not from how fast scheduler is, but
from the pipes capacity and other alike things. Can somebody hint me what
except pipe capacity and context switch to unblocked receiver prevents
sender from sending all data in batch and then receiver from receiving them
all in batch? If different OSes have different policies there, I think
results could be incomparable.


Let me disagree on your conclusion. If OS A does a task in X seconds,
and OS B does the same task in Y seconds, if Y>  X, then OS B is just
not performing good enough. Internal implementation's difference for
the task can not be waived as an excuse for result's comparability.


Sure, numbers are always numbers, but the question is what are they 
showing? Understanding of the test results is even more important for 
purely synthetic tests like this. Especially when one test run gives 25 
seconds, while another gives 50. This test is not completely clear to me 
and that is what I've told.


--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: [RFT][patch] Scheduling for HTT and not only

2012-04-10 Thread Alexander Motin

On 04/10/12 20:18, Alexander Motin wrote:

On 04/10/12 19:58, Arnaud Lacombe wrote:

2012/4/9 Alexander Motin:

[...]

I have strong feeling that while this test may be interesting for
profiling,
it's own results in first place depend not from how fast scheduler
is, but
from the pipes capacity and other alike things. Can somebody hint me
what
except pipe capacity and context switch to unblocked receiver prevents
sender from sending all data in batch and then receiver from
receiving them
all in batch? If different OSes have different policies there, I think
results could be incomparable.


Let me disagree on your conclusion. If OS A does a task in X seconds,
and OS B does the same task in Y seconds, if Y> X, then OS B is just
not performing good enough. Internal implementation's difference for
the task can not be waived as an excuse for result's comparability.


Sure, numbers are always numbers, but the question is what are they
showing? Understanding of the test results is even more important for
purely synthetic tests like this. Especially when one test run gives 25
seconds, while another gives 50. This test is not completely clear to me
and that is what I've told.


Small illustration to my point. Simple scheduler tuning affects thread 
preemption policy and changes this test results in three times:


mav@test:/test/hackbench# ./hackbench 30 process 1000
Running with 30*40 (== 1200) tasks.
Time: 9.568

mav@test:/test/hackbench# sysctl kern.sched.interact=0
kern.sched.interact: 30 -> 0
mav@test:/test/hackbench# ./hackbench 30 process 1000
Running with 30*40 (== 1200) tasks.
Time: 5.163

mav@test:/test/hackbench# sysctl kern.sched.interact=100
kern.sched.interact: 0 -> 100
mav@test:/test/hackbench# ./hackbench 30 process 1000
Running with 30*40 (== 1200) tasks.
Time: 3.190

I think it affects balance between pipe latency and bandwidth, while 
test measures only the last. It is clear that conclusion from these 
numbers depends on what do we want to have.


--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: [RFT][patch] Scheduling for HTT and not only

2012-04-10 Thread Alexander Motin

On 04/10/12 21:46, Arnaud Lacombe wrote:

On Tue, Apr 10, 2012 at 1:53 PM, Alexander Motin  wrote:

On 04/10/12 20:18, Alexander Motin wrote:

On 04/10/12 19:58, Arnaud Lacombe wrote:

2012/4/9 Alexander Motin:

I have strong feeling that while this test may be interesting for
profiling,
it's own results in first place depend not from how fast scheduler
is, but
from the pipes capacity and other alike things. Can somebody hint me
what
except pipe capacity and context switch to unblocked receiver prevents
sender from sending all data in batch and then receiver from
receiving them
all in batch? If different OSes have different policies there, I think
results could be incomparable.


Let me disagree on your conclusion. If OS A does a task in X seconds,
and OS B does the same task in Y seconds, if Y>  X, then OS B is just
not performing good enough. Internal implementation's difference for
the task can not be waived as an excuse for result's comparability.



Sure, numbers are always numbers, but the question is what are they
showing? Understanding of the test results is even more important for
purely synthetic tests like this. Especially when one test run gives 25
seconds, while another gives 50. This test is not completely clear to me
and that is what I've told.


Small illustration to my point. Simple scheduler tuning affects thread
preemption policy and changes this test results in three times:

mav@test:/test/hackbench# ./hackbench 30 process 1000
Running with 30*40 (== 1200) tasks.
Time: 9.568

mav@test:/test/hackbench# sysctl kern.sched.interact=0
kern.sched.interact: 30 ->  0
mav@test:/test/hackbench# ./hackbench 30 process 1000
Running with 30*40 (== 1200) tasks.
Time: 5.163

mav@test:/test/hackbench# sysctl kern.sched.interact=100
kern.sched.interact: 0 ->  100
mav@test:/test/hackbench# ./hackbench 30 process 1000
Running with 30*40 (== 1200) tasks.
Time: 3.190

I think it affects balance between pipe latency and bandwidth, while test
measures only the last. It is clear that conclusion from these numbers
depends on what do we want to have.


I don't really care on this point, I'm only testing default values, or
more precisely, whatever developers though default values would be
good.

Btw, you are testing 3 differents configuration. Different results are
expected. What worries me more is the rather the huge instability on
the *same* configuration, say on a pipe/thread/70 groups/600
iterations run, where results range from 2.7s[0] to 7.4s, or a
socket/thread/20 groups/1400 iterations run, where results range from
2.4s to 4.5s.


Due to reason I've pointed in my first message this test is _extremely_ 
sensitive to context switch interval. The more aggressive scheduler 
switches threads, the smaller will be pipe latency, but the smaller will 
be also bandwidth. During test run scheduler all the time recalculates 
interactivity index for each thread, trying to balance between latency 
and switching overhead. With hundreds of threads running simultaneously 
and interfering with each other it is quite unpredictable process.


--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: ULE/sched issues on stable/9 - why isn't preemption occuring?

2012-05-31 Thread Alexander Motin

On 05/31/12 01:02, Adrian Chadd wrote:

I've re-run the test with powerd and sleep state stuff disabled - lo
and behold, UDP tests are now up around 240-250MBit, what I'd expect
for this 2 stream 11n device.

So why is it that I lose roughly 80MBit of throughput with powerd and
C2/C3 enabled, when there's plenty of CPU going around? The NIC
certainly isn't going to sleep (I've not even added that code.)


I've seen penalties from both of them myself on latency-sensitive 
single-threaded disk benchmark: 17K IOPS instead of 30K IOPS without.


Problem with powerd was that CPU load during the test was below powerd 
idle threshold and it decided to drop frequency, that proportionally 
increased I/O handling latency. powerd can't know that while average CPU 
load is now, the request handling latency is critical for the test.


About C-states, I've noticed on my tests on Core2Duo system that while 
ACPI reports equal exit latency for C1 and C2 states of 1us there, they 
are not really equal -- C2 exit is measurably slower. On newer 
generations of systems (Core i) I've never seen C2 latency reported as 
1us, but instead it has much higher values. Having real big value there 
system should automatically avoid entering those states under the high 
interrupt rates to not get penalties.


But that is all about latency-sensitive test. I am surprised to see such 
results for the network benchmarks. Handling packets in bursts should 
hide that latency. Unless you are loosing packets because of some 
overflows during these delays.


--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Giant free GEOM/CAM XPT

2010-07-22 Thread Alexander Motin
Jerry Toung wrote:
> Hello List,
> while going through the xpt code (8.0 RELEASE), it seems to me that some
> gains can be had
> in src/sys/geom/geom_disk.c where dp->d_strategy(bp2) is surrounded by Giant
> lock. Especially in the case
> where one has 2+ controllers on the system with /dev/daXX attached to them
> during heavy I/O.

Giant locked there only if DISKFLAG_NEEDSGIANT flag is set, which da
driver is not doing.

> I am currently trying to get rid of giant there, but it branches in sys/cam
> and sys/dev/twa. Definitely not a
> trivial exercise. The dependency on Giant seems to come from the XPT code.
> 
> would be neat  if I could just use the SIM lock, which  is per controller.
> 
> Question: do you think it's worth the effort?

I think you misunderstood something. Most of CAM protected by SIM locks.
If some SIMs use Giant for that purpose - it is their own problem. But
as I can see, twa uses own lock, not Giant.

-- 
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Intel TurboBoost in practice

2010-07-24 Thread Alexander Motin
Hi.

I've make small observations of Intel TurboBoost technology under
FreeBSD. This technology allows Intel Core i5/i7 CPUs to rise frequency
of some cores if other cores are idle and power/thermal conditions
permit. CPU core counted as idle, if it has been put into C3 or deeper
power state (may reflect ACPI C2/C3 states). So to reach maximal
effectiveness, some tuning may be needed.

Here is my test case: FreeBSD 9-CURRENT on Core i5 650 CPU, 3.2GHz + 1/2
TurboBoost steps (+133/+266MHz) with boxed cooler at the open air. I was
measuring building time of the net/mpd5 from sources, using only one CPU
core (cpuset -l 0 time make).

Untuned system (hz=1000): 14.15 sec
Enabled ACPI C2 (hz=1000+C2): 13.85 sec
Enabled ACPI C3 (hz=1000+C3): 13.91 sec
Reduced HZ (hz=100):  14.16 sec
Enabled ACPI C2 (hz=100+C2):  13.85 sec
Enabled ACPI C3 (hz=100+C3):  13.86 sec
Timers tuned* (hz=100):   14.10 sec
Enabled ACPI C2 (hz=100+C2):  13.71 sec
Enabled ACPI C3 (hz=100+C3):  13.73 sec

All numbers tested few times and are repeatable up to +/-0.01sec.

*) Timers were tuned to reduce interrupt rates and respectively increase
idle cores sleep time. These lines were added to loader.conf:
sysctl kern.eventtimer.timer1=i8254
sysctl kern.eventtimer.timer2=NONE
kern.eventtimer.singlemul=1
kern.hz="100"

PS: In this case benefit is small, but it is the least that can be
achieved, depending on CPU model. Some models allow frequency to be
risen by up to 6 steps (+798MHz).

PPS: I expect even better effect achieved by further reducing interrupt
rates on idle CPUs.

-- 
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Intel TurboBoost in practice

2010-07-24 Thread Alexander Motin
Norikatsu Shigemura wrote:
> On Sat, 24 Jul 2010 16:53:10 +0300
> Alexander Motin  wrote:
>> PS: In this case benefit is small, but it is the least that can be
>> achieved, depending on CPU model. Some models allow frequency to be
>> risen by up to 6 steps (+798MHz).
> 
>   I tested on Core i7 640UM (Arrandale 1.2GHz -> 2.26GHz) with
>   openssl speed (w/o aesni(4)) and
>   /usr/src/tools/tools/crypto/cryptotest.c (w/ aesni(4)).
> 
>   http://people.freebsd.org/~nork/aesni/aes128cbc-noaesni.pdf [1]
>   http://people.freebsd.org/~nork/aesni/aes128cbc-aesni.pdf [2]
> 
>   In my environment, according to aes128cbc-noaesni.pdf, at least,
>   30% performace up by Turbo Boost (I think).

The numbers are interesting, though they are not proving much, because
of many other factors may influence on result. It would be more
informative to do the tests with C1 and C2/C3 states used.

>   And according to aes128cbc-aesni.pdf, at least, 100% performance
>   up by Turbo Boost (I think).

This IMHO is even more questionable. Single, even boosted core shouldn't
be faster then 2, 3 and 4. I would say there is some scalability
problem. May be context switches, locking, or something else.

-- 
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Intel TurboBoost in practice

2010-07-24 Thread Alexander Motin
Rui Paulo wrote:
> On 24 Jul 2010, at 14:53, Alexander Motin wrote:
>> Here is my test case: FreeBSD 9-CURRENT on Core i5 650 CPU, 3.2GHz + 1/2
>> TurboBoost steps (+133/+266MHz) with boxed cooler at the open air. I was
>> measuring building time of the net/mpd5 from sources, using only one CPU
>> core (cpuset -l 0 time make).
>>
>> Untuned system (hz=1000): 14.15 sec
>> Enabled ACPI C2 (hz=1000+C2): 13.85 sec
>> Enabled ACPI C3 (hz=1000+C3): 13.91 sec
>> Reduced HZ (hz=100):  14.16 sec
>> Enabled ACPI C2 (hz=100+C2):  13.85 sec
>> Enabled ACPI C3 (hz=100+C3):  13.86 sec
>> Timers tuned* (hz=100):   14.10 sec
>> Enabled ACPI C2 (hz=100+C2):  13.71 sec
>> Enabled ACPI C3 (hz=100+C3):  13.73 sec
>>
>> All numbers tested few times and are repeatable up to +/-0.01sec.
>>
>> PS: In this case benefit is small, but it is the least that can be
>> achieved, depending on CPU model. Some models allow frequency to be
>> risen by up to 6 steps (+798MHz).
> 
> The numbers that you are showing doesn't show much difference. Have you tried 
> buildworld?

If you mean relative difference -- as I have told, it's mostly because
of my CPU. It's maximal boost is 266MHz (8.3%), but 133MHz of them is
enabled most of time if CPU is not overheated. It probably doesn't, as
it works on clear table under air conditioner. So maximal effect I can
expect on is 4.2%. In such situation 2.8% probably not so bad to
illustrate that feature works and there is space for further
improvements. If I had Core i5-750S I would expect 33% boost.

If you mean absolute difference, here are results or four buildworld runs:
hw.acpi.cpu.cx_lowest=C1: 4654.23 sec
hw.acpi.cpu.cx_lowest=C2: 4556.37 sec
hw.acpi.cpu.cx_lowest=C2: 4570.85 sec
hw.acpi.cpu.cx_lowest=C1: 4679.83 sec
Benefit is about 2.1%. Each time results were erased and sources
pre-cached into RAM. Storage was SSD, so disk should not be an issue.

-- 
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Intel TurboBoost in practice

2010-07-26 Thread Alexander Motin
Robert Watson wrote:
> On Sun, 25 Jul 2010, Alexander Motin wrote:
>>> The numbers that you are showing doesn't show much difference. Have
>>> you tried buildworld?
>>
>> If you mean relative difference -- as I have told, it's mostly because
>> of my CPU. It's maximal boost is 266MHz (8.3%), but 133MHz of them is
>> enabled most of time if CPU is not overheated. It probably doesn't, as
>> it works on clear table under air conditioner. So maximal effect I can
>> expect on is 4.2%. In such situation 2.8% probably not so bad to
>> illustrate that feature works and there is space for further
>> improvements. If I had Core i5-750S I would expect 33% boost.
> 
> Can I recommend the use of ministat(1) and sample sizes of at least 8
> runs per configuration?

Thanks for pushing me to do it right. :) Here is 3*15 runs with fresh
kernel with disabled debug. Results are quite close to original: -2.73%
and -2.19% of time.
x C1
+ C2
* C3
+-+
|+*  x|
|+*  x|
|+*  x|
|+*  x|
|+*  x|
|+*  x|
|+*  x|
|+   **  x|
|+ + ** xx|
|+ + ** **  xx   x|
| |__M_A| |
|A|   |
||A|  |
+-+
NMinMax Median   AvgStddev
x  15  12.68  12.84  12.69 12.698667   0.039254966
+  15  12.35  12.36  12.35 12.351333  0.0035186578
Difference at 95.0% confidence
-0.347333 +/- 0.0208409
-2.7352% +/- 0.164119%
(Student's t, pooled s = 0.0278687)
*  15  12.41  12.44  12.42 12.42  0.0075592895
Difference at 95.0% confidence
-0.278667 +/- 0.0211391
-2.19446% +/- 0.166467%
(Student's t, pooled s = 0.0282674)

I also checked one more aspect -- TurboBoost works only when CPU runs at
highest EIST frequency (P0 state). I've reduced dev.cpu.0.freq from 3201
to 3067 and repeated the test:
x C1
+ C2
* C3
+-+
| x   +  *|
| x   +  *|
| x   +  *|
| x   +  *   *|
| x  x+  *   *|
| x  x+  +   *   *|
| x  x+  +   *   *|
| x  x+  +   *   *|
| x  x+   +  +   +   *   *|
||MA| |
|   |_MA_||
|M_A_||
+-+
NMinMax Median   AvgStddev
x  15  13.72  13.73  13.72 13.72  0.0048795004
+  15  13.79  13.82   13.8 13.80  0.0072374686
Difference at 95.0% confidence
0.08 +/- 0.00461567
0.582949% +/- 0.0336337%
(Student's t, pooled s = 0.00617213)
*  15  13.89   13.9  13.8913.894  0.0050709255
Difference at 95.0% confidence
0.170667 +/- 0.00372127
1.24362% +/- 0.0271164%
(Student's t, pooled s = 0.00497613)

In that case using C2 or C3 predictably caused small performance reduce,
as after falling to sleep, CPU needs time to wakeup. Even if tested CPU0
won't ever sleep during test, it's TLB shutdown IPIs to other cores
still probably could suffer from waiting other cores' wakeup.

Obviously in first test these 0.58% and 1.24% were subtracted from the
TurboBoost's maximal benefit of 4.3% on this CPU.

-- 
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Intel TurboBoost in practice

2010-07-27 Thread Alexander Motin
Alan Cox wrote:
> On Mon, Jul 26, 2010 at 9:11 AM, Alexander Motin  <mailto:m...@freebsd.org>> wrote:
> 
> In that case using C2 or C3 predictably caused small performance reduce,
> as after falling to sleep, CPU needs time to wakeup. Even if tested CPU0
> won't ever sleep during test, it's TLB shutdown IPIs to other cores
> still probably could suffer from waiting other cores' wakeup.
> 
> In the deeper sleep states, are the TLB contents actually maintained
> while the processor sleeps?  (I notice that in some configurations, we
> actually flush dirty data from the cache before sleeping.)

As I understand, we flush caches only as last resort, if platform does
not supports special techniques, such as disabling arbitration or making
CPU to wake up on bus mastering. But same ACPI C-states could map into
different CPU C-states. Some of these CPU states (like C6) could imply
caches invalidation, though I am not sure it can be seen outside.

ACPI 3.0 specification tells nothing about TLBs, so I am not sure we can
count on their invalidation, except we do it ourselves, like it is done
for caches when CPU can't keep their coherency while sleeping.

-- 
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


One-shot-oriented event timers management

2010-08-29 Thread Alexander Motin
Hi.

I would like to present my new work on timers management code.

In my previous work I was mostly orienting on reimplementing existing
functionality in better way. The result seemed not bad, but after
looking on perspectives of using event timers in one-shot (aperiodic)
mode I've understood that implemented code complexity made it hardly
possible. So I had to significantly cut it down and rewrite from the new
approach, which is instead primarily oriented on using timers in
one-shot mode. As soon as some systems have only periodic timers I have
left that functionality, though it was slightly limited.

New management code implements two modes of operation: one-shot and
periodic. Specific mode to be used depends on hardware capabilities and
can be controlled.

In one-shot mode hardware timers programmed to generate single interrupt
precisely at the time of next wanted event. It is done by comparing
current binuptime with next scheduled times of system events
(hard-/stat-/profclock). This approach has several benefits: event timer
precision is now irrelevant for system timekeeping, hard- and statclocks
are not aliased, while only one timer used for it, and the most
important -- it allows us to define which events and when exactly we
really want to handle, without strict dependence on fixed hz, stathz,
profhz periods. Sure, our callout system is highly depends on hz value,
but now at least we can skip interrupts when we have no callouts to
handle at the time. Later we can go further.

Periodic mode now also uses alike principals of scheduling events. But
timer running in periodic mode just unable to handle arbitrary events
and as soon as event timers may not be synchronized to system
timecounter and may drift from it, causing jitter effects. So I've used
for time source of scheduling the timer events themselves. As result,
periodic timer runs on fixed frequency multiply to hz rate, while
statclock and profclock generated by dividing it respectively. (If
somebody would tell me that hardclock jitter is not really a big
problem, I would happily rip that artificial timekeeping out of there to
simplify code.) Unluckily this approach makes impossible to use two
events timers to completely separate hard- and statclocks any more, but
as I have said, this mode is required only for limited set of systems
without one-shot capable timers. Looking on my recent experience with
different platforms, it is not a big fraction.

Management code is still handles both per-CPU and global timers. Per-CPU
timers usage is obvious. Global timer is programmed to handle all CPUs
needs. In periodic mode global timer generates periodic interrupts to
some one CPU, while management code then redistributes them to CPUs that
really need it, using IPI. In one-shot mode timer is always programmed
to handle first scheduled event throughout the system. When that
interrupt arrives, it is also getting redistributed to wanting CPUs with
IPI.

To demonstrate features that could be obtained from so high flexibility
I have incorporated the idea and some parts of dynamic ticks patches of
Tsuyoshi Ozawa. Now, when some CPU goes down into C2/C3 ACPI sleep
state, that CPU stops scheduling of hard-/stat-/profclock events until
the next registered callout event. If CPU wakes up before that time by
some unrelated interrupt, missed ticks are called artificially (it is
needed now to keep realistic system stats). After system is up to date,
interrupt is handled. Now it is implemented only for ACPI systems with
C2/C3 states support, because ACPI resumes CPU with interrupts disabled,
that allows to keep up missed time before interrupt handler or some
other process (in case of unexpected task switch) may need it. As I can
see, Linux does alike things in the beginning of every interrupt handler.

I have actively tested this code for a few days on my amd64 Core2Duo
laptop and i386 Core-i5 desktop system. With C2/C3 states enabled
systems experience only about 100-150 interrupts per second, having HZ
set to 1000. These events mostly caused by several event-greedy
processes in our tree. I have traced and hacked several most aggressive
ones in this patch: http://people.freebsd.org/~mav/tm6292_idle.patch .
It allowed me to reduce down to as low as 50 interrupts per system,
including IPIs! Here is the output of `systat -vm 1` from my test
system: http://people.freebsd.org/~mav/systat_w_oneshot.txt . Obviously
that with additional tuning the results can be improved even more.

My latest patch against 9-CURRENT can be found here:
http://people.freebsd.org/~mav/timers_oneshot4.patch

Comments, ideas, propositions -- welcome!

Thanks to all who read this. ;)

-- 
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: One-shot-oriented event timers management

2010-08-30 Thread Alexander Motin
Brandon Gooch wrote:
> One thing I see:
> 
> Where is *frame pointing to? It isn't initialized in the function, so...

Thanks! Fixed. Patch updated.

> Also, for those of us testing, should we "reset" our timer settings
> back to defaults and work from there[1] (meaning, should we be futzing
> around with timer event sources, kern.hz, etc...)?

The general logic is still applicable.

Reducing HZ is less important now, but lower value allows system
slightly aggregate close events by the cost of precision. Unluckily we
have no better mechanism to do it now.

What's about event source - there is only one timer supported now and
sysctl/tunable name changed to kern.eventtimer.timer, so previous
options just won't work. Also with support for one-shot mode, use of RTC
and i8254 timers is not recommended any more - they do not support it.
Use LAPIC or HPET. If you have Core-iX class CPU - you may use any of
them, they are very close in functionality. If you use Core2 or earlier
- prefer HPET, as LAPIC is dying in C3 state.

If you use HPET on Core2-class CPU (actually on ICHX class south
bridges, which do not support MSI-like interrupts for HPET) - you may
like to set such tunables:
hint.atrtc.0.clock=0
hint.attimer.0.clock=0
hint.hpet.0.legacy_route=1
It will disable RTC and i8254 timers, but grant their interrupts to
HPET, allowing it to work as per-CPU for dual-CPU systems.

-- 
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: One-shot-oriented event timers management

2010-08-30 Thread Alexander Motin
Gary Jennejohn wrote:
> Hmm.  I applied your patches and am now running the new kernel.  But I
> only installed the new kernel and didn't do make buildworld installworld.
> 
> Mu systat -vm 1 doesn't look anything like yours.  I'm seeing about 2300
> interrupts per second and most of those are coming from the hpet timers:
> 
> 1122 hpet0:t0
> 1124 hpet0:t1

It means 1000Hz of hardclock (hz) events mixed with 127Hz of statclock
(stathz) events. HPET timer here works in one-shot mode handling it.

> So, what else did you do to reduce interrupts so much?
> 
> Ah, I think I see it now.  My desktop has only C1 enabled.  Is that it?
> Unfortunately, it appears that only C1 is supported :(

Yes, as I have said, at this moment empty ticks skipped only while CPU
is in C2/C3 states. In C1 state there is no way to handle lost events on
wake up. While it may be not very dangerous, it is not very good.

-- 
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: One-shot-oriented event timers management

2010-08-30 Thread Alexander Motin
Gary Jennejohn wrote:
> On Mon, 30 Aug 2010 13:07:38 +0300
> Alexander Motin  wrote:
>> Gary Jennejohn wrote:
>>> Ah, I think I see it now.  My desktop has only C1 enabled.  Is that it?
>>> Unfortunately, it appears that only C1 is supported :(
>> Yes, as I have said, at this moment empty ticks skipped only while CPU
>> is in C2/C3 states. In C1 state there is no way to handle lost events on
>> wake up. While it may be not very dangerous, it is not very good.
> 
> Too bad.  I'd say that systems which are limited to C1 don't benefit
> much (or not at all) from your changes.

For this moment - indeed not much. As I have said, feature with skipping
ticks is on early development stage. I've just implemented it in most
straightforward way, abusing feature provided by ACPI. To benefit other
systems and platforms, more tight integration with interrupt, callout
and possibly scheduler subsystem will be needed.

-- 
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: One-shot-oriented event timers management

2010-08-30 Thread Alexander Motin
YAMAMOTO, Taku wrote:
> On Mon, 30 Aug 2010 13:07:38 +0300
> Alexander Motin  wrote:
>> Gary Jennejohn wrote:
> (snip)
>>> So, what else did you do to reduce interrupts so much?
>>>
>>> Ah, I think I see it now.  My desktop has only C1 enabled.  Is that it?
>>> Unfortunately, it appears that only C1 is supported :(
>> Yes, as I have said, at this moment empty ticks skipped only while CPU
>> is in C2/C3 states. In C1 state there is no way to handle lost events on
>> wake up. While it may be not very dangerous, it is not very good.
> 
> There's an alternative way to catch exit-from-C1 atomically:
> use MWAIT with bit0 of ECX set (``Treat masked interrupts as break events''
> in Intel 64 and IA-32 Architecthres Software Developer's Manual).
> 
> In this way we can put each core individually into deeper Cx state without
> additional costs (SMIs and the like) as a bonus.
> 
> The problem is that it may be unavailable to earlier CPUs that support
> MONITOR/MWAIT instructions:
> we should check the presense of this feature by examining bit0 and bit1 of ECX
> that is returned by CPUID 5.

Thank you for the hint. I will investigate it now. But it still help
only x86 systems. I have no idea how power management works on
arm/mips/ppc/..., but I assume that periodic wake up there also may be
not free.

-- 
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: One-shot-oriented event timers management

2010-08-31 Thread Alexander Motin
Alexander Motin wrote:
> YAMAMOTO, Taku wrote:
>> On Mon, 30 Aug 2010 13:07:38 +0300
>> Alexander Motin  wrote:
>>> Gary Jennejohn wrote:
>> (snip)
>>>> So, what else did you do to reduce interrupts so much?
>>>>
>>>> Ah, I think I see it now.  My desktop has only C1 enabled.  Is that it?
>>>> Unfortunately, it appears that only C1 is supported :(
>>> Yes, as I have said, at this moment empty ticks skipped only while CPU
>>> is in C2/C3 states. In C1 state there is no way to handle lost events on
>>> wake up. While it may be not very dangerous, it is not very good.
>> There's an alternative way to catch exit-from-C1 atomically:
>> use MWAIT with bit0 of ECX set (``Treat masked interrupts as break events''
>> in Intel 64 and IA-32 Architecthres Software Developer's Manual).
>>
>> In this way we can put each core individually into deeper Cx state without
>> additional costs (SMIs and the like) as a bonus.
>>
>> The problem is that it may be unavailable to earlier CPUs that support
>> MONITOR/MWAIT instructions:
>> we should check the presense of this feature by examining bit0 and bit1 of 
>> ECX
>> that is returned by CPUID 5.
> 
> Thank you for the hint. I will investigate it now. But it still help
> only x86 systems. I have no idea how power management works on
> arm/mips/ppc/..., but I assume that periodic wake up there also may be
> not free.

I have looked on these MWAIT features. They indeed allow to wake up with
interrupts disabled, but I am worrying about the C-state in which CPU
goes in that case. MWAIT states are CPU C-STATES, not ACPI C-states. So
I have doubts that using MWAIT instead of HLT on ACPI system is correct.
Also I have found some comments that MWAIT on AMD 10h family CPUs does
not allows CPU to go to C1E state, while HLT does. So it is not a
complete (worse) equivalent. Later AMD CPUs just do not support MWAIT.

-- 
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: One-shot-oriented event timers management

2010-08-31 Thread Alexander Motin
Gary Jennejohn wrote:
> On Mon, 30 Aug 2010 12:11:48 +0200
> Gary Jennejohn  wrote:
> 
>> On Mon, 30 Aug 2010 13:07:38 +0300
>> Alexander Motin  wrote:
>>
>>> Gary Jennejohn wrote:
>>>> Hmm.  I applied your patches and am now running the new kernel.  But I
>>>> only installed the new kernel and didn't do make buildworld installworld.
>>>>
>>>> Mu systat -vm 1 doesn't look anything like yours.  I'm seeing about 2300
>>>> interrupts per second and most of those are coming from the hpet timers:
>>>>
>>>> 1122 hpet0:t0
>>>> 1124 hpet0:t1
>>> It means 1000Hz of hardclock (hz) events mixed with 127Hz of statclock
>>> (stathz) events. HPET timer here works in one-shot mode handling it.
>>>
>>>> So, what else did you do to reduce interrupts so much?
>>>>
>>>> Ah, I think I see it now.  My desktop has only C1 enabled.  Is that it?
>>>> Unfortunately, it appears that only C1 is supported :(
>>> Yes, as I have said, at this moment empty ticks skipped only while CPU
>>> is in C2/C3 states. In C1 state there is no way to handle lost events on
>>> wake up. While it may be not very dangerous, it is not very good.
>>>
>> Too bad.  I'd say that systems which are limited to C1 don't benefit
>> much (or not at all) from your changes.
>>
> 
> OK, this is purely anecdotal, but I'll report it anyway.
> 
> I was running pretty much all day with the patched kernel and things
> seemed to be working quite well.
> 
> Then, after about 7 hours, everything just stopped.
> 
> I had gkrellm running and noticed that it updated only when I moved the
> mouse.
> 
> This behavior leads me to suspect that the timer interrupts had stopped
> working and the mouse interrupts were causing processes to get scheduled.
> 
> Unfortunately, I wasn't able to get a dump and had to hit reset to
> recover.
> 
> As I wrote above, this is only anecdotal, but I've never seen anything
> like this before applying the patches.

One-shot timers have one weak side: if for some reason timer interrupt
getting lost -- there will be nobody to reload the timer. Such cases
probably will require special attention. Same funny situation with
mouse-driven scheduler happens also if LAPIC timer dies when pre-Core-iX
CPU goes to C3 state.

-- 
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


  1   2   >