Re: [Gtk-gnutella-devel] Patch: Browse-host sort by descendingfiledate/time (ne

Lloyd Bryant Fri, 17 Nov 2006 08:04:14 -0800


>From: Christian Biere <[EMAIL PROTECTED]>
>Lloyd Bryant wrote:
> > No sort:  3.77 sec (0 ms avg sort time)
> > g_slist_sort:  3.83 sec (18.29 ms avg sort time)
> > sort_slist_with_qsort:  3.77 sec (10.79 ms avg sort time)
>
>Did you use a corrected version of the qsort() variant? Otherwise, the
>last result is bogus.
>


Since the qsort results are almost half of the slist results, I'm assuming 
that I'm using the
right version (your message mentioned taking LONGER).  I can't verify the 
version, however, as I toasted that directory after completing the 100,000 
file test.

> > I'm currently setting up another test, using 100,000 files.  I want to 
>see
> > how sort time grows for the different sorting methods.  It took 20 
>minutes
> > to do the initial directory scan (directories not in cache, I guess), 
>and
> > I'm still waiting for the SHA1's to build so I can do the real testing.
>
>You might want to create a fake sha1_cache to speed things up.

I didn't think of that.  Oh well - it finished (eventually).

> > The initial code, which started SHA1 calculations before directory 
>scanning
> > was complete, was actually a better solution for large numbers of files. 
>  It
> > takes a LONG time to complete the directory scan, and having the SHA1
> > calculation routine setting idle during that time doesn't make a lot of
> > sense.
>
>I'm not sure whether it would really run the SHA-1 calculation in the
>background whilst scanning. I was rather surprised to notice that the
>Gtk-Gnutella doesn't stall during the scan. There are some periodic calls 
>to
>dispatch Gtk+ events to keep the GUI responsive but it seems I/O is 
>dispatched
>as well. I believe this wasn't always the case. Either Gtk+/GLib changed or
>that's the result of using kqueue|epoll|/dev/poll instead of the GLib 
>stuff.
>That's also the reason the you see the warnings when calling rescan from 
>the
>shell. The shell commands are directly executed from the even dispatch
>callback. I'm not sure whether this can be ignored or whether the shell
>commands should rather be enqueued to prevent recursion.

Consider this - have a routine that takes a filename (with path), calculates 
the SHA1, and loads it into the SHA1 cache.  Not do ANY of the other stuff 
that request_sha1() does (such as checking if the file is spam), just 
loading the cache.  This could safely be done while the scan is occurring, 
since the SHA1 cache does not contain any reference to that pesky 
file_index.  It would greatly speed up the request_sha1() calls, as the 
files would (hopefully) already be in the cache.  It would need some sort of 
"drop dead" control flag, so that when share_scan() reached the point of 
submitting files to request_sha1(), that other routine could be told to stop 
processing files and let request_sha1() queue up any remaining files that 
still need SHA1s.

> > Second, the SHA1 calculation does not use enough CPU.  Right now, I'm
> > waiting for the SHA1's to complete - CPU (user + system) is running at 
>about
> > 5% (I have that machine offline, so there's no network load on it at the
> > moment).
>
>One flaw of Gtk-Gnutella is that it cannot really take advantage of 
>multi-core
>or multi-processor machines except that the X11 server might be running on
>different one. Normally, one would have splitted the GUI from the core 
>right
>from the start. Now we have the issue that the core can stall the GUI and 
>worse
>vice-versa. Threads in C are uglier than hell but I could imagine doing the
>SHA-1 calculations in a secondary process. We'd give that process (almost) 
>idle
>priority. That way the OS scheduler does the dirty work and we can utilize
>another core. I'd prefer to do fork() + exec() instead of just fork() so 
>that
>we don't unnecessarily use virtual memory also because that looks scarier 
>in
>"top" than it actually is. On an unrelated note, we could also kill the 
>ADNS
>helper after some timeout and fork()+exec() another when we need it again,
>simply because DNS lookups are typically rarely needed, so keeping it 
>around
>all the time looks somewhat sloppy.
>

I've been using pthreads for years - it isn't really that difficult to work 
with them.  Though it would be a MAJOR headache taking a large app like GtkG 
and converting it to multithreading - multithreading is only easy if the 
program is designed for it from the beginning.

Dbus seems to offer a better solution.  I'm no expert on that subject, but 
from what I can see the Dbus daemon is capable of starting programs on an 
"as-needed" basis.  Thus, whenever a "calculate SHA1" method is called, the 
daemon would start a process to handle it, and that process could terminate 
when there's no more work for it to do.  That process could run with a nice 
of +19, so would grab as much of the free CPU as it can without having much 
effect on any other processes.

There's no reason that the UI and core can't exists as separate processes as 
well, passing info between them via Dbus.  I started looking at Dbus for 
this reason - my headless box doesn't have X11 installed (not enough CPU 
power for a modern desktop anyway), but I would like to have GtkG's core 
running there, and be able to connect a GUI running on another machine to 
it.  The remote shell just isn't up to the job (and I've noticed that it's 
marked as "deprecated" anyway).
-------------------------------------------------------------------------------------------------------
I've finished my 100,000 file test.  There are 4 data sets this time - my 
patch (no sort), my patch (g_slist_sort), my patch (with your 
sort_slist_using_qsort function), and finally with the code from the SVN 
(r12351):

No sort: 21.3 sec (0 ms sorting)
g_slist_sort: 21.2 sec (475ms sorting)
sort_slist_with_qsort: 21.1 sec (128ms sorting)
SVN r12351: (393ms sorting)

I didn't add the metrics for total rescan time to the SVN test (I wanted to 
run with the code exactly as it is in SVN).

With these test results, sort_slist_with_qsort appears to be a hands-down 
winner.  Your current sort is somewhat faster than my original patch (17%), 
but the sort_slist_with_qsort is WAY faster.   Could we dig that version 
back up, and verify that it's actually sorting the list - I didn't connect 
to that machine and actually verify the sort results during my performance 
tests.  I verified the patch before I submitted it, but not the 
sort_slist_with_qsort version.

The fact that that version is so much faster than the SVN qsort makes be a 
bit suspicious of that result.

At any rate:

P2-300 with 256Mb RAM: Processing 100,000 files: current SVN version uses 
less than 0.4 seconds to perform the sort.  Are we agreed that that machine 
with that many files pretty much defines a worst-case scenario as far as 
sorting performance is concerned?

Lloyd Bryant



-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Gtk-gnutella-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/gtk-gnutella-devel

Re: [Gtk-gnutella-devel] Patch: Browse-host sort by descendingfiledate/time (ne

Reply via email to