Hi, I can't believe it's been 18 months since the last State of the Pooch email:
http://mail.gnome.org/archives/dashboard-hackers/2005-May/msg00011.html It's fun to go back and reread it to see all of the stuff we've accomplished in that time. A follow-up has been far too long in coming. Anyway, the purpose of this mail is to fill everyone in on the stuff I and others are doing, and hopefully call to action people who are interested in hacking on Beagle but don't know where to start. * Unified indexes This is a big project I have been working on the last couple of weeks. The gist of it is that Beagle today uses two Lucene indexes for every backend, and we now have (by my count) 17 backends. This is a waste of disk space and memory, and slows down overall search performance. Moreover, these indexes have a very uneven number of items (with many having zero), which also slows down search performance on the bigger ones. This work will result in a fixed number of Lucene indexes regardless of how many backends there are, and have a relatively even distribution of documents contained within them. This work is currently being done on the beagle-unified-indexes-branch in CVS. I can go into a lot more detail on this if people are interested. * Memory usage The other big thing I've been working on is reducing memory usage. I've posted here and blogged about it some in the past, and it continues to be the biggest issue in Beagle and its adoption thus far. Fortunately there is a new Mono profiler out, called heap-shot: http://primates.ximian.com/~lluis/blog/pivot/entry.php?id=56 This, along with heap-buddy, are invaluable tools. I've already identified a few more "hotspots" that we can improve. * Generics and .NET 2.0 Somewhat related to the memory usage, we will probably be switching to using Mono's .NET 2.0 class libraries soon and starting to integrate generics into Beagle code. This is because Mono 1.1.18 declared the generics compiler stable, and a move to generics will also help reduce our memory usage. In addition, many of the new 2.0 classes are more efficient than their 1.x counterparts. * Showing status on the state of the index One common question we get on IRC (and sometimes on-list) is that people are searching for something but they can't find it because Beagle hasn't indexed it yet, and gives no indication that the initial index is still happening. There is some infrastructure for this in place now, but only the Evolution mail backend uses it. This will be fleshed out more (especially for files), so that the UI makes it clear to users that the initial indexing process has not yet finished. * Automatic document language detection Paul Betts is working on code that will allow Beagle to automatically detect what language a document is in, so that we can do proper analysis on that document. Right now we assume everything is English, and apply English rules for stemming. This will allow for us to search for documents based on language and handle language-specific search terms. Paul tells me he has most of the detection code finished, he needs to hook it up into Beagle. We'll also probably need to bring in the Snowball stemmers to handle the document language correctly. * Networked searches Fredrik started the work of integrating Kyle and Alexis's Summer of Code work on the networked searches during the GNOME summit and I know he's made good progress on it. I'm hoping this email will guilt him into finishing that work or at least giving us a status update on that. :) * Spelling suggestions This summer Fredrik also did a proof of concept implementation for giving spelling suggestions on searches. He opened a bugzilla bug about it and attached his work here: http://bugzilla.gnome.org/show_bug.cgi?id=353534 and you can see a screenshot of it in action here: http://bugzilla.gnome.org/attachment.cgi?id=72008&action=view Fred highlighted a few problems with his implementation and Kevin also pointed out some issues he had. It would be great it someone interested in this took this project on. * Handling crashes in the index helper better. We have a problem right now with certain files -- usually Microsoft Word -- crashing the index helper process. Because Beagle is incredibly conservative about corrupting the index, after this happens we purge the index and start reindexing. Obviously this sucks if you have one of those crashy documents. We've tried to push these issues upstream to the wv1 developers, but the bugs basically have been ignored, so an upstream solution doesn't seem forthcoming. The likelihood of a corrupt index in this case is extremely unlikely, so what we should probably do instead is not purge the index and be smarter about detecting a crash so that when we push a batch of files from the daemon to the helper process, we can identify the crashy file, mark it, and move on. Yes, the helper will still crash -- we can't avoid that -- but we will become more robust to those problematic files. * Removable media Beagle needs to support indexing of data on removable media. There isn't any support for this right now. I don't really have in-depth details about this, but it's on the radar and (sadly) pretty far down on the TODO. * Thunderbird memory usage The Thunderbird backend is a bit of a hog right now. This is: http://bugzilla.gnome.org/show_bug.cgi?id=355549 Kevin has been doing some work on this, but we really need people to take a look at this. I know that this backend has been disabled by default in Fedora Core 6. * The return of D-Bus In the last State of the Pooch I talked about removing D-Bus from Beagle due to its unsuitability for Beagle and the lack of stability in the Mono bindings. Now that there is a completely new, all-managed D-Bus implementation, we should revisit that decision and consider adding a D-Bus search API. A proof of concept on this would be very helpful, and could be done as a completely standalone project. Essentially one could write a proxy in C# which exposed a D-Bus search interface, took the requests, and then used the C# Beagle APIs to run the search and return the results. (Make sure to implement live queries!) I think that's it! There is always work to be done in supporting new file formats through filters and data sources through backends, as well as improving our documentation on the Wiki. We've done a great job since the last State of the Pooch and while I hope it's not quite as long until the next one, we can do this great work together. Thanks, Joe _______________________________________________ Dashboard-hackers mailing list Dashboard-hackers@gnome.org http://mail.gnome.org/mailman/listinfo/dashboard-hackers