Hey hey, Lately I've been thinking on how to improve TrackerMinerFS design and performance, as it's a big piece of code that's getting too intricate at places. It mainly has 2 roles that we should separate further:
* Keeping track of what files to index (either fed through the crawler or the dir monitors) * actually indexing them For each of these 2 roles TrackerMinerFS maintains one cache (mtimes for the first, URNs for the second) that's filled in per-directory as processing goes, which introduces a latency directly related to how scattered is the data in the FS. Another source of latency is the need to have a parent folder URN before inserting the data for the file at hand, which forces a flush/commit right before indexing files within a folder to keep nfo:belongsToContainer consistent, but that's harder to beat. So, my idea to improve these situations is to separate the first role out to a separate object that is able to carry out caching operations at a higher level than folders (probably for entire configured directories), and would hide the crawler and the monitor to the miner. That way the miner would query in one go what now does in scattered chunks. Very rough testing seemed to show crawling is reduced to 30%-40% of the original time, just ~2x the effort of only adding the directory monitors. Additionally, I think a filesystem abstraction object should be in place, where GFiles are canonicalized so every comparison afterwards can be performed through == and !=, and directories (and related data, mtime, URN...) are cached for a longer term, while regular files are more short-lived. I'd expect a slightly higher memory usage with this, but almost negligible, since we already have GFiles in memory for every monitored directory and every file waiting to be processed/indexed. But this would specially help in non-first indexes, as actual indexing (mostly bound to tracker-extract) outweights these file operations. Opinions? Carlos _______________________________________________ tracker-list mailing list tracker-list@gnome.org http://mail.gnome.org/mailman/listinfo/tracker-list