Jamie McCracken wrote:
> I've noticed when indexing *large* amounts of data that a lot of disk 
> thrashing is taking place which is greatly slowing down performance of 
> both tracker and the system in general.
> 
> Also the nice +10 is not throttling enough (I dont have ionice in my 
> kernel so I dont know how good a job that does) so I will probably add 
> some sleeping intervals to smooth things out and keep cpu usage low 
> (with a --turbo command line option to disable this for those that want 
> faster indexing)
> 
> The cause of the slow down is heavy fragmentation of the file based hash 
> table.
> 
> Having indexed 30GB of stuff, the optimization routine shrank the full 
> text index from nearly 300MB to 20MB which means a massive 280MB of 
> fragmentation had occurred - this is obscene!
> 
> I note other indexers do not update the hash table directly but cache 
> the data in memory and then bulk upload it to reduce fragmentation and 
> lessen the performance hit. The disadvantage of this is searches for 
> newly indexed content wont appear until the cache is uploaded to the 
> hash table. (we could upload every 10-15 mins or something - infrequent 
> words should be updated more quickly though)

Why not hold back updates, but force flush to disk if a search is called?

> 
> As we are memory conservative, I am planning to do something similiar 
> but using sqlite (instead of precious memory) to cache new files and 
> then bulk upload. We could easily cache the data for many thousands of 
> files before uploading them.

If I remember correctly sqlite3 has some built in cache stuff, you might 
wanna tweak the standard values a bit.

> 
> We can actually do better than others here because firstly we are not 
> using any more RAM so can therefore have much bigger caches and secondly 
> unlike other indexers which upload all at once (which often causes a cpu 
> spike) we can do it incrementally in sqlite.
> 
> And no sqlite will not fragment as its btree based and not a hash table 
> (btrees are much faster to update then hashes) and we will use a 
> seperate db file which can be deleted when finished.
> 
> Will be experimenting on this tonight. There will be a few race 
> conditions to handle with this but its nothing too complex.

Looking forward to it :)

> 
> I am determined to get tracker running as smooth as a baby's bottom!
> 
> 

_______________________________________________
tracker-list mailing list
tracker-list@gnome.org
http://mail.gnome.org/mailman/listinfo/tracker-list

Reply via email to