-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://git.reviewboard.kde.org/r/116692/
-----------------------------------------------------------

(Updated March 10, 2014, 11:12 a.m.)


Review request for Akonadi and Baloo.


Changes
-------

First patch had a line missing from testing to see what the results of 
committing on every change were; this adds it back so performance returns to 
semi-reasonable.

(Interesting finding: Xapian writes a *LOT* to disk on each transaction commit)


Repository: baloo


Description
-------

Baloo is using Xapian for storing processed results from data fed to it by 
akonadi; in doing so it processes all the data it is sent to index and only 
once this is complete is the data committed to the Xapian database. From 
http://xapian.org/docs/apidoc/html/classXapian_1_1WritableDatabase.html#acbea2163142de795024880a7123bc693
 we see: "For efficiency reasons, when performing multiple updates to a 
database it is best (indeed, almost essential) to make as many modifications as 
memory will permit in a single pass through the database. To ensure this, 
Xapian batches up modifications." This means that *all* the data to be stored 
in the Xapian database first ends up in RAM. When indexing large mailboxes (or 
any other large chunk of data) this results in a very large amount of memory 
allocation. On one test of 100k mails in a maildir folder this resulted in 
1.5GB of RAM used. In normal daily usage with maildir I find that it easily 
balloons to several hundred megabytes within days.
  This makes the Baloo indexer unusable on systems with smaller amounts of 
memory (e.g. mobile devices, which typically have only 512MB-2GB of RAM)

Making this even worse is that the indexer is both long-lived *and* the default 
glibc allocator is unable to return the used memory back to the OS (probably 
due to memory fragmentation, though I have not confirmed this). Use of other 
allocators shows the temporary ballooning of memory during processing, but once 
that is done the memory is released and returned back to the OS. As such, this 
is not a memory leak .. but it behaves like one on systems with the default 
glibc allocator with akonai_baloo_indexer taking increasingly large amounts of 
memory on the system that never get returned to the OS. (This is actually how I 
noticed the problem in the first place.)

The approach used to address this problem is to periodically commit data to the 
Xapian database. This happens uniformly and transparently to the 
AbstractIndexer subclasses. The exact behavior is controlled by the 
s_maxUncommittedItems constant which is set arbitrarily to 100: after an 
indexer hits 100 uncommitted changes, the results are committed immediately. 
Caveats:

* This is not a guaranteed fix for the memory fragmentation issue experienced 
with glibc: it is still possible for the memory to grow slowly over time as 
each smaller commit leaves some % of un-releasable memory due to fragmentation. 
It has helped with day to day usage here, but in the "100k mails in a maildir 
structure" test memory did still balloon upwards. 

* It make indexing non-atomic from akonadi's perspective: data fed to 
akonadi_baloo_indexer to be indexed may show up in chunks and even, in the case 
of a crash of the indexer, be only partially added to the database.

Alternative approaches (not necessarily mutually exclusive to this patch or 
each other):

* send smaller data sets from akonadi to akonadi_baloo_indexer for processing. 
This would allow akonadi_baloo_indexer to retain the atomic commit approach 
while avoiding the worst of the Xapian memory usage; it would not address the 
issue of memory fragmentation
* restart akonadi_baloo_indexer process from time to time; this would resolve 
the fragmentation-over-time issue but not the massive memory usage due to 
atomically indexing large datasets
* improve Xapian's chert backend (to become default in 1.4) to not fragment 
memory so much; this would not address the issue of massive memory usage due to 
atomically indexing large datasets
* use an allocator other than glibc's; this would not address the issue of 
massive memory usage due to atomically indexing large datasets


Diffs (updated)
-----

  src/pim/agent/emailindexer.cpp 05f80cf 
  src/pim/agent/abstractindexer.h 8ae6f5c 
  src/pim/agent/abstractindexer.cpp fa9e96f 
  src/pim/agent/akonotesindexer.h 83f36b7 
  src/pim/agent/akonotesindexer.cpp ac3e66c 
  src/pim/agent/contactindexer.h 49dfdeb 
  src/pim/agent/contactindexer.cpp a5a6865 
  src/pim/agent/emailindexer.h 9a5e5cf 

Diff: https://git.reviewboard.kde.org/r/116692/diff/


Testing
-------

I have been running with the patch for a couple of days and one other person on 
irc has tested an earlier (but functionally equivalent) version. Rather than 
reaching the common 250MB+ during regular usage it now idles at ~20MB (up from 
~7MB when first started; so some fragmentation remains as noted in the 
description, but with far better long-term results)


Thanks,

Aaron J. Seigo

>> Visit http://mail.kde.org/mailman/listinfo/kde-devel#unsub to unsubscribe <<

Reply via email to