Re: [Tracker] more issues with indexer-split

Martyn Russell Wed, 13 Aug 2008 09:01:01 -0700

Jamie McCracken wrote:
> On Wed, 2008-08-13 at 17:12 +0200, Carlos Garnacho wrote:
>> Hi!,

Hi :)

>> On mar, 2008-08-12 at 14:18 -0400, Jamie McCracken wrote:
>>
>> <snip>
>>
>>> that sounds inefficient - trunk only ever checked for existing deleted
>>> or junk emails at startup because iterating through all emails in the
>>> summary files is expensive. 
>> >From what I've read in trunk code, you still iterate through all the
>> mails in the summary in check_summary_file(), and you will have to
>> iterate over them again later to index new messages, etc...
> 
> yes but when we are not doing the startup check, we are skipping so its
> faster and we are not stopping at any deleted or junk email and checking
> it 

Of course it is faster, but that doesn't mean we are completely
synchronised - unless I missed the point. If you had an email in the
summary file before and then you mark it as deleted or junk, the summary
file is out of date. If this is done when tracker isn't running or on
another machine, etc. you would _HAVE_ to read the summary file again on
start up to make sure you were synchronised. At least that's how I
understand it.

>> As far as I know, it's quite unavoidable to parse again summaries, since
>> under some circumstances Message IDs could be reused, which would leave
>> you with inconsistent data in the DBs. Even if it isn't, expunging a
>> folder would render any stored offset for the summary file useless (even
>> dangerous).
> 
> true but we would get a deletion from inotify of the summary file if
> that was the case. Its not a byte offset but message count - so we skip
> x messages to get the new ones (similar to what beagle does)

As I illustrated above, you can't guarantee Tracker is either:

1) running all the time
2) email isn't deleted/etc from another machine/client/webserver/etc.

>> Besides, when testing summary parsing, I remember it was pretty fast
>> (like 2-3 seconds for a ~6500 emails summary), of course without
>> inserting to DBs nor doing message body or attachments sniffing, which
>> is more or less what should happen if the junk/deleted flag is set.
> 
> with 100,000+ emails its quite noticeable

The difference is not really an issue. Most people don't have that many
emails. For those that do, they can expect to wait a bit longer. Really,
the difference you are arguing about here is insignificant. If you have
to wait another 30 seconds because you have a ridiculous number of
emails, I don't think that is a problem especially if you are
guaranteeing synchronicity.

>>> the use of a separate junk email table meant
>>> lookups were confined to that table and not the services table so was
>>> faster when number of emails was high
>> You mean the JunkMails table in email-meta.db? As far as I see, this
>> table is just looked up to make sure there aren't duplicates when
>> inserting. And in the end, you still have to lookup/modify the Services
>> table, even if the junk mail wasn't there.
>>
> 
> no when junk/deleted email is encountered during the start up scan its
> UID is checked against that table  (JunkMails) to see if we already know
> about it. If its not in that table then we add it and then delete it
> from our index. Ergo its more efficient than what you have

So if you remove this step completely and just check the index on start
up shouldn't it be JUST as efficient? Checking a table for junk and
keeping that synchronised should be just as wasteful as scanning the
summary file I would imagine.

-- 
Regards,
Martyn
_______________________________________________
tracker-list mailing list
tracker-list@gnome.org
http://mail.gnome.org/mailman/listinfo/tracker-list

Re: [Tracker] more issues with indexer-split

Reply via email to