Re: [Tracker] more issues with indexer-split

Jamie McCracken Wed, 13 Aug 2008 11:47:41 -0700

On Wed, 2008-08-13 at 19:30 +0200, Carlos Garnacho wrote:
> Hi :),
> 
> On mié, 2008-08-13 at 11:47 -0400, Jamie McCracken wrote:
> > On Wed, 2008-08-13 at 17:12 +0200, Carlos Garnacho wrote:
> > > Hi!,
> > > 
> > > On mar, 2008-08-12 at 14:18 -0400, Jamie McCracken wrote:
> > > 
> > > <snip>
> > > 
> > > > that sounds inefficient - trunk only ever checked for existing deleted
> > > > or junk emails at startup because iterating through all emails in the
> > > > summary files is expensive. 
> > > 
> > > >From what I've read in trunk code, you still iterate through all the
> > > mails in the summary in check_summary_file(), and you will have to
> > > iterate over them again later to index new messages, etc...
> > 
> > yes but when we are not doing the startup check, we are skipping so its
> > faster and we are not stopping at any deleted or junk email and checking
> > it 
> 
> How much time to you plan to save doing fseek() instead of fread()? I've
> updated the code in indexer-split to just read over the message when it
> gets to a deleted/junk message, and read_summary() could be changed to
> do fseek() if no data is asked. That makes the indexer-split code do one
> pass where trunk does two. Less disk head movement I'd say :)
> 
> Also, take into account that you're forced to fread() even if you're
> skipping a message, since you have to know strings length to be able to
> skip them.
>


I know that - what I want to avoid is doing lookups on the email
services table whenever it returns null whenever there is a new email


> > 
> > 
> > > 
> > > As far as I know, it's quite unavoidable to parse again summaries, since
> > > under some circumstances Message IDs could be reused, which would leave
> > > you with inconsistent data in the DBs. Even if it isn't, expunging a
> > > folder would render any stored offset for the summary file useless (even
> > > dangerous).
> > 
> > true but we would get a deletion from inotify of the summary file if
> > that was the case. Its not a byte offset but message count - so we skip
> > x messages to get the new ones (similar to what beagle does)
> 
> With "Expunge" I meant "tell $MAIL_APP to get rid of deleted messages in
> the mail folder", in Evolution that would change the summary file and
> mess up offsets for sure.
> 
> As far as I see, for mbox you're storing the offset in the stream:
> 
>         msg_offset = g_mime_parser_tell (mf->parser);
>         ....
>         mail_msg->offset = msg_offset;
>         
> For IMAP, I just get "0" in the Services table, also didn't get to see
> any code to do this.


imap stores message count too - its count rather than byte offset

> 
> > 
> > 
> > > 
> > > Besides, when testing summary parsing, I remember it was pretty fast
> > > (like 2-3 seconds for a ~6500 emails summary), of course without
> > > inserting to DBs nor doing message body or attachments sniffing, which
> > > is more or less what should happen if the junk/deleted flag is set.
> > 
> > with 100,000+ emails its quite noticeable
> > 
> > 
> > > 
> > > > the use of a separate junk email table meant
> > > > lookups were confined to that table and not the services table so was
> > > > faster when number of emails was high
> > > 
> > > You mean the JunkMails table in email-meta.db? As far as I see, this
> > > table is just looked up to make sure there aren't duplicates when
> > > inserting. And in the end, you still have to lookup/modify the Services
> > > table, even if the junk mail wasn't there.
> > > 
> > 
> > no when junk/deleted email is encountered during the start up scan its
> > UID is checked against that table  (JunkMails) to see if we already know
> > about it. If its not in that table then we add it and then delete it
> > from our index. Ergo its more efficient than what you have
> 
> Could you tell me where's that code? The only users for
> InsertJunk/LookupJunk (the stored procedures) are
> tracker_db_email_insert_junk() and tracker_db_email_lookup_junk(), the
> former is also the only user of the latter, and it doesn't do what you
> mention.
> 
> The only place I see where it could delete emails from the DB for
> Evolution is check_summary_file(), and tracker_db_email_delete_email()
> seems to be called inconditionally for any junk/deleted message found.

the way it should work is as described above

I had tested it and it works (deleted and junk emails are pruned on next
restart of trackerd)

How do you currently tell which emails are new in the summary file?
Without storing the count you cannot know without verifying each email
exists in the services table (which would obviously be unacceptable
performance wise)


> 
> 
> > 
> > 
> > > > 
> > > > we should also avoid doing this whenever the summary file changes which
> > > > is why we stored an offset in trunk so we skip over messages to get to
> > > > the new ones only when summary files change or do nothing if no new ones
> > > > are present
> > > 
> > > As said above, I think there are pretty good reasons to avoid this.
> > > 
> > > > 
> > > > the trunk way is faster so i would prefer that restored
> > > 
> > > If you bear with me, I'd prefer to try a few optimizations before having
> > > to add special cases.
> > 
> > well not doing the junk/deletion check everytime the summary file changes 
> > must obviously be faster?
> 
> Sure, but it's also more beneficial for users if tracker DB contents are
> up to date with the actual data. Also, IMHO adding special cases like
> this would break a design that makes tracker really extensible and easy
> to develop for.


that can be done easily -  for quick synch test just check last known
UID in summary file (using stored message count) exists in services - if
it does not then you have a count mismatch and a resync is required

this can be done whenever a new email arrives as its not expensive

suggest having a resync method to do above and a check_synch one to test
its ok

jamie

> 
> Regards,
>    Carlos
> 

_______________________________________________
tracker-list mailing list
tracker-list@gnome.org
http://mail.gnome.org/mailman/listinfo/tracker-list

Re: [Tracker] more issues with indexer-split

Reply via email to