Re: [Dbmail-dev] messageblks logic
Jesse Norell wrote: I think we can safely go to a strategy where we put the message in 2 blocks, 1 for the header and 1 for the body. However, if we make our parsing code to handle messages only in that way, we need to produce a script for migrating from a database with split messages. This should'n be too much of a problem though. Can a "header" flag be added to the message blocks table at this time? That would make perfect sense and should be added when we make the 2-block header-body split. Ilja
Re: [Dbmail-dev] messageblks logic
Paul J Stevens <[EMAIL PROTECTED]> said: > > Jesse Norell wrote: > >> > >>This can be very interesting to do. Of course, we still need to store > >>the message in its original format, but to store some information on > >>mime-parts would be very benificial for performance. > > > > This would/could be another application for making a more generic > > per-message data cache, rather than solely for message headers. > > How about some serialization of meta-data like python's shelve or php's > type-serializer and introduce a cache-field to store results from > introspective queries. > What does this mean in English? Presumably serialized data could no longer be plain-text searched. That missed seem to miss most of the point of the cache fields, which is that we want to have the data directly in the database as messageid/key/value tuples (possibly with joined tables, but that's another debate -- although joins won, iirc, and I recanted my claims that joins would be slower than overstretched indices). If instead we have messageid/serialized-data pairs, then we'd have to parse the serialized data and search it from within dbmail. By using key/value pairs, we can let the database do the searching... which is what databases do! Aaron --
Re: [Dbmail-dev] messageblks logic
Jesse Norell wrote: It occurred to me today that if we were MIME parsing at delivery time, we could also store the mime structure of a message in the database. An IMAP BODY.PEEK for mime structures would be nearly instantaneous. This is probably the single most common request from an email client, especially web-based ones that need to refresh their message list on almost every page hit. This can be very interesting to do. Of course, we still need to store the message in its original format, but to store some information on mime-parts would be very benificial for performance. This would/could be another application for making a more generic per-message data cache, rather than solely for message headers. How about some serialization of meta-data like python's shelve or php's type-serializer and introduce a cache-field to store results from introspective queries. -- Paul Stevens mailto:[EMAIL PROTECTED] NET FACILITIES GROUP PGP: finger [EMAIL PROTECTED] The Netherlandshttp://www.nfg.nl
Re: [Dbmail-dev] messageblks logic
I agree. We should fill in the wiki some more and get a plan together. Aaron ""Jesse Norell"" <[EMAIL PROTECTED]> said: > > > It occurred to me today > > > that if we were MIME parsing at delivery time, we could also store the > > > mime > > > structure of a message in the database. An IMAP BODY.PEEK for mime > > > structures > > > would be nearly instantaneous. This is probably the single most common > > > request > > > from an email client, especially web-based ones that need to refresh their > > > message list on almost every page hit. > > > > This can be very interesting to do. Of course, we still need to store > > the message in its original format, but to store some information on > > mime-parts would be very benificial for performance. > > This would/could be another application for making a more generic > per-message data cache, rather than solely for message headers. > > > > -- > Jesse Norell > > [EMAIL PROTECTED] is not my email address; > change "administrator" to my first name. > -- > > ___ > Dbmail-dev mailing list > Dbmail-dev@dbmail.org > http://twister.fastxs.net/mailman/listinfo/dbmail-dev > --
Re: [Dbmail-dev] messageblks logic
> > It occurred to me today > > that if we were MIME parsing at delivery time, we could also store the mime > > structure of a message in the database. An IMAP BODY.PEEK for mime > > structures > > would be nearly instantaneous. This is probably the single most common > > request > > from an email client, especially web-based ones that need to refresh their > > message list on almost every page hit. > > This can be very interesting to do. Of course, we still need to store > the message in its original format, but to store some information on > mime-parts would be very benificial for performance. This would/could be another application for making a more generic per-message data cache, rather than solely for message headers. -- Jesse Norell [EMAIL PROTECTED] is not my email address; change "administrator" to my first name. --
Re: [Dbmail-dev] messageblks logic
> I think we can safely go to a strategy where we put the message in 2 > blocks, 1 for the header and 1 for the body. However, if we make our > parsing code to handle messages only in that way, we need to produce a > script for migrating from a database with split messages. This should'n > be too much of a problem though. Can a "header" flag be added to the message blocks table at this time? -- Jesse Norell [EMAIL PROTECTED] is not my email address; change "administrator" to my first name. --
Re: [Dbmail-dev] messageblks logic
>> >> Yes, it makes the code more complicated, but I think it's only because >> we >> still haven't fully adapted our thinking to our model. It occurred to me >> today >> that if we were MIME parsing at delivery time, we could also store the >> mime >> structure of a message in the database. An IMAP BODY.PEEK for mime >> structures >> would be nearly instantaneous. This is probably the single most common >> request >> from an email client, especially web-based ones that need to refresh >> their >> message list on almost every page hit. > > This can be very interesting to do. Of course, we still need to store > the message in its original format, but to store some information on > mime-parts would be very benificial for performance. This would also lend to the header caching, or at least let me finish up a sort and thread implementation that would be very fast. just my 0.02 -leif
Re: [Dbmail-dev] messageblks logic
Aaron Stone wrote: I don't think we can safely cap message sizes "at a few mb" -- I've certainly sent myself some hefty sized emails and would be quite frustrated if this weren't a possibility because of an intrinsic, hard-coded limit. I wasn't suggesting capping at a few MB, but rather at something like 128MB. Yes, it makes the code more complicated, but I think it's only because we still haven't fully adapted our thinking to our model. It occurred to me today that if we were MIME parsing at delivery time, we could also store the mime structure of a message in the database. An IMAP BODY.PEEK for mime structures would be nearly instantaneous. This is probably the single most common request from an email client, especially web-based ones that need to refresh their message list on almost every page hit. This can be very interesting to do. Of course, we still need to store the message in its original format, but to store some information on mime-parts would be very benificial for performance. If MySQL can retrieve huge (eg 1 GB) rows easily, and the limitation is only when inserting or updating, then we might be able to use string concatenation to append parts of the message until the body row has the whole thing. Setting the max_allowed_packet high enough (128MB, maybe even 256MB) would make this unnecessary. Ilja
Re: [Dbmail-dev] messageblks logic
I don't think we can safely cap message sizes "at a few mb" -- I've certainly sent myself some hefty sized emails and would be quite frustrated if this weren't a possibility because of an intrinsic, hard-coded limit. Yes, it makes the code more complicated, but I think it's only because we still haven't fully adapted our thinking to our model. It occurred to me today that if we were MIME parsing at delivery time, we could also store the mime structure of a message in the database. An IMAP BODY.PEEK for mime structures would be nearly instantaneous. This is probably the single most common request from an email client, especially web-based ones that need to refresh their message list on almost every page hit. If MySQL can retrieve huge (eg 1 GB) rows easily, and the limitation is only when inserting or updating, then we might be able to use string concatenation to append parts of the message until the body row has the whole thing. Aaron Ilja Booij <[EMAIL PROTECTED]> said: > MySQL used to have a limit on the client-server communication that > forced us to limit the size of blocks being transferred. Nowadays that > limit is much higher. > > Setting the max_allowed_packet variable to a high number (a few MB for > instance) on both client and server will allow for sending big blocks > between client and server. > > The TEXT field itself has no limit. > In PostgreSQL, there's a 1GB limit on the TEXT field. > > I think we can safely go to a strategy where we put the message in 2 > blocks, 1 for the header and 1 for the body. However, if we make our > parsing code to handle messages only in that way, we need to produce a > script for migrating from a database with split messages. This should'n > be too much of a problem though. > > Ilja > > ___ > Dbmail-dev mailing list > Dbmail-dev@dbmail.org > http://twister.fastxs.net/mailman/listinfo/dbmail-dev > --
Re: [Dbmail-dev] messageblks logic
Ilja Booij wrote: I'm not sure what the best way to handle this is, though. My thinking has always revolved around handling huge messages without causing resource starvation. So a gigabyte email should be parsed in pieces and not all allocated into memory at once. But a four megabyte email might as well go into memory. You'd have to get a heck of a lot of people each reading a four meg email at once for that to be a major problem. But, since we want DBMail to be properly scalable to really large installations, it is a distinct possibility to have that many people each reading an email that large (scenario: the CEO sounds out his latest crazy plan in a four meg powerpoint, and everyone in the whole company starts pounding on the mail server to retrieve their copy.) OTOH, a message which consists of multiple block will (almost) always be fetched from the database completely anyway. With that in mind, it wouldn't matter if the message data is in 2 blocks (header and body) instead of > 2 blocks. I can remember something about the maximum size for a TEXT field in MySQL being the original reason for the choice of splitting the message into parts. I'll ask Eelco and Roel, they should know this. I've asked Eelco about this: MySQL used to have a limit on the client-server communication that forced us to limit the size of blocks being transferred. Nowadays that limit is much higher. Setting the max_allowed_packet variable to a high number (a few MB for instance) on both client and server will allow for sending big blocks between client and server. The TEXT field itself has no limit. In PostgreSQL, there's a 1GB limit on the TEXT field. I think we can safely go to a strategy where we put the message in 2 blocks, 1 for the header and 1 for the body. However, if we make our parsing code to handle messages only in that way, we need to produce a script for migrating from a database with split messages. This should'n be too much of a problem though. Ilja
Re: [Dbmail-dev] messageblks logic
Aaron Stone wrote: It simply fills up a block of size READ_BLOCK_SIZE, inserts it, then start filling another one. Doesn't that break searching messages? Breaking up messages on a fixed char width is easiest of all, but then single words in messages could span messageblks. But then, READ_BLOCK_SIZE is .5 MB, and mails larger than that tend to be mime-encoded anyway. If GMIME has a callback architecture, we might be able to ask it to parse messges in blocks of READ_BLOCK_SIZE, so we'd be able to retrieve rows from the database one at a time and pass it on to GMIME. Such callbacks probably can be only be implemented if messageblks are logical mime units, such as a full message or a mime-part. Or at the very least such logical mime parts would have to be reassembled before initializing a gmime object. I'm not sure what the best way to handle this is, though. My thinking has always revolved around handling huge messages without causing resource starvation. So a gigabyte email should be parsed in pieces and not all allocated into memory at once. GB sized emails seem to me to be not-of-this-world at present. I'm quite certain most if not all isp have a cap on the max mailmessage size that's quite a lot smaller than that. Still, a valid guideline though. But a four megabyte email might as well go into memory. You'd have to get a heck of a lot of people each reading a four meg email at once for that to be a major problem. But, since we want DBMail to be properly scalable to really large installations, it is a distinct possibility to have that many people each reading an email that large (scenario: the CEO sounds out his latest crazy plan in a four meg powerpoint, and everyone in the whole company starts pounding on the mail server to retrieve their copy.) I guess only a real-world test can expose the actual bottlenecks involved. You got me thinking ... -- Paul Stevens mailto:[EMAIL PROTECTED] NET FACILITIES GROUP PGP: finger [EMAIL PROTECTED] The Netherlandshttp://www.nfg.nl
Re: [Dbmail-dev] messageblks logic
I'm not sure what the best way to handle this is, though. My thinking has always revolved around handling huge messages without causing resource starvation. So a gigabyte email should be parsed in pieces and not all allocated into memory at once. But a four megabyte email might as well go into memory. You'd have to get a heck of a lot of people each reading a four meg email at once for that to be a major problem. But, since we want DBMail to be properly scalable to really large installations, it is a distinct possibility to have that many people each reading an email that large (scenario: the CEO sounds out his latest crazy plan in a four meg powerpoint, and everyone in the whole company starts pounding on the mail server to retrieve their copy.) OTOH, a message which consists of multiple block will (almost) always be fetched from the database completely anyway. With that in mind, it wouldn't matter if the message data is in 2 blocks (header and body) instead of > 2 blocks. I can remember something about the maximum size for a TEXT field in MySQL being the original reason for the choice of splitting the message into parts. I'll ask Eelco and Roel, they should know this. Ilja
Re: [Dbmail-dev] messageblks logic
It simply fills up a block of size READ_BLOCK_SIZE, inserts it, then start filling another one. I'm not familiar with the output code (tried reading through it, but quickly got confused by the details of the MIME parser). My understanding, though, is that the entire message has to be reassembled in memory at some point for the MIME parsing to work. Ilja, is this correct? If GMIME has a callback architecture, we might be able to ask it to parse messges in blocks of READ_BLOCK_SIZE, so we'd be able to retrieve rows from the database one at a time and pass it on to GMIME. I'm not sure what the best way to handle this is, though. My thinking has always revolved around handling huge messages without causing resource starvation. So a gigabyte email should be parsed in pieces and not all allocated into memory at once. But a four megabyte email might as well go into memory. You'd have to get a heck of a lot of people each reading a four meg email at once for that to be a major problem. But, since we want DBMail to be properly scalable to really large installations, it is a distinct possibility to have that many people each reading an email that large (scenario: the CEO sounds out his latest crazy plan in a four meg powerpoint, and everyone in the whole company starts pounding on the mail server to retrieve their copy.) Aaron Paul J Stevens <[EMAIL PROTECTED]> said: > Hi all, > > Ilja, Aaron, > > I'm playing around with gmime to see if I can rebuild the message > injection and extraction logic around glib/gmime. > > What I'd like to know: what would be the logic for splitting a message > into messageblks? > > I know the first blk is for the email messageheader. Easy. But after > that? What criteria are used? Line split? fix string sizes? Mimepart > boundaries? I have a hard time understanding the current codebase. > > And are the current criteria implicitely required elsewhere in the code... > > I guess imap-searching requires splitting on line-boundaries at the > minimum, or maybe word-boundaries. > > But other than that? > > > -- > >Paul Stevens mailto:[EMAIL PROTECTED] >NET FACILITIES GROUP PGP: finger [EMAIL PROTECTED] >The Netherlandshttp://www.nfg.nl > ___ > Dbmail-dev mailing list > Dbmail-dev@dbmail.org > http://twister.fastxs.net/mailman/listinfo/dbmail-dev > --