Re: [Dbmail-dev] messageblks logic

2004-05-25 Thread Ilja Booij

Jesse Norell wrote:
I think we can safely go to a strategy where we put the message in 2 
blocks, 1 for the header and 1 for the body. However, if we make our 
parsing code to handle messages only in that way, we need to produce a 
script for migrating from a database with split messages. This should'n 
be too much of a problem though.



  Can a "header" flag be added to the message blocks table at this time?


That would make perfect sense and should be added when we make the 
2-block header-body split.


Ilja



Re: [Dbmail-dev] messageblks logic

2004-05-24 Thread Aaron Stone
Paul J Stevens <[EMAIL PROTECTED]> said:

> 
> Jesse Norell wrote:
> >>
> >>This can be very interesting to do. Of course, we still need to store 
> >>the message in its original format, but to store some information on 
> >>mime-parts would be very benificial for performance.
> > 
> >   This would/could be another application for making a more generic
> > per-message data cache, rather than solely for message headers.
> 
> How about some serialization of meta-data like python's shelve or php's 
> type-serializer and introduce a cache-field to store results from 
> introspective queries.
> 

What does this mean in English?

Presumably serialized data could no longer be plain-text searched. That missed
seem to miss most of the point of the cache fields, which is that we want to
have the data directly in the database as messageid/key/value tuples (possibly
with joined tables, but that's another debate -- although joins won, iirc, and
I recanted my claims that joins would be slower than overstretched indices).
If instead we have messageid/serialized-data pairs, then we'd have to parse
the serialized data and search it from within dbmail. By using key/value
pairs, we can let the database do the searching... which is what databases do!

Aaron

-- 


Re: [Dbmail-dev] messageblks logic

2004-05-24 Thread Paul J Stevens



Jesse Norell wrote:

It occurred to me today
that if we were MIME parsing at delivery time, we could also store the mime
structure of a message in the database. An IMAP BODY.PEEK for mime 
structures

would be nearly instantaneous. This is probably the single most common
request
from an email client, especially web-based ones that need to refresh their
message list on almost every page hit.


This can be very interesting to do. Of course, we still need to store 
the message in its original format, but to store some information on 
mime-parts would be very benificial for performance.



  This would/could be another application for making a more generic
per-message data cache, rather than solely for message headers.


How about some serialization of meta-data like python's shelve or php's 
type-serializer and introduce a cache-field to store results from 
introspective queries.


--
  
  Paul Stevens  mailto:[EMAIL PROTECTED]
  NET FACILITIES GROUP PGP: finger [EMAIL PROTECTED]
  The Netherlandshttp://www.nfg.nl


Re: [Dbmail-dev] messageblks logic

2004-05-24 Thread Aaron Stone
I agree. We should fill in the wiki some more and get a plan together.

Aaron


""Jesse Norell"" <[EMAIL PROTECTED]> said:

> > > It occurred to me today
> > > that if we were MIME parsing at delivery time, we could also store the 
> > > mime
> > > structure of a message in the database. An IMAP BODY.PEEK for mime 
> > > structures
> > > would be nearly instantaneous. This is probably the single most common
> > > request
> > > from an email client, especially web-based ones that need to refresh their
> > > message list on almost every page hit.
> > 
> > This can be very interesting to do. Of course, we still need to store 
> > the message in its original format, but to store some information on 
> > mime-parts would be very benificial for performance.
> 
>   This would/could be another application for making a more generic
> per-message data cache, rather than solely for message headers.
> 
> 
> 
> --
> Jesse Norell
> 
> [EMAIL PROTECTED] is not my email address;
> change "administrator" to my first name.
> --
> 
> ___
> Dbmail-dev mailing list
> Dbmail-dev@dbmail.org
> http://twister.fastxs.net/mailman/listinfo/dbmail-dev
> 



-- 





Re: [Dbmail-dev] messageblks logic

2004-05-24 Thread Jesse Norell

> > It occurred to me today
> > that if we were MIME parsing at delivery time, we could also store the mime
> > structure of a message in the database. An IMAP BODY.PEEK for mime 
> > structures
> > would be nearly instantaneous. This is probably the single most common
> > request
> > from an email client, especially web-based ones that need to refresh their
> > message list on almost every page hit.
> 
> This can be very interesting to do. Of course, we still need to store 
> the message in its original format, but to store some information on 
> mime-parts would be very benificial for performance.

  This would/could be another application for making a more generic
per-message data cache, rather than solely for message headers.



--
Jesse Norell

[EMAIL PROTECTED] is not my email address;
change "administrator" to my first name.
--



Re: [Dbmail-dev] messageblks logic

2004-05-24 Thread Jesse Norell


> I think we can safely go to a strategy where we put the message in 2 
> blocks, 1 for the header and 1 for the body. However, if we make our 
> parsing code to handle messages only in that way, we need to produce a 
> script for migrating from a database with split messages. This should'n 
> be too much of a problem though.

  Can a "header" flag be added to the message blocks table at this time?

--
Jesse Norell

[EMAIL PROTECTED] is not my email address;
change "administrator" to my first name.
--



Re: [Dbmail-dev] messageblks logic

2004-05-24 Thread Leif Jackson
>>
>> Yes, it makes the code more complicated, but I think it's only because
>> we
>> still haven't fully adapted our thinking to our model. It occurred to me
>> today
>> that if we were MIME parsing at delivery time, we could also store the
>> mime
>> structure of a message in the database. An IMAP BODY.PEEK for mime
>> structures
>> would be nearly instantaneous. This is probably the single most common
>> request
>> from an email client, especially web-based ones that need to refresh
>> their
>> message list on almost every page hit.
>
> This can be very interesting to do. Of course, we still need to store
> the message in its original format, but to store some information on
> mime-parts would be very benificial for performance.

This would also lend to the header caching, or at least let me finish up a
sort and thread implementation that would be very fast.

just my 0.02

-leif



Re: [Dbmail-dev] messageblks logic

2004-05-24 Thread Ilja Booij

Aaron Stone wrote:

I don't think we can safely cap message sizes "at a few mb" -- I've certainly
sent myself some hefty sized emails and would be quite frustrated if this
weren't a possibility because of an intrinsic, hard-coded limit.


I wasn't suggesting capping at a few MB, but rather at something like 
128MB.


Yes, it makes the code more complicated, but I think it's only because we
still haven't fully adapted our thinking to our model. It occurred to me today
that if we were MIME parsing at delivery time, we could also store the mime
structure of a message in the database. An IMAP BODY.PEEK for mime structures
would be nearly instantaneous. This is probably the single most common request
from an email client, especially web-based ones that need to refresh their
message list on almost every page hit.


This can be very interesting to do. Of course, we still need to store 
the message in its original format, but to store some information on 
mime-parts would be very benificial for performance.


If MySQL can retrieve huge (eg 1 GB) rows easily, and the limitation is only
when inserting or updating, then we might be able to use string concatenation
to append parts of the message until the body row has the whole thing.


Setting the max_allowed_packet high enough (128MB, maybe even 256MB) 
would make this unnecessary.


Ilja




Re: [Dbmail-dev] messageblks logic

2004-05-24 Thread Aaron Stone
I don't think we can safely cap message sizes "at a few mb" -- I've certainly
sent myself some hefty sized emails and would be quite frustrated if this
weren't a possibility because of an intrinsic, hard-coded limit.

Yes, it makes the code more complicated, but I think it's only because we
still haven't fully adapted our thinking to our model. It occurred to me today
that if we were MIME parsing at delivery time, we could also store the mime
structure of a message in the database. An IMAP BODY.PEEK for mime structures
would be nearly instantaneous. This is probably the single most common request
from an email client, especially web-based ones that need to refresh their
message list on almost every page hit.

If MySQL can retrieve huge (eg 1 GB) rows easily, and the limitation is only
when inserting or updating, then we might be able to use string concatenation
to append parts of the message until the body row has the whole thing.

Aaron



Ilja Booij <[EMAIL PROTECTED]> said:

> MySQL used to have a limit on the client-server communication that 
> forced us to limit the size of blocks being transferred. Nowadays that 
> limit is much higher.
> 
> Setting the max_allowed_packet variable to a high number (a few MB for 
> instance) on both client and server will allow for sending big blocks 
> between client and server.
> 
> The TEXT field itself has no limit.
> In PostgreSQL, there's a 1GB limit on the TEXT field.
> 
> I think we can safely go to a strategy where we put the message in 2 
> blocks, 1 for the header and 1 for the body. However, if we make our 
> parsing code to handle messages only in that way, we need to produce a 
> script for migrating from a database with split messages. This should'n 
> be too much of a problem though.
> 
> Ilja
> 
> ___
> Dbmail-dev mailing list
> Dbmail-dev@dbmail.org
> http://twister.fastxs.net/mailman/listinfo/dbmail-dev
> 

-- 



Re: [Dbmail-dev] messageblks logic

2004-05-24 Thread Ilja Booij

Ilja Booij wrote:

I'm not sure what the best way to handle this is, though. My thinking has
always revolved around handling huge messages without causing resource
starvation. So a gigabyte email should be parsed in pieces and not all
allocated into memory at once. But a four megabyte email might as well go into
memory. You'd have to get a heck of a lot of people each reading a four meg
email at once for that to be a major problem. But, since we want DBMail to be
properly scalable to really large installations, it is a distinct possibility
to have that many people each reading an email that large (scenario: the CEO
sounds out his latest crazy plan in a four meg powerpoint, and everyone in the
whole company starts pounding on the mail server to retrieve their copy.)



OTOH, a message which consists of multiple block will (almost) always be 
  fetched from the database completely anyway. With that in mind, it 
wouldn't matter if the message data is in 2 blocks (header and body) 
instead of > 2 blocks.


I can remember something about the maximum size for a TEXT field in 
MySQL being the original reason for the choice of splitting the message 
into parts. I'll ask Eelco and Roel, they should know this.


I've asked Eelco about this:

MySQL used to have a limit on the client-server communication that 
forced us to limit the size of blocks being transferred. Nowadays that 
limit is much higher.


Setting the max_allowed_packet variable to a high number (a few MB for 
instance) on both client and server will allow for sending big blocks 
between client and server.


The TEXT field itself has no limit.
In PostgreSQL, there's a 1GB limit on the TEXT field.

I think we can safely go to a strategy where we put the message in 2 
blocks, 1 for the header and 1 for the body. However, if we make our 
parsing code to handle messages only in that way, we need to produce a 
script for migrating from a database with split messages. This should'n 
be too much of a problem though.


Ilja



Re: [Dbmail-dev] messageblks logic

2004-05-24 Thread Paul J Stevens



Aaron Stone wrote:

It simply fills up a block of size READ_BLOCK_SIZE, inserts it, then start
filling another one.


Doesn't that break searching messages? Breaking up messages on a fixed 
char width is easiest of all, but then single words in messages could 
span messageblks. But then, READ_BLOCK_SIZE is .5 MB, and mails larger 
than that tend to be mime-encoded anyway.





If GMIME has a callback architecture, we might be able to ask it to parse
messges in blocks of READ_BLOCK_SIZE, so we'd be able to retrieve rows from
the database one at a time and pass it on to GMIME.


Such callbacks probably can be only be implemented if messageblks are 
logical mime units, such as a full message or a mime-part. Or at the 
very least such logical mime parts would have to be reassembled before 
initializing a gmime object.



I'm not sure what the best way to handle this is, though. My thinking has
always revolved around handling huge messages without causing resource
starvation. So a gigabyte email should be parsed in pieces and not all
allocated into memory at once. 


GB sized emails seem to me to be not-of-this-world at present. I'm quite 
certain most if not all isp have a cap on the max mailmessage size 
that's quite a lot smaller than that. Still, a valid guideline though.



But a four megabyte email might as well go into
memory. You'd have to get a heck of a lot of people each reading a four meg
email at once for that to be a major problem. But, since we want DBMail to be
properly scalable to really large installations, it is a distinct possibility
to have that many people each reading an email that large (scenario: the CEO
sounds out his latest crazy plan in a four meg powerpoint, and everyone in the
whole company starts pounding on the mail server to retrieve their copy.)


I guess only a real-world test can expose the actual bottlenecks 
involved. You got me thinking ...



--
  
  Paul Stevens  mailto:[EMAIL PROTECTED]
  NET FACILITIES GROUP PGP: finger [EMAIL PROTECTED]
  The Netherlandshttp://www.nfg.nl


Re: [Dbmail-dev] messageblks logic

2004-05-24 Thread Ilja Booij

I'm not sure what the best way to handle this is, though. My thinking has
always revolved around handling huge messages without causing resource
starvation. So a gigabyte email should be parsed in pieces and not all
allocated into memory at once. But a four megabyte email might as well go into
memory. You'd have to get a heck of a lot of people each reading a four meg
email at once for that to be a major problem. But, since we want DBMail to be
properly scalable to really large installations, it is a distinct possibility
to have that many people each reading an email that large (scenario: the CEO
sounds out his latest crazy plan in a four meg powerpoint, and everyone in the
whole company starts pounding on the mail server to retrieve their copy.)


OTOH, a message which consists of multiple block will (almost) always be 
 fetched from the database completely anyway. With that in mind, it 
wouldn't matter if the message data is in 2 blocks (header and body) 
instead of > 2 blocks.


I can remember something about the maximum size for a TEXT field in 
MySQL being the original reason for the choice of splitting the message 
into parts. I'll ask Eelco and Roel, they should know this.


Ilja



Re: [Dbmail-dev] messageblks logic

2004-05-23 Thread Aaron Stone
It simply fills up a block of size READ_BLOCK_SIZE, inserts it, then start
filling another one. I'm not familiar with the output code (tried reading
through it, but quickly got confused by the details of the MIME parser). My
understanding, though, is that the entire message has to be reassembled in
memory at some point for the MIME parsing to work. Ilja, is this correct?

If GMIME has a callback architecture, we might be able to ask it to parse
messges in blocks of READ_BLOCK_SIZE, so we'd be able to retrieve rows from
the database one at a time and pass it on to GMIME.

I'm not sure what the best way to handle this is, though. My thinking has
always revolved around handling huge messages without causing resource
starvation. So a gigabyte email should be parsed in pieces and not all
allocated into memory at once. But a four megabyte email might as well go into
memory. You'd have to get a heck of a lot of people each reading a four meg
email at once for that to be a major problem. But, since we want DBMail to be
properly scalable to really large installations, it is a distinct possibility
to have that many people each reading an email that large (scenario: the CEO
sounds out his latest crazy plan in a four meg powerpoint, and everyone in the
whole company starts pounding on the mail server to retrieve their copy.)

Aaron


Paul J Stevens <[EMAIL PROTECTED]> said:

> Hi all,
> 
> Ilja, Aaron,
> 
> I'm playing around with gmime to see if I can rebuild the message 
> injection and extraction logic around glib/gmime.
> 
> What I'd like to know: what would be the logic for splitting a message 
> into messageblks?
> 
> I know the first blk is for the email messageheader. Easy. But after 
> that? What criteria are used? Line split? fix string sizes? Mimepart 
> boundaries?  I have a hard time understanding the current codebase.
> 
> And are the current criteria implicitely required elsewhere in the code...
> 
> I guess imap-searching requires splitting on line-boundaries at the 
> minimum, or maybe word-boundaries.
> 
> But other than that?
> 
> 
> -- 
>
>Paul Stevens  mailto:[EMAIL PROTECTED]
>NET FACILITIES GROUP PGP: finger [EMAIL PROTECTED]
>The Netherlandshttp://www.nfg.nl
> ___
> Dbmail-dev mailing list
> Dbmail-dev@dbmail.org
> http://twister.fastxs.net/mailman/listinfo/dbmail-dev
> 



--