new InputStream class for mail data

Richard O. Hammer Wed, 02 Jul 2003 09:04:54 -0700

Attached you will find a class that I offer for your consideration. It is called SMTPDataInputStream and it would be used in the doDATA() method of SMTPHandler. There it would replace CharTerminatedInputStream, BytesReadResetInputStream, SizeLimitedInputStream, and DotStuffingInputStream, since I believe it does the work of all those.

There are minor ways in which this class responds differently, and more correctly I believe, than the current CharTerminatedInputStream. The end of mail data indicator which this recognizes is a period alone in a line, rather than CRLF.CRLF as recognized in James at present. As such this class returns the CRLF which terminates the last line of message body data (the CRLF before the period), and it allows the possibility of empty mail data (with a period being the first character sent in the data).

I raised the question of whether these changes are indeed a better
interpretation of RFC 2821 on the mailing list [EMAIL PROTECTED]  If
you like you can see that thread starting at
http://www.imc.org/ietf-smtp/mail-archive/msg00703.html .  If you read
that thread and think the question remains unresolved you might adopt
the bias which I use in this case: multiply the value of each word
from Dan Bernsetin by 1000.

Improved performance would be another advantage which I would expect from adopting this class in James. The present stack of InputStreams makes more method calls for each byte read than this replacement. But, I have to admit, it could be that performance in this code is not a major concern.

It is possible that this class could also serve in NNTPHandler to
replace CharTerminatedInputStream there, but I am not familiar with
the code in NNTPHandler.

I also attach a file showing old and new blocks of code in SMTPHandler, showing changes needed to employ this new SMTPDataInputStream. In addition to these changes, the package name for SMTPDataInputStream will need to be corrected, as well as the importing of WatchDog and MessageSizeException.

I have tested this SMTPDataInputStream as a unit, and it performs correctly in every test I have thought to give it. But I have not tested it installed in SMTPHandler.

Rich Hammer
Hillsborough, N.C.

import java.io.BufferedInputStream;
import java.io.IOException;
import java.io.InputStream;


/** 
 * <p>An InputStream for SMTP message body data. Performs four of the functions
 *  needed while receiving the message body in the SMTP DATA command:
 * <ol>
 *   <li>watches for the end of mail data indicator, and signals this by
 *   returning EOF, that is <code>-1</code>.
 *   </li>
 *   <li>removes dot stuffing as described in RFC 2821 section 4.5.2.
 *   </li>
 *   <li>works with james.util.watchdog.Watchdog to police the minimum rate of 
 *   data transfer.  Calls Watchdog.reset() every time a certain number of
 *   bytes have been read, thus forestalling intervention of the watchdog for
 *   another time increment.
 *   </li>
 *   <li>optionally polices the total size of the data.  Throws 
 *   MessageSizeException if this exceeds a limit.
 *  </li>
 * </ol>
 * </p>
 * <p>The end of mail data indicator which this class recognizes is a period
 * alone in a line.  This indicator is often described as "CRLF.CRLF", but that
 * description leads to errors in possibly minor ways.  The better 
 * description which this class recognizes, "a period alone in a line", leads
 * to better behavior in two ways:
 * <ol>
 *   <li>When the end of mail data indicator is recognized in the input stream,
 *    the CRLF which immediately preceded the period in the indicator is 
 *    returned as part of the mail data as the CRLF which concludes the final
 *    line of mail data, rather than being discarded as part of the end of mail
 *    data indicator.
 *   </li>
 *   <li>The end of mail data indicator can occur in the very first line of 
 *   mail data, with the period being the first character read.
 *   </li>
 * </ol>
 * RFC 2821 discusses this in sections 2.3.7, 3.3, 4.1.1.4, 4.5.2.
 * </p>
 * <p> This class resets the WatchDog each time it has read a quota of bytes
 * as specified in the constructor.  But it does not reset or stop the 
 * WatchDog when it recognizes the end of mail data indicator and returns EOF.
 * </p>
 * <p>This class returns EOF in two circumstances: when it recognizes the end 
 * of mail data indicator in the stream (a normal occurrence); when the 
 * underlying stream signals EOF (probably an error of some sort).  This 
 * behavior may be okay, in that it mimics the behavior of the earlier James 
 * class CharTerminatedInputStream, but it may need further examination at some
 * point.
 * </p>
 * <p>An instance of this class can not be reset.  A new instance must 
 * constructed for each message's data.
 * </p>
 */
public class SMTPDataInputStream extends InputStream{
    BufferedInputStream in;

    /* For a discussion of some decisions made in designing this class,
     * see the comment at the end.
     */

    // The kinds of bytes we care about
    static final int
        EOF    = -1,
        CR     = 13,
        LF     = 10,
        PERIOD = 46;

    //the states in which this SMTPDataInputStream may be
    static final int
        LINE_STARTING_STATE  = 0, //at the start of a line
        MID_LINE_STATE       = 1, //the most common state
        CR_STATE             = 2, //a CR has been received        
        INIT_PERIOD_STATE    = 3, //a period at start of line
        INIT_PERIOD_CR_STATE = 4, //initial period then CR
        BUFFERING_STATE      = 5, //see comments further down
        EOF_STATE            = 6; //either EOF of end of data

    //the variable in which we keep the present state
    private int receivedState = LINE_STARTING_STATE;
        
    /* This comment describes the strategy for monitoring message size and
     * data transfer rate.  The five variables below serve these purposes.
     * 
     * The use of maxMessageSize should be obvious, but note that if it is
     * set to zero then it signals that there is no limit on message size.
     * 
     * Both of the limits (message size and data rate) are checked with one
     * operation in the most frequently used code by using a quota, kept in
     * currentQuota.  When the quota is reached then the code detects which 
     * limit has been reached and responds accordingly.
     * 
     * currentQuota will hold one of two values.  Normally it is 
     * minBytesPerUnitTime, as first set in the constructor.  It retains this
     * value until, if reading a large messsage, maxMessageSize is approached 
     * close enough that bytesSummedSoFar exceeds remnantQuotaThreshold (which
     * is set in the constructor to maxMessageSize - minBytesPerUnitTime).  
     * Then a smaller quota is set, sufficient to just exceed maxMessageSize.
     */    
    long maxMessageSize,
         bytesSummedSoFar,
         remnantQuotaThreshold;
    int currentQuota,
        bytesRemainingInQuota;
        
    WatchDog watchDog;
    
    /**
     * Constructor with the option to set maximum message size.
     * 
     * @param in the BufferedInputStream from which this will read
     * @param maxMessageSize the limit on the number of bytes which this will
     * return.  But, if set to zero, signifies that there is no limit.
     * @param watchDog a WatchDog object which will intervene (elsewhere, not
     * in this class) if too much time passes without it being reset. 
     * @param minBytesPerUnitTime the number of bytes to be read before each
     * call to <code>WatchDog.reset()</code>
     */
    public SMTPDataInputStream(BufferedInputStream in, long maxMessageSize, 
                WatchDog watchDog, int minBytesPerUnitTime) {
        if(in == null || maxMessageSize < 0 || 
            watchDog == null || minBytesPerUnitTime < 1)
        {   
            throw new IllegalArgumentException(
                        "Illegal argument to SMTPDataInputStream constructor");
        }
        this.in = in;
        this.maxMessageSize = maxMessageSize;
        this.watchDog = watchDog;
        if (maxMessageSize == 0){
            //there is no limit on message size, so ...
            currentQuota = minBytesPerUnitTime;
        }
        else{
            //currentQuota might be determined by maxMessageSize
            currentQuota = 
               (int)Math.min(maxMessageSize + 1, minBytesPerUnitTime);
            remnantQuotaThreshold = maxMessageSize - minBytesPerUnitTime;
        }
        bytesRemainingInQuota = currentQuota;
    }

    /**
     * Constructor for a message stream of unlimited size.
     * 
     * @param in the BufferedInputStream from which this will read
     * @param watchDog a WatchDog object which will intervene (elsewhere, not
     * in this class) if too much time passes without it being reset. 
     * @param minBytesPerUnitTime the number of bytes to be read before each
     * call to <code>WatchDog.reset()</code>
     */
    public SMTPDataInputStream(BufferedInputStream in,
                WatchDog watchDog, int minBytesPerUnitTime) {
        this(in, 0, watchDog, minBytesPerUnitTime);
    }
            
    /* This declares a buffer for one character.  We need to use this in
       one rare circumstance.  For more information look in the
       switch-case below for INIT_PERIOD_CR_STATE, in the "else" branch.
     */
    private int characterBuffer;

    /**
     * Reads the next byte of data from the input stream.  Removes dot stuffing
     * and watches for the SMTP end of mail data indicator, which is a period
     * alone in a line.  This method blocks until input data is available, the 
     * end of mail data indicator is recognized, the end of the underlying 
     * stream is reached, or an exception is thrown.
     * 
     * Monitors the number of bytes that have been read, and causes the 
     * Watchdog to be reset each time minBytesPerUnitTime bytes have been read.
     * This will also throw a MessageSizeException if an attempt to read more 
     * than the maximum number of bytes would return anything other than
     * <code>-1</code>.
     * 
     * @return the next byte of data, or <code>-1</code> (EOF) if the end of 
     * mail data indicator is recognized or the end of the underlying stream is
     * reached.
     * @throws IOException if an I/O error occurs in reading from the 
     * underlying stream.
     * @throws MessageSizeException upon attempt to read one byte more than
     * the set maximum.
     */
    public int read() throws IOException{
        /* Strategy: This stream is always in one of the seven states
         * enumerated above.  This state is kept in receivedState.
         * The state starts in LINE_STARTING_STATE.  On each
         * call we act based upon the current receivedState and we
         * update receivedState to reflect the new state. 
         */
        
        if (receivedState ==  EOF_STATE) {
            return EOF;
        }

        int b; //the byte we will get and return
        
        /* getByte is a labeled block in which we set b to be returned by the 
         * code after this block, in the usual case that is.  But this block 
         * can also be exited in either of the two cases when EOF is returned,
         * that is: 
         * EOF returned from underlying stream; or
         * end of mail data indicator recognized.
         */
        getByte:
        {        
            /* Check for the unusual circumstance in which we have
               buffered a character
             */
            if (receivedState == BUFFERING_STATE){
                if (characterBuffer == CR){
                    receivedState = CR_STATE;
                }
                else{
                    receivedState = MID_LINE_STATE;
                }
                b = characterBuffer;
                break getByte;
            }
    
            /* If we arrive here this is the usual case in which we
            need to read from the underlying stream.  We have a while loop
            since we may need to read as many as three characters from the
            underlying stream before we know what to return.  This while loop
            is exited from one of the branches with either "return EOF;"
            or "break getByte;".
            */
            while(true){
                b = in.read();
                
                if (b == EOF){
                    receivedState = EOF_STATE;
                    return EOF;
                }
                switch(receivedState){
                    case MID_LINE_STATE:
                        if (b == CR){
                            receivedState = CR_STATE;
                        }
                        else{
                            //receivedState remains MID_LINE_STATE
                        }
                        break getByte;
                    case LINE_STARTING_STATE:
                        switch(b){
                            case PERIOD:
                                receivedState = INIT_PERIOD_STATE;
                                /* We need to read more from the underlying
                                 stream before we know what, if anything,
                                 to return.
                                 */
                                break;
                            case CR:
                                receivedState = CR_STATE;
                                break getByte;
                            default:
                                //This is the normal start of a new line.
                                receivedState = MID_LINE_STATE;
                                break getByte;
                        }
                        break;
                    case CR_STATE:
                        switch(b){
                            case LF:
                                //We have received CRLF, the normal end of a line
                                receivedState = LINE_STARTING_STATE;
                                break getByte;
                            case CR: 
                                //receivedState remains CR_STATE
                                break getByte;
                            default:
                                receivedState = MID_LINE_STATE;
                                break getByte;
                        }
                    case INIT_PERIOD_STATE:
                        if (b == CR){
                            /* We may be receiving the end of data indicator.
                               We need to read one more character before we know.
                             */
                            receivedState = INIT_PERIOD_CR_STATE;
                            break;
                        }
                        else{
                            /* This is a case of dot stuffing.  We return
                               this character but not the dot which preceded it.
                             */
                            receivedState = MID_LINE_STATE;
                            break getByte;
                        }
                    case INIT_PERIOD_CR_STATE:
                        if (b == LF){ //we have reached the end of data indicator
                            receivedState = EOF_STATE;
                            return EOF;
                        }
                        else{
                            /* We have received the sequence
                               PERIOD CR
                               at the beginning of a line, but it is followed
                               by something other than LF .
                               So this is an unusual case of dot stuffing.
                               We return CR, and buffer b to return on the next
                               call.
                             */
                            receivedState = BUFFERING_STATE;
                            characterBuffer = b;
                            b = CR;
                            break getByte;
                        }
                    default: //should never get here
                        throw new RuntimeException("Program error in " +
                                "SMTPDataInputStream.  Unexpected state.");
                }
            }//while
        }//getByte block
        
        /* If we arrive here we have a byte in b ready to return.  Before
         * returning it we check the status of the quota.
         */
        if (--bytesRemainingInQuota <= 0){
            if (maxMessageSize != 0){//if this is a size-limited stream
                bytesSummedSoFar += currentQuota;
                if (bytesSummedSoFar > remnantQuotaThreshold){
                    if (bytesSummedSoFar > maxMessageSize){
                        throw new MessageSizeException("Maximum message size of "
                             + maxMessageSize + " exceeded.");
                    }else{
                        //next quota determined by maxMessageSize
                        currentQuota =
                            (int)((maxMessageSize - bytesSummedSoFar) + 1);
                    }
                }
            }
            watchDog.reset();
            bytesRemainingInQuota = currentQuota;
        }
        return b;
    }//method read()

    
    public void close() throws IOException{
        super.close();
    }

    /* This class was designed to fit in James as it existed in June 2003.
     * This comment describes my (Richard Hammer's) design decisions 
     * in writing this class.
     * 
     * In James, in SMTPHandler, the InputStream from a Socket is read 
     * alternately by two other classes.  First CRLFTerminatedReader reads 
     * SMTP envelope command lines.  Then, when it comes time to read message
     * data, this class picks up reading from the same InputStream, reading 
     * until the end of mail data is indicated, at which point this class
     * signals EOF.  But then more envelope command lines are read from the
     * underlying InputStream by CRLFTerminatedReader again.
     *  
     * In this environment each of the classes which read from the underlying
     * InputStream must leave it in the right place.  And it makes sense to 
     * have buffering done in the underlying InputStream.  Redundant 
     * buffering does not need to be done in either CRLFTerminatedReader or 
     * this class.  To help establish this practice, the constructors of this
     * class require a BufferedInputStream.
     * 
     * This class overrides InputStream method read(), but not read(byte[]) 
     * or read(byte[], int, int).  Those two InputStream methods work by making
     * repeated calls to read() in this class.  Thereby the bytes returned by
     * those methods deliver the modified functionality promised by this class.
     * 
     * Wanting to be a good boy scout, I undertook to write a 
     * read(byte[], int, int) method for this class.  What I got was large and
     * complex, and in the end I was not convinced that I had improved upon 
     * this class in its present form (in which it overrides only the one
     * read() method).
     * 
     * A difficulty arose in satisfying a part of the InputStream contract for
     * read(byte[], int, int).  That promises that no bytes will be changed in
     * the byte array except those bytes indicated by the number of the return
     * value.  In order to fulfill this promise it looked like I would have to
     * buffer in a byte array in this class, because this class needs to read
     * more bytes than it returns in the cases where it gets a period at the
     * start of a line.  Thus it looked like I would need to duplicate 
     * buffering already done in the supporting BufferedInputStream.  This may
     * not have improved performance, which was the only motive for writing
     * the method.
     *  
     * Early on I considered extending BufferedInputStream instead of 
     * InputStream.  But, in addition to the above-mentioned disadvantage of
     * buffering here rather than in the underlying stream, I discovered that 
     * extending BufferedInputStream would have required that I override
     * read(byte[], int, int).  This is because the BufferedInputStream
     * method read(byte[], int, int) --  unlike the same method in InputStream
     * -- gets its bytes from somewhere deeper.  The BufferedInputStream method
     * does not work by making repeated calls to read() in this class, so it
     * would not provide the additional functionality of this class.  This may
     * be obvious to more seasoned programmers, but it was not obvious to me at
     * the outset.
     */
}

First, in the SMTPHandler declarations, the declaration
        private InputStream in;
should be changed to
        private BufferedInputStream in;



Second, the main change.  First I copy the existing code, then I show the code
that would replace it.

<existing code in SMTPHandler doDATA()>
        responseString = "354 Ok Send data ending with <CRLF>.<CRLF>";
        writeLoggedFlushedResponse(responseString);
        InputStream msgIn = new CharTerminatedInputStream(in, SMTPTerminator);
        try {
            msgIn = new BytesReadResetInputStream(msgIn,
                                                  theWatchdog,
                                                  theConfigData.getResetLength());

            // if the message size limit has been set, we'll
            // wrap msgIn with a SizeLimitedInputStream
            long maxMessageSize = theConfigData.getMaxMessageSize();
            if (maxMessageSize > 0) {
                if (getLogger().isDebugEnabled()) {
                    StringBuffer logBuffer =
                        new StringBuffer(128)
                                .append("Using SizeLimitedInputStream ")
                                .append(" with max message size: ")
                                .append(maxMessageSize);
                    getLogger().debug(logBuffer.toString());
                }
                msgIn = new SizeLimitedInputStream(msgIn, maxMessageSize);
            }
            // Removes the dot stuffing
            msgIn = new DotStuffingInputStream(msgIn);
            // Parse out the message headers
            MailHeaders headers = new MailHeaders(msgIn);
            headers = processMailHeaders(headers);
            processMail(headers, msgIn);
            headers = null;
        } catch (MessagingException me) {
</existing code>


<replacement code in SMTPHandler doDATA()>
        responseString = "354 Ok Send data ending with <CRLF>.<CRLF>";
        writeLoggedFlushedResponse(responseString);
        InputStream msgIn = null;
        try {
            long maxMessageSize = theConfigData.getMaxMessageSize();
            if (maxMessageSize > 0) {
                // if the message size limit has been set
                if (getLogger().isDebugEnabled()) {
                    StringBuffer logBuffer =
                        new StringBuffer(128)
                                .append("Using SizeLimitedInputStream ")
                                .append(" with max message size: ")
                                .append(maxMessageSize);
                    getLogger().debug(logBuffer.toString());
                }
            }
            /* This construction of SMTPInputStream assumes that
                (maxMessageSize == 0) signifies no size limit.
             */
            msgIn = new SMTPDataInputStream(in, maxMessageSize,
                         theWatchdog, theConfigData.getResetLength());
            // Parse out the message headers
            MailHeaders headers = new MailHeaders(msgIn);
            headers = processMailHeaders(headers);
            processMail(headers, msgIn);
            headers = null;
        } catch (MessagingException me) {
</replacement code>

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

new InputStream class for mail data

Reply via email to