Re: Caching support for large attachments

Senaka Fernando Sat, 15 Mar 2008 06:45:46 -0700

> Hi Manjula, Thilina and others,
>
> Yep, I think I'm exactly in the same view point as Thilina when it comes
> to handling attachment data. Well for the chunking part. I think I didn't
> get Thilina right in his first e-mail.
>
> And, However, the file per MIME part may not always be optimal. I say
> rather  each file should have a fixed Max Size and if that is exceeded
> perhaps you can divide it to two. Also a user should always be given the
> option to choose between Thilina's method and this method through the
> axis2.xml (or services.xml). Thus, a user can fine tune memory use.
>
> When it comes to base64 encoded binary data, you can use a mechanism where
> the buffer would always have the size which is a multiple of 4, and then
> when you flush you decode it and copy it to the file, so that should
> essentially be the same to a user when it comes to caching.
>
> OK, so Manjula, you mean when the MIME boundary appears partially in the
> first read and partially in the second?
>
> Well this is probably the best solution.
>
> You will allocate enough size to read twice the size of a MIME boundary
> and in your very first read, you will read 2 times the MIME boundary, then
> you will search for the existence of the MIME boundary. Next you will do a
> memmove() and move all the contents of the buffer starting from the
> MidPoint until the end, to the beginning of the buffer. After doing this,
> you will read a size equivalent to 1/2 the buffer (which again is the size
> of the MIME boundary marker) and store it from the Mid Point of the buffer
> to the end. Then you will search again. You will iterate this procedure
> until you read less than half the size of the buffer.


If you are interested further in this mechanism, I used this approach when
it comes to resending Binary data using TCPMon. You may check that also.

Also, the strstr() has issues when you have '\0' in the middle. Thus you
will have to use a temporary search marker and use that in the process.
Before calling strstr() you will check whether strlen(temp) is greater
than the MIME boundary marker or equal. If it is greater, you only need to
search once. If it is equal, you will need to search exactly twice. If it
is less you increment temp by strlen(temp) and repeat until you cross the
Midpoint. So this makes the search even efficient.

If you want to make the search even efficient, you can make the buffer
size one less than the size of the MIME boundary marker, so when you get
the equals scenario, you will have to search only once.

The fact I've used here is that strstr and strlen behaves the same in a
given implementation. In Windows if strlen() is multibyte aware, so will
strstr(). So, no worries.

Regards,
Senaka

>
> HTH,
> Regards,
> Senaka
>
>>
>> On Sat, 2008-03-15 at 16:03 +0530, Senaka Fernando wrote:
>>> Hi Manjula,
>>>
>>> Please read my reply inline.
>>>
>>> > Hi Senaka,
>>> >
>>> > I am confused here. I think you are taking the discussion to the
>>> > beginning. Because in the receiving side we read till the end of the
>>> > stream. Please see my first mail.
>>>
>>> No I'm not taking the discussion to the starting point. I'm rather
>>> proposing an alternative implementation. According to what I mention
>>> here,
>>> we will rather still read till the end of the stream. But, we will not
>>> buffer everything we read into memory. We will flush the buffer to a
>>> file
>>> once it exceeds a threshold. However, when we read beyond the buffer
>>> size,
>>> we will not directly copy the entire content to file without parsing
>>> it.
>>> Instead we will use our fixed-sized buffer to temporarily store the
>>> content before being flushed and then parse it and write it to file.
>>> Thus,
>>> the file will contain only the binary part. It will not contain the
>>> "--MIMEBoundary" statements etc. These, along with the file name(s) can
>>> be
>>> stored into the parsed attachment object created. Thus, the memory
>>> consumption will be limited to the size of the fixed buffer and we will
>>> use the file for storage. This mechanism gives us the added plus of not
>>> having to worry about re-parsing what is written to file as it has
>>> already
>>> being parsed once. Please note that MIME parsing DOES NOT require us to
>>> store the entire content in memory.
>>
>> For me this is same as what Thilina is saying. So again I need to ask
>> the question what happened when the mime boundary is divided between two
>> reads.
>>
>>>
>>> >
>>> > When sending writing part by part to the stream is same as chunking.
>>> > Because when sending either you should specify a content-length or
>>> > specified it as chunked.
>>>
>>> No, it is not the same as chunking. What I meant here is that you need
>>> not
>>> read the entire content at once to memory and write to the stream in a
>>> single step. Rather we can read part by part and write it to the stream
>>> and repeat the process until the whole large file is written. In here
>>> you
>>> will still be using the Content Length. Chunking is a whole different
>>> story where you can transmit data as blocks. Using chunking we can send
>>> an
>>> arbitrary length of data of which the length is not pre-calculated. Now
>>> you might wonder how do we calculate the content-length without reading
>>> the entire content to the memory. Well, you can seek through the file
>>> and
>>> find out the size of the content to be written. Add to it the standard
>>> header block and MIME boundary demarcation string lengths and you will
>>> get
>>> the Content Length. This is a not at all expensive operation as the
>>> file
>>> seek will be scanning the file as a block without reading it to memory.
>>> The OS will manage it's efficiency.
>>
>> Here also I think it is same as what Thilina is saying.
>>
>> Thanks,
>> -Manjula.
>>
>>>
>>> >
>>> > -Manjula.
>>>
>>> Regards,
>>> Senaka
>>>
>>> >
>>> > On Sat, 2008-03-15 at 13:39 +0530, Senaka Fernando wrote:
>>> >> >>>  BTW, this whole discussion is about in path, that is reading an
>>> >> >>>  incomming message. How about the out path? We have the same
>>> >> problems
>>> >> >>>  when sending attachments. Right now, we read the whole file
>>> into
>>> >> >>> memory
>>> >> >>>  and then only we send over the wire.
>>> >> >> hmm... Why not write it in chunks.. Read a chunk from the file,
>>> then
>>> >> >> write it to the outstream.. Use size of the file for content-type
>>> >> >> calculation in case of non-chunking.. But mostly people will use
>>> >> >> chunking when using MTOM..
>>> >> >
>>> >> > No, chunking is not required. You also don't need to write the
>>> entire
>>> >> data
>>> >> > to be sent, to the stream at once. Because any HTTP Receiver will
>>> pull
>>> >> > from the stream until it sees a valid ending character sequence.
>>> >>
>>> >> It should rather read a length equal to content length. And the
>>> >> terminating sequence is for headers. Sorry for the confusion.
>>> Therefore,
>>> >> the HTTP Receiver will pull from the stream until it reads a content
>>> >> length or until an error occurs.
>>> >>
>>> >> >
>>> >> > I believe that you should be able to write part by part to the
>>> stream,
>>> >> and
>>> >> > send it, then reuse the buffer and write part 2, and send and so
>>> on.
>>> >> This
>>> >> > argument can be justified, because on the receiving end, we must
>>> read
>>> >> the
>>> >> > multi-part data until we encounter the mime boundary, unlike an
>>> >> ordinary
>>> >> > payload where it can be terminated by a valid terminating
>>> character
>>> >>
>>> >> Same here. We'll be reading a length equal to content length.
>>> >>
>>> >> > sequence . We'll only have issues if we are to write large soap
>>> >> payloads
>>> >> > which of course can be dealt with once we've implemented Session
>>> in
>>> >> > Axis2/C.
>>> >> >
>>> >> > Regards,
>>> >> > Senaka
>>> >> >
>>> >> >>
>>> >> >> thanks,
>>> >> >> Thilina
>>> >> >>
>>> >> >>
>>> >> >>>
>>> >> >>>  Samisa...
>>> >> >>>
>>> >> >>>
>>> >> >>>
>>> >> >>>  > Regards,
>>> >> >>>  > Senaka
>>> >> >>>  >
>>> >> >>>  >
>>> >> >>>  >> Hi,
>>> >> >>>  >>
>>> >> >>>  >>>  > In Axis2/Java case we do write the attachment content
>>> >> directly
>>> >> >>> from
>>> >> >>>  >>>  > the InputStream to the File when the attachment size is
>>> >> larger
>>> >> >>> than
>>> >> >>>  >>>  > the threshold.  This avoids loading the whole attachment
>>> to
>>> >> the
>>> >> >>>  >>> memory
>>> >> >>>  >>>  > at all.
>>> >> >>>  >>>
>>> >> >>>  >>>  In this case to find out the attachment size don't you
>>> need
>>> to
>>> >> do
>>> >> >>> any
>>> >> >>>  >>>  mime parsing? How do you find the attachment size with out
>>> >> >>> searching
>>> >> >>>  >>> for
>>> >> >>>  >>>  the mime boundaries ?
>>> >> >>>  >>>
>>> >> >>>  >> Yes.. MIME is a boundary based packaging mechanism and you
>>> does
>>> >> not
>>> >> >>>  >> need to specify the length for each of the parts...Even the
>>> HTTP
>>> >> >>>  >> content length is not there if the message is chunked.
>>> >> >>>  >>
>>> >> >>>  >> What we did in Axis2/Java to overcome this is to read the
>>> data
>>> >> to a
>>> >> >>>  >> byte[] buffer of up to a certain size (the size threshold).
>>> If
>>> >> >>> there
>>> >> >>>  >> are more data available in the mime part (if we have not
>>> >> >>> encountered
>>> >> >>>  >> the boundary yet) then we know this attachment is bigger
>>> than
>>> >> the
>>> >> >>>  >> threshold. So we create the temp file, pump the content in
>>> the
>>> >> >>> buffer
>>> >> >>>  >> to the file, then pump the rest of the stream to the file..
>>> In
>>> >> this
>>> >> >>>  >> way we do not need to know the size of the attachment
>>> upfront..
>>> >> BTW
>>> >> >>> we
>>> >> >>>  >> do all of the above while we are parsing the MIME message at
>>> the
>>> >> >>> MIME
>>> >> >>>  >> parser level..
>>> >> >>>  >>
>>> >> >>>  >>
>>> >> >>>  >>>  > This has the plus point that the attachment size will be
>>> >> >>>  >>>  > limited only by the available free space in the Temp
>>> >> >>> Directory..
>>> >> >>>  >>>  > Will that be possible in Axis2/C.. Or is that wat you
>>> have
>>> >> in
>>> >> >>> mind
>>> >> >>>  >>> :)..
>>> >> >>>  >>>
>>> >> >>>  >>>  Yes this is possible.
>>> >> >>>  >>>
>>> >> >>>  >> But in Axis2/JAVA we will get a OutOfMemory if we parse a
>>> large
>>> >> >>> MIME
>>> >> >>>  >> part upfront, since it reads the attachment to memory. May
>>> be
>>> >> you
>>> >> >>> can
>>> >> >>>  >> have a larger limit with C than in Java, but ultimately
>>> you'll
>>> >> come
>>> >> >>> to
>>> >> >>>  >> a situation where you will not have enough memory to store
>>> that
>>> >> >>> MIME
>>> >> >>>  >> part in memory in the parsing time, unless you write in to a
>>> >> File
>>> >> >>>  >> while parsing,..
>>> >> >>>  >>
>>> >> >>>  >> thanks,
>>> >> >>>  >> Thilina
>>> >> >>>  >>
>>> >> >>>  >>
>>> >> >>>  >>>
>>> >> >>>  >>>  >
>>> >> >>>  >>>  > thanks,
>>> >> >>>  >>>  > Thilina
>>> >> >>>  >>>  >
>>> >> >>>  >>>  >  >and keeping the file name inside
>>> >> >>>  >>>  > >  data_handler instead of the whole buffer. So the
>>> service
>>> >> or
>>> >> >>> the
>>> >> >>>  >>> client
>>> >> >>>  >>>  > >  will get the file name instead of the buffered
>>> stream,
>>> >> when
>>> >> >>> it
>>> >> >>>  >>> receives
>>> >> >>>  >>>  > >  an attachment. This will not prevent buffering the
>>> >> >>> attachment
>>> >> >>> at
>>> >> >>>  >>> the
>>> >> >>>  >>>  > >  transport but will prevent keeping it inside the
>>> om_tree
>>> >> >>> till
>>> >> >>> it
>>> >> >>>  >>> reaches
>>> >> >>>  >>>  > >  the receiver.
>>> >> >>>  >>>  > >
>>> >> >>>  >>>  > >  Before implementing this I would like to know your
>>> >> >>> suggestions
>>> >> >>>  >>> regarding
>>> >> >>>  >>>  > >  this.
>>> >> >>>  >>>  > >
>>> >> >>>  >>>  > >  [1] https://issues.apache.org/jira/browse/AXIS2C-672
>>> >> >>>  >>>  > >
>>> >> >>>  >>>  > >  Thanks,
>>> >> >>>  >>>  > >  -Manjula
>>> >> >>>  >>>  > >
>>> >> >>>  >>>  > >  --
>>> >> >>>  >>>  > >  Manjula Peiris: http://manjula-peiris.blogspot.com/
>>> >> >>>  >>>  > >
>>> >> >>>  >>>  > >
>>> >> >>>  >>>  > >  
>>> >> >>> ---------------------------------------------------------------------
>>> >> >>>  >>>  > >  To unsubscribe, e-mail:
>>> >> [EMAIL PROTECTED]
>>> >> >>>  >>>  > >  For additional commands, e-mail:
>>> >> >>> [EMAIL PROTECTED]
>>> >> >>>  >>>  > >
>>> >> >>>  >>>  > >
>>> >> >>>  >>>  >
>>> >> >>>  >>>  >
>>> >> >>>  >>>  >
>>> >> >>>  >>>
>>> >> >>>  >>>
>>> >> >>>  >>>  
>>> >> >>> ---------------------------------------------------------------------
>>> >> >>>  >>>  To unsubscribe, e-mail:
>>> [EMAIL PROTECTED]
>>> >> >>>  >>>  For additional commands, e-mail:
>>> [EMAIL PROTECTED]
>>> >> >>>  >>>
>>> >> >>>  >>>
>>> >> >>>  >>>
>>> >> >>>  >>
>>> >> >>>  >> --
>>> >> >>>  >> Thilina Gunarathne - http://thilinag.blogspot.com
>>> >> >>>  >>
>>> >> >>>  >> 
>>> >> >>> ---------------------------------------------------------------------
>>> >> >>>  >> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>> >> >>>  >> For additional commands, e-mail:
>>> [EMAIL PROTECTED]
>>> >> >>>  >>
>>> >> >>>  >>
>>> >> >>>  >>
>>> >> >>>  >
>>> >> >>>  >
>>> >> >>>  > 
>>> >> >>> ---------------------------------------------------------------------
>>> >> >>>  > To unsubscribe, e-mail: [EMAIL PROTECTED]
>>> >> >>>  > For additional commands, e-mail:
>>> [EMAIL PROTECTED]
>>> >> >>>  >
>>> >> >>>  >
>>> >> >>>  >
>>> >> >>>  >
>>> >> >>>
>>> >> >>>
>>> >> >>>  --
>>> >> >>>  Samisa Abeysinghe
>>> >> >>>  Software Architect; WSO2 Inc.
>>> >> >>>
>>> >> >>>  http://www.wso2.com/ - "Oxygenating the Web Service Platform."
>>> >> >>>
>>> >> >>>
>>> >> >>>
>>> >> >>>
>>> >> >>>  
>>> >> >>> ---------------------------------------------------------------------
>>> >> >>>  To unsubscribe, e-mail: [EMAIL PROTECTED]
>>> >> >>>  For additional commands, e-mail: [EMAIL PROTECTED]
>>> >> >>>
>>> >> >>>
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> --
>>> >> >> Thilina Gunarathne - http://thilinag.blogspot.com
>>> >> >>
>>> >> >> ---------------------------------------------------------------------
>>> >> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>> >> >> For additional commands, e-mail: [EMAIL PROTECTED]
>>> >> >>
>>> >> >>
>>> >> >
>>> >> >
>>> >> > ---------------------------------------------------------------------
>>> >> > To unsubscribe, e-mail: [EMAIL PROTECTED]
>>> >> > For additional commands, e-mail: [EMAIL PROTECTED]
>>> >> >
>>> >> >
>>> >>
>>> >>
>>> >> ---------------------------------------------------------------------
>>> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>> >> For additional commands, e-mail: [EMAIL PROTECTED]
>>> >>
>>> >
>>> >
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>> For additional commands, e-mail: [EMAIL PROTECTED]
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Caching support for large attachments

Reply via email to