Den 18-05-2011 16:53, Andrei Alexandrescu skrev:
On 5/18/11 6:07 AM, Jonas Drewsen wrote:
Select will wait for data to be ready and ask curl to handle the data
chunk. Curl in turn calls back to a registered callback handler with the
data read. That handler fills the buffer provided by the user. If not
enough data has been receive an new select is performed until the
requested amount of data is read. Then the blocking method can return.

Perhaps this would be too complicated. In any case the core
functionality must be paid top attention. And the core functionality is
streaming.

Currently there are two proposed ways to stream data from an HTTP
address: (a) by using the onReceive callback, and (b) by using
byLine/byChunk. If either of these perform slower than the
best-of-the-breed streaming using libcurl, we have failed.

The onReceive method is not particularly appealing because the client
and libcurl block each other: the client is blocked while libcurl is
waiting for data, and the client blocks libcurl while inside the
callback. (Please correct me if I'm wrong.)

To make byLine/byChunk fast, the basic setup should include a hidden
thread that does the download in separation from the client's thread.
There should be K buffers allocated (K = 2 to e.g. 10), and a simple
protocol for passing the buffers back and forth between the client
thread and the hidden thread. That way, in the quiescent state, there is
no memory allocation and either both client and libcurl are busy doing
work, or one is much slower than the other, which waits.

The same mechanism should be used in byChunkAsync or byFileAsync.

If byChunk is using a hidden thread to download into buffers, then how does it differ from the byChunkAsync that you mention?

The current curl wrapper actually does the hidden thread trick (based on a hint you gave me a while ago). It does not reuse buffers because I thought that all data had to be immutable or by value to go through the message passing system. I'll fix this since it is a good place to do some type casting to allow passing the buffers for reuse.

I think that we have to consider the context of the streaming before we can tell the best solution. I do not have any number to back the following up, but this is how I see it:

If data that is read is going to be processed (e.g. compressed) in some way it is most likely a benefit to spawn a thread to handle the data buffering.

If no processing is done (e.g. a simple copy from net to disk) I believe keeping things in the same thread and simply select on sockets (disk or net) is fastest. This way no message passing and context switching is taking place and does cause any overhead. libcurl can give you access to the file descriptors for this exact purpose but it does have some drawbacks: you are not in control of the buffers used by libcurl. This means that reading from one curl connection and sending on another you would have to copy the data. libcurl does in fact provide even simpler methods where you can provide your own buffers for read/writes. Unfortunately this is only supported for HTTP and a lot of the convenience features such as redirections are lost. The more you want to control to get the last drop of performance, the more you have to manually handle yourself.

In my opinion I think that providing the performance of the standard libcurl API in the D wrapper is the way to go (as done in the current curl wrapper). Generic and efficient streaming across protocols is best done in std.net where buffers can be handled entirely in D. I know this is not a small task which is why I started out with wrapping libcurl.

Thanks
Jonas





Reply via email to