On Fri, Mar 7, 2008 at 3:41 PM, Jon Blower <[EMAIL PROTECTED]> wrote: > Hi John (et al), > > Thanks very much to everyone for very helpful responses on this. > Perhaps I should go into a bit more detail about our application. We > are writing an application for climate scientists that allows them to > run climate simulation codes on remote compute clusters. The codes > produce large amounts of data (100s of gigabytes as a typical example) > and we want the client to be able to download the output files from > the cluster as the simulation progresses (so that the user can monitor > what's going on and also reduce the disk footprint on the remote > cluster). The size of each file is of the order of gigabytes.
Does this imply that EGEE is backing off from OGSA-DAI as the one true way to distribute physics events round the server farms? There is hope for UK e-science after all. -steve (who spent far to much time in grid standards body meetings) > > A client will typically be downloading tens of output files > simultaneously, maybe more. We do not expect more than a handful of > users to be connected to our server at any one time. Nevertheless we > don't want to spawn a new thread for each file that is downloaded (we > could end up with hundreds of threads), which is essentially what we > are forced to do in our current servlet-based implementation. Another > disadvantage of our current system is that if we exhaust the thread > pool, new clients won't get any data at all until a thread is > released. I would rather have every client see a slow trickle than > have a single client monopolise the server. > > There will be minimal re-use of files (if all goes well a given file > will be downloaded exactly once) so caching won't help unfortunately. > We do have control over the clients generally but part of the point of > our design is that people can use their browser to download files if > they wish so we can't assume that this is always true. > > We can't simply use a straight web server (e.g. Apache) for this > because there is some other logic that goes along with the downloading > of files. For example, the files are generally append-only which > means that we can start the process of downloading an output file > before the file is completely written by the simulation code on the > cluster. The logic on the server side detects when a file is finished > and hence we can control when the client sees EOF. Apart from this > there isn't much state associated with the downloading of each file. > > I'm thinking of implementing this by wrapping some simple code around > an NIO FileChannel object that defers to this object for most > operations but the wrapping code will control the detection of EOF. > > A related question: can I support HTTP range headers in Restlet? If > so then we can support resumable downloads and also HTTP download > accelerators that open multiple streams and download different blocks > of data (the latter would of course increase the number of clients). > > Thanks, Jon > > > > On Fri, Mar 7, 2008 at 2:07 PM, John D. Mitchell <[EMAIL PROTECTED]> wrote: > > On Thu, Mar 6, 2008 at 7:14 AM, Jon Blower <[EMAIL PROTECTED]> wrote: > > [...] > > > > > We have an existing RESTful web application that involves clients > > > downloading multiple streams of data simultaneously. Our current > > > implementation is based on servlets and we are experiencing > > > scalability problems with the number of threads involved in serving > > > multiple large data streams simultaneously. I recently came across > > > Restlet and was attracted by the potential to use NIO under the hood > > > to enable more scalable large file transfers. > > > > Cool. > > > > > > > In our case we are not necessarily serving large files that already > > > exist on disk: we are essentially creating the files ourselves on the > > > fly (so they are of unknown length when the file transfer starts). I > > > was wondering if anyone could offer advice on how to support the > > > serving of such data streams through Restlet in a scalable manner > > > (ideally without creating a new thread on the server for each file > > > transfer)? > > > > What do you mean by "large files"? I.e., are talking about generating > > content that is merely large relative to a web page (i.e., measured in > > megabytes) or are you talking about something like complete hi-def > > video (GBs in size) or something both large and nominally endless like > > live video streams? > > > > For the first case, if they are small enough I'd start by just fully > > rendering the contents to a Representation as usual and profile how > > well you can use the existing Jetty connector (with tuning, etc.). As > > you add more simultaneous clients, add more servers. Also, run your > > experiments with the new Grizzly connector and track that as it and > > v1.1+ stabilizes. > > > > For the second case (or where you have content sizes in the first case > > but lots of slow clients), I'd actually have that part of my origin > > servers either be fronted by a reverse-caching-proxy (e.g., squid) or > > generate and dump the contents from the origin server into a local > > file and redirect the client to get that content from e.g., lighttpd > > (+mod_secdownload). Depending on the nature of your client > > applications, the potential reuse of the generated content, etc. you > > can tune how you clean up the caches. > > > > For the last case, if I controlled the clients then I'd probably have > > the clients request good-sized chunks of the data in a loop and > > devolve to the appropriate combination of the first two approaches. Of > > course, that's more or less presuming that you can generate those > > chunks more or less independently (i.e., with minimal state > > information needed to keep the continuity from chunk to chunk). If > > you have heavy amounts of state and/or if you don't control the > > clients then I'd want to know a good bit more before making any > > recommendation. > > > > Hope this helps, > > John > > > > > > > > -- > -------------------------------------------------------------- > Dr Jon Blower Tel: +44 118 378 5213 (direct line) > Technical Director Tel: +44 118 378 8741 (ESSC) > Reading e-Science Centre Fax: +44 118 378 6413 > ESSC Email: [EMAIL PROTECTED] > University of Reading > 3 Earley Gate > Reading RG6 6AL, UK > -------------------------------------------------------------- >