On Fri, Mar 7, 2008 at 3:41 PM, Jon Blower <[EMAIL PROTECTED]> wrote:
> Hi John (et al),
>
>  Thanks very much to everyone for very helpful responses on this.
>  Perhaps I should go into a bit more detail about our application.  We
>  are writing an application for climate scientists that allows them to
>  run climate simulation codes on remote compute clusters.  The codes
>  produce large amounts of data (100s of gigabytes as a typical example)
>  and we want the client to be able to download the output files from
>  the cluster as the simulation progresses (so that the user can monitor
>  what's going on and also reduce the disk footprint on the remote
>  cluster).  The size of each file is of the order of gigabytes.

Does this imply that EGEE is backing off from OGSA-DAI as the one true
way to distribute physics events round the server farms? There is hope
for UK e-science after all.

-steve

(who spent far to much time in grid standards body meetings)

>
>  A client will typically be downloading tens of output files
>  simultaneously, maybe more.  We do not expect more than a handful of
>  users to be connected to our server at any one time.  Nevertheless we
>  don't want to spawn a new thread for each file that is downloaded (we
>  could end up with hundreds of threads), which is essentially what we
>  are forced to do in our current servlet-based implementation.  Another
>  disadvantage of our current system is that if we exhaust the thread
>  pool, new clients won't get any data at all until a thread is
>  released.  I would rather have every client see a slow trickle than
>  have a single client monopolise the server.
>
>  There will be minimal re-use of files (if all goes well a given file
>  will be downloaded exactly once) so caching won't help unfortunately.
>  We do have control over the clients generally but part of the point of
>  our design is that people can use their browser to download files if
>  they wish so we can't assume that this is always true.
>
>  We can't simply use a straight web server (e.g. Apache) for this
>  because there is some other logic that goes along with the downloading
>  of files.  For example, the files are generally append-only which
>  means that we can start the process of downloading an output file
>  before the file is completely written by the simulation code on the
>  cluster.  The logic on the server side detects when a file is finished
>  and hence we can control when the client sees EOF.  Apart from this
>  there isn't much state associated with the downloading of each file.
>
>  I'm thinking of implementing this by wrapping some simple code around
>  an NIO FileChannel object that defers to this object for most
>  operations but the wrapping code will control the detection of EOF.
>
>  A related question: can I support HTTP range headers in Restlet?  If
>  so then we can support resumable downloads and also HTTP download
>  accelerators that open multiple streams and download different blocks
>  of data (the latter would of course increase the number of clients).
>
>  Thanks, Jon
>
>
>
>  On Fri, Mar 7, 2008 at 2:07 PM, John D. Mitchell <[EMAIL PROTECTED]> wrote:
>  > On Thu, Mar 6, 2008 at 7:14 AM, Jon Blower <[EMAIL PROTECTED]> wrote:
>  >  [...]
>  >
>  > >  We have an existing RESTful web application that involves clients
>  >  >  downloading multiple streams of data simultaneously.  Our current
>  >  >  implementation is based on servlets and we are experiencing
>  >  >  scalability problems with the number of threads involved in serving
>  >  >  multiple large data streams simultaneously.  I recently came across
>  >  >  Restlet and was attracted by the potential to use NIO under the hood
>  >  >  to enable more scalable large file transfers.
>  >
>  >  Cool.
>  >
>  >
>  >  >  In our case we are not necessarily serving large files that already
>  >  >  exist on disk: we are essentially creating the files ourselves on the
>  >  >  fly (so they are of unknown length when the file transfer starts).  I
>  >  >  was wondering if anyone could offer advice on how to support the
>  >  >  serving of such data streams through Restlet in a scalable manner
>  >  >  (ideally without creating a new thread on the server for each file
>  >  >  transfer)?
>  >
>  >  What do you mean by "large files"?  I.e., are talking about generating
>  >  content that is merely large relative to a web page (i.e., measured in
>  >  megabytes) or are you talking about something like complete hi-def
>  >  video (GBs in size) or something both large and nominally endless like
>  >  live video streams?
>  >
>  >  For the first case, if they are small enough I'd start by just fully
>  >  rendering the contents to a Representation as usual and profile how
>  >  well you can use the existing Jetty connector (with tuning, etc.).  As
>  >  you add more simultaneous clients, add more servers.  Also, run your
>  >  experiments with the new Grizzly connector and track that as it and
>  >  v1.1+ stabilizes.
>  >
>  >  For the second case (or where you have content sizes in the first case
>  >  but lots of slow clients), I'd actually have that part of my origin
>  >  servers either be fronted by a reverse-caching-proxy (e.g., squid) or
>  >  generate and dump the contents from the origin server into a local
>  >  file and redirect the client to get that content from e.g., lighttpd
>  >  (+mod_secdownload).  Depending on the nature of your client
>  >  applications, the potential reuse of the generated content, etc. you
>  >  can tune how you clean up the caches.
>  >
>  >  For the last case, if I controlled the clients then I'd probably have
>  >  the clients request good-sized chunks of the data in a loop and
>  >  devolve to the appropriate combination of the first two approaches. Of
>  >  course, that's more or less presuming that you can generate those
>  >  chunks more or less independently (i.e., with minimal state
>  >  information needed to keep the continuity from chunk to chunk).  If
>  >  you have heavy amounts of state and/or if you don't control the
>  >  clients then I'd want to know a good bit more before making any
>  >  recommendation.
>  >
>  >  Hope this helps,
>  >  John
>  >
>
>
>
>
>
> --
>  --------------------------------------------------------------
>  Dr Jon Blower              Tel: +44 118 378 5213 (direct line)
>  Technical Director         Tel: +44 118 378 8741 (ESSC)
>  Reading e-Science Centre   Fax: +44 118 378 6413
>  ESSC                       Email: [EMAIL PROTECTED]
>  University of Reading
>  3 Earley Gate
>  Reading RG6 6AL, UK
>  --------------------------------------------------------------
>

Reply via email to