Re: Ordering of commands per connection

'Scott Mansfield' via memcached Tue, 24 Jan 2017 12:38:11 -0800

The F1 FPGA's are connected to the PCI-E bus and not directly connected to
the network. I was trying to figure out how it all works as a system, but
I'm not that far along yet. I do know the boards have a dedicated 64 GB of
RAM for the FPGA and talk to the rest of the system over PCI-E. They have
great bandwidth to the CPU, but nothing direct to the network card. I've
been told the bandwidth is really high but I have done no testing on it.


What I'd like to be able to do is directly see e.g. UDP packets and avoid
TCP for an FPGA design. Less state will make things easier. I don't know
how it would work in the kernel / userspace side.

Dormando, I'm going to send a separate email about pipelining/batching
because this is something that I've been working on recently[1] and seeing
less data change hands would be beneficial to us as well.

[1]: https://github.com/netflix/rend/tree/104-batching


*Scott Mansfield*

Product > Consumer Science Eng > EVCache > Sr. Software Eng
{
  M: 352-514-9452 <(352)%20514-9452>
  E: smansfi...@netflix.com
  K: {M: mobile, E: email, K: key}
}

On Tue, Jan 24, 2017 at 12:12 PM, dormando <dorma...@rydia.net> wrote:

> To be blunt all of the alternative hardware ideas people have tried for
> memcached have had practicality issues. Either cost, complexity, or
> communication with the main CPU tend to kill it.
>
> It's fun to toy with and at some point someone will make something usable,
> I hope. The MS change is neat, solarflare cards have been interesting,
> amazon F1 is interesting (but not connected to the network so far as I can
> tell).
>
> Honestly it's probably more practical for building low power cache
> systems, or low power medium-usage systems with high speed interconnects
> to flash storage. You need to implement the TCP stack, the entire daemon,
> etc, away from the x86 machine which is a lot of work.
>
> Hybrid approaches (like hot key offloading) seem alright but without fast
> access to main memory any sort of scanning workload won't work. You can
> get close with offloading by having the FPGA handle networking and DMA
> buffers to/from userland network stacks or data structures for attaching
> item data to a response... IE; an FPGA would have 32-64g of memory
> directly attached and manage stack, connection buffers, hash table, object
> headers (plus directly small values, likely). Larger values could be held
> in main memory and data could be bulk requested to manage high rates.
>
> It's either that or the FPGA gets pools of RAM or flash via
> bankswitching/something to deal with the cache memory directly. This is
> how the exotic tilera/etc machines worked with lots of NUMA banks (At a
> high level anyway)
>
> Unless something new has happened I don't know about :) PCI has a lot of
> bandwidth but transaction delay is still a limiter.
>
> In practical sense memcached can saturate 20-40gbps easily, if the time
> spent in the kernel is minimized. You can get there quickly by pipelining.
> You can do it today by sticking to multigets or stacking sets/gets via
> proxies and using binprot or ASCII noreply.
>
> Some advances should be coming to memcached's frontend to help narrow the
> gap and allow all types of requests to pipeline/batch. Then the FGPA's
> aren't quite as relevant for getting high performance/low latency.
>
> Then you could have, say, a proxy on each client machine that gathers
> request together, pipelining requests to memcached servers (spymemcached
> client has done this internally forever; but the current implementation of
> binprot generates too many packets at large sizes).
>
> Give me a month or two, maybe? I just merged some fixes for the frontend
> and have been giving it more thought.
>
> On Tue, 24 Jan 2017, 'Scott Mansfield' via memcached wrote:
>
> > A colleague recently forwarded this 2014 paper to
> > me: https://www.cs.princeton.edu/courses/archive/spring16/co
> s598F/06560058.pdf
> > It's an interesting read. I believe the speedup was based on being able
> to serve hits for hot keys
> > effectively out of the FPGA which would otherwise forward the request to
> the main process. This would
> > require your FPGA to be in the hot path from NIC to CPU, though, so that
> may or may not work for you.
> >
> > IMO this won't work well for small things (e.g. hashing) because the
> overhead of data transfer alone would
> > be slower than the action performed.
> >
> > Not directly related, but I'd hope you're aware of the in-line network
> acceleration Microsoft has done in
> > their datacenters. It's some really cool stuff and could enlighten you
> on techniques to use for an inline
> > accelerator as it relates to parsing network
> > data: https://www.microsoft.com/en-us/research/publication/
> configurable-cloud-acceleration/
> >
> >
> > Scott Mansfield
> > Product > Consumer Science Eng > EVCache > Sr. Software Eng
> > {
> >   M: 352-514-9452
> >   E: smansfi...@netflix.com
> >   K: {M: mobile, E: email, K: key}
> > }
> >
> > On Tue, Jan 24, 2017 at 8:11 AM, Ravikiran Gummaluri <
> ravikiran.gummal...@xilinx.com> wrote:
> >       HI ,
> >       We are trying to offload some of the functionality of memcacaheD
> to FPGA to accelerate it. We
> >       are  exploring possibilities of software bottle neck and
> accelerating using FPGA’s . If anyone
> >       has already done some profiling and they can help us to understand
> the functionalities that can
> >       improve performance. Any suggestions are welcome.
> >
> >       Thanks & Regards
> >       Ravi G
> >
> >       From: Scott Mansfield [mailto:smansfi...@netflix.com]
> >       Sent: Tuesday, January 24, 2017 7:40 AM
> >       To: memcached <memcached@googlegroups.com>
> >       Cc: Ravikiran Gummaluri <rgum...@xilinx.com>; Venkata Ravi
> Shankar Jonnalagadda
> >       <vjon...@xilinx.com>; Sunita Jain <suni...@xilinx.com>
> >       Subject: Re: Ordering of commands per connection
> >
> >       I'm actually also very interested to see anything you can share
> about your project.
> >
> >       On Monday, January 23, 2017 at 12:50:03 PM UTC-8, Dormando wrote:
> >       Hey,
> >
> >       I've always wanted to try implementing a server with a xilinx
> chip. Seems
> >       like you folks would be more qualified to do that :)
> >
> >       The short answer is that the server does guarantee order right
> now. The
> >       ASCII protocol doesn't work very well if you reorder the results,
> but
> >       primarily all clients will have been written with that assumption
> in mind.
> >
> >       The longer answer is that binary protocol can technically allow
> >       reordering, but it's unclear if any clients support that. Binprot
> uses
> >       opaques or returns keys to tag requests with responses.
> >
> >       You can still parallelize an ordered ASCII multiget (ie: "get key1
> key2
> >       key3") by creating the iovec structures ahead of time, doing the
> >       hashing/lookup in parallel and filling the results before sending
> the
> >       response.
> >
> >       With binprot each get/response are independently packaged so it's
> a bit
> >       easier, although the protocol bloat makes it less useful at high
> rates.
> >
> >       People have also written papers already on implementing memcached
> with
> >       FPGA's or highly parallel microprocessors (tilera, MIT's tilera
> precursor,
> >       etc). Hopefully you're familiar with them before diving into this.
> >
> >       May I ask if you can share any other details of this project? is
> it a
> >       proof of concept or some kind of a product?
> >
> >       have fun,
> >       -Dormando
> >
> >       On Mon, 23 Jan 2017, Ravi Kiran wrote:
> >
> >       > HI ,
> >       > We are planning to use the MemchaheD software and accelerate it
> with hardware offload. We
> >       would like to know
> >       > from protocol prospective each connection should maintain the
> order in which it receives the
> >       command  to
> >       > send a response back ?
> >       > for Ex: If we receive GET1 GET2 SET1 GET3 do we need to send the
> response in the same order
> >       GET1 GET2 SET1
> >       > GET3 . Can we parallelize commands and send them out off order ?
> >       >
> >       > Thanks & Regards
> >       > Ravi G
> >       >
> >       > --
> >       >
> >       > ---
> >       > You received this message because you are subscribed to the
> Google Groups "memcached" group.
> >       > To unsubscribe from this group and stop receiving emails from
> it, send an email to
> >       > memcached+...@googlegroups.com.
> >       > For more options, visit https://groups.google.com/d/optout.
> >       >
> >       >
> >
> >
> > --
> >
> > ---
> > You received this message because you are subscribed to the Google
> Groups "memcached" group.
> > To unsubscribe from this group and stop receiving emails from it, send
> an email to
> > memcached+unsubscr...@googlegroups.com.
> > For more options, visit https://groups.google.com/d/optout.
> >
> >
>
> --
>
> ---
> You received this message because you are subscribed to a topic in the
> Google Groups "memcached" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/to
> pic/memcached/uQPOyIH3Idk/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> memcached+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"memcached" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to memcached+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Ordering of commands per connection

Reply via email to