Re: Plasma store implementation status across client libraries

Thomas Browne Mon, 04 Jan 2021 11:37:31 -0800

Okay so here is my motivation and why I was/am really excited about thePlasma Store.

I have built a number of big mathematical relative value ("RV") systemsin the finance world (mostly fixed income), using lots of financial timeseries, over the past 10 years (this is me:https://stackoverflow.com/questions/993984/what-are-the-advantages-of-numpy-over-regular-python-lists).Daily data is not a problem. Most TSDBs (Influx, TimeScale, Exasol, evenraw Postgres and Redis), coupled with 1 (sometimes 10) gbit ethernet,allow for fairly responsive distribution of TS data to users. I was adata infrastructure consultant in 2019/20 at a big hedge fund with 40different user "pods", all trying to access a centralised data pool. Wegot it working, but it's hard.

The problem is that increasingly, RV is becoming much more dataintensive, and latency sensitive, even for "discretionary" managers (ienot high frequency trading). It is no longer single series or a fewdozen daily-resolution series. It's hundreds of daily-series, and evenworse, intraday data, often down to minute or even tick level. To giveyou an idea, EUR-USD exchange rate 9 months of data from Bloomberg isover 500k data points for minute data. Try RV across 20+ fx liquidpairs, and you're potentially accessing 10m data points. All currentsolutions choke. The only one that doesn't is Redis, but as @ChrisNuernberger points out, Redis is strictly in memory, and forresponsiveness at "modern" scale, Ethernet is just too slow even at10gbps. Users have to each have Redis on their own machines, and thatbrings with it the very hard "cache coherency" problem, though Redisseems to be trying to solve this(https://redis.io/topics/client-side-caching), though the memoryconstrained data size limit and single-threadedness remains problematic,as does its non-columnar style.

Data locality is key. Coupled with Flight, there is a very bigopportunity to build distributed data systems in the finance (and otherhigh frequency time series domains), using the Arrow format which ticksso many boxes, not least "native" on the wire format for all clients,and of course, the column store aspect. Very attractive too is thespillover potential to SSD via mmapping, as this allows for "huge" datasizes. (as an aside, Intel Optane / Micron 3d Xpoint is a veryinteresting technology in this regard because of it's order-of-magnitudebetter seek capability than traditional NAND SSDs. It really does randomaccess much better).

But the Plasma store would be necessary, because the network througputis a real problem at these data size and latency requirement levels. Acombination of Flight, plus Plasma on local machine, would be a truegame changer in this environment. Almost instant, cross-language accessto the local store. Buh-bye the (super-expensive) KDB+ which many resortto.

Of course the coherency problem would still exist. As would the need forthe Arrow format to accept rapidly updating record batches. But thereare so many boxes that are ticked by Apache Arrow that these additionscould make for a killer solution.

Unfortunately, though I'm a very accomplished R/Python guy, I don't doC++, or I would offer my support. I am learning Rust though - perhapsthat's where I might help?


Hope this above is useful. Strongly believe this space has a massive gap.

Thomas









On 04/01/2021 18:15, Neal Richardson wrote:

I believe Plasma only has Python bindings. FWIW it has not seen activedevelopment in quite a while.


Neal

On Mon, Jan 4, 2021 at 8:58 AM Chris Nuernberger <[email protected]<mailto:[email protected]>> wrote:


    Yes that makes sense.  I guess you also need something to broker
    shared memory filenames/ids.  The database isn't in-memory,
    however, although I know what you mean.  One huge advantage of
    mmap is you can have much larger than memory storage act like
    in-memory storage; so the plasma store can be roughly the size of
    your disk and larger your ram but your program, unless it attempts
    to verbatim copy a column wouldn't know any better.

    Numerical larger-than-memory-but-in-memory redis indeed; that is
    an interesting way to think of it.

    On Mon, Jan 4, 2021 at 9:45 AM Thomas Browne <[email protected]
    <mailto:[email protected]>> wrote:

        Interesting and agreed. I guess this a big advantage of the
        "on the wire" unserialised format - just read it in and it's
        already native. I'll go this way possibly.

        However I also note the beginnings of more advanced
        functionality in the Plasma store, for example, notification
        API on buffer seal (ie when something changes, all clients can
        be notified).

        
https://arrow.apache.org/docs/python/generated/pyarrow.plasma.PlasmaClient.html#pyarrow.plasma.PlasmaClient.subscribe

        I'm assuming the plasma store will add functionality over
        time, and if this is the case, having all client libraries
        implement it means I can almost have a redis-like column-store
        specialising in numerical computation (which would be
        awesome), and for which i don't need to write my own
        functionality for each client library.

        A numerical in-memory database, if you will.

        On 04/01/2021 15:55, Chris Nuernberger wrote:

        Julia, Python, and R all have some support for mmap operations.

        On Mon, Jan 4, 2021 at 8:55 AM Chris Nuernberger
        <[email protected] <mailto:[email protected]>> wrote:

            Could simply saving the arrow file in streaming mode to
            shared memory and then mmap-ing the result in each
            language solve your problem ?  Plasma seems to me to be a
            layer on top of basic mmap operations; as long as you
            have shared memory and mmap then you can have multiple
            processes talking to the same logical block of memory.

            On Mon, Jan 4, 2021 at 8:27 AM Thomas Browne
            <[email protected] <mailto:[email protected]>> wrote:

                I am hoping to use the Apache Arrow project for
                cross-language numerical
                computation, and for that the shared-memory idea is
                very powerful. Am I
                correct that the Plasma Store is the enabling
                technology for this,
                especially for soft real-time computation (ie not
                moving to parquet or
                any file-based sharing system)?

                Is that the case? And if so, then I'm wondering which
                client libraries,
                other than Python (and I assume C[++]), implement the
                Plasma Store. This
                table doesn't feature a row for Plasma:

                https://arrow.apache.org/docs/status.html

                and I can't seem to find any reference to the Plasma
                store in the Julia,
                R, or Javascript libraries.

                https://arrow.apache.org/docs/r/

                https://arrow.apache.org/docs/js/

                https://arrow.juliadata.org/stable/


                Thank you,

                Thomas

Re: Plasma store implementation status across client libraries

Reply via email to