Okay so here is my motivation and why I was/am really excited about the Plasma Store.

I have built a number of big mathematical relative value ("RV") systems in the finance world (mostly fixed income), using lots of financial time series, over the past 10 years (this is me: https://stackoverflow.com/questions/993984/what-are-the-advantages-of-numpy-over-regular-python-lists). Daily data is not a problem. Most TSDBs (Influx, TimeScale, Exasol, even raw Postgres and Redis), coupled with 1 (sometimes 10) gbit ethernet, allow for fairly responsive distribution of TS data to users. I was a data infrastructure consultant in 2019/20 at a big hedge fund with 40 different user "pods", all trying to access a centralised data pool. We got it working, but it's hard.

The problem is that increasingly, RV is becoming much more data intensive, and latency sensitive, even for "discretionary" managers (ie not high frequency trading). It is no longer single series or a few dozen daily-resolution series. It's hundreds of daily-series, and even worse, intraday data, often down to minute or even tick level. To give you an idea, EUR-USD exchange rate 9 months of data from Bloomberg is over 500k data points for minute data. Try RV across 20+ fx liquid pairs, and you're potentially accessing 10m data points. All current solutions choke. The only one that doesn't is Redis, but as @Chris Nuernberger points out, Redis is strictly in memory, and for responsiveness at "modern" scale, Ethernet is just too slow even at 10gbps. Users have to each have Redis on their own machines, and that brings with it the very hard "cache coherency" problem, though Redis seems to be trying to solve this (https://redis.io/topics/client-side-caching),  though the memory constrained data size limit and single-threadedness remains problematic, as does its non-columnar style.

Data locality is key. Coupled with Flight, there is a very big opportunity to build distributed data systems in the finance (and other high frequency time series domains), using the Arrow format which ticks so many boxes, not least "native" on the wire format for all clients, and of course, the column store aspect. Very attractive too is the spillover potential to SSD via mmapping, as this allows for "huge" data sizes. (as an aside, Intel Optane / Micron 3d Xpoint  is a very interesting technology in this regard because of it's order-of-magnitude better seek capability than traditional NAND SSDs. It really does random access much better).

But the Plasma store would be necessary, because the network througput is a real problem at these data size and latency requirement levels. A combination of Flight, plus Plasma on local machine, would be a true game changer in this environment. Almost instant, cross-language access to the local store. Buh-bye the (super-expensive) KDB+ which many resort to.

Of course the coherency problem would still exist. As would the need for the Arrow format to accept rapidly updating record batches. But there are so many boxes that are ticked by Apache Arrow that these additions could make for a killer solution.

Unfortunately, though I'm a very accomplished R/Python guy, I don't do C++, or I would offer my support. I am learning Rust though - perhaps that's where I might help?

Hope this above is useful. Strongly believe this space has a massive gap.

Thomas









On 04/01/2021 18:15, Neal Richardson wrote:
I believe Plasma only has Python bindings. FWIW it has not seen active development in quite a while.

Neal

On Mon, Jan 4, 2021 at 8:58 AM Chris Nuernberger <[email protected] <mailto:[email protected]>> wrote:

    Yes that makes sense.  I guess you also need something to broker
    shared memory filenames/ids.  The database isn't in-memory,
    however, although I know what you mean.  One huge advantage of
    mmap is you can have much larger than memory storage act like
    in-memory storage; so the plasma store can be roughly the size of
    your disk and larger your ram but your program, unless it attempts
    to verbatim copy a column wouldn't know any better.

    Numerical larger-than-memory-but-in-memory redis indeed; that is
    an interesting way to think of it.

    On Mon, Jan 4, 2021 at 9:45 AM Thomas Browne <[email protected]
    <mailto:[email protected]>> wrote:

        Interesting and agreed. I guess this a big advantage of the
        "on the wire" unserialised format - just read it in and it's
        already native. I'll go this way possibly.

        However I also note the beginnings of more advanced
        functionality in the Plasma store, for example, notification
        API on buffer seal (ie when something changes, all clients can
        be notified).

        
https://arrow.apache.org/docs/python/generated/pyarrow.plasma.PlasmaClient.html#pyarrow.plasma.PlasmaClient.subscribe

        I'm assuming the plasma store will add functionality over
        time, and if this is the case, having all client libraries
        implement it means I can almost have a redis-like column-store
        specialising in numerical computation (which would be
        awesome), and for which i don't need to write my own
        functionality for each client library.

        A numerical in-memory database, if you will.

        On 04/01/2021 15:55, Chris Nuernberger wrote:
        Julia, Python, and R all have some support for mmap operations.

        On Mon, Jan 4, 2021 at 8:55 AM Chris Nuernberger
        <[email protected] <mailto:[email protected]>> wrote:

            Could simply saving the arrow file in streaming mode to
            shared memory and then mmap-ing the result in each
            language solve your problem ?  Plasma seems to me to be a
            layer on top of basic mmap operations; as long as you
            have shared memory and mmap then you can have multiple
            processes talking to the same logical block of memory.

            On Mon, Jan 4, 2021 at 8:27 AM Thomas Browne
            <[email protected] <mailto:[email protected]>> wrote:

                I am hoping to use the Apache Arrow project for
                cross-language numerical
                computation, and for that the shared-memory idea is
                very powerful. Am I
                correct that the Plasma Store is the enabling
                technology for this,
                especially for soft real-time computation (ie not
                moving to parquet or
                any file-based sharing system)?

                Is that the case? And if so, then I'm wondering which
                client libraries,
                other than Python (and I assume C[++]), implement the
                Plasma Store. This
                table doesn't feature a row for Plasma:

                https://arrow.apache.org/docs/status.html

                and I can't seem to find any reference to the Plasma
                store in the Julia,
                R, or Javascript libraries.

                https://arrow.apache.org/docs/r/

                https://arrow.apache.org/docs/js/

                https://arrow.juliadata.org/stable/


                Thank you,

                Thomas


Reply via email to