Okay so here is my motivation and why I was/am really excited about the
Plasma Store.
I have built a number of big mathematical relative value ("RV") systems
in the finance world (mostly fixed income), using lots of financial time
series, over the past 10 years (this is me:
https://stackoverflow.com/questions/993984/what-are-the-advantages-of-numpy-over-regular-python-lists).
Daily data is not a problem. Most TSDBs (Influx, TimeScale, Exasol, even
raw Postgres and Redis), coupled with 1 (sometimes 10) gbit ethernet,
allow for fairly responsive distribution of TS data to users. I was a
data infrastructure consultant in 2019/20 at a big hedge fund with 40
different user "pods", all trying to access a centralised data pool. We
got it working, but it's hard.
The problem is that increasingly, RV is becoming much more data
intensive, and latency sensitive, even for "discretionary" managers (ie
not high frequency trading). It is no longer single series or a few
dozen daily-resolution series. It's hundreds of daily-series, and even
worse, intraday data, often down to minute or even tick level. To give
you an idea, EUR-USD exchange rate 9 months of data from Bloomberg is
over 500k data points for minute data. Try RV across 20+ fx liquid
pairs, and you're potentially accessing 10m data points. All current
solutions choke. The only one that doesn't is Redis, but as @Chris
Nuernberger points out, Redis is strictly in memory, and for
responsiveness at "modern" scale, Ethernet is just too slow even at
10gbps. Users have to each have Redis on their own machines, and that
brings with it the very hard "cache coherency" problem, though Redis
seems to be trying to solve this
(https://redis.io/topics/client-side-caching), though the memory
constrained data size limit and single-threadedness remains problematic,
as does its non-columnar style.
Data locality is key. Coupled with Flight, there is a very big
opportunity to build distributed data systems in the finance (and other
high frequency time series domains), using the Arrow format which ticks
so many boxes, not least "native" on the wire format for all clients,
and of course, the column store aspect. Very attractive too is the
spillover potential to SSD via mmapping, as this allows for "huge" data
sizes. (as an aside, Intel Optane / Micron 3d Xpoint is a very
interesting technology in this regard because of it's order-of-magnitude
better seek capability than traditional NAND SSDs. It really does random
access much better).
But the Plasma store would be necessary, because the network througput
is a real problem at these data size and latency requirement levels. A
combination of Flight, plus Plasma on local machine, would be a true
game changer in this environment. Almost instant, cross-language access
to the local store. Buh-bye the (super-expensive) KDB+ which many resort
to.
Of course the coherency problem would still exist. As would the need for
the Arrow format to accept rapidly updating record batches. But there
are so many boxes that are ticked by Apache Arrow that these additions
could make for a killer solution.
Unfortunately, though I'm a very accomplished R/Python guy, I don't do
C++, or I would offer my support. I am learning Rust though - perhaps
that's where I might help?
Hope this above is useful. Strongly believe this space has a massive gap.
Thomas
On 04/01/2021 18:15, Neal Richardson wrote:
I believe Plasma only has Python bindings. FWIW it has not seen active
development in quite a while.
Neal
On Mon, Jan 4, 2021 at 8:58 AM Chris Nuernberger <[email protected]
<mailto:[email protected]>> wrote:
Yes that makes sense. I guess you also need something to broker
shared memory filenames/ids. The database isn't in-memory,
however, although I know what you mean. One huge advantage of
mmap is you can have much larger than memory storage act like
in-memory storage; so the plasma store can be roughly the size of
your disk and larger your ram but your program, unless it attempts
to verbatim copy a column wouldn't know any better.
Numerical larger-than-memory-but-in-memory redis indeed; that is
an interesting way to think of it.
On Mon, Jan 4, 2021 at 9:45 AM Thomas Browne <[email protected]
<mailto:[email protected]>> wrote:
Interesting and agreed. I guess this a big advantage of the
"on the wire" unserialised format - just read it in and it's
already native. I'll go this way possibly.
However I also note the beginnings of more advanced
functionality in the Plasma store, for example, notification
API on buffer seal (ie when something changes, all clients can
be notified).
https://arrow.apache.org/docs/python/generated/pyarrow.plasma.PlasmaClient.html#pyarrow.plasma.PlasmaClient.subscribe
I'm assuming the plasma store will add functionality over
time, and if this is the case, having all client libraries
implement it means I can almost have a redis-like column-store
specialising in numerical computation (which would be
awesome), and for which i don't need to write my own
functionality for each client library.
A numerical in-memory database, if you will.
On 04/01/2021 15:55, Chris Nuernberger wrote:
Julia, Python, and R all have some support for mmap operations.
On Mon, Jan 4, 2021 at 8:55 AM Chris Nuernberger
<[email protected] <mailto:[email protected]>> wrote:
Could simply saving the arrow file in streaming mode to
shared memory and then mmap-ing the result in each
language solve your problem ? Plasma seems to me to be a
layer on top of basic mmap operations; as long as you
have shared memory and mmap then you can have multiple
processes talking to the same logical block of memory.
On Mon, Jan 4, 2021 at 8:27 AM Thomas Browne
<[email protected] <mailto:[email protected]>> wrote:
I am hoping to use the Apache Arrow project for
cross-language numerical
computation, and for that the shared-memory idea is
very powerful. Am I
correct that the Plasma Store is the enabling
technology for this,
especially for soft real-time computation (ie not
moving to parquet or
any file-based sharing system)?
Is that the case? And if so, then I'm wondering which
client libraries,
other than Python (and I assume C[++]), implement the
Plasma Store. This
table doesn't feature a row for Plasma:
https://arrow.apache.org/docs/status.html
and I can't seem to find any reference to the Plasma
store in the Julia,
R, or Javascript libraries.
https://arrow.apache.org/docs/r/
https://arrow.apache.org/docs/js/
https://arrow.juliadata.org/stable/
Thank you,
Thomas