Re: Arrow streaming computation engine

QP Hou Tue, 11 Jan 2022 17:32:56 -0800

For datafusion (the Rust engine that Weston mentioned), the community
is about to start building a PoC for streaming engine. The discussion
is happening at
https://github.com/apache/arrow-datafusion/issues/1544.


On Tue, Jan 11, 2022 at 3:29 PM Weston Pace <[email protected]> wrote:
>
> First, note that there are different computation engines in different
> languages.  The Rust implementation has datafusion[1] for example.
> For the rest of this email, I will speak in more detail specifically
> about the C++ computation engine (which I am more familiar with) that
> is in place today.  The C++ engine is documented here[2] although that
> documentation is a little scarce and we are working on an updated
> version[3].
>
> Also note that the docs describe a "Streaming execution engine"
> because it operates on the data in a batch-oriented fashion.  However,
> this doesn't guarantee that it will use a small amount of memory.  For
> example, if you were to request that the engine sort the data then the
> engine may need to cache the entire dataset into memory (in the future
> this may mean spilling into temporary tables as memory runs out) in
> order to fulfill that query (because the very last row you read might
> be the very first row you need to emit).  However, for properly
> constructed queries, the engine should be able to operate as you are
> describing.  The queries you are describing sound to me like what one
> might expect to find in a "time series database" which is another term
> I've heard thrown around.
>
> I am not an expert in time series databases so I don't know the extent
> of the computation required.  However, the example you give (7 day
> rolling mean of daily US stock prices) is not something that could be
> efficiently computed today.  It is something that could be efficiently
> computed once "window functions" are supported.  Window functions[4]
> are a query engine feature that enables the sliding window needed for
> a rolling average.  I believe there are people at Voltron Data that
> are hoping to add support for these window functions to the C++
> streaming execution engine but that is future work that is not
> currently in progress.  That being said, a time series execution
> engine would probably also need to know about indices, statistics,
> whether the data on disk is sorted or not (and by what columns),
> downsampling functions, interpolation functions, etc.  In addition,
> beyond execution / computation there are concerns such as retention
> policies, streaming / appending data to disk, etc.
>
> > So I am wondering if there is
> > a way to design an engine that can satisfy both streaming and batch mode of
> > processing. Or maybe it needs to be seperate engines but we can minimize
> > the amount of duplication?
>
> Regardless of the timeline and plans for window functions the answer
> to this specific question is probably "yes" but I'm not enough of an
> expert in time series processing to answer with certainty.  The
> streaming execution engine in Arrow today is quite generic.  A graph
> of "exec nodes" is constructed.  Data is passed through these exec
> nodes starting from one or more sources and then ending at a sink.
> The sources could be live data to satisfy your request for (3).  The
> plan is currently run very similar to an actor model where batches are
> pushed from one node to another.  I'm hoping to add more support for
> scheduling and backpressure at some point.  Given what I know of the
> types of queries you are describing I think this model should suffice
> to run those queries efficiently.
>
> So, summarizing, I think some of the work we are doing will be useful
> to you (though possibly not sufficient) and it would be a good idea to
> reuse & share where possible.
>
> [1] https://docs.rs/datafusion/latest/datafusion/
> [2] https://arrow.apache.org/docs/cpp/streaming_execution.html
> [3] https://github.com/apache/arrow/pull/12033
> [4] 
> https://medium.com/an-idea/rolling-sum-and-average-window-functions-mysql-7509d1d576e6
>
> On Tue, Jan 11, 2022 at 11:19 AM Li Jin <[email protected]> wrote:
> >
> > Hi,
> >
> > This is a somewhat lengthy email about thoughts around a streaming
> > computation engine for Arrow dataset that I would like to hear feedback
> > from Arrow devs.
> >
> > The main use cases that we are thinking for the streaming engine are time
> > series data, i.e., data arrives in time order (e.g. daily US stock prices)
> > and the query often follows the time order of the data (e.g., compute 7 day
> > rolling mean of daily US stock prices).
> >
> > The main motivations for a streaming engine is (1) performance: always
> > keeps small amount of hot data always in memory and cache (2)
> > memory efficiency: the engine only need to keep small amounts of data in
> > memory, e.g., for the 7 day rolling mean case, the engine never need to
> > keep more than 7 day worth of stock price data, even it is computing this
> > for a stream of 20 year data. (3) Live data application: data arrives in
> > real time
> >
> > I have talked to Phillip Cloud and am aware of an effort going on to build
> > a computation engine for SQL-like queries (mostly query on the entire
> > dataset) but am unfamiliar with the details. So I am wondering if there is
> > a way to design an engine that can satisfy both streaming and batch mode of
> > processing. Or maybe it needs to be seperate engines but we can minimize
> > the amount of duplication?
> >
> > Looking forward to any thoughts around this.
> >
> > Li

Re: Arrow streaming computation engine

Reply via email to