Having spent a few solid days looking at Finos Perspective a while back, I
think it has a lot of potential, and also a few rough edges.  Like JS
Arrow, the documentation is sparse and experimentation is required.  It
does handle Arrow data well with possibly some caveats.  One is, if I
recall correctly, that it seems to lose the ability to discriminate between
0 and null for integer columns after about 64K rows in size where we told
Perspective that a column was supposed to be a nullable integer.  This was
based on Arrow files generated via JS Arrow, which may be where the problem
lies, I can't remember, or maybe we were doing something wrong.

Perspective does offer streaming support, and I think the streamed data can
be in Arrow IPC format.  It is also WASM and has good capabilities for
parallelism using multiple workers.  And it's quite simple to use.

I think Arrow for in-browser analytics has a lot of promise in terms of
bandwidth, performance, and low memory usage. My team at work has started
working on an analytics project and we've been trying Arrow as a data
format.  For analytics datasets with no high-cardinality dimensions, it's
been really great.  Taking a simple dataset of just over 100K rows that is
32MB+ in array-of-objects format in JSON, we were able to get it to right
at 2MB via Arrow (~94% compression) with no decompression and basically no
parsing of the data needed in the browser.  Once in the browser, we wrote
some functions for filtering, group by, simple joins, etc., and the query
performance has been quite good (around 10-50ms per query on average with a
100K+ row dataset for most slice and dice types of operations, a bit worse
when joins are in play so we try to avoid them).  At this point these are
all just iterations over the entire dataset(s) using a for-of loop and the
Row proxy since we had some difficulties with the scan api, so there is
room for improvement here.

One key point is that we perform our own dictionary encoding of the data
before generating the Arrow file, so basically all of the dimensional data
in the Arrow file itself consists of just numbers (integers) that represent
keys into an array of strings stored outside the Arrow file.  This improved
the size of the Arrow file by ~50%.  It also speeds up the in-browser
queries over the data in the browser by about 300%.  In a multipart mime
response, we send down the Arrow file along with a JSON array that serves
as the "dictionary."  In the browser, queries are run by transforming
strings into the numeric keys contained in the Arrow file, performing the
query, and then only at the end when the result is small is the data
"unpacked" back into strings using the dictionary.  The 300ish% improvement
mentioned includes the time for this packing and unpacking.

For larger datasets, we've tried processing them using web workers just for
experimentation purposes. We tried this with over 1M rows and it worked
nicely, only slightly noticeable lag time for the end user when running
slice and dice operations.  For our use case, a few 100K rows or less, the
overhead of the web workers hasn't been worth it, but it would allow
parallel processing if needed in the future.

If your dataset has high-cardinality fields, then obviously compression
will suffer greatly, etc., but for our specific use case this approach has
shown a lot of promise.

We haven't looked at streaming yet, but we've anticipated either using
micro batches rather than real-time or handling the streaming data outside
of Arrow since it should incrementally represent smaller amounts of data
(e.g., queries in the browser might query over the Arrow data and
separately the streamed data and aggregate the results, then periodically
maybe put the out-of-band data into Arrow format in the browser).  This
also would lend itself to parallel query processing via web workers. We
haven't looked at Flight as of yet, but it sounds really interesting, and
with WASM too, even better.

~Mike

On Sat, Aug 15, 2020 at 6:01 PM Pierre Belzile <pierre.belz...@gmail.com>
wrote:

> Mark,
>
> Dis you take a look at finos perspective? It seems to have some interesting
> overlaps with your goals. I've come across it but have not digged in.
>
> Be curious to get your thoughts on it .
>
> Cheers
>
> On Sat., Aug. 15, 2020, 13:05 , <m...@markfarnan.com> wrote:
>
> > David,
> >
> > Still investigating, but I suspect for streaming I may have to fall back
> > to some form of "custom" Flight implementation over Websockets.
> >
> > Assuming Arrow/Flight actually makes sense for that link, which will
> > probably depend on how well it compresses.   However it will be very nice
> > if it does, to allow common format everywhere.
> >
> > The data I need to move around is highly variable in 'type',  (Arrays of
> > Floats, Ints & Strings with occasional Binary, or vector (array of an
> array
> > of floats in my case) but the number of columns, and their type vary by
> > dataset and visualization choices.  So far arrow seems a good choice
> rather
> > than any 'roll your own', and it will be nice to use same format on
> Client
> > side as well as in the Server system.
> >
> > My use case is primarily 'Get', consuming large datasets for
> > visualization.   I doubt I'll need Put or Exchange from the browser.
> >
> > Mark.
> >
> > -----Original Message-----
> > From: David Li <li.david...@gmail.com>
> > Sent: Saturday, August 15, 2020 5:53 PM
> > To: dev@arrow.apache.org
> > Subject: Re: Arrow Flight + Go, Arrow for Realtime
> >
> > I am curious what you accomplish with Arrow + Flight from the browser.
> > Right now, Flight is all gRPC-based, and browser compatibility is a bit
> > mixed. I expect the various transcoders/gRPC-Web can handle
> > GetFlightInfo/DoGet fine, though IIRC for DoGet, at least some of the
> > transcoders would have to buffer the entire stream before sending it to
> the
> > browser. DoPut/DoExchange seem harder/impossible to bridge right now due
> to
> > the bidirectional streaming.
> >
> > Best,
> > David
> >
> > On 8/14/20, m...@markfarnan.com <m...@markfarnan.com> wrote:
> > > Thanks Wes,
> > >
> > > I'll likely work on that once I get my head around Arrow in general
> > > and confirm will use for the project.
> > >
> > > Considerations for how to account for the streaming append problem to
> an
> > > otherwise immutable dataset is current concern.   Still thinking
> through
> > > that.
> > >
> > > Regards
> > >
> > > Mark.
> > >
> > > -----Original Message-----
> > > From: Wes McKinney <wesmck...@gmail.com>
> > > Sent: Wednesday, August 12, 2020 3:59 PM
> > > To: dev <dev@arrow.apache.org>
> > > Subject: Re: Arrow Flight + Go, Arrow for Realtime
> > >
> > > There's a WIP patch for Flight support in Go
> > >
> > > https://github.com/apache/arrow/pull/6731
> > >
> > > I hope to see someone taking up this work as first-class Flight
> > > support in Go would be very useful for building data services.
> > >
> > > On Wed, Aug 12, 2020 at 5:08 AM Adam Lippai <a...@rigo.sk> wrote:
> > >>
> > >> Arrow is mainly about batching data and leveraging all the
> > >> opportunities this gives.
> > >> This means you either have to buffer the data yourself and flush it
> > >> when a reasonable sized batch is complete or play with preallocating
> > >> Arrow structures This was discussed recently, you might be interested
> > >> in the thread:
> > >> https://www.mail-archive.com/dev@arrow.apache.org/msg19862.html
> > >>
> > >> Note: I'm not an Arrow developer, I'm just following the "streaming"
> > >> features of the Arrow lib, I'm interested in having a "rolling window"
> > >> API (like a fixed size FIFO queue).
> > >>
> > >> Best regards,
> > >> Adam Lippai
> > >>
> > >> On Wed, Aug 12, 2020 at 11:29 AM <m...@markfarnan.com> wrote:
> > >>
> > >> > I'm looking at using Arrow for a realtime IoT project which
> > >> > includes use cases both on server, and also for transferring /using
> > >> > in a Browser via WASM,  and have a few  questions.
> > >> >
> > >> >
> > >> >
> > >> > Language in use is Go.
> > >> >
> > >> >
> > >> >
> > >> > Is anyone working on implementing   Arrow-Flight in Go ?
> > (According
> > >> > to
> > >> > the feature matrix,  nothing ready yet, so wanted to check.
> > >> >
> > >> >
> > >> >
> > >> > Has anyone tried using Apache Arrow in  Go WASM  (Webassembly) ?
>  if
> > >> > so,
> > >> > any issues ?
> > >> >
> > >> >
> > >> >
> > >> > Any pointers/documentation  on using/extending Arrow for realtime
> > >> > streaming
> > >> > cases.   (Specifically where a DataFrame is requested, but then it
> > needs
> > >> > to
> > >> > 'grow' as new data arrives, often at high speed).
> > >> >
> > >> > Not language specific, just trying to understand the right pattern
> > >> > for using Arrow for this,  and couldn't' find much in the docs.
> > >> >
> > >> >
> > >> >
> > >> > Regards
> > >> >
> > >> >
> > >> >
> > >> > Mark.
> > >> >
> > >> >
> > >
> > >
> >
> >
>

Reply via email to