Sorry to thread hijack.

> One key point is that we perform our own dictionary encoding of the data
> before generating the Arrow file, so basically all of the dimensional data
> in the Arrow file itself consists of just numbers (integers) that represent
> keys into an array of strings stored outside the Arrow file.


Just curious is there a reason why you didn't use the built-in dictionary
support in the Arrow format?

-Micah

On Mon, Aug 17, 2020 at 9:02 PM Michael Stephenson <domehead...@gmail.com>
wrote:

> Having spent a few solid days looking at Finos Perspective a while back, I
> think it has a lot of potential, and also a few rough edges.  Like JS
> Arrow, the documentation is sparse and experimentation is required.  It
> does handle Arrow data well with possibly some caveats.  One is, if I
> recall correctly, that it seems to lose the ability to discriminate between
> 0 and null for integer columns after about 64K rows in size where we told
> Perspective that a column was supposed to be a nullable integer.  This was
> based on Arrow files generated via JS Arrow, which may be where the problem
> lies, I can't remember, or maybe we were doing something wrong.
>
> Perspective does offer streaming support, and I think the streamed data can
> be in Arrow IPC format.  It is also WASM and has good capabilities for
> parallelism using multiple workers.  And it's quite simple to use.
>
> I think Arrow for in-browser analytics has a lot of promise in terms of
> bandwidth, performance, and low memory usage. My team at work has started
> working on an analytics project and we've been trying Arrow as a data
> format.  For analytics datasets with no high-cardinality dimensions, it's
> been really great.  Taking a simple dataset of just over 100K rows that is
> 32MB+ in array-of-objects format in JSON, we were able to get it to right
> at 2MB via Arrow (~94% compression) with no decompression and basically no
> parsing of the data needed in the browser.  Once in the browser, we wrote
> some functions for filtering, group by, simple joins, etc., and the query
> performance has been quite good (around 10-50ms per query on average with a
> 100K+ row dataset for most slice and dice types of operations, a bit worse
> when joins are in play so we try to avoid them).  At this point these are
> all just iterations over the entire dataset(s) using a for-of loop and the
> Row proxy since we had some difficulties with the scan api, so there is
> room for improvement here.
>
> One key point is that we perform our own dictionary encoding of the data
> before generating the Arrow file, so basically all of the dimensional data
> in the Arrow file itself consists of just numbers (integers) that represent
> keys into an array of strings stored outside the Arrow file.  This improved
> the size of the Arrow file by ~50%.  It also speeds up the in-browser
> queries over the data in the browser by about 300%.  In a multipart mime
> response, we send down the Arrow file along with a JSON array that serves
> as the "dictionary."  In the browser, queries are run by transforming
> strings into the numeric keys contained in the Arrow file, performing the
> query, and then only at the end when the result is small is the data
> "unpacked" back into strings using the dictionary.  The 300ish% improvement
> mentioned includes the time for this packing and unpacking.
>
> For larger datasets, we've tried processing them using web workers just for
> experimentation purposes. We tried this with over 1M rows and it worked
> nicely, only slightly noticeable lag time for the end user when running
> slice and dice operations.  For our use case, a few 100K rows or less, the
> overhead of the web workers hasn't been worth it, but it would allow
> parallel processing if needed in the future.
>
> If your dataset has high-cardinality fields, then obviously compression
> will suffer greatly, etc., but for our specific use case this approach has
> shown a lot of promise.
>
> We haven't looked at streaming yet, but we've anticipated either using
> micro batches rather than real-time or handling the streaming data outside
> of Arrow since it should incrementally represent smaller amounts of data
> (e.g., queries in the browser might query over the Arrow data and
> separately the streamed data and aggregate the results, then periodically
> maybe put the out-of-band data into Arrow format in the browser).  This
> also would lend itself to parallel query processing via web workers. We
> haven't looked at Flight as of yet, but it sounds really interesting, and
> with WASM too, even better.
>
> ~Mike
>
> On Sat, Aug 15, 2020 at 6:01 PM Pierre Belzile <pierre.belz...@gmail.com>
> wrote:
>
> > Mark,
> >
> > Dis you take a look at finos perspective? It seems to have some
> interesting
> > overlaps with your goals. I've come across it but have not digged in.
> >
> > Be curious to get your thoughts on it .
> >
> > Cheers
> >
> > On Sat., Aug. 15, 2020, 13:05 , <m...@markfarnan.com> wrote:
> >
> > > David,
> > >
> > > Still investigating, but I suspect for streaming I may have to fall
> back
> > > to some form of "custom" Flight implementation over Websockets.
> > >
> > > Assuming Arrow/Flight actually makes sense for that link, which will
> > > probably depend on how well it compresses.   However it will be very
> nice
> > > if it does, to allow common format everywhere.
> > >
> > > The data I need to move around is highly variable in 'type',  (Arrays
> of
> > > Floats, Ints & Strings with occasional Binary, or vector (array of an
> > array
> > > of floats in my case) but the number of columns, and their type vary by
> > > dataset and visualization choices.  So far arrow seems a good choice
> > rather
> > > than any 'roll your own', and it will be nice to use same format on
> > Client
> > > side as well as in the Server system.
> > >
> > > My use case is primarily 'Get', consuming large datasets for
> > > visualization.   I doubt I'll need Put or Exchange from the browser.
> > >
> > > Mark.
> > >
> > > -----Original Message-----
> > > From: David Li <li.david...@gmail.com>
> > > Sent: Saturday, August 15, 2020 5:53 PM
> > > To: dev@arrow.apache.org
> > > Subject: Re: Arrow Flight + Go, Arrow for Realtime
> > >
> > > I am curious what you accomplish with Arrow + Flight from the browser.
> > > Right now, Flight is all gRPC-based, and browser compatibility is a bit
> > > mixed. I expect the various transcoders/gRPC-Web can handle
> > > GetFlightInfo/DoGet fine, though IIRC for DoGet, at least some of the
> > > transcoders would have to buffer the entire stream before sending it to
> > the
> > > browser. DoPut/DoExchange seem harder/impossible to bridge right now
> due
> > to
> > > the bidirectional streaming.
> > >
> > > Best,
> > > David
> > >
> > > On 8/14/20, m...@markfarnan.com <m...@markfarnan.com> wrote:
> > > > Thanks Wes,
> > > >
> > > > I'll likely work on that once I get my head around Arrow in general
> > > > and confirm will use for the project.
> > > >
> > > > Considerations for how to account for the streaming append problem to
> > an
> > > > otherwise immutable dataset is current concern.   Still thinking
> > through
> > > > that.
> > > >
> > > > Regards
> > > >
> > > > Mark.
> > > >
> > > > -----Original Message-----
> > > > From: Wes McKinney <wesmck...@gmail.com>
> > > > Sent: Wednesday, August 12, 2020 3:59 PM
> > > > To: dev <dev@arrow.apache.org>
> > > > Subject: Re: Arrow Flight + Go, Arrow for Realtime
> > > >
> > > > There's a WIP patch for Flight support in Go
> > > >
> > > > https://github.com/apache/arrow/pull/6731
> > > >
> > > > I hope to see someone taking up this work as first-class Flight
> > > > support in Go would be very useful for building data services.
> > > >
> > > > On Wed, Aug 12, 2020 at 5:08 AM Adam Lippai <a...@rigo.sk> wrote:
> > > >>
> > > >> Arrow is mainly about batching data and leveraging all the
> > > >> opportunities this gives.
> > > >> This means you either have to buffer the data yourself and flush it
> > > >> when a reasonable sized batch is complete or play with preallocating
> > > >> Arrow structures This was discussed recently, you might be
> interested
> > > >> in the thread:
> > > >> https://www.mail-archive.com/dev@arrow.apache.org/msg19862.html
> > > >>
> > > >> Note: I'm not an Arrow developer, I'm just following the "streaming"
> > > >> features of the Arrow lib, I'm interested in having a "rolling
> window"
> > > >> API (like a fixed size FIFO queue).
> > > >>
> > > >> Best regards,
> > > >> Adam Lippai
> > > >>
> > > >> On Wed, Aug 12, 2020 at 11:29 AM <m...@markfarnan.com> wrote:
> > > >>
> > > >> > I'm looking at using Arrow for a realtime IoT project which
> > > >> > includes use cases both on server, and also for transferring
> /using
> > > >> > in a Browser via WASM,  and have a few  questions.
> > > >> >
> > > >> >
> > > >> >
> > > >> > Language in use is Go.
> > > >> >
> > > >> >
> > > >> >
> > > >> > Is anyone working on implementing   Arrow-Flight in Go ?
> > > (According
> > > >> > to
> > > >> > the feature matrix,  nothing ready yet, so wanted to check.
> > > >> >
> > > >> >
> > > >> >
> > > >> > Has anyone tried using Apache Arrow in  Go WASM  (Webassembly) ?
> >  if
> > > >> > so,
> > > >> > any issues ?
> > > >> >
> > > >> >
> > > >> >
> > > >> > Any pointers/documentation  on using/extending Arrow for realtime
> > > >> > streaming
> > > >> > cases.   (Specifically where a DataFrame is requested, but then it
> > > needs
> > > >> > to
> > > >> > 'grow' as new data arrives, often at high speed).
> > > >> >
> > > >> > Not language specific, just trying to understand the right pattern
> > > >> > for using Arrow for this,  and couldn't' find much in the docs.
> > > >> >
> > > >> >
> > > >> >
> > > >> > Regards
> > > >> >
> > > >> >
> > > >> >
> > > >> > Mark.
> > > >> >
> > > >> >
> > > >
> > > >
> > >
> > >
> >
>

Reply via email to