Sorry to thread hijack.
> One key point is that we perform our own dictionary encoding of the data > before generating the Arrow file, so basically all of the dimensional data > in the Arrow file itself consists of just numbers (integers) that represent > keys into an array of strings stored outside the Arrow file. Just curious is there a reason why you didn't use the built-in dictionary support in the Arrow format? -Micah On Mon, Aug 17, 2020 at 9:02 PM Michael Stephenson <domehead...@gmail.com> wrote: > Having spent a few solid days looking at Finos Perspective a while back, I > think it has a lot of potential, and also a few rough edges. Like JS > Arrow, the documentation is sparse and experimentation is required. It > does handle Arrow data well with possibly some caveats. One is, if I > recall correctly, that it seems to lose the ability to discriminate between > 0 and null for integer columns after about 64K rows in size where we told > Perspective that a column was supposed to be a nullable integer. This was > based on Arrow files generated via JS Arrow, which may be where the problem > lies, I can't remember, or maybe we were doing something wrong. > > Perspective does offer streaming support, and I think the streamed data can > be in Arrow IPC format. It is also WASM and has good capabilities for > parallelism using multiple workers. And it's quite simple to use. > > I think Arrow for in-browser analytics has a lot of promise in terms of > bandwidth, performance, and low memory usage. My team at work has started > working on an analytics project and we've been trying Arrow as a data > format. For analytics datasets with no high-cardinality dimensions, it's > been really great. Taking a simple dataset of just over 100K rows that is > 32MB+ in array-of-objects format in JSON, we were able to get it to right > at 2MB via Arrow (~94% compression) with no decompression and basically no > parsing of the data needed in the browser. Once in the browser, we wrote > some functions for filtering, group by, simple joins, etc., and the query > performance has been quite good (around 10-50ms per query on average with a > 100K+ row dataset for most slice and dice types of operations, a bit worse > when joins are in play so we try to avoid them). At this point these are > all just iterations over the entire dataset(s) using a for-of loop and the > Row proxy since we had some difficulties with the scan api, so there is > room for improvement here. > > One key point is that we perform our own dictionary encoding of the data > before generating the Arrow file, so basically all of the dimensional data > in the Arrow file itself consists of just numbers (integers) that represent > keys into an array of strings stored outside the Arrow file. This improved > the size of the Arrow file by ~50%. It also speeds up the in-browser > queries over the data in the browser by about 300%. In a multipart mime > response, we send down the Arrow file along with a JSON array that serves > as the "dictionary." In the browser, queries are run by transforming > strings into the numeric keys contained in the Arrow file, performing the > query, and then only at the end when the result is small is the data > "unpacked" back into strings using the dictionary. The 300ish% improvement > mentioned includes the time for this packing and unpacking. > > For larger datasets, we've tried processing them using web workers just for > experimentation purposes. We tried this with over 1M rows and it worked > nicely, only slightly noticeable lag time for the end user when running > slice and dice operations. For our use case, a few 100K rows or less, the > overhead of the web workers hasn't been worth it, but it would allow > parallel processing if needed in the future. > > If your dataset has high-cardinality fields, then obviously compression > will suffer greatly, etc., but for our specific use case this approach has > shown a lot of promise. > > We haven't looked at streaming yet, but we've anticipated either using > micro batches rather than real-time or handling the streaming data outside > of Arrow since it should incrementally represent smaller amounts of data > (e.g., queries in the browser might query over the Arrow data and > separately the streamed data and aggregate the results, then periodically > maybe put the out-of-band data into Arrow format in the browser). This > also would lend itself to parallel query processing via web workers. We > haven't looked at Flight as of yet, but it sounds really interesting, and > with WASM too, even better. > > ~Mike > > On Sat, Aug 15, 2020 at 6:01 PM Pierre Belzile <pierre.belz...@gmail.com> > wrote: > > > Mark, > > > > Dis you take a look at finos perspective? It seems to have some > interesting > > overlaps with your goals. I've come across it but have not digged in. > > > > Be curious to get your thoughts on it . > > > > Cheers > > > > On Sat., Aug. 15, 2020, 13:05 , <m...@markfarnan.com> wrote: > > > > > David, > > > > > > Still investigating, but I suspect for streaming I may have to fall > back > > > to some form of "custom" Flight implementation over Websockets. > > > > > > Assuming Arrow/Flight actually makes sense for that link, which will > > > probably depend on how well it compresses. However it will be very > nice > > > if it does, to allow common format everywhere. > > > > > > The data I need to move around is highly variable in 'type', (Arrays > of > > > Floats, Ints & Strings with occasional Binary, or vector (array of an > > array > > > of floats in my case) but the number of columns, and their type vary by > > > dataset and visualization choices. So far arrow seems a good choice > > rather > > > than any 'roll your own', and it will be nice to use same format on > > Client > > > side as well as in the Server system. > > > > > > My use case is primarily 'Get', consuming large datasets for > > > visualization. I doubt I'll need Put or Exchange from the browser. > > > > > > Mark. > > > > > > -----Original Message----- > > > From: David Li <li.david...@gmail.com> > > > Sent: Saturday, August 15, 2020 5:53 PM > > > To: dev@arrow.apache.org > > > Subject: Re: Arrow Flight + Go, Arrow for Realtime > > > > > > I am curious what you accomplish with Arrow + Flight from the browser. > > > Right now, Flight is all gRPC-based, and browser compatibility is a bit > > > mixed. I expect the various transcoders/gRPC-Web can handle > > > GetFlightInfo/DoGet fine, though IIRC for DoGet, at least some of the > > > transcoders would have to buffer the entire stream before sending it to > > the > > > browser. DoPut/DoExchange seem harder/impossible to bridge right now > due > > to > > > the bidirectional streaming. > > > > > > Best, > > > David > > > > > > On 8/14/20, m...@markfarnan.com <m...@markfarnan.com> wrote: > > > > Thanks Wes, > > > > > > > > I'll likely work on that once I get my head around Arrow in general > > > > and confirm will use for the project. > > > > > > > > Considerations for how to account for the streaming append problem to > > an > > > > otherwise immutable dataset is current concern. Still thinking > > through > > > > that. > > > > > > > > Regards > > > > > > > > Mark. > > > > > > > > -----Original Message----- > > > > From: Wes McKinney <wesmck...@gmail.com> > > > > Sent: Wednesday, August 12, 2020 3:59 PM > > > > To: dev <dev@arrow.apache.org> > > > > Subject: Re: Arrow Flight + Go, Arrow for Realtime > > > > > > > > There's a WIP patch for Flight support in Go > > > > > > > > https://github.com/apache/arrow/pull/6731 > > > > > > > > I hope to see someone taking up this work as first-class Flight > > > > support in Go would be very useful for building data services. > > > > > > > > On Wed, Aug 12, 2020 at 5:08 AM Adam Lippai <a...@rigo.sk> wrote: > > > >> > > > >> Arrow is mainly about batching data and leveraging all the > > > >> opportunities this gives. > > > >> This means you either have to buffer the data yourself and flush it > > > >> when a reasonable sized batch is complete or play with preallocating > > > >> Arrow structures This was discussed recently, you might be > interested > > > >> in the thread: > > > >> https://www.mail-archive.com/dev@arrow.apache.org/msg19862.html > > > >> > > > >> Note: I'm not an Arrow developer, I'm just following the "streaming" > > > >> features of the Arrow lib, I'm interested in having a "rolling > window" > > > >> API (like a fixed size FIFO queue). > > > >> > > > >> Best regards, > > > >> Adam Lippai > > > >> > > > >> On Wed, Aug 12, 2020 at 11:29 AM <m...@markfarnan.com> wrote: > > > >> > > > >> > I'm looking at using Arrow for a realtime IoT project which > > > >> > includes use cases both on server, and also for transferring > /using > > > >> > in a Browser via WASM, and have a few questions. > > > >> > > > > >> > > > > >> > > > > >> > Language in use is Go. > > > >> > > > > >> > > > > >> > > > > >> > Is anyone working on implementing Arrow-Flight in Go ? > > > (According > > > >> > to > > > >> > the feature matrix, nothing ready yet, so wanted to check. > > > >> > > > > >> > > > > >> > > > > >> > Has anyone tried using Apache Arrow in Go WASM (Webassembly) ? > > if > > > >> > so, > > > >> > any issues ? > > > >> > > > > >> > > > > >> > > > > >> > Any pointers/documentation on using/extending Arrow for realtime > > > >> > streaming > > > >> > cases. (Specifically where a DataFrame is requested, but then it > > > needs > > > >> > to > > > >> > 'grow' as new data arrives, often at high speed). > > > >> > > > > >> > Not language specific, just trying to understand the right pattern > > > >> > for using Arrow for this, and couldn't' find much in the docs. > > > >> > > > > >> > > > > >> > > > > >> > Regards > > > >> > > > > >> > > > > >> > > > > >> > Mark. > > > >> > > > > >> > > > > > > > > > > > > > > > > > >