Hi Michael, How much data are you sending over the wire to javascript? For grouping, you might be able to send the original ungrouped table as an arrow buffer and use a new javascript library from the University of Washington called arquero (https://github.com/uwdata/arquero), which natively supports arrow. Here's a few Observable notebooks to get you started:
https://observablehq.com/@uwdata/arquero-roll-up-and-drill-down https://observablehq.com/@uwdata/arquero-and-apache-arrow Regards, Naveen On Wed, Mar 3, 2021 at 2:00 PM Michael Lavina <michael.lav...@factset.com> wrote: > Hey Weston, > > Do you have any public code examples I could take a look at? This does > sound very related to what I am doing. > > One particular question I have related to grouping is how you define > row-grouping. Column grouping is fairly simple I think you can just define > a Struct that tells you how columns of data is grouped, but how would you > go about grouping rows of data for example > > User Table > > First Name | Last Name | Country | State | City | Occupation > > // some data > > I have thought of basically two ways to do this. Send some metadata array > i.e. groupBy that denotes how data should be grouped by and it’s a simple > algorithm maybe something like [country, state, city]. But then you would > need to store some mapping of a given rowIndex returns some rows of > children based of that algorithm. And I think this would require all the > data to be available to do the grouping. > > The other way is defining the structure of the data maybe something like > (this could be entirely wrong I am new to Arrow sorry) > list<struct<country, list<struct<state, list<struct<city, > list<struct<firstName, lastName, occupation>>>>>>>> > > but basically the idea would be if you were to retrieve the data for a > given index of let’s say a state it would return all the cities and vectors > of data related to that given state. > > I also don’t know also if this is a limitations of my understanding of > Arrow or the ArrowJs SDK library and this might be something very easy I am > just not seeing it. > > -Michael > From: Weston Pace <weston.p...@gmail.com> > Date: Friday, February 26, 2021 at 9:34 PM > To: dev@arrow.apache.org <dev@arrow.apache.org> > Cc: Michael Lavina <michael.lav...@factset.com> > Subject: Re: [JS] Exploring usage of apache arrow at my company for > complex table rendering > I used Arrow for this purpose in the past. I don't have much to add > but just a few thoughts off the top of my head... > > * The line between data and metadata can be blurry - For most > measurements we were able to store the "expected distribution" as > metadata (e.g. this measurement should have an expected value of 10 > +/- 3) and that could be used for drawing limit lines. For some > measurements however the common practice in place was to store the > upper/lower limit as separate columns because they often changed > depending on the various independent variables. In that case the same > "concept" (limit) might be stored in data or metadata. > > * Distinction between "data" and a "chart" - For us, we introduced a > separate representation called the "chart" between the data and the > rendering layer. So using that limit line example before if we wanted > to plot a histogram of some column then we would create a bar chart > from the column. This bar chart itself was also an array of numbers > but, since these arrays were much smaller (one per bin, hard limit to > bin count in the thousands based on # of pixels in display), and the > structure was much more deeply nested, we ended up just using JSON for > charts. The "limit" metadata belonged to the data and it was > translated into a vertical line element as part of the chart. > > * Processing layer - For us it was too expensive to send the data > across the Internet for display. So the conversion from data -> chart > happened with the datacenter close to the actual data. The JS UI was > simply responsible for chart -> pixels (well, SVG). It sounds like > you plan on doing the processing in JS. This can work, I'm just > tossing out alternatives to think about. You can even have a hybrid > model where some initial filtering happens in the datacenter and then > chart calculation / rendering happens in JS. > > * Expressions for group/split - Arrow expressions / compute are > starting to become available (and more work is being done on in-arrow > query engines). These can be very helpful for things like grouping or > splitting. For example, if you want to plot two line charts, one for > model X and one for model Y then you can define your split using > expressions. Unfortunately, these are pretty big features and I don't > think they are in the JS library. However, the existing C++/Rust work > could serve as examples for how you might want to tackle this. You > will need a fair amount of compute to go from data to chart > (histograms, averages, standard deviations, etc.). In my case I used > pandas pretty extensively for this since the Arrow compute features > didn't exist yet. There are some JS libraries for this (e.g. d3) so > you can probably investigate that avenue as well. > > On Fri, Feb 26, 2021 at 12:05 PM Paul Taylor <ptay...@apache.org> wrote: > > > > Hi Michael, > > > > The answer to your question about metadata will likely be > > application-specific. > > > > For small amounts of metadata (i.e. communicating a bounding box of > > included geometry), there isn't much room for optimization, so a string > > could be fine. > > > > For larger amounts of metadata (or other constraints, like if the > metadata > > needs to be constantly modified independent of the data), custom > encodings > > or a second service and/or arrow table of the metadata could be the way > to > > go. > > > > The metadata keys/values are UTF-8 strings, so nothing should prevent you > > from stuffing a base64-encoded protobuf in there. > > > > As for whether the library is maintained -- yes it is, but lately I've > only > > had time to work on bug fixes or features required to maintain parity > with > > the spec and other libs. > > > > I will be using Arrow JS in my work again soon, and that could justify > more > > "quality of life" improvements again, but without other maintainers > jumping > > in to contribute or needing it for my work, those things don't get done. > > > > I'd be happy to do a call with you or your team to give a short overview > > and introduction to the JS lib. You can also email me directly or in the > > #arrow-js channel on the-asf.slack.com with any questions. > > > > Best, > > Paul > > > > On Fri, Feb 26, 2021 at 1:47 PM Michael Lavina < > michael.lav...@factset.com> > > wrote: > > > > > Hey Neal, > > > > > > Thanks for the response and I am glad I am using this correctly. I have > > > never really used email servers so hopefully this works. > > > > > > That’s exactly what I was thinking of doing is to create a standard > > > metadata schema to built on top of Apache Arrow with some predefined > user > > > types. > > > > > > I guess I was just wondering if I was trying to use a screwdriver as a > > > hammer. It can work because we are using the metadata and that could be > > > anything but maybe like you said we should be creating a separate > standard > > > entirely for defining the schema to render tables instead of defining > it > > > within Arrow. > > > > > > Does it defeat the value of Arrow if are sending the data using buffers > > > and stream and a giant string of stringified metadata when I could > maybe > > > define the metadata in protobuf binary separately. > > > > > > In addition, I was curious with all these visualization tools has > someone > > > already developed a standard metadata for arrow to help with rendering. > > > Stuff like how to denote grouping of data, relationship between > columns and > > > hidden information. > > > > > > -Michael > > > > > > From: Neal Richardson <neal.p.richard...@gmail.com> > > > Date: Friday, February 26, 2021 at 1:38 PM > > > To: dev <dev@arrow.apache.org> > > > Subject: Re: [JS] Exploring usage of apache arrow at my company for > > > complex table rendering > > > The Arrow IPC specification allows for custom metadata in both the > Schema > > > and the individual Fields: > > > > > > > https://urldefense.com/v3/__https://arrow.apache.org/docs/format/Columnar.html*schema-message__;Iw!!PBKjc0U4!ZDNX2q8bDIOFv2QGswzYOu9kXjf-yQ_0OvCT9gc-9kIH6GXS0qYzmwCGSdcKvxxhHK7K$ > < > https://urldefense.com/v3/__https:/arrow.apache.org/docs/format/Columnar.html*schema-message__;Iw!!PBKjc0U4!ZDNX2q8bDIOFv2QGswzYOu9kXjf-yQ_0OvCT9gc-9kIH6GXS0qYzmwCGSdcKvxxhHK7K$ > > > > > < > > > > https://urldefense.com/v3/__https:/arrow.apache.org/docs/format/Columnar.html*schema-message__;Iw!!PBKjc0U4!ZDNX2q8bDIOFv2QGswzYOu9kXjf-yQ_0OvCT9gc-9kIH6GXS0qYzmwCGSdcKvxxhHK7K$ > > > > > > > > > > Might that work for you? Another alternative would be to track your > > > metadata in a separate object outside of the Arrow data. > > > > > > Neal > > > > > > On Fri, Feb 26, 2021 at 5:02 AM Michael Lavina < > michael.lav...@factset.com > > > > > > > wrote: > > > > > > > Hello Everyone, > > > > > > > > > > > > > > > > Some background. My name is Michael and I work at FactSet, which if > you > > > > use Arrow you may have heard because one of our architects did a > talk on > > > > using Arrow and Dremio. > > > > > > > > > > > > > > > > https://urldefense.com/v3/__https://hello.dremio.com/eliminate-data-transfer-bottlenecks-with-apache-arrow-flight.html?utm_medium=social-free&utm_source=linkedin&utm_term=na&utm_content=na&utm_campaign=eliminate-data-transfer-bottlenecks-with-apache-arrow-flight__;!!PBKjc0U4!ZDNX2q8bDIOFv2QGswzYOu9kXjf-yQ_0OvCT9gc-9kIH6GXS0qYzmwCGSdcKv9lV4pkV$ > < > https://urldefense.com/v3/__https:/hello.dremio.com/eliminate-data-transfer-bottlenecks-with-apache-arrow-flight.html?utm_medium=social-free&utm_source=linkedin&utm_term=na&utm_content=na&utm_campaign=eliminate-data-transfer-bottlenecks-with-apache-arrow-flight__;!!PBKjc0U4!ZDNX2q8bDIOFv2QGswzYOu9kXjf-yQ_0OvCT9gc-9kIH6GXS0qYzmwCGSdcKv9lV4pkV$ > > > > > < > > > > https://urldefense.com/v3/__https:/hello.dremio.com/eliminate-data-transfer-bottlenecks-with-apache-arrow-flight.html?utm_medium=social-free&utm_source=linkedin&utm_term=na&utm_content=na&utm_campaign=eliminate-data-transfer-bottlenecks-with-apache-arrow-flight__;!!PBKjc0U4!ZDNX2q8bDIOFv2QGswzYOu9kXjf-yQ_0OvCT9gc-9kIH6GXS0qYzmwCGSdcKv9lV4pkV$ > > > > > > > > > > > > > > > > > > > > His team has decided to use Arrow as a tabular data interchange > format. > > > > Other teams are doing other things. We are working on standardizing > our > > > > tabular data interchange format at our company. > > > > > > > > > > > > > > > > We have our own open-sourced columnar based schema defined in > protobuf. > > > > > > > > https://urldefense.com/v3/__https://github.com/factset/stachschema__;!!PBKjc0U4!ZDNX2q8bDIOFv2QGswzYOu9kXjf-yQ_0OvCT9gc-9kIH6GXS0qYzmwCGSdcKv6XjzSrx$ > < > https://urldefense.com/v3/__https:/github.com/factset/stachschema__;!!PBKjc0U4!ZDNX2q8bDIOFv2QGswzYOu9kXjf-yQ_0OvCT9gc-9kIH6GXS0qYzmwCGSdcKv6XjzSrx$ > > > > > < > > > > https://urldefense.com/v3/__https:/github.com/factset/stachschema__;!!PBKjc0U4!ZDNX2q8bDIOFv2QGswzYOu9kXjf-yQ_0OvCT9gc-9kIH6GXS0qYzmwCGSdcKv6XjzSrx$ > > > > > > > > > > > > > > > > > > > > We looked into Apache Arrow a few years ago, but decided not to use > it as > > > > it was not mature enough at the time and we had two specific > requirements > > > > > > > > 1) We needed this data not just for analytics but rendering as well > and > > > > rendering requires a lot more complicated information such as > > > understanding > > > > the type of data and relationship between data i.e. grouping > > > > > > > > 2) We need SDKs that support typescript/javascript both browser and > node > > > > and supports both creating and consuming arrow. > > > > > > > > > > > > > > > > Now that Apache Arrow is more mature and stabilized i.e. the schema > and > > > > sdks are post 1.x we are looking into it again. > > > > > > > > > > > > > > > > 1. we are thinking of defining specific metadata in a similar way > we > > > > do for STACH that let’s us define some rendering specific e.g. > adding > > > a > > > > metadata to a Field Schema called isHidden to denote whether we > should > > > > render the data column or not. > > > > 2. It seems like there is a well developed javascript SDK that we > can > > > > use. I am still reading the source code and the Observable > articles to > > > > truly understand how it works. > > > > 1. I read one of the issues is that the JS library might be out > > > > sync, so do people know how actively that repo is maintained. > > > > 2. If there needs to be work done I think we would be able to > help > > > > if we had some help getting started with understanding that > repo. > > > > > > > > > > > > > > > > If possible we would be interested to continue to chat about the > above > > > > ideas, get more information about if Apache Arrow is right for the > job, > > > and > > > > if there is already discussion of other people are using arrow for > > > > rendering in addition to analytics. > > > > > > > > > > > > > > > > To clarify what I mean for existing render technologies I know stuff > like > > > > Falcon and Perspective exist, but those seem to be for basic table > > > > rendering for simple tables. I mean to create a superset of arrow by > > > > definfing metadata that allows for complex nested headers and nested > > > rows. > > > > Something like the image below. Then you can imagine even more data > > > > attached such as describing the data and relationships to other data > on > > > the > > > > page. You can image in the dataset there is some `personId` that is > set > > > to > > > > not be rendered. This personId can then be used to gather more > > > information > > > > in another api call if you wanted to render a tooltip with maybe > some bio > > > > information. In short, rendered tables require a lot more information > > > than > > > > just the data. Does it make sense to build this upon Arrow. > > > > > > > > > > > > > > > > > > > > > > > > -Thanks > > > > > > > > Michael > > > > > > > > > > > > > > > > -- ----------------------------------- Naveen Michaud-Agrawal