Is there any chance you could point me to those abstractions so that I may have a look and play around with them?
Sent from my iPhone > On 20 Sep 2020, at 05:17, Micah Kornfield <emkornfi...@gmail.com> wrote: > > >> Furthermore, these types of queries seem to fit what I would call (for lack >> of a better word) "sliding" dataframes. Arrow's aim (as I understand it) is >> to standardized the static dataframe data structure memory model, can it >> also support a sliding version? > > I don't think there are any explicit library features planned around sliding > windows, but the current abstractions allow for combining lazily combining > columns into a single logical structure and working on that. I imagine > sliding window abstractions could be built from that. > >> On Thu, Sep 10, 2020 at 2:04 AM Pedro Silva <pedro.cl...@gmail.com> wrote: >> Hi Micah, >> >> Thank you for your reply and the links, the threads were quite interesting. >> You are right, I opened the flink issue regarding arrow support to >> understand whether it was on their roadmap to take a look at. >> >> My use-case is processing a stream of events (or rows if you will) to >> compute ~100-150 sliding window aggregations over a subset of the received >> fields (say 10 out of a row with 80+ fields). >> >> Something of the sort: >> average (session_time) group by ID over 1 hour. >> >> For the query above only 3 fields are required, the session_time, ID and >> timestamp (implicitly required to define time windows) meaning that we can >> discard a significant amount of information from the original event. >> >> Furthermore, these types of queries seem to fit what I would call (for lack >> of a better word) "sliding" dataframes. Arrow's aim (as I understand it) is >> to standardized the static dataframe data structure memory model, can it >> also support a sliding version? >> >> Usually these queries are defined by data scientists and domain experts who >> are comfortable using python and not java or c++ which are the languages, >> streaming engines are built on. >> >> My goal is to understand if existing solutions streaming engines like flink >> can converge into a common model that could in the future help develop >> efficient cross-language streaming engines. >> >> I hope I've been able to clarify some points. >> >> Thanks >> >> >>> Em sex., 4 de set. de 2020 às 20:17, Micah Kornfield >>> <emkornfi...@gmail.com> escreveu: >>> Hi Pedro, >>> I think the answer is it likely depends. The main trade-off in using Arrow >>> in a streaming process is the high metadata overhead if you have very few >>> rows. There have been prior discussions on the mailing list about >>> row-based and streaming that might be useful [1][2] in expanding on the >>> trade-offs. >>> >>> For some additional color: Brian Hulette gave a talk [3] a while ago about >>> potentially using Arrow within Beam (I believe flink has a high overlap >>> with the Beam API) and some of the challenges. It also looks like there >>> was a Flink JIRA (that you might be on?) about using Arrow directly in >>> Flink and some of the trade-offs [4]. >>> >>> The questions you posed are a little bit vague, if there is more context it >>> might be able to help make the conversation more productive. >>> >>> -Micah >>> >>> [1] >>> https://lists.apache.org/thread.html/33a4e1a272e77d4959c851481aa25c6e4aa870db172e4c1bbf2e3a35%40%3Cdev.arrow.apache.org%3E >>> [2] >>> https://lists.apache.org/thread.html/27945533db782361143586fd77ca08e15e96e2f2a5250ff084b462d6%40%3Cdev.arrow.apache.org%3E >>> [3] https://www.youtube.com/watch?v=avy1ifTZlhE >>> [4] https://issues.apache.org/jira/browse/FLINK-10929 >>> >>> >>> On Fri, Sep 4, 2020 at 12:39 AM Pedro Silva <pedro.cl...@gmail.com> wrote: >>> >>> > Hello, >>> > >>> > This may be a stupid question but is Arrow used for or designed with >>> > streaming processing use-cases in mind, where data is non-stationary. I.e: >>> > Flink stream processing jobs? >>> > >>> > Particularly, is it possible from a given event source (say Kafka) to >>> > efficiently generate incremental record batches for stream processing? >>> > >>> > Suppose there is a data source that continuously generates messages with >>> > 100+ fields. You want to compute grouped aggregations (sums, averages, >>> > count distinct, etc...) over a select few of those fields, say 5 fields at >>> > most used for all queries. >>> > >>> > Is this a valid use-case for Arrow? >>> > What if time is important and some windowing technique has to be applied? >>> > >>> > Thank you very much for your time! >>> > Have a good day. >>> >