Re: Arrow as a streaming format

Micah Kornfield Mon, 21 Sep 2020 22:30:42 -0700

>
> Is there any chance you could point me to those abstractions so that I may
> have a look and play around with them?


Sorry if there doesn't exist anything in Java (and I realize that might
have been what you were expecting).  I was thinking of C++/Python which
have  ChunkedArray classes.  The computational kernels in C++ (some of
which are exposed in python) process data at this level.  A pyarrow table
is made up of chunked arrays.


[1]
https://arrow.apache.org/docs/python/generated/pyarrow.ChunkedArray.html#pyarrow.ChunkedArray.value_counts

On Sun, Sep 20, 2020 at 2:43 AM Pedro Silva <[email protected]> wrote:

> Is there any chance you could point me to those abstractions so that I may
> have a look and play around with them?
>
> Sent from my iPhone
>
> On 20 Sep 2020, at 05:17, Micah Kornfield <[email protected]> wrote:
>
> 
>
>> Furthermore, these types of queries seem to fit what I would call (for
>> lack of a better word) "sliding" dataframes. Arrow's aim (as I understand
>> it) is to standardized the static dataframe data structure memory model,
>> can it also support a sliding version?
>
>
> I don't think there are any explicit library features planned around
> sliding windows, but the current abstractions allow for combining  lazily
> combining columns into a single logical structure and working on that.  I
> imagine sliding window abstractions could be built from that.
>
> On Thu, Sep 10, 2020 at 2:04 AM Pedro Silva <[email protected]> wrote:
>
>> Hi Micah,
>>
>> Thank you for your reply and the links, the threads were quite
>> interesting.
>> You are right, I opened the flink issue regarding arrow support to
>> understand whether it was on their roadmap to take a look at.
>>
>> My use-case is processing a stream of events (or rows if you will) to
>> compute ~100-150 sliding window aggregations over a subset of the received
>> fields (say 10 out of a row with 80+ fields).
>>
>> Something of the sort:
>> *average (session_time) group by ID over 1 hour.*
>>
>> For the query above only 3 fields are required, the session_time, ID and
>> timestamp (implicitly required to define time windows) meaning that we can
>> discard a significant amount of information from the original event.
>>
>> Furthermore, these types of queries seem to fit what I would call (for
>> lack of a better word) "sliding" dataframes. Arrow's aim (as I understand
>> it) is to standardized the static dataframe data structure memory model,
>> can it also support a sliding version?
>>
>> Usually these queries are defined by data scientists and domain experts
>> who are comfortable using python and not java or c++ which are the
>> languages, streaming engines are built on.
>>
>> My goal is to understand if existing solutions streaming engines like
>> flink can converge into a common model that could in the future help
>> develop efficient cross-language streaming engines.
>>
>> I hope I've been able to clarify some points.
>>
>> Thanks
>>
>>
>> Em sex., 4 de set. de 2020 às 20:17, Micah Kornfield <
>> [email protected]> escreveu:
>>
>>> Hi Pedro,
>>> I think the answer is it likely depends.  The main trade-off in using
>>> Arrow
>>> in a streaming process is the high metadata overhead if you have very few
>>> rows.  There have been prior discussions on the mailing list about
>>> row-based and streaming that might be useful [1][2] in expanding on the
>>> trade-offs.
>>>
>>> For some additional color: Brian Hulette gave a talk [3] a while ago
>>> about
>>> potentially using Arrow within Beam (I believe flink has a high overlap
>>> with the Beam API) and some of the challenges.  It also looks like there
>>> was a Flink JIRA (that you might be on?) about using Arrow directly in
>>> Flink and some of the trade-offs [4].
>>>
>>> The questions you posed are a little bit vague, if there is more context
>>> it
>>> might be able to help make the conversation more productive.
>>>
>>> -Micah
>>>
>>> [1]
>>>
>>> https://lists.apache.org/thread.html/33a4e1a272e77d4959c851481aa25c6e4aa870db172e4c1bbf2e3a35%40%3Cdev.arrow.apache.org%3E
>>> [2]
>>>
>>> https://lists.apache.org/thread.html/27945533db782361143586fd77ca08e15e96e2f2a5250ff084b462d6%40%3Cdev.arrow.apache.org%3E
>>> [3] https://www.youtube.com/watch?v=avy1ifTZlhE
>>> [4] https://issues.apache.org/jira/browse/FLINK-10929
>>>
>>>
>>> On Fri, Sep 4, 2020 at 12:39 AM Pedro Silva <[email protected]>
>>> wrote:
>>>
>>> > Hello,
>>> >
>>> > This may be a stupid question but is Arrow used for or designed with
>>> > streaming processing use-cases in mind, where data is non-stationary.
>>> I.e:
>>> > Flink stream processing jobs?
>>> >
>>> > Particularly, is it possible from a given event source (say Kafka) to
>>> > efficiently generate incremental record batches for stream processing?
>>> >
>>> > Suppose there is a data source that continuously generates messages
>>> with
>>> > 100+ fields. You want to compute grouped aggregations (sums, averages,
>>> > count distinct, etc...) over a select few of those fields, say 5
>>> fields at
>>> > most used for all queries.
>>> >
>>> > Is this a valid use-case for Arrow?
>>> > What if time is important and some windowing technique has to be
>>> applied?
>>> >
>>> > Thank you very much for your time!
>>> > Have a good day.
>>> >
>>>
>>

Re: Arrow as a streaming format

Reply via email to