Re: Multiple aggregations over streaming dataframes

Andy Davidson Thu, 07 Jul 2016 15:01:22 -0700

Kafka has an interesting model that might be applicable.

You can think of kafka as enabling a queue system. Writes are called
producers, and readers are called consumers. The server is called a broker.
A ³topic² is like a named queue.


Producer are independent. They can write to a ³topic² at will. Consumers
(I.e. You nested aggregates) need to be independent of each other and the
broker. The broker receives data from produces stores it using memory and
disk. Consumer read from broker and maintain the cursor. Because the client
maintains the cursor one consumer can not impact other produces and
consumers.

I would think the tricky part for spark would to know when the data can be
deleted. In the Kakfa world each topic is allowed to define a TTL SLA. I.e.
The consumer must read the data with in a limited of window of time.

Andy

From:  Michael Armbrust <mich...@databricks.com>
Date:  Thursday, July 7, 2016 at 2:31 PM
To:  Arnaud Bailly <arnaud.oq...@gmail.com>
Cc:  Sivakumaran S <siva.kuma...@me.com>, "user @spark"
<user@spark.apache.org>
Subject:  Re: Multiple aggregations over streaming dataframes

> We are planning to address this issue in the future.
> 
> At a high level, we'll have to add a delta mode so that updates can be
> communicated from one operator to the next.
> 
> On Thu, Jul 7, 2016 at 8:59 AM, Arnaud Bailly <arnaud.oq...@gmail.com> wrote:
>> Indeed. But nested aggregation does not work with Structured Streaming,
>> that's the point. I would like to know if there is workaround, or what's the
>> plan regarding this feature which seems to me quite useful. If the
>> implementation is not overtly complex and it is just a matter of manpower, I
>> am fine with devoting some time to it.
>> 
>> 
>> 
>> -- 
>> Arnaud Bailly
>> 
>> twitter: abailly
>> skype: arnaud-bailly
>> linkedin: http://fr.linkedin.com/in/arnaudbailly/
>> 
>> On Thu, Jul 7, 2016 at 2:17 PM, Sivakumaran S <siva.kuma...@me.com> wrote:
>>> Arnauld,
>>> 
>>> You could aggregate the first table and then merge it with the second table
>>> (assuming that they are similarly structured) and then carry out the second
>>> aggregation. Unless the data is very large, I don¹t see why you should
>>> persist it to disk. IMO, nested aggregation is more elegant and readable
>>> than a complex single stage.
>>> 
>>> Regards,
>>> 
>>> Sivakumaran
>>> 
>>> 
>>> 
>>>> On 07-Jul-2016, at 1:06 PM, Arnaud Bailly <arnaud.oq...@gmail.com> wrote:
>>>> 
>>>> It's aggregation at multiple levels in a query: first do some aggregation
>>>> on one tavle, then join with another table and do a second aggregation. I
>>>> could probably rewrite the query in such a way that it does aggregation in
>>>> one pass but that would obfuscate the purpose of the various stages.
>>>> 
>>>> Le 7 juil. 2016 12:55, "Sivakumaran S" <siva.kuma...@me.com> a écrit :
>>>>> Hi Arnauld,
>>>>> 
>>>>> Sorry for the doubt, but what exactly is multiple aggregation? What is the
>>>>> use case?
>>>>> 
>>>>> Regards,
>>>>> 
>>>>> Sivakumaran
>>>>> 
>>>>> 
>>>>>> On 07-Jul-2016, at 11:18 AM, Arnaud Bailly <arnaud.oq...@gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>> Hello,
>>>>>> 
>>>>>> I understand multiple aggregations over streaming dataframes is not
>>>>>> currently supported in Spark 2.0. Is there a workaround? Out of the top
>>>>>> of my head I could think of having a two stage approach:
>>>>>>  - first query writes output to disk/memory using "complete" mode
>>>>>>  - second query reads from this output
>>>>>> 
>>>>>> Does this makes sense?
>>>>>> 
>>>>>> Furthermore, I would like to understand what are the technical hurdles
>>>>>> that are preventing Spark SQL from implementing multiple aggregation
>>>>>> right now? 
>>>>>> 
>>>>>> Thanks,
>>>>>> -- 
>>>>>> Arnaud Bailly
>>>>>> 
>>>>>> twitter: abailly
>>>>>> skype: arnaud-bailly
>>>>>> linkedin: http://fr.linkedin.com/in/arnaudbailly/
>>>>> 
>>> 
>> 
>

Re: Multiple aggregations over streaming dataframes

Reply via email to