Re: Side-effects of DataSet::count

Ovidiu-Cristian MARCU Tue, 31 May 2016 03:05:04 -0700

Hi Stephan and all,

Some reference to this may be https://issues.apache.org/jira/browse/FLINK-2250 
<https://issues.apache.org/jira/browse/FLINK-2250> ?
I agree your priorities on streaming are very high, it will make a big +1 for 
the community to create a discussion/place for the design proposal improvement 
and eventually launch an initial draft (including new requirements). As one can 
try to dig in, is quite complex what you have already achieved (for example 
FLINK-2097, FLINK-1350, FLINK-1359 and related, mainly FLINK-986). These issues 
are a pain for DataSets.


Best,
Ovidiu

> On 31 May 2016, at 11:27, Stephan Ewen <se...@apache.org> wrote:
> 
> Hi!
> 
> There was some preliminary work on this. By now, the requirements have
> grown a bit. The backtracking needs to handle
> 
>  - Scheduling for execution (the here raised point), possibly resuming
> from available intermediate results
>  - Recovery from partially executed programs, where operators execute
> whole or not (batch style)
>  - Recover from intermediate result since latest completed checkpoint
>  - Eventually even recover superstep-based iterations.
> 
> So the design needs to be extended slightly. We do not have a design
> writeup for this, but I agree, it would be great to have one.
> I have a pretty good general idea about this, let me see if I can get to
> that next week.
> 
> In general, for such things (long standing ideas and designs), we should
> have something like Kafka has with its KIPs (Kafka Improvement Proposal) -
> a place where to collect them, refine them over time, and
> see how people react to them or step up to implement them. We could call
> them 3Fs (Flink Feature Forms) ;-)
> 
> Greetings,
> Stephan
> 
> 
> On Tue, May 31, 2016 at 1:02 AM, Greg Hogan <c...@greghogan.com> wrote:
> 
>> Hi Stephan,
>> 
>> Is there a design document, prior discussion, or background material on
>> this enhancement? Am I correct in understanding that this only applies to
>> DataSet since streams run indefinitely?
>> 
>> Thanks,
>> Greg
>> 
>> On Mon, May 30, 2016 at 5:49 PM, Stephan Ewen <se...@apache.org> wrote:
>> 
>>> Hi Eron!
>>> 
>>> Yes, the idea is to actually switch all executions to a backtracking
>>> scheduling mode. That simultaneously solves both fine grained recovery
>> and
>>> lazy execution, where later stages build on prior stages.
>>> 
>>> With all the work around streaming, we have not gotten to this so far,
>> but
>>> it is one feature still in the list...
>>> 
>>> Greetings,
>>> Stephan
>>> 
>>> 
>>> On Mon, May 30, 2016 at 9:55 PM, Eron Wright <ewri...@live.com> wrote:
>>> 
>>>> Thinking out loud now…
>>>> 
>>>> Is the job graph fully mutable?   Can it be cleared?   For example,
>>>> shouldn’t the count method remove the sink after execution completes?
>>>> 
>>>> Can numerous job graphs co-exist within a single driver program?    How
>>>> would that relate to the session concept?
>>>> 
>>>> Seems the count method should use ‘backtracking’ schedule mode, and
>> only
>>>> execute the minimum needed to materialize the count sink.
>>>> 
>>>>> On May 29, 2016, at 3:08 PM, Márton Balassi <
>> balassi.mar...@gmail.com>
>>>> wrote:
>>>>> 
>>>>> Hey Eron,
>>>>> 
>>>>> Yes, DataSet#collect and count methods implicitly trigger a JobGraph
>>>>> execution, thus they also trigger writing to any previously defined
>>>> sinks.
>>>>> The idea behind this behavior is to enable interactive querying (the
>>> one
>>>>> that you are used to get from a shell environment) and it is also a
>>> great
>>>>> debugging tool.
>>>>> 
>>>>> Best,
>>>>> 
>>>>> Marton
>>>>> 
>>>>> On Sun, May 29, 2016 at 11:28 PM, Eron Wright <ewri...@live.com>
>>> wrote:
>>>>> 
>>>>>> I was curious as to how the `count` method on DataSet worked, and
>> was
>>>>>> surprised to see that it executes the entire program graph.
>> Wouldn’t
>>>> this
>>>>>> cause undesirable side-effects like writing to sinks?    Also
>> strange
>>>> that
>>>>>> the graph is mutated with the addition of a sink (that isn’t
>>>> subsequently
>>>>>> removed).
>>>>>> 
>>>>>> Surveying the Flink code, there aren’t many situations where the
>>> program
>>>>>> graph is implicitly executed (`collect` is another).   Nonetheless,
>>> this
>>>>>> has deepened my appreciation for how dynamic the application might
>> be.
>>>>>> 
>>>>>> // DataSet.java
>>>>>> public long count() throws Exception {
>>>>>>  final String id = new AbstractID().toString();
>>>>>> 
>>>>>>  output(new Utils.CountHelper<T>(id)).name("count()");
>>>>>> 
>>>>>>  JobExecutionResult res = getExecutionEnvironment().execute();
>>>>>>  return res.<Long> getAccumulatorResult(id);
>>>>>> }
>>>>>> Eron
>>>> 
>>>> 
>>> 
>>

Re: Side-effects of DataSet::count

Reply via email to