Hi,
there are currently no built-in metrics for InputSplit consumption but I do see 
that this could be quite helpful. I think you can have a custom RichInputFormat 
that uses metrics to record stuff, though.

I think adding built-in metrics should be possible at this point in the code: 
https://github.com/apache/flink/blob/8f3d6d239996c83f7cbd102dc8a85ee626a56bf5/flink-runtime/src/main/java/org/apache/flink/runtime/operators/DataSourceTask.java#L144-L144
 
<https://github.com/apache/flink/blob/8f3d6d239996c83f7cbd102dc8a85ee626a56bf5/flink-runtime/src/main/java/org/apache/flink/runtime/operators/DataSourceTask.java#L144-L144>

Best,
Aljoscha
> On 19. Apr 2017, at 09:25, Flavio Pompermaier <pomperma...@okkam.it> wrote:
> 
> Hi Aljoscha,
> thanks for the reply, it was not urgent and I was aware of the FF...btw, 
> congratulations for it, I saw many interesting talks!
> Flink community has grown a lot since it was Stratosphere ;)
> Just one last question: in many of my use cases it could be helpful to see 
> how many of the created splits were "consumed" by an inputFormat/source.
> Is it possible to monitor this part somewhere in the dashboards or with a 
> custom metric?
> 
> On Tue, Apr 18, 2017 at 5:24 PM, Aljoscha Krettek <aljos...@apache.org 
> <mailto:aljos...@apache.org>> wrote:
> Hi,
> sorry for not getting any responses but I think everyone was quite busy with 
> Flink Forward SF. I’m also no expert on the topic but I’ll try and give some 
> answers.
> 
> Regarding a Google Doc version, I don’t think that there is any. You would 
> have to modify the Markdown version we have in the doc.
> 
> For the other answers I’ll reuse an example program that consists of Source 
> -> Map -> Sink, with chaining disabled and parallelism 2. We’ll this have 
> three Tasks: Source, Map, and Sink, with each having two subtasks. Let’s 
> denote the subtasks by a number in parenthesis so the first subtask for 
> Source is Source(1), second one is Source(2). I’ll also refer to Source(1) -> 
> Map(1) -> Sink(1) as a slice of the execution graph since these can be 
> executed within one slot.
> 
> Regarding 1, I think this is true. However, a single slot can execute a 
> complete slice of the execution graph where each subtask (from a different 
> task) would be executed by its own thread.
> 
> Regarding 2.1, Yes, I think it cannot run multiple subtasks of the same task 
> while it is possible (and in fact done) to execute all the subtasks of a 
> slide in the same slot.
> 
> Regarding 2.2, This is so to allow executing a pipeline of parallelism 8 
> using a cluster that has 8 free slots. Basically, each slice fills one slot.
> 
> Regarding 3, I don’t really have an answer.
> 
> Regarding 4, Yes, this can get a bit out of hand if you have very long 
> pipelines.
> 
> Best,
> Aljoscha
> 
>> On 11. Apr 2017, at 14:37, Flavio Pompermaier <pomperma...@okkam.it 
>> <mailto:pomperma...@okkam.it>> wrote:
>> 
>> Any feedback here..?
>> 
>> On Wed, Apr 5, 2017 at 7:43 PM, Flavio Pompermaier <pomperma...@okkam.it 
>> <mailto:pomperma...@okkam.it>> wrote:
>> Hi to all,
>> I had a very long but useful chat with Fabian and I understood a lot of 
>> concepts that was not clear at all to me. We started from the Flink runtime 
>> documentation page 
>> (https://ci.apache.org/projects/flink/flink-docs-release-1.2/concepts/runtime.html
>>  
>> <https://ci.apache.org/projects/flink/flink-docs-release-1.2/concepts/runtime.html>)
>>  but
>> I discovered that the terminology is very inconsistent and misleading along 
>> the page...
>> 
>> For example, one of the very first sentences is :
>> "Flink chains operator subtasks together into tasks. Each task is executed 
>> by one thread."
>> What I first understood was that every operator can be executed only by a 
>> single thread in all the cluster....probably it should be better "one thread 
>> per task slot" (at least). 
>> Moreover, if I'm not wrong, a Task Slot can execute only 1 subtask (aka 
>> parallel instance) of each task and there's no limit to the number of 
>> subtasks per slot (and this is not highlighted at all in that document). The 
>> only constraint is that they should belong to different tasks (right?).
>> 
>> If there's a google doc version of that page I could try to rewrite it down 
>> in order to make it easier to understand some parts...however I still have 
>> some more questions:
>> Is it correct that a single Task Slot can execute only a single subtask of 
>> each task and that this task is executed by a single thread within the slot)?
>> If it so:
>> why at that page there's written "By default, Flink allows subtasks to share 
>> slots even if they are subtasks of different tasks, so long as they are from 
>> the same job"? It seems that it is more common to run multiple subtasks of 
>> the same task (in a slot) than executing different substasks of different 
>> tasks, although this is still permitted...from what I understood a slot 
>> cannot run multiple subtask of the same task at all!
>> and why this constraint? Is there any good reason for that? A subtask is 
>> mapped to 1 thread in the TaskManager, so why a TM with 2 slots can run 2 
>> subtasks of the same task (in the same JVM) while a TM with 1 slot cannot  
>> (while it can execute an arbitrary number of subtasks of different tasks)? 
>> It it is not so, there's no images representing such a situation in that 
>> page...
>> Isn't dangerous to allow (potentially) an unlimited number of threads per TM 
>> slot?? 
>> Cheers,
>> Flavio
>> 
>> 
> 

Reply via email to