Hi, there are currently no built-in metrics for InputSplit consumption but I do see that this could be quite helpful. I think you can have a custom RichInputFormat that uses metrics to record stuff, though.
I think adding built-in metrics should be possible at this point in the code: https://github.com/apache/flink/blob/8f3d6d239996c83f7cbd102dc8a85ee626a56bf5/flink-runtime/src/main/java/org/apache/flink/runtime/operators/DataSourceTask.java#L144-L144 <https://github.com/apache/flink/blob/8f3d6d239996c83f7cbd102dc8a85ee626a56bf5/flink-runtime/src/main/java/org/apache/flink/runtime/operators/DataSourceTask.java#L144-L144> Best, Aljoscha > On 19. Apr 2017, at 09:25, Flavio Pompermaier <pomperma...@okkam.it> wrote: > > Hi Aljoscha, > thanks for the reply, it was not urgent and I was aware of the FF...btw, > congratulations for it, I saw many interesting talks! > Flink community has grown a lot since it was Stratosphere ;) > Just one last question: in many of my use cases it could be helpful to see > how many of the created splits were "consumed" by an inputFormat/source. > Is it possible to monitor this part somewhere in the dashboards or with a > custom metric? > > On Tue, Apr 18, 2017 at 5:24 PM, Aljoscha Krettek <aljos...@apache.org > <mailto:aljos...@apache.org>> wrote: > Hi, > sorry for not getting any responses but I think everyone was quite busy with > Flink Forward SF. I’m also no expert on the topic but I’ll try and give some > answers. > > Regarding a Google Doc version, I don’t think that there is any. You would > have to modify the Markdown version we have in the doc. > > For the other answers I’ll reuse an example program that consists of Source > -> Map -> Sink, with chaining disabled and parallelism 2. We’ll this have > three Tasks: Source, Map, and Sink, with each having two subtasks. Let’s > denote the subtasks by a number in parenthesis so the first subtask for > Source is Source(1), second one is Source(2). I’ll also refer to Source(1) -> > Map(1) -> Sink(1) as a slice of the execution graph since these can be > executed within one slot. > > Regarding 1, I think this is true. However, a single slot can execute a > complete slice of the execution graph where each subtask (from a different > task) would be executed by its own thread. > > Regarding 2.1, Yes, I think it cannot run multiple subtasks of the same task > while it is possible (and in fact done) to execute all the subtasks of a > slide in the same slot. > > Regarding 2.2, This is so to allow executing a pipeline of parallelism 8 > using a cluster that has 8 free slots. Basically, each slice fills one slot. > > Regarding 3, I don’t really have an answer. > > Regarding 4, Yes, this can get a bit out of hand if you have very long > pipelines. > > Best, > Aljoscha > >> On 11. Apr 2017, at 14:37, Flavio Pompermaier <pomperma...@okkam.it >> <mailto:pomperma...@okkam.it>> wrote: >> >> Any feedback here..? >> >> On Wed, Apr 5, 2017 at 7:43 PM, Flavio Pompermaier <pomperma...@okkam.it >> <mailto:pomperma...@okkam.it>> wrote: >> Hi to all, >> I had a very long but useful chat with Fabian and I understood a lot of >> concepts that was not clear at all to me. We started from the Flink runtime >> documentation page >> (https://ci.apache.org/projects/flink/flink-docs-release-1.2/concepts/runtime.html >> >> <https://ci.apache.org/projects/flink/flink-docs-release-1.2/concepts/runtime.html>) >> but >> I discovered that the terminology is very inconsistent and misleading along >> the page... >> >> For example, one of the very first sentences is : >> "Flink chains operator subtasks together into tasks. Each task is executed >> by one thread." >> What I first understood was that every operator can be executed only by a >> single thread in all the cluster....probably it should be better "one thread >> per task slot" (at least). >> Moreover, if I'm not wrong, a Task Slot can execute only 1 subtask (aka >> parallel instance) of each task and there's no limit to the number of >> subtasks per slot (and this is not highlighted at all in that document). The >> only constraint is that they should belong to different tasks (right?). >> >> If there's a google doc version of that page I could try to rewrite it down >> in order to make it easier to understand some parts...however I still have >> some more questions: >> Is it correct that a single Task Slot can execute only a single subtask of >> each task and that this task is executed by a single thread within the slot)? >> If it so: >> why at that page there's written "By default, Flink allows subtasks to share >> slots even if they are subtasks of different tasks, so long as they are from >> the same job"? It seems that it is more common to run multiple subtasks of >> the same task (in a slot) than executing different substasks of different >> tasks, although this is still permitted...from what I understood a slot >> cannot run multiple subtask of the same task at all! >> and why this constraint? Is there any good reason for that? A subtask is >> mapped to 1 thread in the TaskManager, so why a TM with 2 slots can run 2 >> subtasks of the same task (in the same JVM) while a TM with 1 slot cannot >> (while it can execute an arbitrary number of subtasks of different tasks)? >> It it is not so, there's no images representing such a situation in that >> page... >> Isn't dangerous to allow (potentially) an unlimited number of threads per TM >> slot?? >> Cheers, >> Flavio >> >> >