Hi Biao, you're right. What you've described is a totally valid use case and we should design the interfaces such that you can have specialized implementations for the cases where you can exploit things like a common DFS. I think Nico's design should include this.
Cheers, Till On Fri, Jun 16, 2017 at 4:10 PM, Biao Liu <mmyy1...@gmail.com> wrote: > Hi Till > > I agree with you about the Flink's DC. It is another topic indeed. I just > thought that we can think more about it before refactoring BLOB service. > Make sure that it's easy to implement DC on the refactored architecture. > > I have another question about BLOB service. Can we abstract the BLOB > service to some high-level interfaces? May be just some put/get methods in > the interfaces. Easy to extend will be useful in some scenarios. > > For example in Yarn mode, there are some cool features interesting us. > 1. Yarn can localize files only once in one slave machine, all TMs in the > same job can share these files. That may save lots of bandwidth for large > scale jobs or jobs which have large BLOBs. > 2. We can skip uploading files if they are already on DFS. That's a common > scenario in distributed cache. > 3. Even more, actually we don't need a BlobServer component in Yarn mode. > We can rely on DFS to distribute files. There is always a DFS available in > Yarn cluster. > > If we do so, the BLOB service through network can be the default > implementation. It could work in any situation. It's also clear that it > does not dependent on Hadoop explicitly. And we can do some optimization in > different kinds of clusters without any hacking. > > That are just some rough ideas above. But I think well abstracted > interfaces will be very helpful. >