Hi Biao,

you're right. What you've described is a totally valid use case and we
should design the interfaces such that you can have specialized
implementations for the cases where you can exploit things like a common
DFS. I think Nico's design should include this.

Cheers,
Till

On Fri, Jun 16, 2017 at 4:10 PM, Biao Liu <mmyy1...@gmail.com> wrote:

> Hi Till
>
> I agree with you about the Flink's DC. It is another topic indeed. I just
> thought that we can think more about it before refactoring BLOB service.
> Make sure that it's easy to implement DC on the refactored architecture.
>
> I have another question about BLOB service. Can we abstract the BLOB
> service to some high-level interfaces? May be just some put/get methods in
> the interfaces. Easy to extend will be useful in some scenarios.
>
> For example in Yarn mode, there are some cool features interesting us.
> 1. Yarn can localize files only once in one slave machine, all TMs in the
> same job can share these files. That may save lots of bandwidth for large
> scale jobs or jobs which have large BLOBs.
> 2. We can skip uploading files if they are already on DFS. That's a common
> scenario in distributed cache.
> 3. Even more, actually we don't need a BlobServer component in Yarn mode.
> We can rely on DFS to distribute files. There is always a DFS available in
> Yarn cluster.
>
> If we do so, the BLOB service through network can be the default
> implementation. It could work in any situation. It's also clear that it
> does not dependent on Hadoop explicitly. And we can do some optimization in
> different kinds of clusters without any hacking.
>
> That are just some rough ideas above. But I think well abstracted
> interfaces will be very helpful.
>

Reply via email to