Please ignore my previous email I didn't know Apache requires all the discussions to be "open"
On Tue, Nov 19, 2019, 5:40 PM Ying Zheng <yi...@uber.com> wrote: > Hi Jun, > > Thank you very much for your feedback! > > Can we schedule a meeting in your Palo Alto office in December? I think a > face to face discussion is much more efficient than emails. Both Harsha and > I can visit you. Satish may be able to join us remotely. > > On Fri, Nov 15, 2019 at 11:04 AM Jun Rao <j...@confluent.io> wrote: > >> Hi, Satish and Harsha, >> >> The following is a more detailed high level feedback for the KIP. Overall, >> the KIP seems useful. The challenge is how to design it such that it’s >> general enough to support different ways of implementing this feature and >> support existing features. >> >> 40. Local segment metadata storage: The KIP makes the assumption that the >> metadata for the archived log segments are cached locally in every broker >> and provides a specific implementation for the local storage in the >> framework. We probably should discuss this more. For example, some tier >> storage providers may not want to cache the metadata locally and just rely >> upon a remote key/value store if such a store is already present. If a >> local store is used, there could be different ways of implementing it >> (e.g., based on customized local files, an embedded local store like >> RocksDB, etc). An alternative of designing this is to just provide an >> interface for retrieving the tier segment metadata and leave the details >> of >> how to get the metadata outside of the framework. >> >> 41. RemoteStorageManager interface and the usage of the interface in the >> framework: I am not sure if the interface is general enough. For example, >> it seems that RemoteLogIndexEntry is tied to a specific way of storing the >> metadata in remote storage. The framework uses listRemoteSegments() api in >> a pull based approach. However, in some other implementations, a push >> based >> approach may be more preferred. I don’t have a concrete proposal yet. But, >> it would be useful to give this area some more thoughts and see if we can >> make the interface more general. >> >> 42. In the diagram, the RemoteLogManager is side by side with LogManager. >> This KIP only discussed how the fetch request is handled between the two >> layer. However, we should also consider how other requests that touch the >> log can be handled. e.g., list offsets by timestamp, delete records, etc. >> Also, in this model, it's not clear which component is responsible for >> managing the log start offset. It seems that the log start offset could be >> changed by both RemoteLogManager and LogManager. >> >> 43. There are quite a few existing features not covered by the KIP. It >> would be useful to discuss each of those. >> 43.1 I won’t say that compacted topics are rarely used and always small. >> For example, KStreams uses compacted topics for storing the states and >> sometimes the size of the topic could be large. While it might be ok to >> not >> support compacted topics initially, it would be useful to have a high >> level >> idea on how this might be supported down the road so that we don’t have to >> make incompatible API changes in the future. >> 43.2 We need to discuss how EOS is supported. In particular, how is the >> producer state integrated with the remote storage. >> 43.3 Now that KIP-392 (allow consumers to fetch from closest replica) is >> implemented, we need to discuss how reading from a follower replica is >> supported with tier storage. >> 43.4 We need to discuss how JBOD is supported with tier storage. >> >> Thanks, >> >> Jun >> >> On Fri, Nov 8, 2019 at 12:06 AM Tom Bentley <tbent...@redhat.com> wrote: >> >> > Thanks for those insights Ying. >> > >> > On Thu, Nov 7, 2019 at 9:26 PM Ying Zheng <yi...@uber.com.invalid> >> wrote: >> > >> > > > >> > > > >> > > > >> > > > Thanks, I missed that point. However, there's still a point at which >> > the >> > > > consumer fetches start getting served from remote storage (even if >> that >> > > > point isn't as soon as the local log retention time/size). This >> > > represents >> > > > a kind of performance cliff edge and what I'm really interested in >> is >> > how >> > > > easy it is for a consumer which falls off that cliff to catch up >> and so >> > > its >> > > > fetches again come from local storage. Obviously this can depend on >> all >> > > > sorts of factors (like production rate, consumption rate), so it's >> not >> > > > guaranteed (just like it's not guaranteed for Kafka today), but this >> > > would >> > > > represent a new failure mode. >> > > > >> > > >> > > As I have explained in the last mail, it's a very rare case that a >> > > consumer >> > > need to read remote data. With our experience at Uber, this only >> happens >> > > when the consumer service had an outage for several hours. >> > > >> > > There is not a "performance cliff" as you assume. The remote storage >> is >> > > even faster than local disks in terms of bandwidth. Reading from >> remote >> > > storage is going to have higher latency than local disk. But since the >> > > consumer >> > > is catching up several hours data, it's not sensitive to the >> sub-second >> > > level >> > > latency, and each remote read request will read a large amount of >> data to >> > > make the overall performance better than reading from local disks. >> > > >> > > >> > > >> > > > Another aspect I'd like to understand better is the effect of >> serving >> > > fetch >> > > > request from remote storage has on the broker's network >> utilization. If >> > > > we're just trimming the amount of data held locally (without >> increasing >> > > the >> > > > overall local+remote retention), then we're effectively trading disk >> > > > bandwidth for network bandwidth when serving fetch requests from >> remote >> > > > storage (which I understand to be a good thing, since brokers are >> > > > often/usually disk bound). But if we're increasing the overall >> > > local+remote >> > > > retention then it's more likely that network itself becomes the >> > > bottleneck. >> > > > I appreciate this is all rather hand wavy, I'm just trying to >> > understand >> > > > how this would affect broker performance, so I'd be grateful for any >> > > > insights you can offer. >> > > > >> > > > >> > > Network bandwidth is a function of produce speed, it has nothing to do >> > with >> > > remote retention. As long as the data is shipped to remote storage, >> you >> > can >> > > keep the data there for 1 day or 1 year or 100 years, it doesn't >> consume >> > > any >> > > network resources. >> > > >> > >> >