Apache Pinot Daily Email Digest (2020-11-02)

Pinot Slack Email Digest Mon, 02 Nov 2020 18:00:48 -0800

#general

@ehwalee: @ehwalee has joined the channel
@noahprince8: How does fault tolerance work with servers in Pinot? I.e., what happens when a server crashes? My guess would be that, as a helix participant, somehow the controller sees it has crashed, and as such sends out messages to other servers to take the segments from the crashed controller? Then, when the server reboots, the controller sees a new server is available, and starts distributing segments to it? Requiring a rebalance to truly get a bunch of segments back onto it?
@tanmay.movva: In my understanding, yes that is how it happens. All helix participants, controllers maintain their state in zookeeper. As per the config/replication we’ve set the controller always works to maintain the ideal state of the cluster and it’s segments. In the case where a new server joins the helix cluster, controller triggers a rebalancing by redistributing segments across segments to maintain a balanced ideal state. It asks the new server to load segments from the deep store by putting messages into the server’s queue. This happens till the IDEAL state of the cluster is reached. Hope this helps. Correct me if I don’t make sense.
@noahprince8: It does. So another question, I’m digging into the internals and it doesn’t look like there’s a message triggering the removal of a segment, other than from the transition from ONLINE to OFFLINE
@noahprince8: I started to trace down how rebalances work, but it seems pretty complex. When a rebalance occurs, how are segments removed? What is that event, and how does it hook in? I would expect it to call `HelixInstanceDataManager.removeSegment`
@noahprince8: For context, I’m working on
@tanmay.movva: From what I understand, there is a state assigned/attached to individual segments also. Which has info such as which servers to be present on etc,. So when a controller triggers rebalancing, I’m expecting it updates the state of the segments, which inturn changes the ideal state of the cluster, and to achieve that ideal state it asks servers to load/unload segments onto their disks/volumes. I haven’t dwelled into code myself. Would not be able to help with exact code blocks where this happens. But would love to know it. :slightly_smiling_face:
@g.kishore: @noahprince8 read a bit about Helix here
@g.kishore: once you get that following Pinot code will be easier.
@g.kishore: all callbacks from Helix are already handled in Pinot, when you trigger rebalance - new segments are loaded via offlineToOnline callbacks
@g.kishore: see SegmentOnlineOfflineStateModel for more details
@g.kishore: ```// Delete segment from local directory. @Transition(from = "OFFLINE", to = "DROPPED") public void onBecomeDroppedFromOffline(Message message, NotificationContext context) {```
@g.kishore: this gets triggered when a segment in removed from the idealstate for a specific server
@noahprince8: So, it looks like that directly removes the directory but never removes it from the table data manager
@karinwolok1: :wave: Hi everyoneeeee!!! I'm new to the Pinot community and I want to shape some of the Pinot community programs for the future. :wine_glass: I'm looking to get to know some people in the community (newbies and/or seasoned). Anyone here open to having a 30 min video conversation with me? Pleaaaassssseeeee!!!! :star-struck: (I only look creepy in the pic of me hiding in corn fields... I'm not _that_ bad). Send a DM or just schedule a time to chat:
@narayanan.arunachalam: @narayanan.arunachalam has joined the channel
@kennybastani: Welcome @karinwolok1! So excited to have you
@karinwolok1: Thank youuuuu, Kenny! :heart:
@gregsimons84: Hey @karinwolok1 Welcome to the wonderful world of Apache Pinot. Great to have you onboard !
@karinwolok1: Thanks, Greg!
@kennybastani: Hey all. I highly recommend spending some time with @karinwolok1 to talk about your use case and experience with Pinot. This is extremely crucial to the project and for our future. I rarely use `@here` notifications to this channel, but in this case, it’s important. Thanks everyone.
@karinwolok1: Thanks, Kenny! Maybe I am downplaying all the awesome things we can do with the Pinot community and that this is an opportunity to really be part of the movement . :D
@kennybastani: I think you’re spot on.
@dzpanda: @dzpanda has joined the channel

#random

@ehwalee: @ehwalee has joined the channel
@narayanan.arunachalam: @narayanan.arunachalam has joined the channel
@dzpanda: @dzpanda has joined the channel

#feat-text-search

@gary.a.stafford: @gary.a.stafford has joined the channel

#feat-presto-connector

@gary.a.stafford: @gary.a.stafford has joined the channel

#troubleshooting

@gary.a.stafford: @gary.a.stafford has joined the channel

#segment-cold-storage

@noahprince8: @noahprince8 has joined the channel
@noahprince8: @noahprince8 set the channel purpose: Discussing
@g.kishore: @g.kishore has joined the channel
@fx19880617: @fx19880617 has joined the channel
@g.kishore: thanks for creating this channel
@noahprince8: Feel free to add anyone as necessary. Felt like it might be better/quicker to discuss things here than the github issue
@mayanks: @mayanks has joined the channel
@jackie.jxt: @jackie.jxt has joined the channel
@noahprince8: So, the original plan was to create a class of servers that lazily loads segments on demand for queries. While this works, the idea of cold storage dovetails with the idea that you’re probably going to have _many_ segments. Each segment means overhead for helix reaching an ideal state. What if we instead create a separate class of OFFLINE table with some COLD_STORAGE flag enable. Cold storage tables are not part of the helix ideal state. We explicitly accept they are going to have a lot of segments that are used infrequently. Assignment to servers _explicitly_ happens on demand. A query would look like this: 1. Broker prunes and finds the ideal list of segments 2. Routing table indicates particular segments are unassigned. 3. Broker somehow communicates the need for this segment to be assigned to a server. 4. Broker waits for ideal state 5. Broker runs query as usual. 6. A minion prunes cold storage segments that haven’t been used in a while
@noahprince8: The nice thing about this is that the code for the server doesn’t need to change. As far as it’s concerned, it’s just getting assigned normal OFFLINE segments from the deep store.
@noahprince8: This also makes it easier to evenly distribute cold storage segments for queries across servers, potentially avoiding hotspotting if all of the cold storage segments for a particular time range are on the same server.
@npawar: @npawar has joined the channel
@noahprince8: This actually fits pretty well into the tiering model, with `storageType=DEEP_STORE`
@g.kishore: how will the servers know which segments to load for a query when its not in the idealstate
@noahprince8: The broker will need to handle that. It has to assign servers to segments if the segment has not been assigned to any servers. It then has to wait for that state to be achieved before running the query
@g.kishore: ah, you are saying segments are still part of idealstate
@noahprince8: But only actively used segments
@g.kishore: its just that the table is not part of Helix
@noahprince8: So effectively the segments only become part of helix when someone is trying to query them
@noahprince8: Else, they are just stored in the routing table with no assigned servers
@mayanks: How does this solve the memory issue in broker, once a bunch of queries that have touched all segments are executed?
@g.kishore: but all the things you have mentioned can still be achieved by having a LazyImmutableSegmentLoader right
@noahprince8: It doesn’t solve that issue, we’d need a separate solution. It does solve the issue of too many messages as in
@noahprince8: I think for cold storage it may be reasonable to impose tier configuration around the maximum number of segments a query is allowed to materialize.
@mayanks: How about something like Redis Cache?
@mayanks: So we keep caching orthogonal to state management?
@noahprince8: My concern is helix managing millions of segments, 99% of which are never touched.
@mayanks: But this proposal is not addressing that problem, is it?
@noahprince8: It is, you only tell helix about that segment (and to try to assign it to a server) when someone attempts to query it
@mayanks: Side note query latency would go out the window if helix segment assignment is happening during query execution.
@noahprince8: Yeah, I think that’s acceptable for cold storage. You don’t have to wait for the _full_ ideal state. You can start to run pieces of the query when the segments finish downloading. Not sure how hard that would be to achieve.
@noahprince8: Even in the lazy model, you still have that problem.
@mayanks: I think there are separate issues here: ```1. Huge Ideal State 2. Broker memory pressure 3. Storage cost on server```
@noahprince8: So > 1. Huge Ideal State > > Do not include cold storage segments in ideal state until they are queried. The broker will need to request segments to be added to the ideal state if they aren’t currently included. A cleanup minion is required > > 2. Broker memory pressure > > Allow configuration on cold storage tiers that limits the number of segments that can be materialized by a single query > > 3. Storage cost on server > > Because unused segments are not part of the ideal state, OFFLINE servers will not load them onto the disc.
@mayanks: This ^^ will also need unloading of segments from IS , broker & server
@noahprince8: IS?
@mayanks: ideal-state
@noahprince8: Yeah, I figure you ave a minion doing that
@mayanks: I see
@noahprince8: Keep it simple to start and just have a time and size based eviction.
@mayanks: How would the selectivity of the query be?
@noahprince8: What do you mean? Like selectivity of cold storage queries?
@mayanks: Low selectivity -> large number of rows selected to process, and vice-versa
@noahprince8: Well, for our use case high selectivity. But I imagine not everyone’s is like that.
@noahprince8: Ideally, you want high selectivity with this. But this is why ```Allow configuration on cold storage tiers that limits the number of segments that can be materialized by a single query ``` Is important
@mayanks: Ok, that is good. So we don't have to worry about server memory pressure during query execution.
@noahprince8: I think we can implement 1 and 3 without 2, though. Allowing configuration to limit the number of segments used in a single query is a separate, nice to have feature.
@noahprince8: As a proof of concept, we can tell people not to make dumb queries :slightly_smiling_face:
@noahprince8: Really, that feature is useful on all OFFLINE tables. Not just cold storage
@mayanks: I am wondering how much bottleneck 1 really is
@mayanks: We have a cluster with 10's of millions of segments (across tables)
@noahprince8: It is, according to Uber and LinkedIn a bottleneck. Though I personally do not have experience with it
@mayanks: I'd start of with 3 first, and see if 1 is really a bottleneck
@g.kishore: LinkedIn definition of bottleneck is very different
@g.kishore: we try to work on millisecond and sub second
@noahprince8: Though I think, in general, it’s a better design to manage the caching outside of the individual server
@noahprince8: Makes it easier to avoid hotspotting
@noahprince8: It’s just a nice bonus that design also avoids message pressure on helix
@mayanks: But LazyLoader impl is independent of server right?
@noahprince8: No, lazy loader would be on the server itself. It knows the segments assigned to it, and only downloads them from deep store as needed
@noahprince8: Probably use something like Caffeine to handle the LRU aspect
@mayanks: Correct. And this way we have taken out Helix from the business of sending more messages to load/unload?
@g.kishore: lets start with that
@g.kishore: LazyLoader
@g.kishore: I have some ideas w.r.t Helix but lets tackle the lazyloading first
@g.kishore: Note that we can also enhance Helix
@mayanks: @noahprince8 in case you were not aware, @g.kishore is the author of Helix :grinning:
@noahprince8: Hahah yeah, I know, he’s got an impressive resume. It’s honestly very cool to be having these conversations. Especially given a year ago I was doing web dev :smile:
@mayanks: Very impressive transition from web dev :clap:
@noahprince8: Data infra is where it’s at. Got tired of transpiling things 70 times :slightly_smiling_face: . The two fields are kind of similar on the backend, though. When you get into multiple dbs, microservices, job queues, load balancers, event driven systems. Wasn’t a huge setback switching fields, luckily
@noahprince8: But yeah, architecture discussions with the titans of the data infra industry, on the bleeding edge. Pretty damn cool.
@mayanks: :heavy_plus_sign:
@noahprince8: Well, my concern is that if we pull this lazy loading away from the server, these code changes aren’t really necessary
@noahprince8: No server change necessary if you use cluster state to manage lazy loading
@noahprince8: And really, which segments are lazy loaded _is_ an important part of the cluster state. It’s useful from a monitoring perspective.
@mayanks: Good point
@g.kishore: I see
@g.kishore: let me think about that idea
@noahprince8: There’s also practical aspects. Like the UI will force materialization of segments because it wants to show all of the segments assigned to the server
@noahprince8: So I’d need to “hack” that to only show the ones that have actually materialized.
@npawar: the unloading could also be done by the same periodic task which is responsible for moving from tier to tier (assuming you are planning to use the tiered storage design). Based on some flag in the tier config, for storageType=deepStore
@noahprince8: Cache unloading?
@npawar: unloading from ideal state
@npawar: instead of minions
@noahprince8: Ah yeah. Is that a scheduled task or long running?
@mayanks: Scheduled
@noahprince8: Would love for some way to listen to an event bus of hit segments. Because something like would be super nice for something like this.
@noahprince8: And really, making cache eviction configurable. Because for specific use cases, you may want an intelligent caching strategy.
@npawar: will help in reducing 1 component
@g.kishore: lets do a zoom call?
@noahprince8: I’m down. When?
@g.kishore: 3 pm pst?
@noahprince8: Earlier might be better, if you’re available
@g.kishore: 1:30 pm pst
@noahprince8: Works for me
@mayanks: Works for me too
@mayanks: Could you guys send me your email id's to send the invite to?
@noahprince8:
@mayanks: @npawar?
@npawar:
@mayanks: Sent invite for 1:30-2:00pm today
@jackie.jxt: @mayanks Can you please also add me to the invite:
@mayanks: Done
@jackie.jxt: Thanks
--------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pinot.apache.org For additional commands, e-mail: dev-h...@pinot.apache.org

Apache Pinot Daily Email Digest (2020-11-02)

#general

#random

#feat-text-search

#feat-presto-connector

#troubleshooting

#segment-cold-storage

Reply via email to