All, I've run into the a usage pattern that seems like it would pop up elsewhere, so I'd like to kick around ways to solve it and perhaps land on a common and reusable approach. Here it is:
Consider a use of Spark Streaming's updateStateByKey, but the state being maintained may be too large to fit into memory (say tens or hundreds of terabytes). Some of the state may be "cold" in that it's not likely to be updated...but there isn't a way to know beforehand which state key/values are cold and which are hot. It doesn't seem like updateStateByKey in its current form is intended for this use case, since the state is in an RDD that is loaded into memory every time. So I'm considering keeping the large, possibly cold state in an external store (like HBase) and loading values into an "state" RDD as needed. We can then apply our "reduce" like operation against the DStream and loaded state to get the desired results. (Think Google's MillWheel.) The catch has been dealing with the possible re-computation of a DStream...we can't just update our external state store since it needs to appear immutable to re-compute a portion of our stream due to failure. At this point I think we can work around this by keeping the large initial state of this operation in an external store (loading it as needed), and do something like what updateStateByKey does today, keeping all updates to the state in their own, immutable RDD's. But if the state produced by all of the processed updates gets large, that underlying state RDD becomes unwieldy. I suppose the ideal world is some sort of State RDD that evicts cold data to disk or other storage while keeping hot keys available for efficient processing. We can simulate something like this in user code over Spark, although it's not trivial. Does this path seem reasonable? If others are have similar needs, I'd be interested in working out a common solution. Best, Ryan CONFIDENTIALITY NOTICE This message and any included attachments are from Cerner Corporation and are intended only for the addressee. The information contained in this message is confidential and may constitute inside or non-public information under international, federal, or state securities laws. Unauthorized forwarding, printing, copying, distribution, or use of such information is strictly prohibited and may be unlawful. If you are not the addressee, please promptly delete this message and notify the sender of the delivery error by e-mail or you may call Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.