All,

I've run into the a usage pattern that seems like it would pop up elsewhere, so 
I'd like to kick around ways to solve it and perhaps land on a common and 
reusable approach. Here it is:

Consider a use of Spark Streaming's updateStateByKey, but the state being 
maintained may be too large to fit into memory (say tens or hundreds of 
terabytes). Some of the state may be "cold" in that it's not likely to be 
updated...but there isn't a way to know beforehand which state key/values are 
cold and which are hot.

It doesn't seem like updateStateByKey in its current form is intended for this 
use case, since the state is in an RDD that is loaded into memory every time. 
So I'm considering keeping the large, possibly cold state in an external store 
(like HBase) and loading values into an "state" RDD as needed. We can then 
apply our "reduce" like operation against the DStream and loaded state to get 
the desired results. (Think Google's MillWheel.)

The catch has been dealing with the possible re-computation of a DStream...we 
can't just update our external state store since it needs to appear immutable 
to re-compute a portion of our stream due to failure.

At this point I think we can work around this by keeping the large initial 
state of this operation in an external store (loading it as needed), and do 
something like what updateStateByKey does today, keeping all updates to the 
state in their own, immutable RDD's. But if the state produced by all of the 
processed updates gets large, that underlying state RDD becomes unwieldy.

I suppose the ideal world is some sort of State RDD that evicts cold data to 
disk or other storage while keeping hot keys available for efficient 
processing. We can simulate something like this in user code over Spark, 
although it's not trivial.

Does this path seem reasonable? If others are have similar needs, I'd be 
interested in working out a common solution.

Best,
Ryan

CONFIDENTIALITY NOTICE This message and any included attachments are from 
Cerner Corporation and are intended only for the addressee. The information 
contained in this message is confidential and may constitute inside or 
non-public information under international, federal, or state securities laws. 
Unauthorized forwarding, printing, copying, distribution, or use of such 
information is strictly prohibited and may be unlawful. If you are not the 
addressee, please promptly delete this message and notify the sender of the 
delivery error by e-mail or you may call Cerner's corporate offices in Kansas 
City, Missouri, U.S.A at (+1) (816)221-1024.

Reply via email to