Hi, I want to join a DStream with some other dataset, e.g. join a click stream with a spam ip list. I can think of two possible solutions, one is use broadcast variable, and the other is use transform operation as is described in the manual.
But the problem is the spam ip list will be updated outside of the spark streaming program, so how can it be noticed to reload the list? For broadcast variables, they are immutable. For transform operation, is it costly to reload the RDD on every batch? If it is, and I use RDD.persist(), does it mean I need to launch a thread to regularly unpersist it so that it can get the updates? Any ideas will be appreciated. Thanks. -- Jerry --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org