Hello Experts, I have a requirement of maintaining a list of ids for every customer for all of time. I should be able to provide count distinct ids on demand. All the examples I have seen so far indicate I need to maintain counts directly. My concern is, I will not be able to identify cumulative distinct values in that case. Also, maintaining a state so huge would be tolerable to the framework?
Here is what I am attempting: val updateUniqueids :Option[RDD[String]] = (values: Seq[String], state: Option[RDD[String]]) => { val currentLst = values.distinct val previousLst:RDD[String] = state.getOrElse().asInstanceOf[RDD[String]] Some(currentLst.union(previousLst).distinct) } Another challenge is concatenating RDD[String] and Seq[String] without being able to access the spark context as this function has to adhere to updateFunc: (Seq[V], Option[S]) => Option[S] I'm also trying to figure out if I can use the (iterator: Iterator[(K, Seq[V], Option[S])]) but haven't figured it out yet. Appreciate any suggestions in this regard. regards Sunita P.S: I am aware of mapwithState but not on the latest version as of now.