Hi All, I'm experiencing the same issue with Spark 120 (not verified with previous).
Could you please help us on this? Thanks Alessandro On Tue, Nov 18, 2014 at 1:40 AM, mtimper <mich...@timper.com> wrote: > Hi I'm running a standalone cluster with 8 worker servers. > I'm developing a streaming app that is adding new lines of text to several > different RDDs each batch interval. Each line has a well randomized unique > identifier that I'm trying to use for partitioning, since the data stream > does contain duplicates lines. I'm doing partitioning with this: > > val eventsByKey = streamRDD.map { event => (getUID(event), event)} > val partionedEventsRdd = sparkContext.parallelize(eventsByKey.toSeq) > .partitionBy(new HashPartitioner(numPartions)).map(e => e._2) > > I'm adding to the existing RDD like with this: > > val mergedRDD = currentRDD.zipPartitions(partionedEventsRdd, true) { > (currentIter,batchIter) => > val uniqEvents = ListBuffer[String]() > val uids = Map[String,Boolean]() > Array(currentIter, batchIter).foreach { iter => > iter.foreach { event => > val uid = getUID(event) > if (!uids.contains(uid)) { > uids(uid) = true > uniqEvents +=event > } > } > } > uniqEvents.iterator > } > > val count = mergedRDD.count > > The reason I'm doing it this way is that when I was doing: > > val mergedRDD = currentRDD.union(batchRDD).coalesce(numPartions).distinct > val count = mergedRDD.count > > It would start taking a long time and a lot of shuffles. > > The zipPartitions approach does perform better, though after running an > hour > or so I start seeing this > in the webUI. > > < > http://apache-spark-user-list.1001560.n3.nabble.com/file/n19112/Executors.png > > > > As you can see most of the data is skewing to just 2 executors, with 1 > getting more than half the Blocks. These become a hotspot and eventually I > start seeing OOM errors. I've tried this a half a dozen times and the 'hot' > executors changes, but not the skewing behavior. > > Any idea what is going on here? > > Thanks, > > Mike > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/RDD-Blocks-skewing-to-just-few-executors-tp19112.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >