2015-04-11
I have two RDD

leftRDD = RDD[(Long, (DetailInputRecord, VISummary, Long))]
rightRDD =
RDD[(Long, com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum)

DetailInputRecord is a object that contains (guid, sessionKey,
sessionStartDAte, siteID)

There are 10 records in leftRDD (confirmed with leftRDD.count, and each of
DetailInputRecord record in leftRDD has data within its members)

I do leftRDD.leftOuterJoin(rightRDD)

viEventsWithListings  = leftRDD
spsLvlMetric   = rightRDD

val viEventsWithListingsJoinSpsLevelMetric =
viEventsWithListings.leftOuterJoin(spsLvlMetric).map  {
  case (viJoinSpsLevelMetric) => {
val (sellerId, ((viEventDetail, viSummary, itemId), spsLvlMetric))
= viJoinSpsLevelMetric

println("sellerId:" + sellerId)
println("sessionKey:" + viEventDetail.get("sessionKey"))
println("guid:" + viEventDetail.get("guid"))
println("sessionStartDate:" + viEventDetail.get("sessionStartDate"))
println("siteId:" + viEventDetail.get("siteId"))

if (spsLvlMetric.isDefined) {

// do something


I print  each of the items within the DetailInputRecord (viEventDetail) of
viEventsWithListings before and within leftOuterJoin.  Before leftOuterJoin
i get values of each member within record (total 10 records).

Within join when i do the print i get only guid as value for all members.
How is this possible ?

Within join: (print statements. These are guids)

What went wrong, i have debugged multiple times but fail to understand the
Appreciate your help

2015-04-11
I took that RDD run through it and printed 4 elements from it, they all
printed correctly.

val x = viEvents.map {
  case (itemId, event) =>
println(event.get("guid"), itemId, event.get("itemId"),
(itemId, event)

The above code prints


viEvents.collect.foreach(a => println(a._2.get("guid"), a._1,
a._2.get("itemId"), a._2.get("siteId")))

*Now, i collected it, this might have lead to serialization of the RDD.*
Now when i print the same 4 elements, *i only get guid values for all. Has
this got something to do with serialization ?*


The RDD is of type org.apache.spark.rdd.RDD[(Long,

At the time of context creation i did this
val conf = new SparkConf()
  .set("spark.serializer", "org.apache.spark.serializer.*KryoSerializer*
  .set("spark.driver.maxResultSize", arguments.get("maxResultSize").get)
  .set("spark.yarn.maxAppAttempts", "1")






The class heirarchy is

DetailInputRecord extends InputRecord extends SessionRecord extends
ExperimentationRecord extends
   org.apache.avro.generic.GenericRecord.Record(schema: Schema)

Please suggest.

> viEventsWithListings before and w