Hello I need a design recommendation. I need to calcualte a couple of calculations with min shuffling and better perf. I have an nested structure with say a class have n number of students and structure will be similiar to this
{ classId: String, StudendId:String, Score:Int, AreaCode:String} now i have to validate say class will have students who should be all part of same area code and another one student who is taking more than one class. I can create groupby classId and count(AreaCode) get classID, count.. similiarly groupby StudentID and count(Class_Id) get aggregated structure and join these two with say studentId but this is taking multiple shuffles and data is huge so cant really use broadcast join . Can you please suggest some better approach. Regards, sk