Was my question so dumb? Or, is this not a good use case for Spark?
On Sun, Nov 17, 2013 at 11:41 PM, Something Something < mailinglist...@gmail.com> wrote: > I am a newbie to both Spark & Scala, but I've been working with Hadoop/Pig > for quite some time. > > We've quite a few ETL processes running in production that use Pig, but > now we're evaluating Spark to see if they would indeed run faster. > > A very common use case in our Pig script is joining a file containing > Facts to a file containing Dimension data. The joins are of course, inner, > left & outer. > > I thought I would start simple. Let's say I've 2 files: > > 1) Students: student_id, course_id, score > 2) Course: course_id, course_title > > We want to produce a file that contains: student_id, course_title, score > > (Note: This is a hypothetical case. The real files have millions of > facts & thousands of dimensions) > > Would something like this work? Note: I did say I am a newbie ;) > > val students = sc.textFile("./students.txt") > val courses = sc.textFile("./courses.txt") > val s = students.map(x => x.split(',')) > val left = students.map(x => x.split(',')).map(y => (y(1), y)) > val right = courses.map(x => x.split(',')).map(y => (y(0), y)) > val joined = left.join(right) > > > Any pointers in this regard would be greatly appreciated. Thanks. >