Was my question so dumb?  Or, is this not a good use case for Spark?

On Sun, Nov 17, 2013 at 11:41 PM, Something Something <
mailinglist...@gmail.com> wrote:

> I am a newbie to both Spark & Scala, but I've been working with Hadoop/Pig
> for quite some time.
>
> We've quite a few ETL processes running in production that use Pig, but
> now we're evaluating Spark to see if they would indeed run faster.
>
> A very common use case in our Pig script is joining a file containing
> Facts to a file containing Dimension data.  The joins are of course, inner,
> left & outer.
>
> I thought I would start simple.  Let's say I've 2 files:
>
> 1)  Students:  student_id, course_id, score
> 2)  Course:  course_id, course_title
>
> We want to produce a file that contains:  student_id, course_title, score
>
> (Note:  This is a hypothetical case.  The real files have millions of
> facts & thousands of dimensions)
>
> Would something like this work?  Note:  I did say I am a newbie ;)
>
> val students = sc.textFile("./students.txt")
> val courses = sc.textFile("./courses.txt")
> val s = students.map(x => x.split(','))
> val left = students.map(x => x.split(',')).map(y => (y(1), y))
> val right = courses.map(x => x.split(',')).map(y => (y(0), y))
> val joined = left.join(right)
>
>
> Any pointers in this regard would be greatly appreciated.  Thanks.
>

Reply via email to