Re: Joining files

2013-11-20 Thread Something Something
Questions:

1)  I don't see APIs for LEFT, FULL OUTER Joins.  True?
2)  Apache Pig provides different join types such as 'replicated',
'skewed'.  Now 'replicated' may not be a concern in Spark 'cause everything
happens in memory (possibly).
3)  Does the 'join' (which seems to work like INNER Join) guarantee order?
 For example, can I assume that columns from the left side will appear
before columns on left & their order will be preserved?

On a side note, it appears, as of now Spark cannot be used as a replacement
for Pig - without some major coding.  Agree?




On Mon, Nov 18, 2013 at 10:47 PM, Horia  wrote:

> It seems to me that what you want is the following procedure
> - parse each file line by line
> - generate key, value pairs
> - join by key
>
> I think the following should accomplish what you're looking for
>
> val students = sc.textFile("./students.txt")// mapping over this RDD
> already maps over lines
> val courses = sc.textFile("./courses.txt")// mapping over this RDD
> already maps over lines
> val left = students.map( x => {
> columns = x.split(",")
> (columns(1), (columns(0), columns(2)))
> } )
> val right = courses.map( x => {
> columns = x.split(",")
> (columns(0), columns(1))
> } )
> val joined = left.join(right)
>
>
> The major difference is selectively returning the fields which you
> actually want to join, rather than all the fields. A secondary difference
> is syntactic: you don't need a .map().map() since you can use a slightly
> more complex function block as illustrated. I think Spark is smart enough
> to optimize the .map().map() to basically what I've explicitly written...
>
> Horia
>
>
>
> On Mon, Nov 18, 2013 at 10:34 PM, Something Something <
> mailinglist...@gmail.com> wrote:
>
>> Was my question so dumb?  Or, is this not a good use case for Spark?
>>
>>
>> On Sun, Nov 17, 2013 at 11:41 PM, Something Something <
>> mailinglist...@gmail.com> wrote:
>>
>>> I am a newbie to both Spark & Scala, but I've been working with
>>> Hadoop/Pig for quite some time.
>>>
>>> We've quite a few ETL processes running in production that use Pig, but
>>> now we're evaluating Spark to see if they would indeed run faster.
>>>
>>> A very common use case in our Pig script is joining a file containing
>>> Facts to a file containing Dimension data.  The joins are of course, inner,
>>> left & outer.
>>>
>>> I thought I would start simple.  Let's say I've 2 files:
>>>
>>> 1)  Students:  student_id, course_id, score
>>> 2)  Course:  course_id, course_title
>>>
>>> We want to produce a file that contains:  student_id, course_title, score
>>>
>>> (Note:  This is a hypothetical case.  The real files have millions of
>>> facts & thousands of dimensions)
>>>
>>> Would something like this work?  Note:  I did say I am a newbie ;)
>>>
>>> val students = sc.textFile("./students.txt")
>>> val courses = sc.textFile("./courses.txt")
>>> val s = students.map(x => x.split(','))
>>> val left = students.map(x => x.split(',')).map(y => (y(1), y))
>>> val right = courses.map(x => x.split(',')).map(y => (y(0), y))
>>> val joined = left.join(right)
>>>
>>>
>>> Any pointers in this regard would be greatly appreciated.  Thanks.
>>>
>>
>>
>


Re: Joining files

2013-11-18 Thread Something Something
Was my question so dumb?  Or, is this not a good use case for Spark?


On Sun, Nov 17, 2013 at 11:41 PM, Something Something <
mailinglist...@gmail.com> wrote:

> I am a newbie to both Spark & Scala, but I've been working with Hadoop/Pig
> for quite some time.
>
> We've quite a few ETL processes running in production that use Pig, but
> now we're evaluating Spark to see if they would indeed run faster.
>
> A very common use case in our Pig script is joining a file containing
> Facts to a file containing Dimension data.  The joins are of course, inner,
> left & outer.
>
> I thought I would start simple.  Let's say I've 2 files:
>
> 1)  Students:  student_id, course_id, score
> 2)  Course:  course_id, course_title
>
> We want to produce a file that contains:  student_id, course_title, score
>
> (Note:  This is a hypothetical case.  The real files have millions of
> facts & thousands of dimensions)
>
> Would something like this work?  Note:  I did say I am a newbie ;)
>
> val students = sc.textFile("./students.txt")
> val courses = sc.textFile("./courses.txt")
> val s = students.map(x => x.split(','))
> val left = students.map(x => x.split(',')).map(y => (y(1), y))
> val right = courses.map(x => x.split(',')).map(y => (y(0), y))
> val joined = left.join(right)
>
>
> Any pointers in this regard would be greatly appreciated.  Thanks.
>


Joining files

2013-11-17 Thread Something Something
I am a newbie to both Spark & Scala, but I've been working with Hadoop/Pig
for quite some time.

We've quite a few ETL processes running in production that use Pig, but now
we're evaluating Spark to see if they would indeed run faster.

A very common use case in our Pig script is joining a file containing Facts
to a file containing Dimension data.  The joins are of course, inner, left
& outer.

I thought I would start simple.  Let's say I've 2 files:

1)  Students:  student_id, course_id, score
2)  Course:  course_id, course_title

We want to produce a file that contains:  student_id, course_title, score

(Note:  This is a hypothetical case.  The real files have millions of facts
& thousands of dimensions)

Would something like this work?  Note:  I did say I am a newbie ;)

val students = sc.textFile("./students.txt")
val courses = sc.textFile("./courses.txt")
val s = students.map(x => x.split(','))
val left = students.map(x => x.split(',')).map(y => (y(1), y))
val right = courses.map(x => x.split(',')).map(y => (y(0), y))
val joined = left.join(right)


Any pointers in this regard would be greatly appreciated.  Thanks.