Re: two calls of saveAsTextFile() have different results on the same RDD

2014-04-23 Thread randylu
i got it, thanks very much :)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/two-calls-of-saveAsTextFile-have-different-results-on-the-same-RDD-tp4578p4655.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: two calls of saveAsTextFile() have different results on the same RDD

2014-04-23 Thread Cheng Lian
To experiment, try this in the Spark shell:

val r0 = sc.makeRDD(1 to 3, 1)val r1 = r0.map { x =>
  println(x)
  x
}val r2 = r1.map(_ * 2)val r3 = r1.map(_ * 2 + 1)
(r2 ++ r3).collect()

You’ll see elements in r1 are printed (thus evaluated) twice. By adding
.cache() to r1, you’ll see those elements are printed only once.


On Wed, Apr 23, 2014 at 4:35 PM, Cheng Lian  wrote:

> Good question :)
>
> Although RDD DAG is lazy evaluated, it’s not exactly the same as Scala
> lazy val. For Scala lazy val, evaluated value is automatically cached,
> while evaluated RDD elements are not cached unless you call 
> .cache()explicitly, because materializing an RDD can often be expensive. Take 
> local
> file reading as an analogy:
>
> val v0 = sc.textFile("input.log").cache()
>
> is similar to a lazy val
>
> lazy val u0 = Source.fromFile("input.log").mkString
>
> while
>
> val v1 = sc.textFile("input.log")
>
> is similar to a function
>
> def u0 = Source.fromFile("input.log").mkString
>
> Think it this way: if you want to “reuse” the evaluated elements, you have
> to cache those elements somewhere. Without caching, you have to re-evaluate
> the RDD, and the semantics of an uncached RDD simply downgrades to a
> function rather than a lazy val.
>
>
> On Wed, Apr 23, 2014 at 4:00 PM, Mayur Rustagi wrote:
>
>> Shouldnt the dag optimizer optimize these routines. Sorry if its a dumb
>> question :)
>>
>>
>> Mayur Rustagi
>> Ph: +1 (760) 203 3257
>> http://www.sigmoidanalytics.com
>> @mayur_rustagi 
>>
>>
>>
>> On Wed, Apr 23, 2014 at 12:29 PM, Cheng Lian wrote:
>>
>>> Without caching, an RDD will be evaluated multiple times if referenced
>>> multiple times by other RDDs. A silly example:
>>>
>>> val text = sc.textFile("input.log")val r1 = text.filter(_ startsWith 
>>> "ERROR")val r2 = text.map(_ split " ")val r3 = (r1 ++ r2).collect()
>>>
>>> Here the input file will be scanned twice unless you call .cache() on
>>> text. So if your computation involves nondeterminism (e.g. random
>>> number), you may get different results.
>>>
>>>
>>> On Tue, Apr 22, 2014 at 11:30 AM, randylu  wrote:
>>>
 it's ok when i call doc_topic_dist.cache() firstly.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/two-calls-of-saveAsTextFile-have-different-results-on-the-same-RDD-tp4578p4580.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

>>>
>>>
>>
>


Re: two calls of saveAsTextFile() have different results on the same RDD

2014-04-23 Thread Cheng Lian
Good question :)

Although RDD DAG is lazy evaluated, it’s not exactly the same as Scala lazy
val. For Scala lazy val, evaluated value is automatically cached, while
evaluated RDD elements are not cached unless you call .cache() explicitly,
because materializing an RDD can often be expensive. Take local file
reading as an analogy:

val v0 = sc.textFile("input.log").cache()

is similar to a lazy val

lazy val u0 = Source.fromFile("input.log").mkString

while

val v1 = sc.textFile("input.log")

is similar to a function

def u0 = Source.fromFile("input.log").mkString

Think it this way: if you want to “reuse” the evaluated elements, you have
to cache those elements somewhere. Without caching, you have to re-evaluate
the RDD, and the semantics of an uncached RDD simply downgrades to a
function rather than a lazy val.


On Wed, Apr 23, 2014 at 4:00 PM, Mayur Rustagi wrote:

> Shouldnt the dag optimizer optimize these routines. Sorry if its a dumb
> question :)
>
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi 
>
>
>
> On Wed, Apr 23, 2014 at 12:29 PM, Cheng Lian wrote:
>
>> Without caching, an RDD will be evaluated multiple times if referenced
>> multiple times by other RDDs. A silly example:
>>
>> val text = sc.textFile("input.log")val r1 = text.filter(_ startsWith 
>> "ERROR")val r2 = text.map(_ split " ")val r3 = (r1 ++ r2).collect()
>>
>> Here the input file will be scanned twice unless you call .cache() on
>> text. So if your computation involves nondeterminism (e.g. random
>> number), you may get different results.
>>
>>
>> On Tue, Apr 22, 2014 at 11:30 AM, randylu  wrote:
>>
>>> it's ok when i call doc_topic_dist.cache() firstly.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/two-calls-of-saveAsTextFile-have-different-results-on-the-same-RDD-tp4578p4580.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>
>>
>


Re: two calls of saveAsTextFile() have different results on the same RDD

2014-04-23 Thread Mayur Rustagi
Shouldnt the dag optimizer optimize these routines. Sorry if its a dumb
question :)


Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi 



On Wed, Apr 23, 2014 at 12:29 PM, Cheng Lian  wrote:

> Without caching, an RDD will be evaluated multiple times if referenced
> multiple times by other RDDs. A silly example:
>
> val text = sc.textFile("input.log")val r1 = text.filter(_ startsWith 
> "ERROR")val r2 = text.map(_ split " ")val r3 = (r1 ++ r2).collect()
>
> Here the input file will be scanned twice unless you call .cache() on text.
> So if your computation involves nondeterminism (e.g. random number), you
> may get different results.
>
>
> On Tue, Apr 22, 2014 at 11:30 AM, randylu  wrote:
>
>> it's ok when i call doc_topic_dist.cache() firstly.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/two-calls-of-saveAsTextFile-have-different-results-on-the-same-RDD-tp4578p4580.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>
>


Re: two calls of saveAsTextFile() have different results on the same RDD

2014-04-23 Thread Cheng Lian
Without caching, an RDD will be evaluated multiple times if referenced
multiple times by other RDDs. A silly example:

val text = sc.textFile("input.log")val r1 = text.filter(_ startsWith
"ERROR")val r2 = text.map(_ split " ")val r3 = (r1 ++ r2).collect()

Here the input file will be scanned twice unless you call .cache() on text.
So if your computation involves nondeterminism (e.g. random number), you
may get different results.


On Tue, Apr 22, 2014 at 11:30 AM, randylu  wrote:

> it's ok when i call doc_topic_dist.cache() firstly.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/two-calls-of-saveAsTextFile-have-different-results-on-the-same-RDD-tp4578p4580.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: two calls of saveAsTextFile() have different results on the same RDD

2014-04-21 Thread randylu
it's ok when i call doc_topic_dist.cache() firstly.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/two-calls-of-saveAsTextFile-have-different-results-on-the-same-RDD-tp4578p4580.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.