Shouldnt the dag optimizer optimize these routines. Sorry if its a dumb question :)
Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi <https://twitter.com/mayur_rustagi> On Wed, Apr 23, 2014 at 12:29 PM, Cheng Lian <lian.cs....@gmail.com> wrote: > Without caching, an RDD will be evaluated multiple times if referenced > multiple times by other RDDs. A silly example: > > val text = sc.textFile("input.log")val r1 = text.filter(_ startsWith > "ERROR")val r2 = text.map(_ split " ")val r3 = (r1 ++ r2).collect() > > Here the input file will be scanned twice unless you call .cache() on text. > So if your computation involves nondeterminism (e.g. random number), you > may get different results. > > > On Tue, Apr 22, 2014 at 11:30 AM, randylu <randyl...@gmail.com> wrote: > >> it's ok when i call doc_topic_dist.cache() firstly. >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/two-calls-of-saveAsTextFile-have-different-results-on-the-same-RDD-tp4578p4580.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> > >