You should probably be
asking the opposite question: why do you think it *should* be applied
immediately? Since the driver program hasn't requested any data back
(distinct generates a new RDD, it doesn't return any data), there's no
need to actually compute anything yet. As the documentation describes, if the call returns an RDD, it's transforming the data and will just keep track of the operation it eventually needs to perform. Only methods that return data back to the driver should trigger any computation. (The one known exception is sortByKey, which really should be lazy, but apparently uses an RDD.count call in its implementation: https://spark-project.atlassian.net/browse/SPARK-1021).
|
- Are all transformations lazy? David Thomas
- Re: Are all transformations lazy? Ewen Cheslack-Postava
- Re: Are all transformations lazy? David Thomas
- Re: Are all transformations lazy? Mayur Rustagi
- Re: Are all transformations lazy? Sandy Ryza
- Re: Are all transformations lazy? Ewen Cheslack-Postava
- Re: Are all transformations lazy? David Thomas