Re: The difference between pyspark.rdd.PipelinedRDD and pyspark.rdd.RDD

2014-09-17 Thread edmond_huo
Hi Davis, When I run your code in pyspark, I still get the same error: >>> sc.parallelize(range(10)).map(lambda x: (x, str(x))).sortByKey().count() Traceback (most recent call last): File "", line 1, in AttributeError: 'PipelinedRDD' object has no attribute 'sortByKey' Is it the matter with

Re: The difference between pyspark.rdd.PipelinedRDD and pyspark.rdd.RDD

2014-09-17 Thread edmond_huo
Hi Davis, Thank you for you answer. This is my code. I think it is very similar with word count example in spark lines = sc.textFile(sys.argv[2]) sie = lines.map(lambda l: (l.strip().split(',')[4],1)).reduceByKey(lambda a, b: a + b) sort_sie = sie.sortByKey(False) Thanks again. -- View

The difference between pyspark.rdd.PipelinedRDD and pyspark.rdd.RDD

2014-09-16 Thread edmond_huo
Hi, I am a freshman about spark. I tried to run a job like wordcount example in python. But when I tried to get the top 10 popular words in the file, I got the message:AttributeError: 'PipelinedRDD' object has no attribute 'sortByKey'. So my question is what is the difference between PipelinedRDD