Hi Guys, I have a serious problem regarding the 'None' in RDD(pyspark).
Take a example of transformations that produce 'None'. leftOuterJoin(self, other, numPartitions=None) Perform a left outer join of self and other. (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key. Because it is leftOuterJoin, The result RDD also contains None in *(K, (V, None))*. The None will be a trouble in subsequent transformations, every transformations need to check the None otherwise a error will be thrown. Another example about the CSV load function, MOV = sc.textFile('/movie.csv'); MOV = MOV.map(lambda strLine: strLine.split(",")).map(lambda data:{"MOVIE_ID":int(data[0]), "MOVIE_NAME":str(data[1]), "MOVIE_DIRECTOR":str(data[2])}); It is expected to have 3 fields and seperated by comma in the CSV file, However some dirty data maybe only 2 fields. Than "MOVIE_DIRECTOR":str(data[2])} is dangerous.(IndexError: list index out of range) It is common to check "None" or illegal format in a common programming language. However for big data programming, it is tedious to check None or illegal data as the illegal data is expected. For Apache Pig, there have a special handling for the nulls, it looks better as none check is not needed and takeing care of illegal data as well. http://pig.apache.org/docs/r0.12.1/basic.html#nulls For Spark, what is the best practice to handle none and illegal data as in above exmple? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/None-in-RDD-tp12167.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org