None in RDD

guoxu1231 Thu, 14 Aug 2014 22:31:05 -0700

Hi Guys, 

I have a serious problem regarding the 'None' in RDD(pyspark).


Take a example of transformations that produce 'None'.
        leftOuterJoin(self, other, numPartitions=None)
Perform a left outer join of self and other.  (K, V) and (K, W), returns a
dataset of (K, (V, W)) pairs with all pairs of elements for each key.

Because it is leftOuterJoin, The result RDD also contains None in *(K, (V,
None))*.  The None will be a trouble in subsequent transformations, every
transformations need to check the None otherwise a error will be thrown. 


Another example about the CSV load function, 
MOV = sc.textFile('/movie.csv');
MOV = MOV.map(lambda strLine: strLine.split(",")).map(lambda
data:{"MOVIE_ID":int(data[0]), "MOVIE_NAME":str(data[1]),
"MOVIE_DIRECTOR":str(data[2])});

It is expected to have 3 fields and seperated by comma in the CSV file, 
However some dirty data maybe only 2 fields. Than
"MOVIE_DIRECTOR":str(data[2])} is dangerous.(IndexError: list index out of
range)


It is common to check "None" or illegal format in a common programming
language. 
However for big data programming,  it is tedious to check None or illegal
data as the illegal data is expected.

For Apache Pig, there have a special handling for the nulls, it looks better
as none check is not needed and takeing care of illegal data as well.
http://pig.apache.org/docs/r0.12.1/basic.html#nulls

For Spark, what is the best practice to handle none and illegal data as in
above exmple?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/None-in-RDD-tp12167.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

None in RDD

Reply via email to