With a local spark instance built with hive support, (-Pyarn -Phadoop-2.6 
-Dhadoop.version=2.6.0 -Phive -Phive-thriftserver)

The following script/sequence works in Pyspark without any error against 
1.6.x, but fails with 2.x. 

people = sc.parallelize(["Michael,30", "Andy,12", "Justin,19"])
peoplePartsRDD = people.map(lambda p: p.split(","))
peopleRDD = peoplePartsRDD.map(lambda p: pyspark.sql.Row(name=p[0], 
age=int(p[1])))
peopleDF= sqlContext.createDataFrame(peopleRDD)
peopleDF.first()

sqlContext2 = SQLContext(sc)
people2 = sc.parallelize(["Abcd,40", "Efgh,14", "Ijkl,16"])
peoplePartsRDD2 = people2.map(lambda l: l.split(","))
peopleRDD2 = peoplePartsRDD2.map(lambda p: pyspark.sql.Row(fname=p[0], 
age=int(p[1])))
peopleDF2 = sqlContext2.createDataFrame(peopleRDD2) # <==== error here


The error goes away if sqlContext2 is replaced with sqlContext in the 
error line. Is this a regression, or has something changed that makes this 
the expected behavior in Spark 2.x ?

Regards,
Vinayak



Reply via email to