With a local spark instance built with hive support, (-Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver)
The following script/sequence works in Pyspark without any error against 1.6.x, but fails with 2.x. people = sc.parallelize(["Michael,30", "Andy,12", "Justin,19"]) peoplePartsRDD = people.map(lambda p: p.split(",")) peopleRDD = peoplePartsRDD.map(lambda p: pyspark.sql.Row(name=p[0], age=int(p[1]))) peopleDF= sqlContext.createDataFrame(peopleRDD) peopleDF.first() sqlContext2 = SQLContext(sc) people2 = sc.parallelize(["Abcd,40", "Efgh,14", "Ijkl,16"]) peoplePartsRDD2 = people2.map(lambda l: l.split(",")) peopleRDD2 = peoplePartsRDD2.map(lambda p: pyspark.sql.Row(fname=p[0], age=int(p[1]))) peopleDF2 = sqlContext2.createDataFrame(peopleRDD2) # <==== error here The error goes away if sqlContext2 is replaced with sqlContext in the error line. Is this a regression, or has something changed that makes this the expected behavior in Spark 2.x ? Regards, Vinayak