This works as expected in the 1.1 branch:
from pyspark.sql import *
rdd = sc.parallelize([range(0, 10), range(10,20), range(20, 30)]
# define the schema
schemaString = value1 value2 value3 value4 value5 value6 value7 value8 value9
value10
fields = [StructField(field_name, IntegerType(), True)
applySchema() only accept RDD of Row/list/tuple, it does not work with
numpy.array.
After applySchema(), the Python RDD will be pickled and unpickled in
JVM, so you will not have any benefit by using numpy.array.
It will work if you convert ndarray into list:
schemaRDD =