numpy arrays and spark sql

2014-12-01 Thread Joseph Winston
This works as expected in the 1.1 branch: from pyspark.sql import * rdd = sc.parallelize([range(0, 10), range(10,20), range(20, 30)] # define the schema schemaString = value1 value2 value3 value4 value5 value6 value7 value8 value9 value10 fields = [StructField(field_name, IntegerType(), True)

Re: numpy arrays and spark sql

2014-12-01 Thread Davies Liu
applySchema() only accept RDD of Row/list/tuple, it does not work with numpy.array. After applySchema(), the Python RDD will be pickled and unpickled in JVM, so you will not have any benefit by using numpy.array. It will work if you convert ndarray into list: schemaRDD =