Hello everyone!

I am a new Spark learner and trying to do a task seems very simple. I want to 
read a text file, save the content to JavaRDD and convert it to Dataframe, so I 
can use it for Word2Vec Model in the future. The code looks pretty simple but I 
cannot make it work:


SparkSession spark = SparkSession.builder().appName("Word2Vec").getOrCreate();
JavaRDD<String> lines = spark.sparkContext().textFile("input.txt", 
10).toJavaRDD();
JavaRDD<Row> rows = lines.map(new Function<String, Row>(){
public Row call(String line){
return RowFactory.create(Arrays.asList(line.split(" ")));
}
});
StructType schema = new StructType(new StructField[] {
new StructField("text", new ArrayType(DataTypes.StringType, true), false, 
Metadata.empty())
});
Dataset<Row> input = spark.createDataFrame(rows, schema);
input.show(3);


It throws an exception at input.show(3):


Caused by: java.lang.ClassCastException: cannot assign instance of 
scala.collection.immutable.List$SerializationProxy to field 
org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type 
scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD


Seems it has problem converting the JavaRDD<Row> to Dataframe. However I cannot 
figure out what mistake I make here and the exception message is hard to 
understand. Anyone can help? Thanks!

Reply via email to