Hi, I am trying to join two dataframes and able to display the results in the console ater join. I am saving that data and and saving in the joined data in CSV format using spark-csv api . Its just saving the column names not data at all.
Below is the sample code for the reference: spark-shell --packages com.databricks:spark-csv_2.10:1.1.0 --master > yarn-client --driver-memory 512m --executor-memory 512m > > import org.apache.spark.sql.hive.HiveContext > import org.apache.spark.sql.hive.orc._ > val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) > import org.apache.spark.sql.types.{StructType, StructField, StringType, > IntegerType,FloatType ,LongType ,TimestampType }; > > val firstSchema = StructType(Seq(StructField("COLUMN1", StringType, > true),StructField("COLUMN2", StringType, true),StructField("COLUMN2", > StringType, true),StructField("COLUMN3", StringType, true) > StructField("COLUMN4", StringType, true),StructField("COLUMN5", > StringType, true))) > val file1df = > hiveContext.read.format("com.databricks.spark.csv").option("header", > "true").schema(firstSchema).load("/tmp/File1.csv") > > > val secondSchema = StructType(Seq( > StructField("COLUMN1", StringType, true), > StructField("COLUMN2", NullType , true), > StructField("COLUMN3", TimestampType , true), > StructField("COLUMN4", TimestampType , true), > StructField("COLUMN5", NullType , true), > StructField("COLUMN6", StringType, true), > StructField("COLUMN7", IntegerType, true), > StructField("COLUMN8", IntegerType, true), > StructField("COLUMN9", StringType, true), > StructField("COLUMN10", IntegerType, true), > StructField("COLUMN11", IntegerType, true), > StructField("COLUMN12", IntegerType, true))) > > > val file2df = > hiveContext.read.format("com.databricks.spark.csv").option("header", > "false").schema(secondSchema).load("/tmp/file2.csv") > val joineddf = file1df.join(file2df, file1df("COLUMN1") === > file2df("COLUMN6")) > val selecteddata = joineddf.select(file1df("COLUMN2"),file2df("COLUMN10")) > //the below statement is printing the joined data > joineddf.collect.foreach(println) > > //this statement saves the CSVfile but only columns names mentioned above > on the select are being saved > selecteddata.write.format("com.databricks.spark.csv").option("header", > "true").save("/tmp/JoinedData.csv") > Would really appreciate the pointers /help. Thanks, Divya