It seems that running insertInto on an SchemaRDD with a ParquetRelation creates an individual file for each item in the RDD. Sometimes, it has multiple rows in one file, and sometimes it only writes the column headers.
My question is, is it possible to have it write the entire RDD as 1 file, but still be associated and registered as a table? Right now I'm doing the following: // Create the Parquet "file" createParquetFile[T]("hdfs://somewhere/folder").registerAsTable("table") val rdd = some RDD // Insert the RDD's items into the table createSchemaRDD[T](rdd).insertInto("table") However, this ends up with a single file for each row of the format "part-r-${partition + offset}.parquet" (snagged from ParquetTableOperations > AppendingParquetOutputFormat) I know that I can create a single parquet file from an RDD by using SchemaRDD.saveAsParquetFile, but that prevents me from being able to load a table once and be aware of any changes. I'm fine with each insertInto call making a new parquet file in the table directory. But a file per row is a little over the top... Perhaps there are Hadoop confgs that I'm missing? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-Parquet-insertInto-makes-many-files-tp13480.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org