Try doing coalesce(1) on the rdd before insert into.
On Thu, Sep 4, 2014 at 10:40 AM, DanteSama <chris.feder...@sojo.com> wrote: > It seems that running insertInto on an SchemaRDD with a ParquetRelation > creates an individual file for each item in the RDD. Sometimes, it has > multiple rows in one file, and sometimes it only writes the column headers. > > My question is, is it possible to have it write the entire RDD as 1 file, > but still be associated and registered as a table? Right now I'm doing the > following: > > // Create the Parquet "file" > createParquetFile[T]("hdfs://somewhere/folder").registerAsTable("table") > > val rdd = some RDD > > // Insert the RDD's items into the table > createSchemaRDD[T](rdd).insertInto("table") > > However, this ends up with a single file for each row of the format > "part-r-${partition + offset}.parquet" (snagged from ParquetTableOperations > > AppendingParquetOutputFormat) > > I know that I can create a single parquet file from an RDD by using > SchemaRDD.saveAsParquetFile, but that prevents me from being able to load a > table once and be aware of any changes. > > I'm fine with each insertInto call making a new parquet file in the table > directory. But a file per row is a little over the top... Perhaps there are > Hadoop confgs that I'm missing? > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-Parquet-insertInto-makes-many-files-tp13480.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >