SchemaRDD - Parquet - "insertInto" makes many files

DanteSama Thu, 04 Sep 2014 10:41:44 -0700

It seems that running insertInto on an SchemaRDD with a ParquetRelation
creates an individual file for each item in the RDD. Sometimes, it has
multiple rows in one file, and sometimes it only writes the column headers.


My question is, is it possible to have it write the entire RDD as 1 file,
but still be associated and registered as a table? Right now I'm doing the
following:

// Create the Parquet "file"
createParquetFile[T]("hdfs://somewhere/folder").registerAsTable("table")

val rdd = some RDD

// Insert the RDD's items into the table
createSchemaRDD[T](rdd).insertInto("table")

However, this ends up with a single file for each row of the format
"part-r-${partition + offset}.parquet" (snagged from ParquetTableOperations
> AppendingParquetOutputFormat)

I know that I can create a single parquet file from an RDD by using
SchemaRDD.saveAsParquetFile, but that prevents me from being able to load a
table once and be aware of any changes.

I'm fine with each insertInto call making a new parquet file in the table
directory. But a file per row is a little over the top... Perhaps there are
Hadoop confgs that I'm missing?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-Parquet-insertInto-makes-many-files-tp13480.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

SchemaRDD - Parquet - "insertInto" makes many files

Reply via email to