Re: SchemaRDD - Parquet - "insertInto" makes many files

Michael Armbrust Thu, 04 Sep 2014 10:55:10 -0700

Try doing coalesce(1) on the rdd before insert into.


On Thu, Sep 4, 2014 at 10:40 AM, DanteSama <chris.feder...@sojo.com> wrote:

> It seems that running insertInto on an SchemaRDD with a ParquetRelation
> creates an individual file for each item in the RDD. Sometimes, it has
> multiple rows in one file, and sometimes it only writes the column headers.
>
> My question is, is it possible to have it write the entire RDD as 1 file,
> but still be associated and registered as a table? Right now I'm doing the
> following:
>
> // Create the Parquet "file"
> createParquetFile[T]("hdfs://somewhere/folder").registerAsTable("table")
>
> val rdd = some RDD
>
> // Insert the RDD's items into the table
> createSchemaRDD[T](rdd).insertInto("table")
>
> However, this ends up with a single file for each row of the format
> "part-r-${partition + offset}.parquet" (snagged from ParquetTableOperations
> > AppendingParquetOutputFormat)
>
> I know that I can create a single parquet file from an RDD by using
> SchemaRDD.saveAsParquetFile, but that prevents me from being able to load a
> table once and be aware of any changes.
>
> I'm fine with each insertInto call making a new parquet file in the table
> directory. But a file per row is a little over the top... Perhaps there are
> Hadoop confgs that I'm missing?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-Parquet-insertInto-makes-many-files-tp13480.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: SchemaRDD - Parquet - "insertInto" makes many files

Reply via email to