Stream writing parquet files

Christopher Piggott Thu, 19 Apr 2018 18:24:07 -0700

I am trying to write some parquet files and running out of memory.  I'm
giving my workers each 16GB and the data is 102 columns * 65536 rows - not
really all that much.  The content of each row is a short string.


I am trying to create the file by dynamically allocating a StructType of
StructField objects.  I then tried various methods of building an Array or
List of Row objects for each of the 65,536 rows.  The last attempt was to
create an ArrayBuffer of the correct length.

In all cases, I run out of memory.

It occurs to me that what I really need is a way to generate and stream the
parquet files directly to an HDFS file.  I have 70,000+ of these files, so
for starters I'm OK with creating 70,000 parquet files as long as there's
some way I can merge them later.

Is there an approach for generating parquet files from spark (ultimately to
HDFS) that lets me put each row out one at a time, in a streaming fashion?

BTW I'm using spark 2.2.1 and whatever parquet library was bundled within.

--Chris

Stream writing parquet files

Reply via email to