> 1) When I tried to read a huge file from local and used Avro + Parquet to
> transform it into Parquet format and stored them to HDFS using the API
> "saveAsNewAPIHadoopFile", the JVM would be out of memory, because the file
> is too large to be contained by memory.
>

How much memory are you giving the JVM and how many cores are you giving
spark?  You do not need to be able to fit the entire dataset in memory, but
since parquet is a columnar format you are going to need a pretty large
buffer space when writing out a lot of data.  I believe this is especially
true if you have a lot of columns.

Options include increasing the amount of memory you give spark executors
"spark.executor.memory" or decreasing the number of parallel tasks
"spark.cores.max".

You might also try asking on the parquet mailing list.  There are probably
options for configuring how much buffer space it allocates.  However, a lot
of the performance benefits of formats like parquet come from large buffers
so this may not be the best option.


> 2) When I tried to read a fraction of them and write to HDFS as Parquet
> format using the API "saveAsNewAPIHadoopFile", I found that for each loop,
> it would generate a directory with a list of files, namely, it would be
> deemed as several independent outputs, which was not what I would like and
> would occur some problems when I tried to process them in the future.
>

This is slightly different, but in Spark SQL (coming with the 1.0 release)
there is experimental support for creating a virtual table that is backed
by a parquet file.  You can do many insertions into this table and they
will all be read by a single job pointing to that directory.

Reply via email to