> 1) When I tried to read a huge file from local and used Avro + Parquet to > transform it into Parquet format and stored them to HDFS using the API > "saveAsNewAPIHadoopFile", the JVM would be out of memory, because the file > is too large to be contained by memory. >
How much memory are you giving the JVM and how many cores are you giving spark? You do not need to be able to fit the entire dataset in memory, but since parquet is a columnar format you are going to need a pretty large buffer space when writing out a lot of data. I believe this is especially true if you have a lot of columns. Options include increasing the amount of memory you give spark executors "spark.executor.memory" or decreasing the number of parallel tasks "spark.cores.max". You might also try asking on the parquet mailing list. There are probably options for configuring how much buffer space it allocates. However, a lot of the performance benefits of formats like parquet come from large buffers so this may not be the best option. > 2) When I tried to read a fraction of them and write to HDFS as Parquet > format using the API "saveAsNewAPIHadoopFile", I found that for each loop, > it would generate a directory with a list of files, namely, it would be > deemed as several independent outputs, which was not what I would like and > would occur some problems when I tried to process them in the future. > This is slightly different, but in Spark SQL (coming with the 1.0 release) there is experimental support for creating a virtual table that is backed by a parquet file. You can do many insertions into this table and they will all be read by a single job pointing to that directory.