Hi there,

I am sorry to bother you, but I encountered a problem about transforming
large files (hundreds of giga per file) in local file system to HDFS as
Parquet file format using Spark. The problem can be described as follows.

1) When I tried to read a huge file from local and used Avro + Parquet to
transform it into Parquet format and stored them to HDFS using the API
"saveAsNewAPIHadoopFile", the JVM would be out of memory, because the file
is too large to be contained by memory.

2) When I tried to read a fraction of them and write to HDFS as Parquet
format using the API "saveAsNewAPIHadoopFile", I found that for each loop,
it would generate a directory with a list of files, namely, it would be
deemed as several independent outputs, which was not what I would like and
would occur some problems when I tried to process them in the future.

So, for a huge file which cannot be entirely hold by memory, are there any
way to transform it into Parquet format and output all the files to HDFS in
the same directory and thus deemed as a unique file?

In addition, could anybody know for Spark how to get the directory structure
of HDFS, or how to read a directory recursively so as to read all the files
in that directory and its sub-directory? That may be also substitution of
this problem.

I wish that some one could help me fix it. I will really appreciate it.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Question-about-Transforming-huge-files-from-Local-to-HDFS-tp4867.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to