Hi,
I have about 8K files on about 10 directories on hdfs and I need to add a
column to all files with the file name (e.g. file1.txt adds a column with
file1.txt, file 2 with "file2.txt" etc)

The current approach was to read all files using *sc.WholeTextFiles("myPath")
*and have the file name as key and add it as coulmn to each file.

1) I run this on 5 servers each with 24 cores and 24GB RAM with a config of
:
*spark-shell --master yarn-client --executor-core 5 --executor-memory 5G*
But when we run this on all directories at once
(sc.WholeTextFiles("/MySource/*/*") I am getting *java.lang.OutOfMemoryError:
Java heap space*
When running on a single directory all works well
*sc.WholeTextFiles("/MySource/dir1/*") *.

2) One other option is not to use WholeTextFile but read each line with
sc.textFile, but how can I get the file name with textFile?

Eran

Reply via email to