Hi, I have about 8K files on about 10 directories on hdfs and I need to add a column to all files with the file name (e.g. file1.txt adds a column with file1.txt, file 2 with "file2.txt" etc)
The current approach was to read all files using *sc.WholeTextFiles("myPath") *and have the file name as key and add it as coulmn to each file. 1) I run this on 5 servers each with 24 cores and 24GB RAM with a config of : *spark-shell --master yarn-client --executor-core 5 --executor-memory 5G* But when we run this on all directories at once (sc.WholeTextFiles("/MySource/*/*") I am getting *java.lang.OutOfMemoryError: Java heap space* When running on a single directory all works well *sc.WholeTextFiles("/MySource/dir1/*") *. 2) One other option is not to use WholeTextFile but read each line with sc.textFile, but how can I get the file name with textFile? Eran