Hi, I have 1 master and 4 slave node. Input data size is 14GB. Slave Node config : 32GB Ram,16 core
I am trying to train word embedding model using spark. It is going out of memory. To train 14GB of data how much memory do i require?. I have givem 20gb per executor but below shows it is using 11.8GB out of 20 GB. BlockManagerInfo: Added broadcast_1_piece0 in memory on ip-.-.-.dev:35035 (size: 4.6 KB, free: 11.8 GB) This is the code if __name__ == "__main__": sc = SparkContext(appName="Word2VecExample") # SparkContext # $example on$ inp = sc.textFile("s3://word2vec/data/word2vec_word_data.txt/").map(lambda row: row.split(" ")) word2vec = Word2Vec() model = word2vec.fit(inp) model.save(sc, "s3://pysparkml/word2vecresult2/") sc.stop() Spark-submit Command: spark-submit --master yarn --conf 'spark.executor.extraJavaOptions=-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/mnt/tmp -XX:+UseG1GC -XX:+UseG1GC -XX:+PrintFlagsFinal -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark' --num-executors 4 --executor-cores 2 --executor-memory 20g Word2VecExample.py -- Selvam Raman "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"