Hi,

I have 1 master and 4 slave node. Input data size is 14GB.
Slave Node config : 32GB Ram,16 core


I am trying to train word embedding model using spark. It is going out of
memory. To train 14GB of data how much memory do i require?.


I have givem 20gb per executor but below shows it is using 11.8GB out of 20
GB.
BlockManagerInfo: Added broadcast_1_piece0 in memory on ip-.-.-.dev:35035
(size: 4.6 KB, free: 11.8 GB)


This is the code
if __name__ == "__main__":
    sc = SparkContext(appName="Word2VecExample")  # SparkContext

    # $example on$
    inp =
sc.textFile("s3://word2vec/data/word2vec_word_data.txt/").map(lambda row:
row.split(" "))

    word2vec = Word2Vec()
    model = word2vec.fit(inp)

    model.save(sc, "s3://pysparkml/word2vecresult2/")
    sc.stop()


Spark-submit Command:
spark-submit --master yarn --conf
'spark.executor.extraJavaOptions=-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/mnt/tmp -XX:+UseG1GC -XX:+UseG1GC -XX:+PrintFlagsFinal
-XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails
-XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy
-XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark' --num-executors 4
--executor-cores 2 --executor-memory 20g Word2VecExample.py


-- 
Selvam Raman
"லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"

Reply via email to