With vocabulary size 4M and 400 vector size, you need 400 * 4M = 16B
floats to store the model. That is 64GB. We store the model on the
driver node in the current implementation. So I don't think it would
work. You might try increasing the minCount to decrease the vocabulary
size and reduce the
, May 19, 2015 1:25 PM
To: Shilad Sen
Cc: user
Subject: Re: Word2Vec with billion-word corpora
With vocabulary size 4M and 400 vector size, you need 400 * 4M = 16B floats to
store the model. That is 64GB. We store the model on the driver node in the
current implementation. So I don't think it would
://apache-spark-user-list.1001560.n3.nabble.com/Word2Vec-with-billion-word-corpora-tp22895.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
Hi all,
I'm experimenting with Spark's Word2Vec implementation for a relatively
large (5B word, vocabulary size 4M) corpora. Has anybody had success
running it at this scale?
-Shilad
--
Shilad W. Sen
Associate Professor
Mathematics, Statistics, and Computer Science Dept.
Macalester College
Hi all,
I'm experimenting with Spark's Word2Vec implementation for a relatively
large (5B word, vocabulary size 4M, 400-dimensional vectors) corpora. Has
anybody had success running it at this scale?
Thanks in advance for your guidance!
-Shilad
--
Shilad W. Sen
Associate Professor