Re: Word2Vec with billion-word corpora

2015-05-19 Thread Xiangrui Meng
With vocabulary size 4M and 400 vector size, you need 400 * 4M = 16B floats to store the model. That is 64GB. We store the model on the driver node in the current implementation. So I don't think it would work. You might try increasing the minCount to decrease the vocabulary size and reduce the

RE: Word2Vec with billion-word corpora

2015-05-19 Thread nate
, May 19, 2015 1:25 PM To: Shilad Sen Cc: user Subject: Re: Word2Vec with billion-word corpora With vocabulary size 4M and 400 vector size, you need 400 * 4M = 16B floats to store the model. That is 64GB. We store the model on the driver node in the current implementation. So I don't think it would

Word2Vec with billion-word corpora

2015-05-14 Thread shilad
://apache-spark-user-list.1001560.n3.nabble.com/Word2Vec-with-billion-word-corpora-tp22895.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

Word2Vec with billion-word corpora

2015-05-13 Thread Shilad Sen
Hi all, I'm experimenting with Spark's Word2Vec implementation for a relatively large (5B word, vocabulary size 4M) corpora. Has anybody had success running it at this scale? -Shilad -- Shilad W. Sen Associate Professor Mathematics, Statistics, and Computer Science Dept. Macalester College

Word2Vec with billion-word corpora

2015-05-13 Thread Shilad Sen
Hi all, I'm experimenting with Spark's Word2Vec implementation for a relatively large (5B word, vocabulary size 4M, 400-dimensional vectors) corpora. Has anybody had success running it at this scale? Thanks in advance for your guidance! -Shilad -- Shilad W. Sen Associate Professor