Re: Driver hangs on running mllib word2vec

2015-01-06 Thread Ganon Pierce
Two billion words is a very large vocabulary… You can try solving this issue by by setting the number of times words must occur in order to be included in the vocabulary using setMinCount, this will be prevent common misspellings, websites, and other things from being included and may improve

Re: Driver hangs on running mllib word2vec

2015-01-06 Thread Ganon Pierce
Oops, just kidding, this method is not in the current release. However, it is included in the latest commit on git if you want to do a build. On Jan 6, 2015, at 2:56 PM, Ganon Pierce ganon.pie...@me.com wrote: Two billion words is a very large vocabulary… You can try solving this issue by

Re: Driver hangs on running mllib word2vec

2015-01-05 Thread Eric Zhen
Thanks Zhan, I'm also confused about the jstack output, why the driver gets stuck at org.apache.spark.SparkContext.clean ? On Tue, Jan 6, 2015 at 2:10 PM, Zhan Zhang zzh...@hortonworks.com wrote: I think it is overflow. The training data is quite big. The algorithms scalability highly

Re: Driver hangs on running mllib word2vec

2015-01-05 Thread Xiangrui Meng
How big is your dataset, and what is the vocabulary size? -Xiangrui On Sun, Jan 4, 2015 at 11:18 PM, Eric Zhen zhpeng...@gmail.com wrote: Hi, When we run mllib word2vec(spark-1.1.0), driver get stuck with 100% cup usage. Here is the jstack output: main prio=10 tid=0x40112800

Re: Driver hangs on running mllib word2vec

2015-01-05 Thread Eric Zhen
Hi Xiangrui, Our dataset is about 80GB(10B lines). In the driver's log, we foud this: *INFO Word2Vec: trainWordsCount = -1610413239* it seems that there is a integer overflow? On Tue, Jan 6, 2015 at 5:44 AM, Xiangrui Meng men...@gmail.com wrote: How big is your dataset, and what is the

Re: Driver hangs on running mllib word2vec

2015-01-05 Thread Zhan Zhang
I think it is overflow. The training data is quite big. The algorithms scalability highly depends on the vocabSize. Even without overflow, there are still other bottlenecks, for example, syn0Global and syn1Global, each of them has vocabSize * vectorSize elements. Thanks. Zhan Zhang On Jan

Driver hangs on running mllib word2vec

2015-01-04 Thread Eric Zhen
Hi, When we run mllib word2vec(spark-1.1.0), driver get stuck with 100% cup usage. Here is the jstack output: main prio=10 tid=0x40112800 nid=0x46f2 runnable [0x4162e000] java.lang.Thread.State: RUNNABLE at