I think it is overflow. The training data is quite big. The algorithms scalability highly depends on the vocabSize. Even without overflow, there are still other bottlenecks, for example, syn0Global and syn1Global, each of them has vocabSize * vectorSize elements.
Thanks. Zhan Zhang On Jan 5, 2015, at 7:47 PM, Eric Zhen <zhpeng...@gmail.com> wrote: > Hi Xiangrui, > > Our dataset is about 80GB(10B lines). > > In the driver's log, we foud this: > > INFO Word2Vec: trainWordsCount = -1610413239 > > it seems that there is a integer overflow? > > > On Tue, Jan 6, 2015 at 5:44 AM, Xiangrui Meng <men...@gmail.com> wrote: > How big is your dataset, and what is the vocabulary size? -Xiangrui > > On Sun, Jan 4, 2015 at 11:18 PM, Eric Zhen <zhpeng...@gmail.com> wrote: > > Hi, > > > > When we run mllib word2vec(spark-1.1.0), driver get stuck with 100% cup > > usage. Here is the jstack output: > > > > "main" prio=10 tid=0x0000000040112800 nid=0x46f2 runnable > > [0x000000004162e000] > > java.lang.Thread.State: RUNNABLE > > at > > java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1847) > > at > > java.io.ObjectOutputStream$BlockDataOutputStream.write(ObjectOutputStream.java:1778) > > at java.io.DataOutputStream.writeInt(DataOutputStream.java:182) > > at java.io.DataOutputStream.writeFloat(DataOutputStream.java:225) > > at > > java.io.ObjectOutputStream$BlockDataOutputStream.writeFloats(ObjectOutputStream.java:2064) > > at > > java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1310) > > at > > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1154) > > at > > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1518) > > at > > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1483) > > at > > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1400) > > at > > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1158) > > at > > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1518) > > at > > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1483) > > at > > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1400) > > at > > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1158) > > at > > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1518) > > at > > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1483) > > at > > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1400) > > at > > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1158) > > at > > java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:330) > > at > > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42) > > at > > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73) > > at > > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164) > > at > > org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158) > > at org.apache.spark.SparkContext.clean(SparkContext.scala:1242) > > at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:610) > > at > > org.apache.spark.mllib.feature.Word2Vec$$anonfun$fit$1.apply$mcVI$sp(Word2Vec.scala:291) > > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) > > at org.apache.spark.mllib.feature.Word2Vec.fit(Word2Vec.scala:290) > > at com.baidu.inf.WordCount$.main(WordCount.scala:31) > > at com.baidu.inf.WordCount.main(WordCount.scala) > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > > at > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > > at > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > > at java.lang.reflect.Method.invoke(Method.java:597) > > at > > org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328) > > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) > > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > > > > -- > > Best Regards > > > > -- > Best Regards -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.