I think it is overflow. The training data is quite big. The algorithms  
scalability highly depends on the vocabSize. Even without overflow, there are 
still other bottlenecks, for example, syn0Global and syn1Global, each of them 
has vocabSize * vectorSize elements.

Thanks.

Zhan Zhang


On Jan 5, 2015, at 7:47 PM, Eric Zhen <zhpeng...@gmail.com> wrote:

> Hi Xiangrui,
> 
> Our dataset is about 80GB(10B lines). 
> 
> In the driver's log, we foud this:
> 
> INFO Word2Vec: trainWordsCount = -1610413239
> 
> it seems that there is a integer overflow?
> 
> 
> On Tue, Jan 6, 2015 at 5:44 AM, Xiangrui Meng <men...@gmail.com> wrote:
> How big is your dataset, and what is the vocabulary size? -Xiangrui
> 
> On Sun, Jan 4, 2015 at 11:18 PM, Eric Zhen <zhpeng...@gmail.com> wrote:
> > Hi,
> >
> > When we run mllib word2vec(spark-1.1.0), driver get stuck with 100% cup
> > usage. Here is the jstack output:
> >
> > "main" prio=10 tid=0x0000000040112800 nid=0x46f2 runnable
> > [0x000000004162e000]
> >    java.lang.Thread.State: RUNNABLE
> >         at
> > java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1847)
> >         at
> > java.io.ObjectOutputStream$BlockDataOutputStream.write(ObjectOutputStream.java:1778)
> >         at java.io.DataOutputStream.writeInt(DataOutputStream.java:182)
> >         at java.io.DataOutputStream.writeFloat(DataOutputStream.java:225)
> >         at
> > java.io.ObjectOutputStream$BlockDataOutputStream.writeFloats(ObjectOutputStream.java:2064)
> >         at
> > java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1310)
> >         at
> > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1154)
> >         at
> > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1518)
> >         at
> > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1483)
> >         at
> > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1400)
> >         at
> > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1158)
> >         at
> > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1518)
> >         at
> > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1483)
> >         at
> > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1400)
> >         at
> > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1158)
> >         at
> > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1518)
> >         at
> > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1483)
> >         at
> > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1400)
> >         at
> > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1158)
> >         at
> > java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:330)
> >         at
> > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
> >         at
> > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73)
> >         at
> > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164)
> >         at
> > org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
> >         at org.apache.spark.SparkContext.clean(SparkContext.scala:1242)
> >         at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:610)
> >         at
> > org.apache.spark.mllib.feature.Word2Vec$$anonfun$fit$1.apply$mcVI$sp(Word2Vec.scala:291)
> >         at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
> >         at org.apache.spark.mllib.feature.Word2Vec.fit(Word2Vec.scala:290)
> >         at com.baidu.inf.WordCount$.main(WordCount.scala:31)
> >         at com.baidu.inf.WordCount.main(WordCount.scala)
> >         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >         at
> > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> >         at
> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >         at java.lang.reflect.Method.invoke(Method.java:597)
> >         at
> > org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
> >         at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
> >         at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> >
> > --
> > Best Regards
> 
> 
> 
> -- 
> Best Regards


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Reply via email to