Hi Xiangrui,

Our dataset is about 80GB(10B lines).

In the driver's log, we foud this:

*INFO Word2Vec: trainWordsCount = -1610413239*

it seems that there is a integer overflow?


On Tue, Jan 6, 2015 at 5:44 AM, Xiangrui Meng <men...@gmail.com> wrote:

> How big is your dataset, and what is the vocabulary size? -Xiangrui
>
> On Sun, Jan 4, 2015 at 11:18 PM, Eric Zhen <zhpeng...@gmail.com> wrote:
> > Hi,
> >
> > When we run mllib word2vec(spark-1.1.0), driver get stuck with 100% cup
> > usage. Here is the jstack output:
> >
> > "main" prio=10 tid=0x0000000040112800 nid=0x46f2 runnable
> > [0x000000004162e000]
> >    java.lang.Thread.State: RUNNABLE
> >         at
> >
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1847)
> >         at
> >
> java.io.ObjectOutputStream$BlockDataOutputStream.write(ObjectOutputStream.java:1778)
> >         at java.io.DataOutputStream.writeInt(DataOutputStream.java:182)
> >         at java.io.DataOutputStream.writeFloat(DataOutputStream.java:225)
> >         at
> >
> java.io.ObjectOutputStream$BlockDataOutputStream.writeFloats(ObjectOutputStream.java:2064)
> >         at
> > java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1310)
> >         at
> > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1154)
> >         at
> >
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1518)
> >         at
> > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1483)
> >         at
> >
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1400)
> >         at
> > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1158)
> >         at
> >
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1518)
> >         at
> > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1483)
> >         at
> >
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1400)
> >         at
> > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1158)
> >         at
> >
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1518)
> >         at
> > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1483)
> >         at
> >
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1400)
> >         at
> > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1158)
> >         at
> > java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:330)
> >         at
> >
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
> >         at
> >
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73)
> >         at
> >
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164)
> >         at
> > org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
> >         at org.apache.spark.SparkContext.clean(SparkContext.scala:1242)
> >         at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:610)
> >         at
> >
> org.apache.spark.mllib.feature.Word2Vec$$anonfun$fit$1.apply$mcVI$sp(Word2Vec.scala:291)
> >         at
> scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
> >         at
> org.apache.spark.mllib.feature.Word2Vec.fit(Word2Vec.scala:290)
> >         at com.baidu.inf.WordCount$.main(WordCount.scala:31)
> >         at com.baidu.inf.WordCount.main(WordCount.scala)
> >         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >         at
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> >         at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >         at java.lang.reflect.Method.invoke(Method.java:597)
> >         at
> > org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
> >         at
> org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
> >         at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> >
> > --
> > Best Regards
>



-- 
Best Regards

Reply via email to