That's not work. I don't think it is just slow, It never ends(with 30+ hours,
and I killed it).
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/parallelize-for-a-large-Seq-is-extreamly-slow-tp4801p4900.html
Sent from the Apache Spark User List mailing list
is
typically less than one second.
Thanks for the help:)
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/parallelize-for-a-large-Seq-is-extreamly-slow-tp4801p4914.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
reduceByKey(_+_).countByKey instead of countByKey seems to be fast.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/parallelize-for-a-large-Seq-is-extreamly-slow-tp4801p4870.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
(word_mapping.value.toSeq).saveAsTextFile(hdfs://ns1/nlp/word_mapping)
mapped.saveAsTextFile(hdfs://ns1/nlp/lda/wiki.docs)
spark.stop()
}
}
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/parallelize-for-a-large-Seq-is-extreamly-slow-tp4801p4871.html
Sent from
.1001560.n3.nabble.com/parallelize-for-a-large-Seq-is-extreamly-slow-tp4801p4871.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
)
at
org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.finishApplicationMaster(ApplicationMasterProtocolPBClientImpl.java:91)
... 10 more
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/parallelize-for-a-large-Seq-is-extreamly-slow-tp4801p4868.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
spark.parallelize(word_mapping.value.toSeq).saveAsTextFile(hdfs://ns1/nlp/word_mapping)
this line is too slow. There are about 2 million elements in word_mapping.
*Is there a good style for writing a large collection to hdfs?*
import org.apache.spark._
import SparkContext._
import
Try setting the serializer to org.apache.spark.serializer.KryoSerializer (see
http://spark.apache.org/docs/0.9.1/tuning.html), it should be considerably
faster.
Matei
On Apr 24, 2014, at 8:01 PM, Earthson Lu earthson...@gmail.com wrote:
)
mapped.saveAsTextFile(hdfs://ns1/nlp/lda/wiki.docs)
spark.stop()
}
}
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/parallelize-for-a-large-Seq-is-extreamly-slow-tp4801p4809.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.