Re: parallelize for a large Seq is extreamly slow.

2014-04-27 Thread Earthson
That's not work. I don't think it is just slow, It never ends(with 30+ hours, and I killed it). -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/parallelize-for-a-large-Seq-is-extreamly-slow-tp4801p4900.html Sent from the Apache Spark User List mailing list

Re: parallelize for a large Seq is extreamly slow.

2014-04-27 Thread Earthson
is typically less than one second. Thanks for the help:) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/parallelize-for-a-large-Seq-is-extreamly-slow-tp4801p4914.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: parallelize for a large Seq is extreamly slow.

2014-04-26 Thread Earthson
reduceByKey(_+_).countByKey instead of countByKey seems to be fast. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/parallelize-for-a-large-Seq-is-extreamly-slow-tp4801p4870.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: parallelize for a large Seq is extreamly slow.

2014-04-26 Thread Earthson
(word_mapping.value.toSeq).saveAsTextFile(hdfs://ns1/nlp/word_mapping) mapped.saveAsTextFile(hdfs://ns1/nlp/lda/wiki.docs) spark.stop() } } -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/parallelize-for-a-large-Seq-is-extreamly-slow-tp4801p4871.html Sent from

Re: parallelize for a large Seq is extreamly slow.

2014-04-26 Thread Aaron Davidson
.1001560.n3.nabble.com/parallelize-for-a-large-Seq-is-extreamly-slow-tp4801p4871.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: parallelize for a large Seq is extreamly slow.

2014-04-25 Thread Earthson
) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.finishApplicationMaster(ApplicationMasterProtocolPBClientImpl.java:91) ... 10 more -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/parallelize-for-a-large-Seq-is-extreamly-slow-tp4801p4868.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

parallelize for a large Seq is extreamly slow.

2014-04-24 Thread Earthson Lu
spark.parallelize(word_mapping.value.toSeq).saveAsTextFile(hdfs://ns1/nlp/word_mapping) this line is too slow. There are about 2 million elements in word_mapping. *Is there a good style for writing a large collection to hdfs?* import org.apache.spark._ import SparkContext._ import

Re: parallelize for a large Seq is extreamly slow.

2014-04-24 Thread Matei Zaharia
Try setting the serializer to org.apache.spark.serializer.KryoSerializer (see http://spark.apache.org/docs/0.9.1/tuning.html), it should be considerably faster. Matei On Apr 24, 2014, at 8:01 PM, Earthson Lu earthson...@gmail.com wrote:

Re: parallelize for a large Seq is extreamly slow.

2014-04-24 Thread Earthson
) mapped.saveAsTextFile(hdfs://ns1/nlp/lda/wiki.docs) spark.stop() } } -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/parallelize-for-a-large-Seq-is-extreamly-slow-tp4801p4809.html Sent from the Apache Spark User List mailing list archive at Nabble.com.