Re: [Yarn-Client]Can not access SparkUI

2015-10-26 Thread Earthson Lu
ctException) caught when processing request: 连接超时 2015-10-26 11:45:36,600 INFO org.apache.commons.httpclient.HttpMethodDirector: Retrying request --  Earthson Lu On October 26, 2015 at 15:30:21, Deng Ching-Mallete (och...@apache.org) wrote: Hi Earthson, Unfortunately, attachments aren't allowed in the list s

[Yarn-Client]Can not access SparkUI

2015-10-26 Thread Earthson
We are using Spark 1.5.1 with `--master yarn`, Yarn RM is running in HA mode. direct visit click ApplicationMaster link YARN RM log -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Yarn-Client-Can-not-access-SparkUI-tp25197.html Sent from the

Re: [Spark-1.4.0]jackson-databind conflict?

2015-06-14 Thread Earthson Lu
I’ve recompiled spark-1.4.0 with fasterxml-2.5.x, it works fine now:) --  Earthson Lu On June 12, 2015 at 23:24:32, Sean Owen (so...@cloudera.com) wrote: I see the same thing in an app that uses Jackson 2.5. Downgrading to 2.4 made it work. I meant to go back and figure out if there's

[Spark-1.4.0]jackson-databind conflict?

2015-06-12 Thread Earthson
I'm using Play-2.4 with play-json-2.4, It works fine with spark-1.3.1, but it failed after I upgrade Spark to spark-1.4.0:( sc.parallelize(1 to 1).count code [info] com.fasterxml.jackson.databind.JsonMappingException: Could not find creator property with name 'id' (in class

Re: what is the best way to implement mini batches?

2014-12-15 Thread Earthson Lu
large batch for parallel inside each batch(It seems to be the way that SGD implemented in MLLib does?). --  Earthson Lu On December 16, 2014 at 04:02:22, Imran Rashid (im...@therashids.com) wrote: I'm a little confused by some of the responses.  It seems like there are two different issues being

Re: what is the best way to implement mini batches?

2014-12-14 Thread Earthson
I think it could be done like: 1. using mapPartition to randomly drop some partition 2. drop some elements randomly(for selected partition) 3. calculate gradient step for selected elements I don't think fixed step is needed, but fixed step could be done: 1. zipWithIndex 2. create ShuffleRDD

How to get applicationId for yarn mode(both yarn-client and yarn-cluster mode)

2014-11-21 Thread Earthson
Is there any way to get the yarn application_id inside the program? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-get-applicationId-for-yarn-mode-both-yarn-client-and-yarn-cluster-mode-tp19462.html Sent from the Apache Spark User List mailing

Re: How to get applicationId for yarn mode(both yarn-client and yarn-cluster mode)

2014-11-21 Thread Earthson
Finally, I've found two ways: 1. search the output with something like Submitted application application_1416319392519_0115 2. use specific AppName. We could query the ApplicationID(yarn) -- View this message in context:

Re: [SparkSQL] Convert JavaSchemaRDD to SchemaRDD

2014-10-16 Thread Earthson
I'm trying to give API interface to Java users. And I need to accept their JavaSchemaRDDs, and convert it to SchemaRDD for Scala users. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Convert-JavaSchemaRDD-to-SchemaRDD-tp16482p16641.html Sent from

[SparkSQL] Convert JavaSchemaRDD to SchemaRDD

2014-10-15 Thread Earthson
I don't know why the JavaSchemaRDD.baseSchemaRDD is private[sql]. And I found that DataTypeConversions is protected[sql]. Finally I find this solution: pre code jrdd.registerTempTable(transform_tmp) jrdd.sqlContext.sql(select * from transform_tmp) /code /pre Could Any One tell me

[PySpark][Python 2.7.8][Spark 1.0.2] count() with TypeError: an integer is required

2014-08-22 Thread Earthson
I am using PySpark with IPython notebook. pre data = sc.parallelize(range(1000), 10) #successful data.map(lambda x: x+1).collect() #Error data.count() /pre Something similar:http://apache-spark-user-list.1001560.n3.nabble.com/Exception-on-simple-pyspark-script-td3415.html But it does not

Re: [PySpark][Python 2.7.8][Spark 1.0.2] count() with TypeError: an integer is required

2014-08-22 Thread Earthson
I'm running pyspark with Python 2.7.8 under Virtualenv System Python Version: Python 2.6.x -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Python-2-7-8-Spark-1-0-2-count-with-TypeError-an-integer-is-required-tp12643p12645.html Sent from the Apache

Re: [PySpark][Python 2.7.8][Spark 1.0.2] count() with TypeError: an integer is required

2014-08-22 Thread Earthson
Do I have to deploy Python to every machine to make $PYSPARK_PYTHON work correctly? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Python-2-7-8-Spark-1-0-2-count-with-TypeError-an-integer-is-required-tp12643p12651.html Sent from the Apache Spark

[Spark 1.0.1][SparkSQL] reduce stage of shuffle is slow。

2014-07-28 Thread Earthson
I'm using SparkSQL with Hive 0.13, here is the SQL for inserting a partition with 2048 buckets. pre sqlsc.set(spark.sql.shuffle.partitions, 2048) hql(|insert %s table mz_log |PARTITION (date='%s') |select * from tmp_mzlog

Re: [Spark 1.0.1][SparkSQL] reduce stage of shuffle is slow。

2014-07-28 Thread Earthson
spark.MapOutputTrackerMasterActor: Asked to send map output locations for shuffle 0 to takes too much time, what should I do? What is the correct configuration? blockManager timeout if I using a small number of reduce partition.

Re: Why spark-submit command hangs?

2014-07-22 Thread Earthson
I've just have the same problem. I'm using pre $SPARK_HOME/bin/spark-submit --master yarn --deploy-mode client $JOBJAR --class $JOBCLASS /pre It's really strange, because the log shows that pre 14/07/22 16:16:58 INFO ui.SparkUI: Started SparkUI at http://k1227.mzhen.cn:4040 14/07/22 16:16:58

Re: Why spark-submit command hangs?

2014-07-22 Thread Earthson
That's what my problem is:) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Why-spark-submit-command-hangs-tp10308p10394.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

How could I set the number of executor?

2014-06-20 Thread Earthson
spark-submit has an arguments --num-executors to set the number of executor, but how could I set it from anywhere else? We're using Shark, and want to change the number of executor. The number of executor seems to be same as workers by default? Shall we configure the executor number manually(Is

Re: How could I set the number of executor?

2014-06-20 Thread Earthson
--num-executors seems to be only available with YARN-only. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-could-I-set-the-number-of-executor-tp7990p7992.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

How to add jar with SparkSQL HiveContext?

2014-06-17 Thread Earthson
I have a problem with add jar command hql(add jar /.../xxx.jar) Error: Exception in thread main java.lang.AssertionError: assertion failed: No plan for AddJar ... How could I do this job with HiveContext, I can't find any api to do it. Does SparkSQL with Hive support UDF/UDAF? -- View this

Re: problem about broadcast variable in iteration

2014-05-15 Thread Earthson
RDD is not cached? Because recomputing may be required, every broadcast object is included in the dependences of RDDs, this may also have memory issue(when n and kv is too large in your case). -- View this message in context:

Re: Cache issue for iteration with broadcast

2014-05-05 Thread Earthson
.set(spark.cleaner.ttl, 120) drops broadcast_0 which makes a Exception below. It is strange, because broadcast_0 is no need, and I have broadcast_3 instead, and recent RDD is persisted, there is no need for recomputing... what is the problem? need help. ~~~ 14/05/05 17:03:12 INFO

Re: Cache issue for iteration with broadcast

2014-05-05 Thread Earthson
Using checkpoint. It removes dependences:) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Cache-issue-for-iteration-with-broadcast-tp5350p5368.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Cache issue for iteration with broadcast

2014-05-05 Thread Earthson
RDD.checkpoint works fine. But spark.cleaner.ttl is really ugly for broadcast cleaning. May be it could be removed automatically when no dependences. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Cache-issue-for-iteration-with-broadcast-tp5350p5369.html

Re: Cache issue for iteration with broadcast

2014-05-05 Thread Earthson
Yes, I've tried. The problem is new broadcast object generated by every step until eat up all of the memory. I solved it by using RDD.checkpoint to remove dependences to old broadcast object, and use cleanner.ttl to clean up these broadcast object automatically. If there's more elegant way to

Re: Incredible slow iterative computation

2014-05-05 Thread Earthson
checkpoint seems to be just add a CheckPoint mark? You need an action after marked it. I have tried it with success:) newRdd = oldRdd.map(myFun).persist(myStorageLevel) newRdd.checkpoint // checkpoint here newRdd.isCheckpointed // false here newRdd.foreach(x = {}) // Force evaluation

Re: cache not work as expected for iteration?

2014-05-04 Thread Earthson
thx for the help, unpersist is excatly what I want:) I see that spark will remove some cache automatically when memory is full, it is much more helpful if the rule satisfy something like LRU It seems that persist and cache is some kind of lazy? -- View this message in context:

Cache issue for iteration with broadcast

2014-05-04 Thread Earthson
A new broadcast object will generated for every iteration step, it may eat up the memory and make persist fail. The broadcast object should not be removed because RDD may be recomputed. And I am trying to prevent recomputing RDD, it need old broadcast release some memory. I've tried to set

Re: Cache issue for iteration with broadcast

2014-05-04 Thread Earthson
Code Here https://github.com/Earthson/sparklda/blob/dev/src/main/scala/net/earthson/nlp/lda/lda.scala#L121 Finally, iteration still runs into recomputing... -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Cache-issue-for-iteration-with-broadcast

Re: Cache issue for iteration with broadcast

2014-05-04 Thread Earthson
I tried using serialization instead of broadcast, and my program exit with Error(beyond physical memory limits). The large object can not be released by GC? because it is needed for recomputing? So what is the recomended way to solve this problem? -- View this message in context:

cache not work as expected for iteration?

2014-05-03 Thread Earthson
:) https://github.com/Earthson/sparklda/blob/master/src/main/scala/net/earthson/nlp/lda/lda.scala#L99 http://apache-spark-user-list.1001560.n3.nabble.com/file/n5292/sparklda_cache1.png http://apache-spark-user-list.1001560.n3.nabble.com/file/n5292/sparklda_cache2.png -- View this message

Re: Why Spark require this object to be serializerable?

2014-04-29 Thread Earthson
The code is here:https://github.com/Earthson/sparklda/blob/master/src/main/scala/net/earthson/nlp/lda/lda.scala I've change it to from Broadcast to Serializable. Now it works:) But There are too many rdd cache, It is the problem? -- View this message in context: http://apache-spark-user-list

Why Spark require this object to be serializerable?

2014-04-28 Thread Earthson
The problem is this object can't be Serializerable, it holds a RDD field and SparkContext. But Spark shows an error that it need Serialization. The order of my debug output is really strange. ~ Training Start! Round 0 Hehe? Hehe? started? failed? Round 1 Hehe? ~ here is my code 69

Re: Why Spark require this object to be serializerable?

2014-04-28 Thread Earthson
I've moved SparkContext and RDD as parameter of train. And now it tells me that SparkContext need to serialize! I think the the problem is RDD is trying to make itself lazy. and some BroadCast Object need to be generate dynamicly, so the closure have SparkContext inside, so the task complete

Re: parallelize for a large Seq is extreamly slow.

2014-04-27 Thread Earthson
That's not work. I don't think it is just slow, It never ends(with 30+ hours, and I killed it). -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/parallelize-for-a-large-Seq-is-extreamly-slow-tp4801p4900.html Sent from the Apache Spark User List mailing list

Re: parallelize for a large Seq is extreamly slow.

2014-04-27 Thread Earthson
It's my fault! I upload a wrong jar when I changed the number of partitions. and Now it just works fine:) The size of word_mapping is 2444185. So it will take very long time for large object serialization? I don't think two million is very large, because the cost at local for such size is

Re: parallelize for a large Seq is extreamly slow.

2014-04-26 Thread Earthson
reduceByKey(_+_).countByKey instead of countByKey seems to be fast. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/parallelize-for-a-large-Seq-is-extreamly-slow-tp4801p4870.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: parallelize for a large Seq is extreamly slow.

2014-04-26 Thread Earthson
parallelize is still so slow. package com.semi.nlp import org.apache.spark._ import SparkContext._ import scala.io.Source import com.esotericsoftware.kryo.Kryo import org.apache.spark.serializer.KryoRegistrator class MyRegistrator extends KryoRegistrator { override def

Re: parallelize for a large Seq is extreamly slow.

2014-04-25 Thread Earthson
I've tried to set larger buffer, but reduceByKey seems to be failed. need help:) 14/04/26 12:31:12 INFO cluster.CoarseGrainedSchedulerBackend: Shutting down all executors 14/04/26 12:31:12 INFO cluster.CoarseGrainedSchedulerBackend: Asking each executor to shut down 14/04/26 12:31:12 INFO

parallelize for a large Seq is extreamly slow.

2014-04-24 Thread Earthson Lu
spark.parallelize(word_mapping.value.toSeq).saveAsTextFile(hdfs://ns1/nlp/word_mapping) this line is too slow. There are about 2 million elements in word_mapping. *Is there a good style for writing a large collection to hdfs?* import org.apache.spark._ import SparkContext._ import

Re: parallelize for a large Seq is extreamly slow.

2014-04-24 Thread Earthson
Kryo With Exception below: com.esotericsoftware.kryo.KryoException (com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0, required: 1) com.esotericsoftware.kryo.io.Output.require(Output.java:138) com.esotericsoftware.kryo.io.Output.writeAscii_slow(Output.java:446)