ctException) caught when processing request: 连接超时
2015-10-26 11:45:36,600 INFO org.apache.commons.httpclient.HttpMethodDirector:
Retrying request
--
Earthson Lu
On October 26, 2015 at 15:30:21, Deng Ching-Mallete (och...@apache.org) wrote:
Hi Earthson,
Unfortunately, attachments aren't allowed in the list s
We are using Spark 1.5.1 with `--master yarn`, Yarn RM is running in HA mode.
direct visit
click ApplicationMaster link
YARN RM log
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Yarn-Client-Can-not-access-SparkUI-tp25197.html
Sent from the
I’ve recompiled spark-1.4.0 with fasterxml-2.5.x, it works fine now:)
--
Earthson Lu
On June 12, 2015 at 23:24:32, Sean Owen (so...@cloudera.com) wrote:
I see the same thing in an app that uses Jackson 2.5. Downgrading to
2.4 made it work. I meant to go back and figure out if there's
I'm using Play-2.4 with play-json-2.4, It works fine with spark-1.3.1, but it
failed after I upgrade Spark to spark-1.4.0:(
sc.parallelize(1 to 1).count
code
[info] com.fasterxml.jackson.databind.JsonMappingException: Could not find
creator property with name 'id' (in class
large batch for parallel inside each batch(It seems to be the
way that SGD implemented in MLLib does?).
--
Earthson Lu
On December 16, 2014 at 04:02:22, Imran Rashid (im...@therashids.com) wrote:
I'm a little confused by some of the responses. It seems like there are two
different issues being
I think it could be done like:
1. using mapPartition to randomly drop some partition
2. drop some elements randomly(for selected partition)
3. calculate gradient step for selected elements
I don't think fixed step is needed, but fixed step could be done:
1. zipWithIndex
2. create ShuffleRDD
Is there any way to get the yarn application_id inside the program?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-get-applicationId-for-yarn-mode-both-yarn-client-and-yarn-cluster-mode-tp19462.html
Sent from the Apache Spark User List mailing
Finally, I've found two ways:
1. search the output with something like Submitted application
application_1416319392519_0115
2. use specific AppName. We could query the ApplicationID(yarn)
--
View this message in context:
I'm trying to give API interface to Java users. And I need to accept their
JavaSchemaRDDs, and convert it to SchemaRDD for Scala users.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Convert-JavaSchemaRDD-to-SchemaRDD-tp16482p16641.html
Sent from
I don't know why the JavaSchemaRDD.baseSchemaRDD is private[sql]. And I found
that DataTypeConversions is protected[sql].
Finally I find this solution:
pre
code
jrdd.registerTempTable(transform_tmp)
jrdd.sqlContext.sql(select * from transform_tmp)
/code
/pre
Could Any One tell me
I am using PySpark with IPython notebook.
pre
data = sc.parallelize(range(1000), 10)
#successful
data.map(lambda x: x+1).collect()
#Error
data.count()
/pre
Something
similar:http://apache-spark-user-list.1001560.n3.nabble.com/Exception-on-simple-pyspark-script-td3415.html
But it does not
I'm running pyspark with Python 2.7.8 under Virtualenv
System Python Version: Python 2.6.x
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Python-2-7-8-Spark-1-0-2-count-with-TypeError-an-integer-is-required-tp12643p12645.html
Sent from the Apache
Do I have to deploy Python to every machine to make $PYSPARK_PYTHON work
correctly?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Python-2-7-8-Spark-1-0-2-count-with-TypeError-an-integer-is-required-tp12643p12651.html
Sent from the Apache Spark
I'm using SparkSQL with Hive 0.13, here is the SQL for inserting a partition
with 2048 buckets.
pre
sqlsc.set(spark.sql.shuffle.partitions, 2048)
hql(|insert %s table mz_log
|PARTITION (date='%s')
|select * from tmp_mzlog
spark.MapOutputTrackerMasterActor: Asked to send map output locations for
shuffle 0 to takes too much time, what should I do? What is the correct
configuration?
blockManager timeout if I using a small number of reduce partition.
I've just have the same problem.
I'm using
pre
$SPARK_HOME/bin/spark-submit --master yarn --deploy-mode client $JOBJAR
--class $JOBCLASS
/pre
It's really strange, because the log shows that
pre
14/07/22 16:16:58 INFO ui.SparkUI: Started SparkUI at
http://k1227.mzhen.cn:4040
14/07/22 16:16:58
That's what my problem is:)
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Why-spark-submit-command-hangs-tp10308p10394.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
spark-submit has an arguments --num-executors to set the number of
executor, but how could I set it from anywhere else?
We're using Shark, and want to change the number of executor. The number of
executor seems to be same as workers by default?
Shall we configure the executor number manually(Is
--num-executors seems to be only available with YARN-only.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-could-I-set-the-number-of-executor-tp7990p7992.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
I have a problem with add jar command
hql(add jar /.../xxx.jar)
Error:
Exception in thread main java.lang.AssertionError: assertion failed: No
plan for AddJar ...
How could I do this job with HiveContext, I can't find any api to do it.
Does SparkSQL with Hive support UDF/UDAF?
--
View this
RDD is not cached?
Because recomputing may be required, every broadcast object is included in
the dependences of RDDs, this may also have memory issue(when n and kv is
too large in your case).
--
View this message in context:
.set(spark.cleaner.ttl, 120) drops broadcast_0 which makes a Exception
below. It is strange, because broadcast_0 is no need, and I have broadcast_3
instead, and recent RDD is persisted, there is no need for recomputing...
what is the problem? need help.
~~~
14/05/05 17:03:12 INFO
Using checkpoint. It removes dependences:)
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Cache-issue-for-iteration-with-broadcast-tp5350p5368.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
RDD.checkpoint works fine. But spark.cleaner.ttl is really ugly for broadcast
cleaning. May be it could be removed automatically when no dependences.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Cache-issue-for-iteration-with-broadcast-tp5350p5369.html
Yes, I've tried.
The problem is new broadcast object generated by every step until eat up all
of the memory.
I solved it by using RDD.checkpoint to remove dependences to old broadcast
object, and use cleanner.ttl to clean up these broadcast object
automatically.
If there's more elegant way to
checkpoint seems to be just add a CheckPoint mark? You need an action after
marked it. I have tried it with success:)
newRdd = oldRdd.map(myFun).persist(myStorageLevel)
newRdd.checkpoint // checkpoint here
newRdd.isCheckpointed // false here
newRdd.foreach(x = {}) // Force evaluation
thx for the help, unpersist is excatly what I want:)
I see that spark will remove some cache automatically when memory is full,
it is much more helpful if the rule satisfy something like LRU
It seems that persist and cache is some kind of lazy?
--
View this message in context:
A new broadcast object will generated for every iteration step, it may eat up
the memory and make persist fail.
The broadcast object should not be removed because RDD may be recomputed.
And I am trying to prevent recomputing RDD, it need old broadcast release
some memory.
I've tried to set
Code Here
https://github.com/Earthson/sparklda/blob/dev/src/main/scala/net/earthson/nlp/lda/lda.scala#L121
Finally, iteration still runs into recomputing...
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Cache-issue-for-iteration-with-broadcast
I tried using serialization instead of broadcast, and my program exit with
Error(beyond physical memory limits).
The large object can not be released by GC? because it is needed for
recomputing? So what is the recomended way to solve this problem?
--
View this message in context:
:)
https://github.com/Earthson/sparklda/blob/master/src/main/scala/net/earthson/nlp/lda/lda.scala#L99
http://apache-spark-user-list.1001560.n3.nabble.com/file/n5292/sparklda_cache1.png
http://apache-spark-user-list.1001560.n3.nabble.com/file/n5292/sparklda_cache2.png
--
View this message
The code is
here:https://github.com/Earthson/sparklda/blob/master/src/main/scala/net/earthson/nlp/lda/lda.scala
I've change it to from Broadcast to Serializable. Now it works:) But There
are too many rdd cache, It is the problem?
--
View this message in context:
http://apache-spark-user-list
The problem is this object can't be Serializerable, it holds a RDD field and
SparkContext. But Spark shows an error that it need Serialization.
The order of my debug output is really strange.
~
Training Start!
Round 0
Hehe?
Hehe?
started?
failed?
Round 1
Hehe?
~
here is my code
69
I've moved SparkContext and RDD as parameter of train. And now it tells me
that SparkContext need to serialize!
I think the the problem is RDD is trying to make itself lazy. and some
BroadCast Object need to be generate dynamicly, so the closure have
SparkContext inside, so the task complete
That's not work. I don't think it is just slow, It never ends(with 30+ hours,
and I killed it).
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/parallelize-for-a-large-Seq-is-extreamly-slow-tp4801p4900.html
Sent from the Apache Spark User List mailing list
It's my fault! I upload a wrong jar when I changed the number of partitions.
and Now it just works fine:)
The size of word_mapping is 2444185.
So it will take very long time for large object serialization? I don't think
two million is very large, because the cost at local for such size is
reduceByKey(_+_).countByKey instead of countByKey seems to be fast.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/parallelize-for-a-large-Seq-is-extreamly-slow-tp4801p4870.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
parallelize is still so slow.
package com.semi.nlp
import org.apache.spark._
import SparkContext._
import scala.io.Source
import com.esotericsoftware.kryo.Kryo
import org.apache.spark.serializer.KryoRegistrator
class MyRegistrator extends KryoRegistrator {
override def
I've tried to set larger buffer, but reduceByKey seems to be failed. need
help:)
14/04/26 12:31:12 INFO cluster.CoarseGrainedSchedulerBackend: Shutting down
all executors
14/04/26 12:31:12 INFO cluster.CoarseGrainedSchedulerBackend: Asking each
executor to shut down
14/04/26 12:31:12 INFO
spark.parallelize(word_mapping.value.toSeq).saveAsTextFile(hdfs://ns1/nlp/word_mapping)
this line is too slow. There are about 2 million elements in word_mapping.
*Is there a good style for writing a large collection to hdfs?*
import org.apache.spark._
import SparkContext._
import
Kryo With Exception below:
com.esotericsoftware.kryo.KryoException
(com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0,
required: 1)
com.esotericsoftware.kryo.io.Output.require(Output.java:138)
com.esotericsoftware.kryo.io.Output.writeAscii_slow(Output.java:446)
41 matches
Mail list logo