Hi,
According to the Spark UI, one worker is lost after a failed job. It is not
a "lost executor" error, but that the UI now only shows 8 workers (I have 9
workers). However from the ec2 console, it shows the machine is "running"
and no check alarms. So I am confused how I could reconnect the lost
Hi,
I have a question about Array[T].distinct on customized class T. My data is
a like RDD[(String, Array[T])] in which T is a class written by my class.
There are some duplicates in each Array[T] so I want to remove them. I
override the equals() method in T and use
val dataNoDuplicates = dataDu
Hi,
For security reasons, we added a server between my aws Spark Cluster and
local, so I couldn't connect to the cluster directly. To see the SparkUI and
its related work's stdout and stderr, I used dynamic forwarding and
configured the SOCKS proxy. Now I could see the SparkUI using the internal
Hi,
I am wondering if there's some way that could lead some of the worker stdout
to one place instead of in each worker's stdout. For example, I have the
following code
RDD.foreach{line =>
try{
do something
}catch{
case e:exception => println(line)
}
}
Every time I want to check what's causing t
Hi,
I am using Spark on AWS and want to write the output to S3. It is a
relatively small file and I don't want them to output as multiple parts. So
I use
result.repartition(1).saveAsTextFile("s3://...")
However as long as I am using the saveAsTextFile method, the output doesn't
keep the original
spark 1.1.0 and having
> the same error.
> I tried to add the dependency to math3 with versions 3.11, 3.2, 3.3 and it
> didn't help.
>
> Any ideas what might be the problem?
>
> Thanks,
> Lev.
>
> anny9699 wrote
> I use the breeze.stats.di
Hi,
I have a question about the worker_instances setting and worker_cores
setting in aws ec2 cluster. I understand it is a cluster and the default
setting in the cluster is
*SPARK_WORKER_CORES = 8
SPARK_WORKER_INSTANCES = 1*
However after I changed it to
*SPARK_WORKER_CORES = 8
SPARK_WORKER_INS
Hi,
I found writing output back to s3 using rdd.saveAsTextFile() is extremely
slow, much slower than reading from s3. Is there a way to make it faster?
The rdd has 150 partitions so parallelism is enough I assume.
Thanks a lot!
Anny
--
View this message in context:
http://apache-spark-user-li
Hi,
I see that this type of question has been asked before, however still a
little confused about it in practice. Such as there are two ways I could
deal with a series of RDD transformation before I do a RDD action, which way
is faster:
Way 1:
val data = sc.textFile()
val data1 = data.map(x => f1
Thanks Ted this is working now!
Previously I added another commons-math3 jar to my classpath and that one
doesn't work. This one included by maven seems to work.
Thanks a lot!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/org-apache-commons-math3-random-
, 2014 at 1:48 PM, 陈韵竹 <[hidden email]
>>> <http://user/SendEmail.jtp?type=node&node=15754&i=2>> wrote:
>>>
>>>> Hi Ted,
>>>>
>>>> So according to previous posts, the problem should be solved by
>>>> changing
Hi,
I use the breeze.stats.distributions.Bernoulli in my code, however met this
problem
java.lang.NoClassDefFoundError:
org/apache/commons/math3/random/RandomGenerator
I read the posts about this problem before, and if I added
org.apache.commons
commons-math3
3.3
run
Hi,
Sorry I am not very familiar with Java. I found that if I set the RDD
partition number to be higher, I meet this error
message"java.lang.OutOfMemoryError: Requested array size exceeds VM limit";
however if I set the RDD partition number to be lower, the error is gone.
My aws ec2 cluster has 7
4...@n3.nabble.com> wrote:
> Hi
>
> How many nodes in your cluster? It seems to me 64g does not help if each
> of your node doesn't have that many memory.
>
> Liquan
>
> On Wed, Oct 1, 2014 at 1:37 PM, anny9699 <[hidden email]
> <http://user/SendEmail.jtp?typ
Hi,
After reading some previous posts about this issue, I have increased the
java heap space to "-Xms64g -Xmx64g", but still met the
"java.lang.OutOfMemoryError: GC overhead limit exceeded" error. Does anyone
have other suggestions?
I am reading a data of 200 GB and my total memory is 120 GB, so
Hi,
Is there a guidance about for a data of certain data size, how much total
memory should be needed to achieve a relatively good speed?
I have a data of around 200 GB and the current total memory for my 8
machines are around 120 GB. Is that too small to run the data of this big?
Even the read i
Hi,
I read the past posts about partition number, but am still a little confused
about partitioning strategy.
I have a cluster with 8 works and 2 cores for each work. Is it true that the
optimal partition number should be 2-4 * total_coreNumber or should
approximately equal to total_coreNumber?
Thanks a lot Sean! It works now for me now~~
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/sc-textFile-can-t-recognize-004-tp8059p8071.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi,
I need to parse a file which is separated by a series of separators. I used
SparkContext.textFile and I met two problems:
1) One of the separators is '\004', which could be recognized by python or R
or Hive, however Spark seems can't recognize this one and returns a symbol
looking like '?'. A
Thanks so much Sonal! I am much clearer now.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Do-all-classes-involving-RDD-operation-need-to-be-registered-tp3439p3472.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Thanks a lot Ognen!
It's not a fancy class that I wrote, and now I realized I neither extends
Serializable or register with Kyro and that's why it is not working.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Do-all-classes-involving-RDD-operation-need-to
Hi,
I am sorry if this has been asked before. I found that if I wrapped up some
methods in a class with parameters, spark will throw "Task Nonserializable"
exception; however if wrapped up in an object or case class without
parameters, it will work fine. Is it true that all classes involving RDD
o
Hi Andrew,
Thanks for the reply. However I did almost the same thing in another
closure:
val simi=dataByRow.map(point => {
val corrs=dataByRow.map(x => arrCorr(point._2,x._2))
(point._1,corrs)
})
here dataByRow is of format RDD[(Int,Array[Double])] and arrCorr is a
function that I wrote to compu
Hi,
I met this exception when computing new RDD from an existing RDD or using
.count on some RDDs. The following is the situation:
val DD1=D.map(d => {
(d._1,D.map(x => math.sqrt(x._2*d._2)).toArray)
})
DD is in the format RDD[(Int,Double)] and the error message is:
org.apache.spark.SparkExcept
24 matches
Mail list logo