Hi,
According to the Spark UI, one worker is lost after a failed job. It is not
a lost executor error, but that the UI now only shows 8 workers (I have 9
workers). However from the ec2 console, it shows the machine is running
and no check alarms. So I am confused how I could reconnect the lost
Hi,
I have a question about Array[T].distinct on customized class T. My data is
a like RDD[(String, Array[T])] in which T is a class written by my class.
There are some duplicates in each Array[T] so I want to remove them. I
override the equals() method in T and use
val dataNoDuplicates =
Hi,
For security reasons, we added a server between my aws Spark Cluster and
local, so I couldn't connect to the cluster directly. To see the SparkUI and
its related work's stdout and stderr, I used dynamic forwarding and
configured the SOCKS proxy. Now I could see the SparkUI using the
Hi,
I am wondering if there's some way that could lead some of the worker stdout
to one place instead of in each worker's stdout. For example, I have the
following code
RDD.foreach{line =
try{
do something
}catch{
case e:exception = println(line)
}
}
Every time I want to check what's causing
Hi,
I am using Spark on AWS and want to write the output to S3. It is a
relatively small file and I don't want them to output as multiple parts. So
I use
result.repartition(1).saveAsTextFile(s3://...)
However as long as I am using the saveAsTextFile method, the output doesn't
keep the original
the same error.
I tried to add the dependency to math3 with versions 3.11, 3.2, 3.3 and it
didn't help.
Any ideas what might be the problem?
Thanks,
Lev.
anny9699 wrote
I use the breeze.stats.distributions.Bernoulli in my code, however met
this problem
java.lang.NoClassDefFoundError:
org
Hi,
I have a question about the worker_instances setting and worker_cores
setting in aws ec2 cluster. I understand it is a cluster and the default
setting in the cluster is
*SPARK_WORKER_CORES = 8
SPARK_WORKER_INSTANCES = 1*
However after I changed it to
*SPARK_WORKER_CORES = 8
Hi,
I found writing output back to s3 using rdd.saveAsTextFile() is extremely
slow, much slower than reading from s3. Is there a way to make it faster?
The rdd has 150 partitions so parallelism is enough I assume.
Thanks a lot!
Anny
--
View this message in context:
Hi,
I see that this type of question has been asked before, however still a
little confused about it in practice. Such as there are two ways I could
deal with a series of RDD transformation before I do a RDD action, which way
is faster:
Way 1:
val data = sc.textFile()
val data1 = data.map(x =
Hi,
I use the breeze.stats.distributions.Bernoulli in my code, however met this
problem
java.lang.NoClassDefFoundError:
org/apache/commons/math3/random/RandomGenerator
I read the posts about this problem before, and if I added
dependency
groupIdorg.apache.commons/groupId
, Ted Yu [hidden email]
http://user/SendEmail.jtp?type=nodenode=15754i=3 wrote:
Cycling bits:
http://search-hadoop.com/m/JW1q5UX9S1/breeze+sparksubj=Build+error+when+using+spark+with+breeze
On Sat, Oct 4, 2014 at 12:59 PM, anny9699 [hidden email]
http://user/SendEmail.jtp?type=nodenode=15754i
Thanks Ted this is working now!
Previously I added another commons-math3 jar to my classpath and that one
doesn't work. This one included by maven seems to work.
Thanks a lot!
--
View this message in context:
Hi,
Sorry I am not very familiar with Java. I found that if I set the RDD
partition number to be higher, I meet this error
messagejava.lang.OutOfMemoryError: Requested array size exceeds VM limit;
however if I set the RDD partition number to be lower, the error is gone.
My aws ec2 cluster has 72
Hi,
After reading some previous posts about this issue, I have increased the
java heap space to -Xms64g -Xmx64g, but still met the
java.lang.OutOfMemoryError: GC overhead limit exceeded error. Does anyone
have other suggestions?
I am reading a data of 200 GB and my total memory is 120 GB, so I
wrote:
Hi
How many nodes in your cluster? It seems to me 64g does not help if each
of your node doesn't have that many memory.
Liquan
On Wed, Oct 1, 2014 at 1:37 PM, anny9699 [hidden email]
http://user/SendEmail.jtp?type=nodenode=15541i=0 wrote:
Hi,
After reading some previous posts about
Hi,
Is there a guidance about for a data of certain data size, how much total
memory should be needed to achieve a relatively good speed?
I have a data of around 200 GB and the current total memory for my 8
machines are around 120 GB. Is that too small to run the data of this big?
Even the read
Hi,
I read the past posts about partition number, but am still a little confused
about partitioning strategy.
I have a cluster with 8 works and 2 cores for each work. Is it true that the
optimal partition number should be 2-4 * total_coreNumber or should
approximately equal to total_coreNumber?
Thanks a lot Sean! It works now for me now~~
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/sc-textFile-can-t-recognize-004-tp8059p8071.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi,
I need to parse a file which is separated by a series of separators. I used
SparkContext.textFile and I met two problems:
1) One of the separators is '\004', which could be recognized by python or R
or Hive, however Spark seems can't recognize this one and returns a symbol
looking like '?'.
Thanks so much Sonal! I am much clearer now.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Do-all-classes-involving-RDD-operation-need-to-be-registered-tp3439p3472.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Thanks a lot Ognen!
It's not a fancy class that I wrote, and now I realized I neither extends
Serializable or register with Kyro and that's why it is not working.
--
View this message in context:
21 matches
Mail list logo