Spark ec2 cluster lost worker

2015-06-24 Thread anny9699
Hi, According to the Spark UI, one worker is lost after a failed job. It is not a "lost executor" error, but that the UI now only shows 8 workers (I have 9 workers). However from the ec2 console, it shows the machine is "running" and no check alarms. So I am confused how I could reconnect the lost

Array[T].distinct doesn't work inside RDD

2015-04-07 Thread anny9699
Hi, I have a question about Array[T].distinct on customized class T. My data is a like RDD[(String, Array[T])] in which T is a class written by my class. There are some duplicates in each Array[T] so I want to remove them. I override the equals() method in T and use val dataNoDuplicates = dataDu

How to configure SparkUI to use internal ec2 ip

2015-03-30 Thread anny9699
Hi, For security reasons, we added a server between my aws Spark Cluster and local, so I couldn't connect to the cluster directly. To see the SparkUI and its related work's stdout and stderr, I used dynamic forwarding and configured the SOCKS proxy. Now I could see the SparkUI using the internal

output worker stdout to one place

2015-02-20 Thread anny9699
Hi, I am wondering if there's some way that could lead some of the worker stdout to one place instead of in each worker's stdout. For example, I have the following code RDD.foreach{line => try{ do something }catch{ case e:exception => println(line) } } Every time I want to check what's causing t

How to output to S3 and keep the order

2015-01-19 Thread anny9699
Hi, I am using Spark on AWS and want to write the output to S3. It is a relatively small file and I don't want them to output as multiple parts. So I use result.repartition(1).saveAsTextFile("s3://...") However as long as I am using the saveAsTextFile method, the output doesn't keep the original

Re: org/apache/commons/math3/random/RandomGenerator issue

2014-11-08 Thread anny9699
spark 1.1.0 and having > the same error. > I tried to add the dependency to math3 with versions 3.11, 3.2, 3.3 and it > didn't help. > > Any ideas what might be the problem? > > Thanks, > Lev. > > anny9699 wrote > I use the breeze.stats.di

worker_instances vs worker_cores

2014-10-20 Thread anny9699
Hi, I have a question about the worker_instances setting and worker_cores setting in aws ec2 cluster. I understand it is a cluster and the default setting in the cluster is *SPARK_WORKER_CORES = 8 SPARK_WORKER_INSTANCES = 1* However after I changed it to *SPARK_WORKER_CORES = 8 SPARK_WORKER_INS

Spark output to s3 extremely slow

2014-10-14 Thread anny9699
Hi, I found writing output back to s3 using rdd.saveAsTextFile() is extremely slow, much slower than reading from s3. Is there a way to make it faster? The rdd has 150 partitions so parallelism is enough I assume. Thanks a lot! Anny -- View this message in context: http://apache-spark-user-li

lazy evaluation of RDD transformation

2014-10-06 Thread anny9699
Hi, I see that this type of question has been asked before, however still a little confused about it in practice. Such as there are two ways I could deal with a series of RDD transformation before I do a RDD action, which way is faster: Way 1: val data = sc.textFile() val data1 = data.map(x => f1

Re: org/apache/commons/math3/random/RandomGenerator issue

2014-10-04 Thread anny9699
Thanks Ted this is working now! Previously I added another commons-math3 jar to my classpath and that one doesn't work. This one included by maven seems to work. Thanks a lot! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/org-apache-commons-math3-random-

Re: org/apache/commons/math3/random/RandomGenerator issue

2014-10-04 Thread anny9699
, 2014 at 1:48 PM, 陈韵竹 <[hidden email] >>> <http://user/SendEmail.jtp?type=node&node=15754&i=2>> wrote: >>> >>>> Hi Ted, >>>> >>>> So according to previous posts, the problem should be solved by >>>> changing

org/apache/commons/math3/random/RandomGenerator issue

2014-10-04 Thread anny9699
Hi, I use the breeze.stats.distributions.Bernoulli in my code, however met this problem java.lang.NoClassDefFoundError: org/apache/commons/math3/random/RandomGenerator I read the posts about this problem before, and if I added org.apache.commons commons-math3 3.3 run

array size limit vs partition number

2014-10-03 Thread anny9699
Hi, Sorry I am not very familiar with Java. I found that if I set the RDD partition number to be higher, I meet this error message"java.lang.OutOfMemoryError: Requested array size exceeds VM limit"; however if I set the RDD partition number to be lower, the error is gone. My aws ec2 cluster has 7

Re: still "GC overhead limit exceeded" after increasing heap space

2014-10-01 Thread anny9699
4...@n3.nabble.com> wrote: > Hi > > How many nodes in your cluster? It seems to me 64g does not help if each > of your node doesn't have that many memory. > > Liquan > > On Wed, Oct 1, 2014 at 1:37 PM, anny9699 <[hidden email] > <http://user/SendEmail.jtp?typ

still "GC overhead limit exceeded" after increasing heap space

2014-10-01 Thread anny9699
Hi, After reading some previous posts about this issue, I have increased the java heap space to "-Xms64g -Xmx64g", but still met the "java.lang.OutOfMemoryError: GC overhead limit exceeded" error. Does anyone have other suggestions? I am reading a data of 200 GB and my total memory is 120 GB, so

memory vs data_size

2014-09-30 Thread anny9699
Hi, Is there a guidance about for a data of certain data size, how much total memory should be needed to achieve a relatively good speed? I have a data of around 200 GB and the current total memory for my 8 machines are around 120 GB. Is that too small to run the data of this big? Even the read i

about partition number

2014-09-29 Thread anny9699
Hi, I read the past posts about partition number, but am still a little confused about partitioning strategy. I have a cluster with 8 works and 2 cores for each work. Is it true that the optimal partition number should be 2-4 * total_coreNumber or should approximately equal to total_coreNumber?

Re: sc.textFile can't recognize '\004'

2014-06-21 Thread anny9699
Thanks a lot Sean! It works now for me now~~ -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/sc-textFile-can-t-recognize-004-tp8059p8071.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

sc.textFile can't recognize '\004'

2014-06-20 Thread anny9699
Hi, I need to parse a file which is separated by a series of separators. I used SparkContext.textFile and I met two problems: 1) One of the separators is '\004', which could be recognized by python or R or Hive, however Spark seems can't recognize this one and returns a symbol looking like '?'. A

Re: Do all classes involving RDD operation need to be registered?

2014-03-29 Thread anny9699
Thanks so much Sonal! I am much clearer now. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Do-all-classes-involving-RDD-operation-need-to-be-registered-tp3439p3472.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Do all classes involving RDD operation need to be registered?

2014-03-28 Thread anny9699
Thanks a lot Ognen! It's not a fancy class that I wrote, and now I realized I neither extends Serializable or register with Kyro and that's why it is not working. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Do-all-classes-involving-RDD-operation-need-to

Do all classes involving RDD operation need to be registered?

2014-03-28 Thread anny9699
Hi, I am sorry if this has been asked before. I found that if I wrapped up some methods in a class with parameters, spark will throw "Task Nonserializable" exception; however if wrapped up in an object or case class without parameters, it will work fine. Is it true that all classes involving RDD o

Re: java.lang.NullPointerException met when computing new RDD or use .count

2014-03-17 Thread anny9699
Hi Andrew, Thanks for the reply. However I did almost the same thing in another closure: val simi=dataByRow.map(point => { val corrs=dataByRow.map(x => arrCorr(point._2,x._2)) (point._1,corrs) }) here dataByRow is of format RDD[(Int,Array[Double])] and arrCorr is a function that I wrote to compu

java.lang.NullPointerException met when computing new RDD or use .count

2014-03-17 Thread anny9699
Hi, I met this exception when computing new RDD from an existing RDD or using .count on some RDDs. The following is the situation: val DD1=D.map(d => { (d._1,D.map(x => math.sqrt(x._2*d._2)).toArray) }) DD is in the format RDD[(Int,Double)] and the error message is: org.apache.spark.SparkExcept