Spark ec2 cluster lost worker

2015-06-24 Thread anny9699
Hi, According to the Spark UI, one worker is lost after a failed job. It is not a lost executor error, but that the UI now only shows 8 workers (I have 9 workers). However from the ec2 console, it shows the machine is running and no check alarms. So I am confused how I could reconnect the lost

Array[T].distinct doesn't work inside RDD

2015-04-07 Thread anny9699
Hi, I have a question about Array[T].distinct on customized class T. My data is a like RDD[(String, Array[T])] in which T is a class written by my class. There are some duplicates in each Array[T] so I want to remove them. I override the equals() method in T and use val dataNoDuplicates =

How to configure SparkUI to use internal ec2 ip

2015-03-30 Thread anny9699
Hi, For security reasons, we added a server between my aws Spark Cluster and local, so I couldn't connect to the cluster directly. To see the SparkUI and its related work's stdout and stderr, I used dynamic forwarding and configured the SOCKS proxy. Now I could see the SparkUI using the

output worker stdout to one place

2015-02-20 Thread anny9699
Hi, I am wondering if there's some way that could lead some of the worker stdout to one place instead of in each worker's stdout. For example, I have the following code RDD.foreach{line = try{ do something }catch{ case e:exception = println(line) } } Every time I want to check what's causing

How to output to S3 and keep the order

2015-01-19 Thread anny9699
Hi, I am using Spark on AWS and want to write the output to S3. It is a relatively small file and I don't want them to output as multiple parts. So I use result.repartition(1).saveAsTextFile(s3://...) However as long as I am using the saveAsTextFile method, the output doesn't keep the original

Re: org/apache/commons/math3/random/RandomGenerator issue

2014-11-08 Thread anny9699
the same error. I tried to add the dependency to math3 with versions 3.11, 3.2, 3.3 and it didn't help. Any ideas what might be the problem? Thanks, Lev. anny9699 wrote I use the breeze.stats.distributions.Bernoulli in my code, however met this problem java.lang.NoClassDefFoundError: org

worker_instances vs worker_cores

2014-10-20 Thread anny9699
Hi, I have a question about the worker_instances setting and worker_cores setting in aws ec2 cluster. I understand it is a cluster and the default setting in the cluster is *SPARK_WORKER_CORES = 8 SPARK_WORKER_INSTANCES = 1* However after I changed it to *SPARK_WORKER_CORES = 8

Spark output to s3 extremely slow

2014-10-14 Thread anny9699
Hi, I found writing output back to s3 using rdd.saveAsTextFile() is extremely slow, much slower than reading from s3. Is there a way to make it faster? The rdd has 150 partitions so parallelism is enough I assume. Thanks a lot! Anny -- View this message in context:

lazy evaluation of RDD transformation

2014-10-06 Thread anny9699
Hi, I see that this type of question has been asked before, however still a little confused about it in practice. Such as there are two ways I could deal with a series of RDD transformation before I do a RDD action, which way is faster: Way 1: val data = sc.textFile() val data1 = data.map(x =

org/apache/commons/math3/random/RandomGenerator issue

2014-10-04 Thread anny9699
Hi, I use the breeze.stats.distributions.Bernoulli in my code, however met this problem java.lang.NoClassDefFoundError: org/apache/commons/math3/random/RandomGenerator I read the posts about this problem before, and if I added dependency groupIdorg.apache.commons/groupId

Re: org/apache/commons/math3/random/RandomGenerator issue

2014-10-04 Thread anny9699
, Ted Yu [hidden email] http://user/SendEmail.jtp?type=nodenode=15754i=3 wrote: Cycling bits: http://search-hadoop.com/m/JW1q5UX9S1/breeze+sparksubj=Build+error+when+using+spark+with+breeze On Sat, Oct 4, 2014 at 12:59 PM, anny9699 [hidden email] http://user/SendEmail.jtp?type=nodenode=15754i

Re: org/apache/commons/math3/random/RandomGenerator issue

2014-10-04 Thread anny9699
Thanks Ted this is working now! Previously I added another commons-math3 jar to my classpath and that one doesn't work. This one included by maven seems to work. Thanks a lot! -- View this message in context:

array size limit vs partition number

2014-10-03 Thread anny9699
Hi, Sorry I am not very familiar with Java. I found that if I set the RDD partition number to be higher, I meet this error messagejava.lang.OutOfMemoryError: Requested array size exceeds VM limit; however if I set the RDD partition number to be lower, the error is gone. My aws ec2 cluster has 72

still GC overhead limit exceeded after increasing heap space

2014-10-01 Thread anny9699
Hi, After reading some previous posts about this issue, I have increased the java heap space to -Xms64g -Xmx64g, but still met the java.lang.OutOfMemoryError: GC overhead limit exceeded error. Does anyone have other suggestions? I am reading a data of 200 GB and my total memory is 120 GB, so I

Re: still GC overhead limit exceeded after increasing heap space

2014-10-01 Thread anny9699
wrote: Hi How many nodes in your cluster? It seems to me 64g does not help if each of your node doesn't have that many memory. Liquan On Wed, Oct 1, 2014 at 1:37 PM, anny9699 [hidden email] http://user/SendEmail.jtp?type=nodenode=15541i=0 wrote: Hi, After reading some previous posts about

memory vs data_size

2014-09-30 Thread anny9699
Hi, Is there a guidance about for a data of certain data size, how much total memory should be needed to achieve a relatively good speed? I have a data of around 200 GB and the current total memory for my 8 machines are around 120 GB. Is that too small to run the data of this big? Even the read

about partition number

2014-09-29 Thread anny9699
Hi, I read the past posts about partition number, but am still a little confused about partitioning strategy. I have a cluster with 8 works and 2 cores for each work. Is it true that the optimal partition number should be 2-4 * total_coreNumber or should approximately equal to total_coreNumber?

Re: sc.textFile can't recognize '\004'

2014-06-21 Thread anny9699
Thanks a lot Sean! It works now for me now~~ -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/sc-textFile-can-t-recognize-004-tp8059p8071.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

sc.textFile can't recognize '\004'

2014-06-20 Thread anny9699
Hi, I need to parse a file which is separated by a series of separators. I used SparkContext.textFile and I met two problems: 1) One of the separators is '\004', which could be recognized by python or R or Hive, however Spark seems can't recognize this one and returns a symbol looking like '?'.

Re: Do all classes involving RDD operation need to be registered?

2014-03-29 Thread anny9699
Thanks so much Sonal! I am much clearer now. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Do-all-classes-involving-RDD-operation-need-to-be-registered-tp3439p3472.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Do all classes involving RDD operation need to be registered?

2014-03-28 Thread anny9699
Thanks a lot Ognen! It's not a fancy class that I wrote, and now I realized I neither extends Serializable or register with Kyro and that's why it is not working. -- View this message in context: