Re: sequenceFile and groupByKey

2014-03-09 Thread Shixiong Zhu
Hi Kane, In the sequence file, the class is org.apache.hadoop.io.Text. You need to convert Text to String. There are two approaches: 1. Use implicit conversions to convert Text to String automatically. I recommend this one. E.g., val t2 = sc.sequenceFile[String, String](/user/hdfs/e1Mseq)

Aggregators in GraphX

2014-03-09 Thread Sebastian Schelter
Hi, Does GraphX currently support Giraph/Pregel's aggregator feature? I was thinking to implement a PageRank version that is able to correctly handle dangling vertices (i.e. vertices with no outlinks). Therefore I would have to globally sum up the rank associated to them in every iteration,

Re: Explain About Logs NetworkWordcount.scala

2014-03-09 Thread Eduardo Costa Alfaia
Yes TD, I can use tcpdump to see if the data are being accepted by the receiver and if else them are arriving into the IP packet. Thanks Em 3/8/14, 4:19, Tathagata Das escreveu: I am not sure how to debug this without any more information about the source. Can you monitor on the receiver side

RE: major Spark performance problem

2014-03-09 Thread Livni, Dana
YARN also have this scheduling option. The problem is all of our applications have the same flow where the first stage is the heaviest and the rest are very small. The problem is when some request (application) start to run on the same time, the first stage of all is schedule in parallel, and

State of spark docker script

2014-03-09 Thread Aureliano Buendia
Hi, Is the spark docker script now mature enough to substitute spark-ec2 script? Anyone here using the docker script is production?

Re: major Spark performance problem

2014-03-09 Thread Matei Zaharia
Hi Dana, It’s hard to tell exactly what is consuming time, but I’d suggest starting by profiling the single application first. Three things to look at there: 1) How many stages and how many tasks per stage is Spark launching (in the application web UI at http://driver:4040)? If you have

Spark on YARN use only one node

2014-03-09 Thread Assaf
Hi, I've installed Spark 0.81 on IDH 3.0.2 as on YARN. My cluster have 3 servers, 1 is NN and DN, other 2 only DN. I manage to launch spark-shell and execute the mllib kmeans. The problem is it is using only one node ( the NN ) and not running on the other 2 DN Please advise My spark-env.sh

Re: State of spark docker script

2014-03-09 Thread Aaron Davidson
Whoa, wait, the docker scripts are only used for testing purposes right now. They have not been designed with the intention of replacing the spark-ec2 scripts. For instance, there isn't an ssh server running so you can stop and restart the cluster (like sbin/stop-all.sh). Also, we currently mount

no stdout output from worker

2014-03-09 Thread Sen, Ranjan [USA]
Hi I have some System.out.println in my Java code that is working ok in a local environment. But when I run the same code on a standalone mode in a EC2 cluster I do not see them at the worker stdout (in the worker node under spark location/work ) or at the driver console. Could you help me

CDH5b2, Spark 0.9.0 and shark

2014-03-09 Thread danoomistmatiste
Hi, I am running cdh5b2. I have installed the hadoop2 version of shark 0.9.0 for cdh5. Want to know if there is a compatible version of shark that will run with this combination. -- View this message in context:

Re: Sbt Permgen

2014-03-09 Thread Sandy Ryza
There was an issue related to this fixed recently: https://github.com/apache/spark/pull/103 On Sun, Mar 9, 2014 at 8:40 PM, Koert Kuipers ko...@tresata.com wrote: edit last line of sbt/sbt, after which i run: sbt/sbt test On Sun, Mar 9, 2014 at 10:24 PM, Sean Owen so...@cloudera.com wrote:

Re: no stdout output from worker

2014-03-09 Thread Patrick Wendell
Hey Sen, Is your code in the driver code or inside one of the tasks? If it's in the tasks, the place you would expect these to be is in stdout file under spark/appid/work/[stdout/stderr]. Are you seeing at least stderr logs in that folder? If not then the tasks might not be running on the