Re: how to solve this problem?

2014-04-22 Thread Akhil Das
Hi, Would you mind sharing the piece of code that caused this exception? As per Javadoc NoSuchElementException is thrown if you call nextElement() method of Enumeration and there is no more element in Enumeration. Thanks Best Regards. On Tue, Apr 22, 2014 at 8:50 AM, gogototo

Re: ERROR TaskSchedulerImpl: Lost an executor

2014-04-22 Thread Praveen R
Do have cluster deployed on aws? Could you try checking if 7077 port is accessible from worker nodes. On Tue, Apr 22, 2014 at 2:56 AM, jaeholee jho...@lbl.gov wrote: Hi, I am trying to set up my own standalone Spark, and I started the master node and worker nodes. Then I ran

Re: how to solve this problem?

2014-04-22 Thread Ankur Dave
This is a known bug in GraphX, and the fix is in PR #367: https://github.com/apache/spark/pull/367 Applying that PR should solve the problem. Ankur http://www.ankurdave.com/ On Mon, Apr 21, 2014 at 8:20 PM, gogototo wangbi...@gmail.com wrote: java.util.NoSuchElementException: End of stream

Efficient Aggregation over DB data

2014-04-22 Thread Sai Prasanna
Hi All, I want to access a particular column of a DB table stored in a CSV format and perform some aggregate queries over it. I wrote the following query in scala as a first step. *var add=(x:String)=x.split(\\s+)(2).toInt* *var result=List[Int]()* *input.split(\\n).foreach(x=result::=add(x)) *

Re: PySpark still reading only text?

2014-04-22 Thread Bertrand Dechoux
Cool, thanks for the link. Bertrand Dechoux On Mon, Apr 21, 2014 at 7:31 PM, Nick Pentreath nick.pentre...@gmail.comwrote: Also see: https://github.com/apache/spark/pull/455 This will add support for reading sequencefile and other inputformat in PySpark, as long as the Writables are either

Question about running spark on yarn

2014-04-22 Thread Gordon Wang
In this page http://spark.apache.org/docs/0.9.0/running-on-yarn.html We have to use spark assembly to submit spark apps to yarn cluster. And I checked the assembly jars of spark. It contains some yarn classes which are added during compile time. The yarn classes are not what I want. My question

Bind exception while running FlumeEventCount

2014-04-22 Thread NehaS Singh
Hi, I have installed spark-0.9.0-incubating-bin-cdh4 and I am using apache flume for streaming. I have used the streaming.examples.FlumeEventCount. Also I have written Avro conf file for flume.When I try to do streamin ing spark and I run the following command

Re: Using google cloud storage for spark big data

2014-04-22 Thread Andras Nemeth
We don't have anything fancy. It's basically some very thin layer of google specifics on top of a stand alone cluster. We basically created two disk snapshots, one for the master and one for the workers. The snapshots contain initialization scripts so that the master/worker daemons are started on

Re: Spark running slow for small hadoop files of 10 mb size

2014-04-22 Thread Andre Bois-Crettez
The data partitionning is done by default *according to the number of HDFS blocks* of the source. You can change the partitionning with .repartion, either to increase or decrease the level of parallelism : val recordsRDD = SparkContext.sequenceFile[NullWritable,BytesWritable](FilePath,256) val

'Filesystem closed' while running spark job

2014-04-22 Thread Marcin Cylke
Hi I have a Spark job that reads files from HDFS, does some pretty basic transformations, then writes it to some other location on hdfs. I'm running this job with spark-0.9.1-rc3, on Hadoop Yarn with Kerberos security enabled. One of my approaches to fixing this issue was changing SparkConf,

Re: Spark-ec2 asks for password

2014-04-22 Thread Pierre Borckmans
We’ve been experiencing this as well, and our simple solution is to actually keep trying the ssh connection instead of just waiting: Something like this: def wait_for_ssh_connection(opts, host): u.message(Waiting for ssh connection to host {}.format(host)) connected = False while

help me

2014-04-22 Thread Joe L
I got the following performance is it normal in spark to be like this. some times spark switchs into node_local mode from process_local and it becomes 10x faster. I am very confused. scala val a = sc.textFile(/user/exobrain/batselem/LUBM1000) scala f.count() Long = 137805557 took 130.809661618 s

Re: 'Filesystem closed' while running spark job

2014-04-22 Thread Marcin Cylke
On Tue, 22 Apr 2014 12:28:15 +0200 Marcin Cylke marcin.cy...@ext.allegro.pl wrote: Hi I have a Spark job that reads files from HDFS, does some pretty basic transformations, then writes it to some other location on hdfs. I'm running this job with spark-0.9.1-rc3, on Hadoop Yarn with

Re: stdout in workers

2014-04-22 Thread Daniel Darabos
On Mon, Apr 21, 2014 at 7:59 PM, Jim Carroll jimfcarr...@gmail.com wrote: I'm experimenting with a few things trying to understand how it's working. I took the JavaSparkPi example as a starting point and added a few System.out lines. I added a system.out to the main body of the driver

Spark runs applications in an inconsistent way

2014-04-22 Thread Aureliano Buendia
Hi, Sometimes running the very same spark application binary, behaves differently with every execution. - The Ganglia profile is different with every execution: sometimes it takes 0.5 TB of memory, the next time it takes 1 TB of memory, the next time it is 0.75 TB... - Spark UI shows

Some questions in using Graphx

2014-04-22 Thread wu zeming
Hi all, I am using Graphx in spark-0.9.0-incubating. The number of vertices can be 100 million and the number of edges can be 1 billion in our graph. As a result, I must carefully use my limit memory. So I have some questions to the Graphx module. Why do some transformations like partitionBy,

Re: Spark is slow

2014-04-22 Thread Nicholas Chammas
How long are the count() steps taking? And how many partitions are pairs1and triples initially divided into? You can see this by doing pairs1._jrdd.splits().size(), for example. If you just need to count the number of distinct keys, is it faster if you did the following instead of

Some questions in using Graphx

2014-04-22 Thread wu zeming
Hi all, I am using Graphx in spark-0.9.0-incubating. The number of vertices can be 100 million and the number of edges can be 1 billion in our graph. As a result, I must carefully use my limit memory. So I have some questions to the Graphx module. Why do some transformations like partitionBy,

internship opportunity

2014-04-22 Thread Tom Vacek
Thomson Reuters is looking for a graduate (or possibly advanced undergraduate) summer intern in Eagan, MN. This is a chance to work on an innovative project exploring how big data sets can be used by professionals such as lawyers, scientists and journalists. If you're subscribed to this mailing

Re: Question about running spark on yarn

2014-04-22 Thread Sandy Ryza
Hi Gordon, We recently handled this in SPARK-1064. As of 1.0.0, you'll be able to pass -Phadoop-provided to Maven and avoid including Hadoop and its dependencies in the assembly jar. -Sandy On Tue, Apr 22, 2014 at 2:43 AM, Gordon Wang gw...@gopivotal.com wrote: In this page

Re: ERROR TaskSchedulerImpl: Lost an executor

2014-04-22 Thread jaeholee
No, I am not using the aws. I am using one of the national lab's cluster. But as I mentioned, I am pretty new to computer science, so I might not be answering your question right... but 7077 is accessible. Maybe I got it wrong from the get-go? I will just write down what I did... Basically I

Re: Some questions in using Graphx

2014-04-22 Thread Ankur Dave
These are excellent questions. Answers below: On Tue, Apr 22, 2014 at 8:20 AM, wu zeming zemin...@gmail.com wrote: 1. Why do some transformations like partitionBy, mapVertices cache the new graph and some like outerJoinVertices not? In general, we cache RDDs that are used more than once to

Re: java.net.SocketException on reduceByKey() in pyspark

2014-04-22 Thread benlaird
I was getting this error after upgrading my nodes to Python2.7. I suspected the problem was due to conflicting Python versions, but my 2.7 install seemed correct on my nodes. I set the PYSPARK_PYTHON variable to my 2.7 install (as I still had 2.6 installed and linked to the 'python' executable,

Running large join in ALS example through PySpark

2014-04-22 Thread Laird, Benjamin
Hello all - I'm running the ALS/Collaborative Filtering code through pySpark on spark0.9.0. (http://spark.apache.org/docs/0.9.0/mllib-guide.html#using-mllib-in-python) My data file has about 27M tuples (User, Item, Rating). ALS.train(ratings,1,30) runs on my 3 node cluster (24 cores, 60GB RAM)

Re: ERROR TaskSchedulerImpl: Lost an executor

2014-04-22 Thread Praveen R
Could you try setting MASTER variable in spark-env.sh export MASTER=spark://master-ip:7077 For starting the standalone cluster, ./sbin/start-all.sh should work as far as you have password less access to all machines. Any error here ? On Tue, Apr 22, 2014 at 10:10 PM, jaeholee jho...@lbl.gov

Re: ERROR TaskSchedulerImpl: Lost an executor

2014-04-22 Thread jaeholee
wow! it worked! thank you so much! so now, all I need to do is to put the number of workers that I want to use when I read the data right? e.g. val numWorkers = 10 val data = sc.textFile(somedirectory/data.csv, numWorkers) -- View this message in context:

Re: Bind exception while running FlumeEventCount

2014-04-22 Thread Tathagata Das
Hello Neha, This is the result of a known bug in 0.9. Can you try running the latest Spark master branch to see if this problem is resolved? TD On Tue, Apr 22, 2014 at 2:48 AM, NehaS Singh nehas.si...@lntinfotech.comwrote: Hi, I have installed

Re: ERROR TaskSchedulerImpl: Lost an executor

2014-04-22 Thread Daniel Darabos
Most likely the data is not just too big. For most operations the data is processed partition by partition. The partitions may be too big. This is what your last question hints at too: val numWorkers = 10 val data = sc.textFile(somedirectory/data.csv, numWorkers) This will work, but not quite

Joining large dataset causes failure on Spark!

2014-04-22 Thread Hasan Asfoor
Greetings, I have created an RDD of 60 rows and then I joined it with itself. For some reason Spark consumes all of my storage which is more than 20GB of free storage! Is this the expected behavior of Spark!? Am I doing something wrong here? The code is shown below (done in Java). I tried to

Re: ERROR TaskSchedulerImpl: Lost an executor

2014-04-22 Thread jaeholee
How do you determine the number of partitions? For example, I have 16 workers, and the number of cores and the worker memory set in spark-env.sh are: CORE = 8 MEMORY = 16g The .csv data I have is about 500MB, but I am eventually going to use a file that is about 15GB. Is the MEMORY variable in

Re: ERROR TaskSchedulerImpl: Lost an executor

2014-04-22 Thread Daniel Darabos
On Wed, Apr 23, 2014 at 12:06 AM, jaeholee jho...@lbl.gov wrote: How do you determine the number of partitions? For example, I have 16 workers, and the number of cores and the worker memory set in spark-env.sh are: CORE = 8 MEMORY = 16g So you have the capacity to work on 16 * 8 = 128

Re: ERROR TaskSchedulerImpl: Lost an executor

2014-04-22 Thread jaeholee
Ok. I tried setting the partition number to 128 and numbers greater than 128, and now I get another error message about Java heap space. Is it possible that there is something wrong with the setup of my Spark cluster to begin with? Or is it still an issue with partitioning my data? Or do I just

GraphX: .edges.distinct().count() is 10?

2014-04-22 Thread Ryan Compton
I am trying to read an edge list into a Graph. My data looks like 394365859 -- 136153151 589404147 -- 1361045425 I read it into a Graph via: val edgeFullStrRDD: RDD[String] = sc.textFile(unidirFName) val edgeTupRDD = edgeFullStrRDD.map(x = x.split(\t)) .map(x

Re: GraphX: .edges.distinct().count() is 10?

2014-04-22 Thread Ankur Dave
I wasn't able to reproduce this with a small test file, but I did change the file parsing to use x(1).toLong instead of x(2).toLong. Did you mean to take the third column rather than the second? If so, would you mind posting a larger sample of the file, or even the whole file if possible? Here's

Re: GraphX: .edges.distinct().count() is 10?

2014-04-22 Thread Ryan Compton
Try this: https://www.dropbox.com/s/xf34l0ta496bdsn/.txt This code: println(g.numEdges) println(g.numVertices) println(g.edges.distinct().count()) gave me 1 9294 2 On Tue, Apr 22, 2014 at 5:14 PM, Ankur Dave ankurd...@gmail.com wrote: I wasn't able to reproduce this

Re: Question about running spark on yarn

2014-04-22 Thread Gordon Wang
Hi Sandy, Thanks for your reply ! Does this work for sbt ? I checked the commit, looks like only maven build has such option. On Wed, Apr 23, 2014 at 12:38 AM, Sandy Ryza sandy.r...@cloudera.comwrote: Hi Gordon, We recently handled this in SPARK-1064. As of 1.0.0, you'll be able to

No configuration setting found for key 'akka.version'

2014-04-22 Thread mbaryu
I can't seem to instantiate a SparkContext. What am I doing wrong? I tried using a SparkConf and the 2-string constructor with identical results. (Note that the project is configured for eclipse in the pom, but I'm compiling and running on the command line.) Here's the exception:

no response in spark web UI

2014-04-22 Thread wxhsdp
Hi, all i used to run my app using sbt run. but now i want to see the job information in spark web ui. i'am in local mode, i start the spark shell, and access the web ui on http://ubuntu.local:4040/stages/. but when i sbt run some application, there is no response in the web ui. how to make

Re: Some questions in using Graphx

2014-04-22 Thread Wu Zeming
Hi Ankur, Thanks for your reply! I think these are usefully for me. I hope these can be improved in spark-0.9.2 or spark-1.0. Another thing I forgot. I think the persist api for Graph, VertexRDD and EdgeRDD should not be set public now, because it will lead to UnsupportedOperationException when

Re: Question about running spark on yarn

2014-04-22 Thread sandy . ryza
I currently don't have plans to work on that. -Sandy On Apr 22, 2014, at 8:06 PM, Gordon Wang gw...@gopivotal.com wrote: Thanks I see. Do you guys have plan to port this to sbt? On Wed, Apr 23, 2014 at 10:24 AM, Sandy Ryza sandy.r...@cloudera.com wrote: Right, it only works for Maven

Re: no response in spark web UI

2014-04-22 Thread Akhil Das
Hi SparkContext launches the web interface at 4040, if you have multiple sparkContext's on the same machine then the ports will be bind to successive ports beginning with 4040. Here's the documentation: https://spark.apache.org/docs/0.9.0/monitoring.html And here's a simple scala program to