Hi,
Would you mind sharing the piece of code that caused this exception? As per
Javadoc NoSuchElementException is thrown if you call nextElement() method
of Enumeration and there is no more element in Enumeration.
Thanks
Best Regards.
On Tue, Apr 22, 2014 at 8:50 AM, gogototo
Do have cluster deployed on aws? Could you try checking if 7077 port is
accessible from worker nodes.
On Tue, Apr 22, 2014 at 2:56 AM, jaeholee jho...@lbl.gov wrote:
Hi, I am trying to set up my own standalone Spark, and I started the master
node and worker nodes. Then I ran
This is a known bug in GraphX, and the fix is in PR #367:
https://github.com/apache/spark/pull/367
Applying that PR should solve the problem.
Ankur http://www.ankurdave.com/
On Mon, Apr 21, 2014 at 8:20 PM, gogototo wangbi...@gmail.com wrote:
java.util.NoSuchElementException: End of stream
Hi All,
I want to access a particular column of a DB table stored in a CSV format
and perform some aggregate queries over it. I wrote the following query in
scala as a first step.
*var add=(x:String)=x.split(\\s+)(2).toInt*
*var result=List[Int]()*
*input.split(\\n).foreach(x=result::=add(x)) *
Cool, thanks for the link.
Bertrand Dechoux
On Mon, Apr 21, 2014 at 7:31 PM, Nick Pentreath nick.pentre...@gmail.comwrote:
Also see: https://github.com/apache/spark/pull/455
This will add support for reading sequencefile and other inputformat in
PySpark, as long as the Writables are either
In this page http://spark.apache.org/docs/0.9.0/running-on-yarn.html
We have to use spark assembly to submit spark apps to yarn cluster.
And I checked the assembly jars of spark. It contains some yarn classes
which are added during compile time. The yarn classes are not what I want.
My question
Hi,
I have installed
spark-0.9.0-incubating-bin-cdh4 and I am using apache flume for streaming. I
have used the streaming.examples.FlumeEventCount. Also I have written Avro conf
file for flume.When I try to do streamin ing spark and I run the following
command
We don't have anything fancy. It's basically some very thin layer of google
specifics on top of a stand alone cluster. We basically created two disk
snapshots, one for the master and one for the workers. The snapshots
contain initialization scripts so that the master/worker daemons are
started on
The data partitionning is done by default *according to the number of
HDFS blocks* of the source.
You can change the partitionning with .repartion, either to increase or
decrease the level of parallelism :
val recordsRDD =
SparkContext.sequenceFile[NullWritable,BytesWritable](FilePath,256)
val
Hi
I have a Spark job that reads files from HDFS, does some pretty basic
transformations, then writes it to some other location on hdfs.
I'm running this job with spark-0.9.1-rc3, on Hadoop Yarn with Kerberos
security enabled.
One of my approaches to fixing this issue was changing SparkConf,
We’ve been experiencing this as well, and our simple solution is to actually
keep trying the ssh connection instead of just waiting:
Something like this:
def wait_for_ssh_connection(opts, host):
u.message(Waiting for ssh connection to host {}.format(host))
connected = False
while
I got the following performance is it normal in spark to be like this. some
times spark switchs into node_local mode from process_local and it becomes
10x faster. I am very confused.
scala val a = sc.textFile(/user/exobrain/batselem/LUBM1000)
scala f.count()
Long = 137805557
took 130.809661618 s
On Tue, 22 Apr 2014 12:28:15 +0200
Marcin Cylke marcin.cy...@ext.allegro.pl wrote:
Hi
I have a Spark job that reads files from HDFS, does some pretty basic
transformations, then writes it to some other location on hdfs.
I'm running this job with spark-0.9.1-rc3, on Hadoop Yarn with
On Mon, Apr 21, 2014 at 7:59 PM, Jim Carroll jimfcarr...@gmail.com wrote:
I'm experimenting with a few things trying to understand how it's working.
I
took the JavaSparkPi example as a starting point and added a few System.out
lines.
I added a system.out to the main body of the driver
Hi,
Sometimes running the very same spark application binary, behaves
differently with every execution.
- The Ganglia profile is different with every execution: sometimes it
takes 0.5 TB of memory, the next time it takes 1 TB of memory, the next
time it is 0.75 TB...
- Spark UI shows
Hi all,
I am using Graphx in spark-0.9.0-incubating. The number of vertices can be 100
million and the number of edges can be 1 billion in our graph. As a result, I
must carefully use my limit memory. So I have some questions to the Graphx
module.
Why do some transformations like partitionBy,
How long are the count() steps taking? And how many partitions are pairs1and
triples initially divided into? You can see this by doing
pairs1._jrdd.splits().size(), for example.
If you just need to count the number of distinct keys, is it faster if you
did the following instead of
Hi all,
I am using Graphx in spark-0.9.0-incubating. The number of vertices can be 100
million and the number of edges can be 1 billion in our graph. As a result, I
must carefully use my limit memory. So I have some questions to the Graphx
module.
Why do some transformations like partitionBy,
Thomson Reuters is looking for a graduate (or possibly advanced
undergraduate) summer intern in Eagan, MN. This is a chance to work on an
innovative project exploring how big data sets can be used by professionals
such as lawyers, scientists and journalists. If you're subscribed to this
mailing
Hi Gordon,
We recently handled this in SPARK-1064. As of 1.0.0, you'll be able to
pass -Phadoop-provided to Maven and avoid including Hadoop and its
dependencies in the assembly jar.
-Sandy
On Tue, Apr 22, 2014 at 2:43 AM, Gordon Wang gw...@gopivotal.com wrote:
In this page
No, I am not using the aws. I am using one of the national lab's cluster. But
as I mentioned, I am pretty new to computer science, so I might not be
answering your question right... but 7077 is accessible.
Maybe I got it wrong from the get-go? I will just write down what I did...
Basically I
These are excellent questions. Answers below:
On Tue, Apr 22, 2014 at 8:20 AM, wu zeming zemin...@gmail.com wrote:
1. Why do some transformations like partitionBy, mapVertices cache the new
graph and some like outerJoinVertices not?
In general, we cache RDDs that are used more than once to
I was getting this error after upgrading my nodes to Python2.7. I suspected
the problem was due to conflicting Python versions, but my 2.7 install
seemed correct on my nodes.
I set the PYSPARK_PYTHON variable to my 2.7 install (as I still had 2.6
installed and linked to the 'python' executable,
Hello all -
I'm running the ALS/Collaborative Filtering code through pySpark on spark0.9.0.
(http://spark.apache.org/docs/0.9.0/mllib-guide.html#using-mllib-in-python)
My data file has about 27M tuples (User, Item, Rating). ALS.train(ratings,1,30)
runs on my 3 node cluster (24 cores, 60GB RAM)
Could you try setting MASTER variable in spark-env.sh
export MASTER=spark://master-ip:7077
For starting the standalone cluster, ./sbin/start-all.sh should work as far
as you have password less access to all machines. Any error here ?
On Tue, Apr 22, 2014 at 10:10 PM, jaeholee jho...@lbl.gov
wow! it worked! thank you so much!
so now, all I need to do is to put the number of workers that I want to use
when I read the data right?
e.g.
val numWorkers = 10
val data = sc.textFile(somedirectory/data.csv, numWorkers)
--
View this message in context:
Hello Neha,
This is the result of a known bug in 0.9. Can you try running the latest
Spark master branch to see if this problem is resolved?
TD
On Tue, Apr 22, 2014 at 2:48 AM, NehaS Singh nehas.si...@lntinfotech.comwrote:
Hi,
I have installed
Most likely the data is not just too big. For most operations the data is
processed partition by partition. The partitions may be too big. This is
what your last question hints at too:
val numWorkers = 10
val data = sc.textFile(somedirectory/data.csv, numWorkers)
This will work, but not quite
Greetings,
I have created an RDD of 60 rows and then I joined it with itself. For
some reason Spark consumes all of my storage which is more than 20GB of
free storage! Is this the expected behavior of Spark!? Am I doing something
wrong here? The code is shown below (done in Java). I tried to
How do you determine the number of partitions? For example, I have 16
workers, and the number of cores and the worker memory set in spark-env.sh
are:
CORE = 8
MEMORY = 16g
The .csv data I have is about 500MB, but I am eventually going to use a file
that is about 15GB.
Is the MEMORY variable in
On Wed, Apr 23, 2014 at 12:06 AM, jaeholee jho...@lbl.gov wrote:
How do you determine the number of partitions? For example, I have 16
workers, and the number of cores and the worker memory set in spark-env.sh
are:
CORE = 8
MEMORY = 16g
So you have the capacity to work on 16 * 8 = 128
Ok. I tried setting the partition number to 128 and numbers greater than 128,
and now I get another error message about Java heap space. Is it possible
that there is something wrong with the setup of my Spark cluster to begin
with? Or is it still an issue with partitioning my data? Or do I just
I am trying to read an edge list into a Graph. My data looks like
394365859 -- 136153151
589404147 -- 1361045425
I read it into a Graph via:
val edgeFullStrRDD: RDD[String] = sc.textFile(unidirFName)
val edgeTupRDD = edgeFullStrRDD.map(x = x.split(\t))
.map(x
I wasn't able to reproduce this with a small test file, but I did change
the file parsing to use x(1).toLong instead of x(2).toLong. Did you mean to
take the third column rather than the second?
If so, would you mind posting a larger sample of the file, or even the
whole file if possible?
Here's
Try this: https://www.dropbox.com/s/xf34l0ta496bdsn/.txt
This code:
println(g.numEdges)
println(g.numVertices)
println(g.edges.distinct().count())
gave me
1
9294
2
On Tue, Apr 22, 2014 at 5:14 PM, Ankur Dave ankurd...@gmail.com wrote:
I wasn't able to reproduce this
Hi Sandy,
Thanks for your reply !
Does this work for sbt ?
I checked the commit, looks like only maven build has such option.
On Wed, Apr 23, 2014 at 12:38 AM, Sandy Ryza sandy.r...@cloudera.comwrote:
Hi Gordon,
We recently handled this in SPARK-1064. As of 1.0.0, you'll be able to
I can't seem to instantiate a SparkContext. What am I doing wrong? I tried
using a SparkConf and the 2-string constructor with identical results.
(Note that the project is configured for eclipse in the pom, but I'm
compiling and running on the command line.)
Here's the exception:
Hi, all
i used to run my app using sbt run. but now i want to see the job
information in spark web ui.
i'am in local mode, i start the spark shell, and access the web ui on
http://ubuntu.local:4040/stages/.
but when i sbt run some application, there is no response in the web ui.
how to make
Hi Ankur,
Thanks for your reply! I think these are usefully for me. I hope these can
be improved in spark-0.9.2 or spark-1.0.
Another thing I forgot. I think the persist api for Graph, VertexRDD and
EdgeRDD should not be set public now, because it will lead to
UnsupportedOperationException when
I currently don't have plans to work on that.
-Sandy
On Apr 22, 2014, at 8:06 PM, Gordon Wang gw...@gopivotal.com wrote:
Thanks I see. Do you guys have plan to port this to sbt?
On Wed, Apr 23, 2014 at 10:24 AM, Sandy Ryza sandy.r...@cloudera.com wrote:
Right, it only works for Maven
Hi
SparkContext launches the web interface at 4040, if you have multiple
sparkContext's on the same machine then the ports will be bind to
successive ports beginning with 4040.
Here's the documentation:
https://spark.apache.org/docs/0.9.0/monitoring.html
And here's a simple scala program to
41 matches
Mail list logo