Hi, I am a beginner of Hadoop and Spark, and want some help in understanding
how hadoop works.
If we have a cluster of 5 computers, and install Spark on the cluster
WITHOUT Hadoop. And then we run the code on one computer:
val doc = sc.textFile(/home/scalatest.txt,5)
doc.count
Can the count task
Without caching, an RDD will be evaluated multiple times if referenced
multiple times by other RDDs. A silly example:
val text = sc.textFile(input.log)val r1 = text.filter(_ startsWith
ERROR)val r2 = text.map(_ split )val r3 = (r1 ++ r2).collect()
Here the input file will be scanned twice
Shouldnt the dag optimizer optimize these routines. Sorry if its a dumb
question :)
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Wed, Apr 23, 2014 at 12:29 PM, Cheng Lian lian.cs@gmail.com wrote:
Without caching,
As long as the path is present available on all machines you should be
able to leverage distribution. HDFS is one way to make that happen, NFS is
another simple replication is another.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi
Very abstract.
EC2 is unlikely culprit.
What are you trying to do. Spark is typically not inconsistent like that
but huge intermediate data, reduce size issues could be involved, but hard
to help without some more detail of what you are trying to achieve.
Mayur Rustagi
Ph: +1 (760) 203 3257
Good question :)
Although RDD DAG is lazy evaluated, it’s not exactly the same as Scala lazy
val. For Scala lazy val, evaluated value is automatically cached, while
evaluated RDD elements are not cached unless you call .cache() explicitly,
because materializing an RDD can often be expensive. Take
To experiment, try this in the Spark shell:
val r0 = sc.makeRDD(1 to 3, 1)val r1 = r0.map { x =
println(x)
x
}val r2 = r1.map(_ * 2)val r3 = r1.map(_ * 2 + 1)
(r2 ++ r3).collect()
You’ll see elements in r1 are printed (thus evaluated) twice. By adding
.cache() to r1, you’ll see those
i have a similar question
i'am testing in standalone mode in only one pc.
i use ./sbin/start-master.sh to start a master and
./bin/spark-class org.apache.spark.deploy.worker.Worker spark://ubuntu:7077
to connect to the master
from the web ui, i can see the local worker registered
Hello everyone,
I'm a newbie in both hadoop and spark so please forgive any obvious
mistakes, I'm posting because my google-fu has failed me.
I'm trying to run a test Spark script in order to connect Spark to hadoop.
The script is the following
from pyspark import SparkContext
sc =
With the right program you can always exhaust any amount of memory :).
There is no silver bullet. You have to figure out what is happening in your
code that causes a high memory use and address that. I spent all of last
week doing this for a simple program of my own. Lessons I learned that may
or
This is caused by https://issues.apache.org/jira/browse/SPARK-1188. I think
the fix will be in the next release. But until then, do:
g.edges.map(_.copy()).distinct.count
On Wed, Apr 23, 2014 at 2:26 AM, Ryan Compton compton.r...@gmail.comwrote:
Try this:
i got it, thanks very much :)
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/two-calls-of-saveAsTextFile-have-different-results-on-the-same-RDD-tp4578p4655.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
- Spark UI shows number of succeeded tasks is more than total number
of tasks, eg: 3500/3000. There are no failed tasks. At this stage the
computation keeps carrying on for a long time without returning an answer.
No sign of resubmitted tasks in the command line logs either?
You
my code is like:
rdd2 = rdd1.filter(_._2.length 1)
rdd2.collect()
it works well, but if i use a variable /num/ instead of 1:
var num = 1
rdd2 = rdd1.filter(_._2.length num)
rdd2.collect()
it fails at rdd2.collect()
so strange?
--
View this message in context:
I have a small program, which I can launch successfully by yarn client with
yarn-standalon mode.
the command look like this:
(javac javac -classpath .:jars/spark-assembly-0.9.1-hadoop2.2.0.jar
LoadTest.java)
(jar cvf loadtest.jar LoadTest.class)
You need to set SPARK_MEM or SPARK_EXECUTOR_MEMORY (for Spark 1.0) to
amount of memory your application needs to consume at each node. Try
setting those variables (example: export SPARK_MEM=10g) or set it via
SparkConf.set as suggested by jholee.
On Tue, Apr 22, 2014 at 4:25 PM, jaeholee
Hi there,
I am new to Spark and new to scala, although have lots of experience on the
Java side. I am experimenting with Spark for a new project where it seems
like it could be a good fit. As I go through the examples, there is one
case scenario that I am trying to figure out, comparing the
Yes, things get more unstable with larger data. But, that's the whole point
of my question:
Why should spark get unstable when data gets larger?
When data gets larger, spark should get *slower*, not more unstable. lack
of stability makes parameter tuning very difficult, time consuming and a
This could happen if variable is defined in such a way that it pulls its
own class reference into the closure. Hence serilization tries to
serialize the whole outer class reference which is not serializable and
whole thing failed.
On Wed, Apr 23, 2014 at 3:15 PM, randylu randyl...@gmail.com
Hi,
We got spork working on spark 0.9.0
Repository available at:
https://github.com/sigmoidanalytics/pig/tree/spork-hadoopasm-fix
Please suggest your feedback.
-
Lalit Yadav
la...@sigmoidanalytics.com
--
View this message in context:
Hi,
What is the easiest way to skip first n lines in rdd??
I am not able to figure this one out?
Thanks
Spark hangs after i perform the following operations
ArrayListbyte[] bytesList = new ArrayListbyte[]();
/*
add 40k entries to bytesList
*/
JavaRDDbyte[] rdd = sparkContext.parallelize(bytesList);
System.out.println(Count= + rdd.count());
If i add just one entry it works.
It works if i
Good question, I am wondering too how it is possible to add a line
number to distributed data.
I thought it was a job for maptPartionsWithIndex, but it seems difficult.
Something similar here :
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-and-Partition-td991.html#a995
Maybe at the
sorry...added a subject now
On Wed, Apr 23, 2014 at 9:32 AM, Mohit Jaggi mohitja...@gmail.com wrote:
I am trying to run the example linear regression code from
http://spark.apache.org/docs/latest/mllib-guide.html
But I am getting the following error...am I missing an import?
code
If the first partition doesn't have enough records, then it may not
drop enough lines. Try
rddData.zipWithIndex().filter(_._2 = 10L).map(_._1)
It might trigger a job.
Best,
Xiangrui
On Wed, Apr 23, 2014 at 9:46 AM, DB Tsai dbt...@stanford.edu wrote:
Hi Chengi,
If you just want to skip first
How big is each entry, and how much memory do you have on each
executor? You generated all data on driver and
sc.parallelize(bytesList) will send the entire dataset to a single
executor. You may run into I/O or memory issues. If the entries are
generated, you should create a simple RDD
Sorry, I didn't realize that zipWithIndex() is not in v0.9.1. It is in
the master branch and will be included in v1.0. It first counts number
of records per partition and then assigns indices starting from 0.
-Xiangrui
On Wed, Apr 23, 2014 at 9:56 AM, Chengi Liu chengi.liu...@gmail.com wrote:
What I suggested will not work if # of records you want to drop is more
than the data in first partition. In my use-case, I only drop the first
couple lines, so I don't have this issue.
Sincerely,
DB Tsai
---
My Blog: https://www.dbtsai.com
PipedRDD is an RDD[String]. If you know how to parse each result line
into (key, value) pairs, then you can call reduce after.
piped.map(x = (key, value)).reduceByKey((v1, v2) = v)
-Xiangrui
On Wed, Apr 23, 2014 at 2:09 AM, zhxfl 291221...@qq.com wrote:
Hello,we know Hadoop-streaming is use
See http://people.csail.mit.edu/matei/spark-unified-docs/ for a more recent
build of the docs; if you spot any problems in those, let us know.
Matei
On Apr 23, 2014, at 9:49 AM, Xiangrui Meng men...@gmail.com wrote:
The doc is for 0.9.1. You are running a later snapshot, which added
sparse
Ah, you're right about SPARK_CLASSPATH and ADD_JARS. My bad.
SPARK_YARN_APP_JAR is going away entirely -
https://issues.apache.org/jira/browse/SPARK-1053
On Wed, Apr 23, 2014 at 8:07 AM, Christophe Préaud
christophe.pre...@kelkoo.com wrote:
Hi Sandy,
Thanks for your reply !
I thought
Greetings Spark users/devs! I'm interested in using Spark to process
large volumes of data with a geospatial component, and I haven't been
able to find much information on Spark's ability to handle this kind
of operation. I don't need anything too complex; just distance between
two points,
Thr are two benefits I get as of now
1. Most of the time a lot of customers dont want the full power but they
want something dead simple with which they can do dsl. They end up using
Hive for a lot of ETL just cause its SQL they understand it. Pig is close
wraps up a lot of framework level
Hello Team,
I'm new to SPARK and just came across SPARK SQL, which appears to be
interesting but not sure how I could get it.
I know it's an Alpha version but not sure if its available for community
yet.
Many thanks.
Raj.
Are all the features available in PIG working in SPORK ?? Like for eg: UDFs
?
Thanks.
On Thu, Apr 24, 2014 at 1:54 AM, Mayur Rustagi mayur.rust...@gmail.comwrote:
Thr are two benefits I get as of now
1. Most of the time a lot of customers dont want the full power but they
want something
UDF
Generate
many many more are not working :)
Several of them work. Joins, filters, group by etc.
I am translating the ones we need, would be happy to get help on others.
Will host a jira to track them if you are intersted.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
After doing that, I ran my code once with a smaller example, and it worked.
But ever since then, I get the No space left on device message for the
same sample, even if I re-start the master...
ERROR TaskSetManager: Task 29.0:20 failed 4 times; aborting job
org.apache.spark.SparkException: Job
It’s currently in the master branch, on https://github.com/apache/spark. You
can check that out from git, build it with sbt/sbt assembly, and then try it
out. We’re also going to post some release candidates soon that will be
pre-built.
Matei
On Apr 23, 2014, at 1:30 PM, diplomatic Guru
We currently are in the process of converting PIG and Java map reduce jobs
to SPARK jobs. And we have written couple of PIG UDFs as well. Hence was
checking if we can leverage SPORK without converting to SPARK jobs.
And is there any way I can port my existing Java MR jobs to SPARK ?
I know this
Right now UDF is not working. Its in the top list though. You should be
able to soon :)
Are thr any other functionality of pig you use often apart from the usual
suspects??
Existing Java MR jobs would be a easier move. are these cascading jobs or
single map reduce jobs. If single then you should
Thanks a lot Arpit. It's really helpful.
On Fri, Apr 18, 2014 at 4:24 AM, Arpit Tak arpit.sparku...@gmail.comwrote:
Download Cloudera VM from here.
https://drive.google.com/file/d/0B7zn-Mmft-XcdTZPLXltUjJyeUE/edit?usp=sharing
Regards,
Arpit Tak
On Fri, Apr 18, 2014 at 1:20 PM,
I am getting this cryptic error running LinearRegressionwithSGD
Data sample
LabeledPoint(39.0, [144.0, 1521.0, 20736.0, 59319.0, 2985984.0])
14/04/23 15:15:34 INFO SparkContext: Starting job: first at
GeneralizedLinearAlgorithm.scala:121
14/04/23 15:15:34 INFO DAGScheduler: Got job 2 (first at
This sounds like a configuration issue. Either you have not set the MASTER
correctly, or possibly another process is using up all of the cores
Dave
From: ge ko [mailto:koenig@gmail.com]
Sent: Sunday, April 13, 2014 12:51 PM
To: user@spark.apache.org
Subject:
Hi,
I'm still going to start
Here are some out-of-the-box ideas: If the elements lie in a fairly small
range and/or you're willing to work with limited precision, you could use
counting sort. Moreover, you could iteratively find the median using
bisection, which would be associative and commutative. It's easy to think
of
Whoops, I should have mentioned that it's a multivariate median (cf
http://www.pnas.org/content/97/4/1423.full.pdf ). It's easy to compute
when all the values are accessible at once. I'm not sure it's possible
with a combiner. So, I guess the question should be: Can I use
GraphX's Pregel without a
If you need access to all message values in vprog, there's nothing wrong
with building up an array in mergeMsg (option #1). This is what
org.apache.spark.graphx.lib.TriangleCount does, though with sets instead of
arrays. There will be a performance penalty because of the communication,
but it
14/04/23 17:17:40 INFO DAGScheduler: Failed to run collect at
SparkListDocByTopic.scala:407
Exception in thread main java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
@Cheng Lian-2, Sourav Chandra, thanks very much.
You are right! The situation just like what you say. so nice !
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/about-rdd-filter-tp4657p4718.html
Sent from the Apache Spark User List mailing list archive
hi
i'am testing SimpleApp.scala in standalone mode with only one pc, so i have
one master and one local worker on the same pc
with rather small input file size(4.5K), i have got the
java.lang.OutOfMemoryError: Java heap space error
here's my settings:
spark-env.sh:
export
by the way, codes run ok in spark shell
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-heap-size-tp4719p4720.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
When I was testing spark, I faced this issue, this issue is not related to
memory shortage, It is because your configurations are not correct. Try to
pass you current Jar to to the SparkContext with SparkConf's setJars
function and try again.
On Thu, Apr 24, 2014 at 8:38 AM, wxhsdp
Hi All, Some help !
RDD.first or RDD.take(1) gives the first item, is there a straight forward
way to access the last element in a similar way ?
I coudnt fine a tail/last method for RDD. !!
You can use following code:
RDD.take(RDD.count())
On Thu, Apr 24, 2014 at 9:51 AM, Sai Prasanna ansaiprasa...@gmail.comwrote:
Hi All, Some help !
RDD.first or RDD.take(1) gives the first item, is there a straight forward
way to access the last element in a similar way ?
I coudnt fine a
Oh ya, Thanks Adnan.
On Thu, Apr 24, 2014 at 10:30 AM, Adnan Yaqoob nsyaq...@gmail.com wrote:
You can use following code:
RDD.take(RDD.count())
On Thu, Apr 24, 2014 at 9:51 AM, Sai Prasanna ansaiprasa...@gmail.comwrote:
Hi All, Some help !
RDD.first or RDD.take(1) gives the first item,
Adnan, but RDD.take(RDD.count()) returns all the elements of the RDD.
I want only to access the last element.
On Thu, Apr 24, 2014 at 10:33 AM, Sai Prasanna ansaiprasa...@gmail.comwrote:
Oh ya, Thanks Adnan.
On Thu, Apr 24, 2014 at 10:30 AM, Adnan Yaqoob nsyaq...@gmail.com wrote:
You can
This function will return scala List, you can use List's last function to
get the last element.
For example:
RDD.take(RDD.count()).last
On Thu, Apr 24, 2014 at 10:28 AM, Sai Prasanna ansaiprasa...@gmail.comwrote:
Adnan, but RDD.take(RDD.count()) returns all the elements of the RDD.
I want
What i observe is, this way of computing is very inefficient. It returns
all the elements of the RDD to a List which takes considerable amount of
time.
Then it calculates the last element.
I have a file of size 3 GB in which i ran a lot of aggregate operations
which dint took the time that this
57 matches
Mail list logo