Need help about how hadoop works.

2014-04-23 Thread Carter
Hi, I am a beginner of Hadoop and Spark, and want some help in understanding how hadoop works. If we have a cluster of 5 computers, and install Spark on the cluster WITHOUT Hadoop. And then we run the code on one computer: val doc = sc.textFile(/home/scalatest.txt,5) doc.count Can the count task

Re: two calls of saveAsTextFile() have different results on the same RDD

2014-04-23 Thread Cheng Lian
Without caching, an RDD will be evaluated multiple times if referenced multiple times by other RDDs. A silly example: val text = sc.textFile(input.log)val r1 = text.filter(_ startsWith ERROR)val r2 = text.map(_ split )val r3 = (r1 ++ r2).collect() Here the input file will be scanned twice

Re: two calls of saveAsTextFile() have different results on the same RDD

2014-04-23 Thread Mayur Rustagi
Shouldnt the dag optimizer optimize these routines. Sorry if its a dumb question :) Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Wed, Apr 23, 2014 at 12:29 PM, Cheng Lian lian.cs@gmail.com wrote: Without caching,

Re: Need help about how hadoop works.

2014-04-23 Thread Mayur Rustagi
As long as the path is present available on all machines you should be able to leverage distribution. HDFS is one way to make that happen, NFS is another simple replication is another. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi

Re: Spark runs applications in an inconsistent way

2014-04-23 Thread Mayur Rustagi
Very abstract. EC2 is unlikely culprit. What are you trying to do. Spark is typically not inconsistent like that but huge intermediate data, reduce size issues could be involved, but hard to help without some more detail of what you are trying to achieve. Mayur Rustagi Ph: +1 (760) 203 3257

Re: two calls of saveAsTextFile() have different results on the same RDD

2014-04-23 Thread Cheng Lian
Good question :) Although RDD DAG is lazy evaluated, it’s not exactly the same as Scala lazy val. For Scala lazy val, evaluated value is automatically cached, while evaluated RDD elements are not cached unless you call .cache() explicitly, because materializing an RDD can often be expensive. Take

Re: two calls of saveAsTextFile() have different results on the same RDD

2014-04-23 Thread Cheng Lian
To experiment, try this in the Spark shell: val r0 = sc.makeRDD(1 to 3, 1)val r1 = r0.map { x = println(x) x }val r2 = r1.map(_ * 2)val r3 = r1.map(_ * 2 + 1) (r2 ++ r3).collect() You’ll see elements in r1 are printed (thus evaluated) twice. By adding .cache() to r1, you’ll see those

Re: ERROR TaskSchedulerImpl: Lost an executor

2014-04-23 Thread wxhsdp
i have a similar question i'am testing in standalone mode in only one pc. i use ./sbin/start-master.sh to start a master and ./bin/spark-class org.apache.spark.deploy.worker.Worker spark://ubuntu:7077 to connect to the master from the web ui, i can see the local worker registered

Accesing Hdfs from Spark gives TokenCache error Can't get Master Kerberos principal for use as renewer

2014-04-23 Thread Spyros Gasteratos
Hello everyone, I'm a newbie in both hadoop and spark so please forgive any obvious mistakes, I'm posting because my google-fu has failed me. I'm trying to run a test Spark script in order to connect Spark to hadoop. The script is the following from pyspark import SparkContext sc =

Re: ERROR TaskSchedulerImpl: Lost an executor

2014-04-23 Thread Daniel Darabos
With the right program you can always exhaust any amount of memory :). There is no silver bullet. You have to figure out what is happening in your code that causes a high memory use and address that. I spent all of last week doing this for a simple program of my own. Lessons I learned that may or

Re: GraphX: .edges.distinct().count() is 10?

2014-04-23 Thread Daniel Darabos
This is caused by https://issues.apache.org/jira/browse/SPARK-1188. I think the fix will be in the next release. But until then, do: g.edges.map(_.copy()).distinct.count On Wed, Apr 23, 2014 at 2:26 AM, Ryan Compton compton.r...@gmail.comwrote: Try this:

Re: two calls of saveAsTextFile() have different results on the same RDD

2014-04-23 Thread randylu
i got it, thanks very much :) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/two-calls-of-saveAsTextFile-have-different-results-on-the-same-RDD-tp4578p4655.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark runs applications in an inconsistent way

2014-04-23 Thread Andras Barjak
- Spark UI shows number of succeeded tasks is more than total number of tasks, eg: 3500/3000. There are no failed tasks. At this stage the computation keeps carrying on for a long time without returning an answer. No sign of resubmitted tasks in the command line logs either? You

about rdd.filter()

2014-04-23 Thread randylu
my code is like: rdd2 = rdd1.filter(_._2.length 1) rdd2.collect() it works well, but if i use a variable /num/ instead of 1: var num = 1 rdd2 = rdd1.filter(_._2.length num) rdd2.collect() it fails at rdd2.collect() so strange? -- View this message in context:

SparkException: env SPARK_YARN_APP_JAR is not set

2014-04-23 Thread ????
I have a small program, which I can launch successfully by yarn client with yarn-standalon mode. the command look like this: (javac javac -classpath .:jars/spark-assembly-0.9.1-hadoop2.2.0.jar LoadTest.java) (jar cvf loadtest.jar LoadTest.class)

Re: ERROR TaskSchedulerImpl: Lost an executor

2014-04-23 Thread Parviz Deyhim
You need to set SPARK_MEM or SPARK_EXECUTOR_MEMORY (for Spark 1.0) to amount of memory your application needs to consume at each node. Try setting those variables (example: export SPARK_MEM=10g) or set it via SparkConf.set as suggested by jholee. On Tue, Apr 22, 2014 at 4:25 PM, jaeholee

Comparing RDD Items

2014-04-23 Thread Jared Rodriguez
Hi there, I am new to Spark and new to scala, although have lots of experience on the Java side. I am experimenting with Spark for a new project where it seems like it could be a good fit. As I go through the examples, there is one case scenario that I am trying to figure out, comparing the

Re: Spark runs applications in an inconsistent way

2014-04-23 Thread Aureliano Buendia
Yes, things get more unstable with larger data. But, that's the whole point of my question: Why should spark get unstable when data gets larger? When data gets larger, spark should get *slower*, not more unstable. lack of stability makes parameter tuning very difficult, time consuming and a

Re: about rdd.filter()

2014-04-23 Thread Sourav Chandra
This could happen if variable is defined in such a way that it pulls its own class reference into the closure. Hence serilization tries to serialize the whole outer class reference which is not serializable and whole thing failed. On Wed, Apr 23, 2014 at 3:15 PM, randylu randyl...@gmail.com

Re: Pig on Spark

2014-04-23 Thread lalit1303
Hi, We got spork working on spark 0.9.0 Repository available at: https://github.com/sigmoidanalytics/pig/tree/spork-hadoopasm-fix Please suggest your feedback. - Lalit Yadav la...@sigmoidanalytics.com -- View this message in context:

skip lines in spark

2014-04-23 Thread Chengi Liu
Hi, What is the easiest way to skip first n lines in rdd?? I am not able to figure this one out? Thanks

Spark hangs when i call parallelize + count on a ArrayListbyte[] having 40k elements

2014-04-23 Thread amit karmakar
Spark hangs after i perform the following operations ArrayListbyte[] bytesList = new ArrayListbyte[](); /* add 40k entries to bytesList */ JavaRDDbyte[] rdd = sparkContext.parallelize(bytesList); System.out.println(Count= + rdd.count()); If i add just one entry it works. It works if i

Re: skip lines in spark

2014-04-23 Thread Andre Bois-Crettez
Good question, I am wondering too how it is possible to add a line number to distributed data. I thought it was a job for maptPartionsWithIndex, but it seems difficult. Something similar here : http://apache-spark-user-list.1001560.n3.nabble.com/RDD-and-Partition-td991.html#a995 Maybe at the

error in mllib lr example code

2014-04-23 Thread Mohit Jaggi
sorry...added a subject now On Wed, Apr 23, 2014 at 9:32 AM, Mohit Jaggi mohitja...@gmail.com wrote: I am trying to run the example linear regression code from http://spark.apache.org/docs/latest/mllib-guide.html But I am getting the following error...am I missing an import? code

Re: skip lines in spark

2014-04-23 Thread Xiangrui Meng
If the first partition doesn't have enough records, then it may not drop enough lines. Try rddData.zipWithIndex().filter(_._2 = 10L).map(_._1) It might trigger a job. Best, Xiangrui On Wed, Apr 23, 2014 at 9:46 AM, DB Tsai dbt...@stanford.edu wrote: Hi Chengi, If you just want to skip first

Re: Spark hangs when i call parallelize + count on a ArrayListbyte[] having 40k elements

2014-04-23 Thread Xiangrui Meng
How big is each entry, and how much memory do you have on each executor? You generated all data on driver and sc.parallelize(bytesList) will send the entire dataset to a single executor. You may run into I/O or memory issues. If the entries are generated, you should create a simple RDD

Re: skip lines in spark

2014-04-23 Thread Xiangrui Meng
Sorry, I didn't realize that zipWithIndex() is not in v0.9.1. It is in the master branch and will be included in v1.0. It first counts number of records per partition and then assigns indices starting from 0. -Xiangrui On Wed, Apr 23, 2014 at 9:56 AM, Chengi Liu chengi.liu...@gmail.com wrote:

Re: skip lines in spark

2014-04-23 Thread DB Tsai
What I suggested will not work if # of records you want to drop is more than the data in first partition. In my use-case, I only drop the first couple lines, so I don't have this issue. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com

Re: Hadoop—streaming

2014-04-23 Thread Xiangrui Meng
PipedRDD is an RDD[String]. If you know how to parse each result line into (key, value) pairs, then you can call reduce after. piped.map(x = (key, value)).reduceByKey((v1, v2) = v) -Xiangrui On Wed, Apr 23, 2014 at 2:09 AM, zhxfl 291221...@qq.com wrote: Hello,we know Hadoop-streaming is use

Re: error in mllib lr example code

2014-04-23 Thread Matei Zaharia
See http://people.csail.mit.edu/matei/spark-unified-docs/ for a more recent build of the docs; if you spot any problems in those, let us know. Matei On Apr 23, 2014, at 9:49 AM, Xiangrui Meng men...@gmail.com wrote: The doc is for 0.9.1. You are running a later snapshot, which added sparse

Re: SPARK_YARN_APP_JAR, SPARK_CLASSPATH and ADD_JARS in a spark-shell on YARN

2014-04-23 Thread Sandy Ryza
Ah, you're right about SPARK_CLASSPATH and ADD_JARS. My bad. SPARK_YARN_APP_JAR is going away entirely - https://issues.apache.org/jira/browse/SPARK-1053 On Wed, Apr 23, 2014 at 8:07 AM, Christophe Préaud christophe.pre...@kelkoo.com wrote: Hi Sandy, Thanks for your reply ! I thought

Is Spark a good choice for geospatial/GIS applications? Is a community volunteer needed in this area?

2014-04-23 Thread neveroutgunned
Greetings Spark users/devs! I'm interested in using Spark to process large volumes of data with a geospatial component, and I haven't been able to find much information on Spark's ability to handle this kind of operation. I don't need anything too complex; just distance between two points,

Re: Pig on Spark

2014-04-23 Thread Mayur Rustagi
Thr are two benefits I get as of now 1. Most of the time a lot of customers dont want the full power but they want something dead simple with which they can do dsl. They end up using Hive for a lot of ETL just cause its SQL they understand it. Pig is close wraps up a lot of framework level

How do I access the SPARK SQL

2014-04-23 Thread diplomatic Guru
Hello Team, I'm new to SPARK and just came across SPARK SQL, which appears to be interesting but not sure how I could get it. I know it's an Alpha version but not sure if its available for community yet. Many thanks. Raj.

Re: Pig on Spark

2014-04-23 Thread suman bharadwaj
Are all the features available in PIG working in SPORK ?? Like for eg: UDFs ? Thanks. On Thu, Apr 24, 2014 at 1:54 AM, Mayur Rustagi mayur.rust...@gmail.comwrote: Thr are two benefits I get as of now 1. Most of the time a lot of customers dont want the full power but they want something

Re: Pig on Spark

2014-04-23 Thread Mayur Rustagi
UDF Generate many many more are not working :) Several of them work. Joins, filters, group by etc. I am translating the ones we need, would be happy to get help on others. Will host a jira to track them if you are intersted. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com

Re: ERROR TaskSchedulerImpl: Lost an executor

2014-04-23 Thread jaeholee
After doing that, I ran my code once with a smaller example, and it worked. But ever since then, I get the No space left on device message for the same sample, even if I re-start the master... ERROR TaskSetManager: Task 29.0:20 failed 4 times; aborting job org.apache.spark.SparkException: Job

Re: How do I access the SPARK SQL

2014-04-23 Thread Matei Zaharia
It’s currently in the master branch, on https://github.com/apache/spark. You can check that out from git, build it with sbt/sbt assembly, and then try it out. We’re also going to post some release candidates soon that will be pre-built. Matei On Apr 23, 2014, at 1:30 PM, diplomatic Guru

Re: Pig on Spark

2014-04-23 Thread suman bharadwaj
We currently are in the process of converting PIG and Java map reduce jobs to SPARK jobs. And we have written couple of PIG UDFs as well. Hence was checking if we can leverage SPORK without converting to SPARK jobs. And is there any way I can port my existing Java MR jobs to SPARK ? I know this

Re: Pig on Spark

2014-04-23 Thread Mayur Rustagi
Right now UDF is not working. Its in the top list though. You should be able to soon :) Are thr any other functionality of pig you use often apart from the usual suspects?? Existing Java MR jobs would be a easier move. are these cascading jobs or single map reduce jobs. If single then you should

Re: AmpCamp exercise in a local environment

2014-04-23 Thread Nabeel Memon
Thanks a lot Arpit. It's really helpful. On Fri, Apr 18, 2014 at 4:24 AM, Arpit Tak arpit.sparku...@gmail.comwrote: Download Cloudera VM from here. https://drive.google.com/file/d/0B7zn-Mmft-XcdTZPLXltUjJyeUE/edit?usp=sharing Regards, Arpit Tak On Fri, Apr 18, 2014 at 1:20 PM,

Failed to run count?

2014-04-23 Thread Ian Ferreira
I am getting this cryptic error running LinearRegressionwithSGD Data sample LabeledPoint(39.0, [144.0, 1521.0, 20736.0, 59319.0, 2985984.0]) 14/04/23 15:15:34 INFO SparkContext: Starting job: first at GeneralizedLinearAlgorithm.scala:121 14/04/23 15:15:34 INFO DAGScheduler: Got job 2 (first at

RE:

2014-04-23 Thread Buttler, David
This sounds like a configuration issue. Either you have not set the MASTER correctly, or possibly another process is using up all of the cores Dave From: ge ko [mailto:koenig@gmail.com] Sent: Sunday, April 13, 2014 12:51 PM To: user@spark.apache.org Subject: Hi, I'm still going to start

Re: GraphX: Help understanding the limitations of Pregel

2014-04-23 Thread Tom Vacek
Here are some out-of-the-box ideas: If the elements lie in a fairly small range and/or you're willing to work with limited precision, you could use counting sort. Moreover, you could iteratively find the median using bisection, which would be associative and commutative. It's easy to think of

Re: GraphX: Help understanding the limitations of Pregel

2014-04-23 Thread Ryan Compton
Whoops, I should have mentioned that it's a multivariate median (cf http://www.pnas.org/content/97/4/1423.full.pdf ). It's easy to compute when all the values are accessible at once. I'm not sure it's possible with a combiner. So, I guess the question should be: Can I use GraphX's Pregel without a

Re: GraphX: Help understanding the limitations of Pregel

2014-04-23 Thread Ankur Dave
If you need access to all message values in vprog, there's nothing wrong with building up an array in mergeMsg (option #1). This is what org.apache.spark.graphx.lib.TriangleCount does, though with sets instead of arrays. There will be a performance penalty because of the communication, but it

Re: about rdd.filter()

2014-04-23 Thread randylu
14/04/23 17:17:40 INFO DAGScheduler: Failed to run collect at SparkListDocByTopic.scala:407 Exception in thread main java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at

Re: about rdd.filter()

2014-04-23 Thread randylu
@Cheng Lian-2, Sourav Chandra, thanks very much. You are right! The situation just like what you say. so nice ! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/about-rdd-filter-tp4657p4718.html Sent from the Apache Spark User List mailing list archive

how to set spark.executor.memory and heap size

2014-04-23 Thread wxhsdp
hi i'am testing SimpleApp.scala in standalone mode with only one pc, so i have one master and one local worker on the same pc with rather small input file size(4.5K), i have got the java.lang.OutOfMemoryError: Java heap space error here's my settings: spark-env.sh: export

Re: how to set spark.executor.memory and heap size

2014-04-23 Thread wxhsdp
by the way, codes run ok in spark shell -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-heap-size-tp4719p4720.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: how to set spark.executor.memory and heap size

2014-04-23 Thread Adnan Yaqoob
When I was testing spark, I faced this issue, this issue is not related to memory shortage, It is because your configurations are not correct. Try to pass you current Jar to to the SparkContext with SparkConf's setJars function and try again. On Thu, Apr 24, 2014 at 8:38 AM, wxhsdp

Access Last Element of RDD

2014-04-23 Thread Sai Prasanna
Hi All, Some help ! RDD.first or RDD.take(1) gives the first item, is there a straight forward way to access the last element in a similar way ? I coudnt fine a tail/last method for RDD. !!

Re: Access Last Element of RDD

2014-04-23 Thread Adnan Yaqoob
You can use following code: RDD.take(RDD.count()) On Thu, Apr 24, 2014 at 9:51 AM, Sai Prasanna ansaiprasa...@gmail.comwrote: Hi All, Some help ! RDD.first or RDD.take(1) gives the first item, is there a straight forward way to access the last element in a similar way ? I coudnt fine a

Re: Access Last Element of RDD

2014-04-23 Thread Sai Prasanna
Oh ya, Thanks Adnan. On Thu, Apr 24, 2014 at 10:30 AM, Adnan Yaqoob nsyaq...@gmail.com wrote: You can use following code: RDD.take(RDD.count()) On Thu, Apr 24, 2014 at 9:51 AM, Sai Prasanna ansaiprasa...@gmail.comwrote: Hi All, Some help ! RDD.first or RDD.take(1) gives the first item,

Re: Access Last Element of RDD

2014-04-23 Thread Sai Prasanna
Adnan, but RDD.take(RDD.count()) returns all the elements of the RDD. I want only to access the last element. On Thu, Apr 24, 2014 at 10:33 AM, Sai Prasanna ansaiprasa...@gmail.comwrote: Oh ya, Thanks Adnan. On Thu, Apr 24, 2014 at 10:30 AM, Adnan Yaqoob nsyaq...@gmail.com wrote: You can

Re: Access Last Element of RDD

2014-04-23 Thread Adnan Yaqoob
This function will return scala List, you can use List's last function to get the last element. For example: RDD.take(RDD.count()).last On Thu, Apr 24, 2014 at 10:28 AM, Sai Prasanna ansaiprasa...@gmail.comwrote: Adnan, but RDD.take(RDD.count()) returns all the elements of the RDD. I want

Re: Access Last Element of RDD

2014-04-23 Thread Sai Prasanna
What i observe is, this way of computing is very inefficient. It returns all the elements of the RDD to a List which takes considerable amount of time. Then it calculates the last element. I have a file of size 3 GB in which i ran a lot of aggregate operations which dint took the time that this