Re: Access Last Element of RDD

2014-04-24 Thread Sourav Chandra
You can use rdd.takeOrdered(1)(reverseOrdrering) reverseOrdering is you Ordering[T] instance where you define the ordering logic. This you have to pass in the method On Thu, Apr 24, 2014 at 11:21 AM, Frank Austin Nothaft fnoth...@berkeley.edu wrote: If you do this, you could simplify to:

Re: Access Last Element of RDD

2014-04-24 Thread Sourav Chandra
Also same thing can be done using rdd.top(1)(reverseOrdering) On Thu, Apr 24, 2014 at 11:28 AM, Sourav Chandra sourav.chan...@livestream.com wrote: You can use rdd.takeOrdered(1)(reverseOrdrering) reverseOrdering is you Ordering[T] instance where you define the ordering logic. This you

Re: Access Last Element of RDD

2014-04-24 Thread Sai Prasanna
Thanks Guys ! On Thu, Apr 24, 2014 at 11:29 AM, Sourav Chandra sourav.chan...@livestream.com wrote: Also same thing can be done using rdd.top(1)(reverseOrdering) On Thu, Apr 24, 2014 at 11:28 AM, Sourav Chandra sourav.chan...@livestream.com wrote: You can use

Re: how to set spark.executor.memory and heap size

2014-04-24 Thread wxhsdp
thank you, i add setJars, but nothing changes val conf = new SparkConf() .setMaster(spark://127.0.0.1:7077) .setAppName(Simple App) .set(spark.executor.memory, 1g) .setJars(Seq(target/scala-2.10/simple-project_2.10-1.0.jar)) val sc = new SparkContext(conf) --

Re: Re: how to set spark.executor.memory and heap size

2014-04-24 Thread qinwei
try the complete path qinwei  From: wxhsdpDate: 2014-04-24 14:21To: userSubject: Re: how to set spark.executor.memory and heap sizethank you, i add setJars, but nothing changes       val conf = new SparkConf()   .setMaster(spark://127.0.0.1:7077)   .setAppName(Simple App)  

Re: Need help about how hadoop works.

2014-04-24 Thread Carter
Thanks Mayur. So without Hadoop and any other distributed file systems, by running: val doc = sc.textFile(/home/scalatest.txt,5) doc.count we can only get parallelization within the computer where the file is loaded, but not the parallelization within the computers in the cluster (Spark

Re: Need help about how hadoop works.

2014-04-24 Thread Prashant Sharma
Prashant Sharma On Thu, Apr 24, 2014 at 12:15 PM, Carter gyz...@hotmail.com wrote: Thanks Mayur. So without Hadoop and any other distributed file systems, by running: val doc = sc.textFile(/home/scalatest.txt,5) doc.count we can only get parallelization within the computer where

Re: Re: how to set spark.executor.memory and heap size

2014-04-24 Thread wxhsdp
i tried, but no effect Qin Wei wrote try the complete path qinwei  From: wxhsdpDate: 2014-04-24 14:21To: userSubject: Re: how to set spark.executor.memory and heap sizethank you, i add setJars, but nothing changes       val conf = new SparkConf()  

Re: SPARK_YARN_APP_JAR, SPARK_CLASSPATH and ADD_JARS in a spark-shell on YARN

2014-04-24 Thread Christophe Préaud
Good to know, thanks for pointing this out to me! On 23/04/2014 19:55, Sandy Ryza wrote: Ah, you're right about SPARK_CLASSPATH and ADD_JARS. My bad. SPARK_YARN_APP_JAR is going away entirely - https://issues.apache.org/jira/browse/SPARK-1053 On Wed, Apr 23, 2014 at 8:07 AM, Christophe

Re: Need help about how hadoop works.

2014-04-24 Thread Carter
Thank you very much for your help Prashant. Sorry I still have another question about your answer: however if the file(/home/scalatest.txt) is present on the same path on all systems it will be processed on all nodes. When presenting the file to the same path on all nodes, do we just simply copy

Re: Need help about how hadoop works.

2014-04-24 Thread Prashant Sharma
It is the same file and hadoop library that we use for splitting takes care of assigning the right split to each node. Prashant Sharma On Thu, Apr 24, 2014 at 1:36 PM, Carter gyz...@hotmail.com wrote: Thank you very much for your help Prashant. Sorry I still have another question about your

Re: how to set spark.executor.memory and heap size

2014-04-24 Thread wxhsdp
i think maybe it's the problem of read local file val logFile = /home/wxhsdp/spark/example/standalone/README.md val logData = sc.textFile(logFile).cache() if i replace the above code with val logData = sc.parallelize(Array(1,2,3,4)).cache() the job can complete successfully can't i read a

Re: how to set spark.executor.memory and heap size

2014-04-24 Thread Adnan Yaqoob
You need to use proper url format: file://home/wxhsdp/spark/example/standalone/README.md On Thu, Apr 24, 2014 at 1:29 PM, wxhsdp wxh...@gmail.com wrote: i think maybe it's the problem of read local file val logFile = /home/wxhsdp/spark/example/standalone/README.md val logData =

Re: how to set spark.executor.memory and heap size

2014-04-24 Thread Adnan Yaqoob
Sorry wrong format: file:///home/wxhsdp/spark/example/standalone/README.md An extra / is needed at the start. On Thu, Apr 24, 2014 at 1:46 PM, Adnan Yaqoob nsyaq...@gmail.com wrote: You need to use proper url format: file://home/wxhsdp/spark/example/standalone/README.md On Thu, Apr 24,

Re: how to set spark.executor.memory and heap size

2014-04-24 Thread wxhsdp
thanks for your reply, adnan, i tried val logFile = file:///home/wxhsdp/spark/example/standalone/README.md i think there needs three left slash behind file: it's just the same as val logFile = home/wxhsdp/spark/example/standalone/README.md the error remains:( -- View this message in context:

Re: how to set spark.executor.memory and heap size

2014-04-24 Thread Arpit Tak
Hi, You should be able to read it, file://or file:/// not even required for reading locally , just path is enough.. what error message you getting on spark-shell while reading... for local: Also read the same from hdfs file also ... put your README file there and read , it works in both ways..

Re: how to set spark.executor.memory and heap size

2014-04-24 Thread wxhsdp
hi arpit, on spark shell, i can read local file properly, but when i use sbt run, error occurs. the sbt error message is in the beginning of the thread Arpit Tak-2 wrote Hi, You should be able to read it, file://or file:/// not even required for reading locally , just path is enough..

RE: Need help about how hadoop works.

2014-04-24 Thread Carter
Thank you very much Prashant. Date: Thu, 24 Apr 2014 01:24:39 -0700 From: ml-node+s1001560n4739...@n3.nabble.com To: gyz...@hotmail.com Subject: Re: Need help about how hadoop works. It is the same file and hadoop library that we use for splitting takes care of assigning the right

Re: how to set spark.executor.memory and heap size

2014-04-24 Thread Arpit Tak
Okk fine, try like this , i tried and it works.. specify spark path also in constructor... and also export SPARK_JAVA_OPTS=-Xms300m -Xmx512m -XX:MaxPermSize=1g import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ object SimpleApp { def main(args:

Re: error in mllib lr example code

2014-04-24 Thread Arpit Tak
Also try out these examples, all of them works http://docs.sigmoidanalytics.com/index.php/MLlib if you spot any problems in those, let us know. Regards, arpit On Wed, Apr 23, 2014 at 11:08 PM, Matei Zaharia matei.zaha...@gmail.comwrote: See

Re: Access Last Element of RDD

2014-04-24 Thread Sai Prasanna
Hi All, Finally i wrote the following code, which is felt does optimally if not the most optimum one. Using file pointers, seeking the byte after the last \n but backwards !! This is memory efficient and i hope even unix tail implementation should be something similar !! import

Re: SparkPi performance-3 cluster standalone mode

2014-04-24 Thread Adnan
Hi, Relatively new on spark and have tried running SparkPi example on a standalone 12 core three machine cluster. What I'm failing to understand is, that running this example with a single slice gives better performance as compared to using 12 slices. Same was the case when I was using

Re: how to set spark.executor.memory and heap size

2014-04-24 Thread wxhsdp
it seems that it's nothing about settings, i tried take action, and find it's ok, but error occurs when i tried count and collect val a = sc.textFile(any file) a.take(n).foreach(println) //ok a.count() //failed a.collect()//failed val b = sc.parallelize((Array(1,2,3,4))

Re: Access Last Element of RDD

2014-04-24 Thread Cheng Lian
You may try this: val lastOption = sc.textFile(input).mapPartitions { iterator = if (iterator.isEmpty) { iterator } else { Iterator .continually((iterator.next(), iterator.hasNext())) .collect { case (value, false) = value } .take(1) } }.collect().lastOption

Re: Is Spark a good choice for geospatial/GIS applications? Is a community volunteer needed in this area?

2014-04-24 Thread neveroutgunned
Thanks for the info. It seems like the JTS library is exactly what I need (I'm not doing any raster processing at this point). So, once they successfully finish the Scala wrappers for JTS, I would theoretically be able to use Scala to write a Spark job that includes the JTS library, and then run

Re: Deploying a python code on a spark EC2 cluster

2014-04-24 Thread Shubhabrata
Moreover it seems all the workers are registered and have sufficient memory (2.7GB where as I have asked for 512 MB). The UI also shows the jobs are running on the slaves. But on the termial it is still the same error Initial job has not accepted any resources; check your cluster UI to ensure that

reduceByKeyAndWindow - spark internals

2014-04-24 Thread Adrian Mocanu
If I have this code: val stream1= doublesInputStream.window(Seconds(10), Seconds(2)) val stream2= stream1.reduceByKeyAndWindow(_ + _, Seconds(10), Seconds(10)) Does reduceByKeyAndWindow merge all RDDs from stream1 that came in the 10 second window? Example, in the first 10 secs stream1 will

Re: How do I access the SPARK SQL

2014-04-24 Thread Andrew Or
Did you build it with SPARK_HIVE=true? On Thu, Apr 24, 2014 at 7:00 AM, diplomatic Guru diplomaticg...@gmail.comwrote: Hi Matei, I checked out the git repository and built it. However, I'm still getting below error. It couldn't find those SQL packages. Please advice. package

Re: How do I access the SPARK SQL

2014-04-24 Thread Michael Armbrust
You shouldn't need to set SPARK_HIVE=true unless you want to use the JavaHiveContext. You should be able to access org.apache.spark.sql.api.java.JavaSQLContext with the default build. How are you building your application? Michael On Thu, Apr 24, 2014 at 9:17 AM, Andrew Or

Re: How do I access the SPARK SQL

2014-04-24 Thread Aaron Davidson
Looks like you're depending on Spark 0.9.1, which doesn't have Spark SQL. Assuming you've downloaded Spark, just run 'mvn install' to publish Spark locally, and depend on Spark version 1.0.0-SNAPSHOT. On Thu, Apr 24, 2014 at 9:58 AM, diplomatic Guru diplomaticg...@gmail.comwrote: It's a simple

Re: How do I access the SPARK SQL

2014-04-24 Thread Michael Armbrust
Oh, and you'll also need to add a dependency on spark-sql_2.10. On Thu, Apr 24, 2014 at 10:13 AM, Michael Armbrust mich...@databricks.comwrote: Yeah, you'll need to run `sbt publish-local` to push the jars to your local maven repository (~/.m2) and then depend on version 1.0.0-SNAPSHOT. On

Re: How do I access the SPARK SQL

2014-04-24 Thread diplomatic Guru
Many thanks for your prompt reply. I'll try your suggestions and will get back to you. On 24 April 2014 18:17, Michael Armbrust mich...@databricks.com wrote: Oh, and you'll also need to add a dependency on spark-sql_2.10. On Thu, Apr 24, 2014 at 10:13 AM, Michael Armbrust

Re: IDE for sparkR

2014-04-24 Thread maxpar
Rstudio should be fine. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/IDE-for-sparkR-tp4764p4772.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Access Last Element of RDD

2014-04-24 Thread Sai Prasanna
Thanks Cheng !! On Thu, Apr 24, 2014 at 5:43 PM, Cheng Lian lian.cs@gmail.com wrote: You may try this: val lastOption = sc.textFile(input).mapPartitions { iterator = if (iterator.isEmpty) { iterator } else { Iterator .continually((iterator.next(),

Spark mllib throwing error

2014-04-24 Thread John King
./spark-shell: line 153: 17654 Killed $FWDIR/bin/spark-class org.apache.spark.repl.Main $@ Any ideas?

Re: Deploying a python code on a spark EC2 cluster

2014-04-24 Thread Matei Zaharia
Did you launch this using our EC2 scripts (http://spark.apache.org/docs/latest/ec2-scripts.html) or did you manually set up the daemons? My guess is that their hostnames are not being resolved properly on all nodes, so executor processes can’t connect back to your driver app. This error

Re: SparkPi performance-3 cluster standalone mode

2014-04-24 Thread Matei Zaharia
The problem is that SparkPi uses Math.random(), which is a synchronized method, so it can’t scale to multiple cores. In fact it will be slower on multiple cores due to lock contention. Try another example and you’ll see better scaling. I think we’ll have to update SparkPi to create a new Random

Re: Spark mllib throwing error

2014-04-24 Thread Xiangrui Meng
Could you share the command you used and more of the error message? Also, is it an MLlib specific problem? -Xiangrui On Thu, Apr 24, 2014 at 11:49 AM, John King usedforprinting...@gmail.com wrote: ./spark-shell: line 153: 17654 Killed $FWDIR/bin/spark-class org.apache.spark.repl.Main $@ Any

Re: Trying to use pyspark mllib NaiveBayes

2014-04-24 Thread Xiangrui Meng
Is your Spark cluster running? Try to start with generating simple RDDs and counting. -Xiangrui On Thu, Apr 24, 2014 at 11:38 AM, John King usedforprinting...@gmail.com wrote: I receive this error: Traceback (most recent call last): File stdin, line 1, in module File

Re: Spark mllib throwing error

2014-04-24 Thread John King
Last command was: val model = new NaiveBayes().run(points) On Thu, Apr 24, 2014 at 4:27 PM, Xiangrui Meng men...@gmail.com wrote: Could you share the command you used and more of the error message? Also, is it an MLlib specific problem? -Xiangrui On Thu, Apr 24, 2014 at 11:49 AM, John King

Re: Trying to use pyspark mllib NaiveBayes

2014-04-24 Thread John King
Yes, I got it running for large RDD (~7 million lines) and mapping. Just received this error when trying to classify. On Thu, Apr 24, 2014 at 4:32 PM, Xiangrui Meng men...@gmail.com wrote: Is your Spark cluster running? Try to start with generating simple RDDs and counting. -Xiangrui On

Re: Deploying a python code on a spark EC2 cluster

2014-04-24 Thread John King
This happens to me when using the EC2 scripts for v1.0.0rc2 recent release. The Master connects and then disconnects immediately, eventually saying Master disconnected from cluster. On Thu, Apr 24, 2014 at 4:01 PM, Matei Zaharia matei.zaha...@gmail.comwrote: Did you launch this using our EC2

Re: How do I access the SPARK SQL

2014-04-24 Thread diplomatic Guru
It worked!! Many thanks for your brilliant support. On 24 April 2014 18:20, diplomatic Guru diplomaticg...@gmail.com wrote: Many thanks for your prompt reply. I'll try your suggestions and will get back to you. On 24 April 2014 18:17, Michael Armbrust mich...@databricks.com wrote: Oh,

Re: error in mllib lr example code

2014-04-24 Thread Mohit Jaggi
Thanks Xiangrui, Matei and Arpit. It does work fine after adding Vector.dense. I have a follow up question, I will post on a new thread. On Thu, Apr 24, 2014 at 2:49 AM, Arpit Tak arpi...@sigmoidanalytics.comwrote: Also try out these examples, all of them works

spark mllib to jblas calls..and comparison with VW

2014-04-24 Thread Mohit Jaggi
Folks, I am wondering how mllib interacts with jblas and lapack. Does it make copies of data from my RDD format to jblas's format? Does jblas copy it again before passing to lapack native code? I also saw some comparisons with VW and it seems mllib is slower on a single node but scales better and

Re: Trying to use pyspark mllib NaiveBayes

2014-04-24 Thread Xiangrui Meng
I tried locally with the example described in the latest guide: http://54.82.157.211:4000/mllib-naive-bayes.html , and it worked fine. Do you mind sharing the code you used? -Xiangrui On Thu, Apr 24, 2014 at 1:57 PM, John King usedforprinting...@gmail.com wrote: Yes, I got it running for large

Re: Spark mllib throwing error

2014-04-24 Thread Xiangrui Meng
Do you mind sharing more code and error messages? The information you provided is too little to identify the problem. -Xiangrui On Thu, Apr 24, 2014 at 1:55 PM, John King usedforprinting...@gmail.com wrote: Last command was: val model = new NaiveBayes().run(points) On Thu, Apr 24, 2014 at

Re: Trying to use pyspark mllib NaiveBayes

2014-04-24 Thread John King
I was able to run simple examples as well. Which version of Spark? Did you use the most recent commit or from branch-1.0? Some background: I tried to build both on Amazon EC2, but the master kept disconnecting from the client and executors failed after connecting. So I tried to just use one

Re: Spark mllib throwing error

2014-04-24 Thread John King
In the other thread I had an issue with Python. In this issue, I tried switching to Scala. The code is: *import* org.apache.spark.mllib.regression.*LabeledPoint**;* *import org.apache.spark.mllib.linalg.SparseVector;* *import org.apache.spark.mllib.classification.NaiveBayes;* import

Re: Trying to use pyspark mllib NaiveBayes

2014-04-24 Thread John King
Also when will the official 1.0 be released? On Thu, Apr 24, 2014 at 7:04 PM, John King usedforprinting...@gmail.comwrote: I was able to run simple examples as well. Which version of Spark? Did you use the most recent commit or from branch-1.0? Some background: I tried to build both on

Re: Spark mllib throwing error

2014-04-24 Thread Xiangrui Meng
I don't see anything wrong with your code. Could you do points.count() to see how many training examples you have? Also, make sure you don't have negative feature values. The error message you sent did not say NaiveBayes went wrong, but the Spark shell was killed. -Xiangrui On Thu, Apr 24, 2014

Re: Spark mllib throwing error

2014-04-24 Thread John King
It just displayed this error and stopped on its own. Do the lines of code mentioned in the error have anything to do with it? On Thu, Apr 24, 2014 at 7:54 PM, Xiangrui Meng men...@gmail.com wrote: I don't see anything wrong with your code. Could you do points.count() to see how many training

Re: how to set spark.executor.memory and heap size

2014-04-24 Thread wxhsdp
anyone knows the reason? i've googled a bit, and found some guys had the same problem, but with no replies... -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-heap-size-tp4719p4796.html Sent from the Apache Spark User

Re: how to set spark.executor.memory and heap size

2014-04-24 Thread wxhsdp
i noticed that error occurs at org.apache.hadoop.io.WritableUtils.readCompressedStringArray(WritableUtils.java:183) at org.apache.hadoop.conf.Configuration.readFields(Configuration.java:2378) at

Re: compile spark 0.9.1 in hadoop 2.2 above exception

2014-04-24 Thread Patrick Wendell
Try running sbt/sbt clean and re-compiling. Any luck? On Thu, Apr 24, 2014 at 5:33 PM, martin.ou martin...@orchestrallinc.cnwrote: occure exception when compile spark 0.9.1 using sbt,env: hadoop 2.3 1. SPARK_HADOOP_VERSION=2.3.0 SPARK_YARN=true sbt/sbt assembly 2.found Exception:

Re: Finding bad data

2014-04-24 Thread Matei Zaharia
Hey Jim, this is unfortunately harder than I’d like right now, but here’s how to do it. Look at the stderr file of the executor on that machine, and you’ll see lines like this: 14/04/24 19:17:24 INFO HadoopRDD: Input split: file:/Users/matei/workspace/apache-spark/README.md:0+2000 This says

parallelize for a large Seq is extreamly slow.

2014-04-24 Thread Earthson Lu
spark.parallelize(word_mapping.value.toSeq).saveAsTextFile(hdfs://ns1/nlp/word_mapping) this line is too slow. There are about 2 million elements in word_mapping. *Is there a good style for writing a large collection to hdfs?* import org.apache.spark._ import SparkContext._ import

Re: parallelize for a large Seq is extreamly slow.

2014-04-24 Thread Matei Zaharia
Try setting the serializer to org.apache.spark.serializer.KryoSerializer (see http://spark.apache.org/docs/0.9.1/tuning.html), it should be considerably faster. Matei On Apr 24, 2014, at 8:01 PM, Earthson Lu earthson...@gmail.com wrote:

Problem with the Item-Based Collaborative Filtering Recommendation Algorithms in spark

2014-04-24 Thread Qin Wei
Hi All, I have a problem with the Item-Based Collaborative Filtering Recommendation Algorithms in spark. The basic flow is as below: (Item1, (User1 , Score1)) RDD1 ==(Item2, (User2 , Score2))

Re: parallelize for a large Seq is extreamly slow.

2014-04-24 Thread Earthson
Kryo With Exception below: com.esotericsoftware.kryo.KryoException (com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0, required: 1) com.esotericsoftware.kryo.io.Output.require(Output.java:138) com.esotericsoftware.kryo.io.Output.writeAscii_slow(Output.java:446)

Re: Spark running slow for small hadoop files of 10 mb size

2014-04-24 Thread neeravsalaria
Thanks for the reply. It indeed increased the usage. There was another issue we found, we were broadcasting hadoop configuration by writing a wrapper class over it. But found the proper way in Spark Code sc.broadcast(new SerializableWritable(conf)) -- View this message in context:

Re: Spark mllib throwing error

2014-04-24 Thread Xiangrui Meng
I only see one risk: if your feature indices are not sorted, it might have undefined behavior. Other than that, I don't see any thing suspicious. -Xiangrui On Thu, Apr 24, 2014 at 4:56 PM, John King usedforprinting...@gmail.com wrote: It just displayed this error and stopped on its own. Do the

Re: how to set spark.executor.memory and heap size

2014-04-24 Thread YouPeng Yang
Hi I am also curious about this question. The textFile function was supposed to read a hdfs file? In this case ,It is on local filesystem that the file was taken in.There are any recognization ways to identify the local filesystem and the hdfs in the textFile function? Beside, the OOM