persist @ disk-only failing

2014-05-19 Thread Sai Prasanna
Hi all, When i gave the persist level as DISK_ONLY, still Spark tries to use memory and caches. Any reason ? Do i need to override some parameter elsewhere ? Thanks !

Preferred RDD Size

2014-05-15 Thread Sai Prasanna
Hi, Is there any lower-bound on the size of RDD to optimally utilize the in-memory framework Spark. Say creating RDD for very small data set of some 64 MB is not as efficient as that of some 256 MB, then accordingly the application can be tuned. So is there a soft-lowerbound related to

File present but file not found exception

2014-05-15 Thread Sai Prasanna
Hi Everyone, I think all are pretty busy, the response time in this group has slightly increased. But anyways, this is a pretty silly problem, but could not get over. I have a file in my localFS, but when i try to create an RDD out of it, tasks fails with file not found exception is thrown at

saveAsTextFile with replication factor in HDFS

2014-05-14 Thread Sai Prasanna
Hi, Can we override the default file-replication factor while using saveAsTextFile() to HDFS. My default repl.factor is 1. But intermediate files that i want to put in HDFS while running a SPARK query need not be replicated, so is there a way ? Thanks !

Re: File present but file not found exception

2014-05-12 Thread Sai Prasanna
, create an RDD out of it operate * Is there any way out ?? Thanks in advance ! On Fri, May 9, 2014 at 12:18 AM, Sai Prasanna ansaiprasa...@gmail.comwrote: Hi Everyone, I think all are pretty busy, the response time in this group has slightly increased. But anyways, this is a pretty silly

Spark on Yarn - A small issue !

2014-05-12 Thread Sai Prasanna
Hi All, I wanted to launch Spark on Yarn, interactive - yarn client mode. With default settings of yarn-site.xml and spark-env.sh, i followed the given link http://spark.apache.org/docs/0.8.1/running-on-yarn.html I get the pi value correct when i run without launching the shell. When i launch

Check your cluster UI to ensure that workers are registered and have sufficient memory

2014-05-05 Thread Sai Prasanna
I executed the following commands to launch spark app with yarn client mode. I have Hadoop 2.3.0, Spark 0.8.1 and Scala 2.9.3 SPARK_HADOOP_VERSION=2.3.0 SPARK_YARN=true sbt/sbt assembly SPARK_YARN_MODE=true \ SPARK_JAR=./assembly/target/scala-2.9.3/spark-assembly-0.8.1-incubating-hadoop2.3.0.jar

Setting spark.locality.wait.node parameter in interactive shell

2014-04-30 Thread Sai Prasanna
Hi, Any suggestion to the following issue ?? I have replication factor 3 in my HDFS. With 3 datanodes, i ran my experiments. Now i just added another node to it with no data in it. When i ran, SPARK launches non-local tasks in it and the time taken is more than what it took for 3 node cluster.

Delayed Scheduling - Setting spark.locality.wait.node parameter in interactive shell

2014-04-29 Thread Sai Prasanna
Hi All, I have replication factor 3 in my HDFS. With 3 datanodes, i ran my experiments. Now i just added another node to it with no data in it. When i ran, SPARK launches non-local tasks in it and the time taken is more than what it took for 3 node cluster. Here delayed scheduling fails i think

Spark with Parquet

2014-04-28 Thread Sai Prasanna
Hi All, I want to store a csv-text file in Parquet format in HDFS and then do some processing in Spark. Somehow my search to find the way to do was futile. More help was available for parquet with impala. Any guidance here? Thanks !!

Re: Access Last Element of RDD

2014-04-24 Thread Sai Prasanna
wrote: This function will return scala List, you can use List's last function to get the last element. For example: RDD.take(RDD.count()).last On Thu, Apr 24, 2014 at 10:28 AM, Sai Prasanna ansaiprasa...@gmail.comwrote: Adnan, but RDD.take(RDD.count()) returns all the elements of the RDD

Re: Access Last Element of RDD

2014-04-24 Thread Sai Prasanna
*/ On Thu, Apr 24, 2014 at 11:42 AM, Sai Prasanna ansaiprasa...@gmail.comwrote: Thanks Guys ! On Thu, Apr 24, 2014 at 11:29 AM, Sourav Chandra sourav.chan...@livestream.com wrote: Also same thing can be done using rdd.top(1)(reverseOrdering) On Thu, Apr 24, 2014 at 11:28 AM, Sourav Chandra

Re: Access Last Element of RDD

2014-04-24 Thread Sai Prasanna
. On Thu, Apr 24, 2014 at 7:38 PM, Sai Prasanna ansaiprasa...@gmail.comwrote: Hi All, Finally i wrote the following code, which is felt does optimally if not the most optimum one. Using file pointers, seeking the byte after the last \n but backwards !! This is memory efficient and i hope even unix

Access Last Element of RDD

2014-04-23 Thread Sai Prasanna
Hi All, Some help ! RDD.first or RDD.take(1) gives the first item, is there a straight forward way to access the last element in a similar way ? I coudnt fine a tail/last method for RDD. !!

Re: Access Last Element of RDD

2014-04-23 Thread Sai Prasanna
Oh ya, Thanks Adnan. On Thu, Apr 24, 2014 at 10:30 AM, Adnan Yaqoob nsyaq...@gmail.com wrote: You can use following code: RDD.take(RDD.count()) On Thu, Apr 24, 2014 at 9:51 AM, Sai Prasanna ansaiprasa...@gmail.comwrote: Hi All, Some help ! RDD.first or RDD.take(1) gives the first item

Re: Access Last Element of RDD

2014-04-23 Thread Sai Prasanna
Adnan, but RDD.take(RDD.count()) returns all the elements of the RDD. I want only to access the last element. On Thu, Apr 24, 2014 at 10:33 AM, Sai Prasanna ansaiprasa...@gmail.comwrote: Oh ya, Thanks Adnan. On Thu, Apr 24, 2014 at 10:30 AM, Adnan Yaqoob nsyaq...@gmail.com wrote: You can

Re: Access Last Element of RDD

2014-04-23 Thread Sai Prasanna
: RDD.take(RDD.count()).last On Thu, Apr 24, 2014 at 10:28 AM, Sai Prasanna ansaiprasa...@gmail.comwrote: Adnan, but RDD.take(RDD.count()) returns all the elements of the RDD. I want only to access the last element. On Thu, Apr 24, 2014 at 10:33 AM, Sai Prasanna ansaiprasa

Efficient Aggregation over DB data

2014-04-22 Thread Sai Prasanna
Hi All, I want to access a particular column of a DB table stored in a CSV format and perform some aggregate queries over it. I wrote the following query in scala as a first step. *var add=(x:String)=x.split(\\s+)(2).toInt* *var result=List[Int]()* *input.split(\\n).foreach(x=result::=add(x)) *

Null Pointer Exception in Spark Application with Yarn Client Mode

2014-04-07 Thread Sai Prasanna
Hi All, I wanted Spark on Yarn to up and running. I did *SPARK_HADOOP_VERSION=2.3.0 SPARK_YARN=true ./sbt/sbt assembly* Then i ran *SPARK_JAR=./assembly/target/scala-2.9.3/spark-assembly-0.8.1-incubating-hadoop2.3.0.jar

SSH problem

2014-04-01 Thread Sai Prasanna
Hi All, I have a five node spark cluster, Master, s1,s2,s3,s4. I have passwordless ssh to all slaves from master and vice-versa. But only one machine, s2, what happens is after 2-3 minutes of my connection from master to slave, the write-pipe is broken. So if try to connect again from master i

Re: GC overhead limit exceeded

2014-03-27 Thread Sai Prasanna
and employing things like Kryo, some are solved by specifying the number of tasks to break down a given operation into etc. Ognen On 3/27/14, 10:21 AM, Sai Prasanna wrote: java.lang.OutOfMemoryError: GC overhead limit exceeded What is the problem. The same code, i run, one instance

Re: GC overhead limit exceeded

2014-03-27 Thread Sai Prasanna
scheme are you using? I am guessing it is MEMORY_ONLY. In large datasets, MEMORY_AND_DISK or MEMORY_AND_DISK_SER work better. You can call unpersist on an RDD to remove it from Cache though. On Thu, Mar 27, 2014 at 11:57 AM, Sai Prasanna ansaiprasa...@gmail.comwrote: No i am running on 0.8.1

Distributed running in Spark Interactive shell

2014-03-26 Thread Sai Prasanna
similar ??? -- *Sai Prasanna. AN* *II M.Tech (CS), SSSIHL* *Entire water in the ocean can never sink a ship, Unless it gets inside.All the pressures of life can never hurt you, Unless you let them in.*

Re: Distributed running in Spark Interactive shell

2014-03-26 Thread Sai Prasanna
master URL if the later case, also yes, you can observe the distributed task in the Spark UI -- Nan Zhu On Wednesday, March 26, 2014 at 8:54 AM, Sai Prasanna wrote: Is it possible to run across cluster using Spark Interactive Shell ? To be more explicit, is the procedure similar to running

GC overhead limit exceeded in Spark-interactive shell

2014-03-24 Thread Sai Prasanna
-env.sh export SPARK_DEAMON_MEMORY=8g export SPARK_WORKER_MEMORY=8g export SPARK_DEAMON_JAVA_OPTS=-Xms8g -Xmx8g export SPARK_JAVA_OPTS=-Xms8g -Xmx8g export HADOOP_HEAPSIZE=4000 Any suggestions ?? -- *Sai Prasanna. AN* *II M.Tech (CS), SSSIHL*

Connect Exception Error in spark interactive shell...

2014-03-18 Thread Sai Prasanna
) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1206) at org.apache.hadoop.ipc.Client.call(Client.java:1050) ... 53 more -- *Sai Prasanna. AN* *II M.Tech (CS), SSSIHL* *Entire water in the ocean can never sink a ship, Unless it gets inside.All the pressures of life can never hurt you, Unless you let them in.*

Spark shell exits after 1 min

2014-03-17 Thread Sai Prasanna
need to set somewhere a timeout ??? Thank you !! -- *Sai Prasanna. AN* *II M.Tech (CS), SSSIHL*

Re: Spark shell exits after 1 min

2014-03-17 Thread Sai Prasanna
Solved...but dont know whats the difference... just giving ./spark-shell fixes it all...but dont know why !! On Mon, Mar 17, 2014 at 1:32 PM, Sai Prasanna ansaiprasa...@gmail.comwrote: Hi everyone !! I installed scala 2.9.3, spark 0.8.1, oracle java 7... I launched master and logged