Hi all,
When i gave the persist level as DISK_ONLY, still Spark tries to use memory
and caches.
Any reason ?
Do i need to override some parameter elsewhere ?
Thanks !
Hi,
Is there any lower-bound on the size of RDD to optimally utilize the
in-memory framework Spark.
Say creating RDD for very small data set of some 64 MB is not as efficient
as that of some 256 MB, then accordingly the application can be tuned.
So is there a soft-lowerbound related to
Hi Everyone,
I think all are pretty busy, the response time in this group has slightly
increased.
But anyways, this is a pretty silly problem, but could not get over.
I have a file in my localFS, but when i try to create an RDD out of it,
tasks fails with file not found exception is thrown at
Hi,
Can we override the default file-replication factor while using
saveAsTextFile() to HDFS.
My default repl.factor is 1. But intermediate files that i want to put in
HDFS while running a SPARK query need not be replicated, so is there a way ?
Thanks !
, create an RDD out of it operate *
Is there any way out ??
Thanks in advance !
On Fri, May 9, 2014 at 12:18 AM, Sai Prasanna ansaiprasa...@gmail.comwrote:
Hi Everyone,
I think all are pretty busy, the response time in this group has slightly
increased.
But anyways, this is a pretty silly
Hi All,
I wanted to launch Spark on Yarn, interactive - yarn client mode.
With default settings of yarn-site.xml and spark-env.sh, i followed the
given link
http://spark.apache.org/docs/0.8.1/running-on-yarn.html
I get the pi value correct when i run without launching the shell.
When i launch
I executed the following commands to launch spark app with yarn client
mode. I have Hadoop 2.3.0, Spark 0.8.1 and Scala 2.9.3
SPARK_HADOOP_VERSION=2.3.0 SPARK_YARN=true sbt/sbt assembly
SPARK_YARN_MODE=true \
SPARK_JAR=./assembly/target/scala-2.9.3/spark-assembly-0.8.1-incubating-hadoop2.3.0.jar
Hi, Any suggestion to the following issue ??
I have replication factor 3 in my HDFS.
With 3 datanodes, i ran my experiments. Now i just added another node to it
with no data in it.
When i ran, SPARK launches non-local tasks in it and the time taken is more
than what it took for 3 node cluster.
Hi All,
I have replication factor 3 in my HDFS.
With 3 datanodes, i ran my experiments. Now i just added another node to it
with no data in it.
When i ran, SPARK launches non-local tasks in it and the time taken is more
than what it took for 3 node cluster.
Here delayed scheduling fails i think
Hi All,
I want to store a csv-text file in Parquet format in HDFS and then do some
processing in Spark.
Somehow my search to find the way to do was futile. More help was available
for parquet with impala.
Any guidance here? Thanks !!
wrote:
This function will return scala List, you can use List's last function
to get the last element.
For example:
RDD.take(RDD.count()).last
On Thu, Apr 24, 2014 at 10:28 AM, Sai Prasanna
ansaiprasa...@gmail.comwrote:
Adnan, but RDD.take(RDD.count()) returns all the elements of the RDD
*/
On Thu, Apr 24, 2014 at 11:42 AM, Sai Prasanna ansaiprasa...@gmail.comwrote:
Thanks Guys !
On Thu, Apr 24, 2014 at 11:29 AM, Sourav Chandra
sourav.chan...@livestream.com wrote:
Also same thing can be done using rdd.top(1)(reverseOrdering)
On Thu, Apr 24, 2014 at 11:28 AM, Sourav Chandra
.
On Thu, Apr 24, 2014 at 7:38 PM, Sai Prasanna ansaiprasa...@gmail.comwrote:
Hi All, Finally i wrote the following code, which is felt does optimally
if not the most optimum one.
Using file pointers, seeking the byte after the last \n but backwards !!
This is memory efficient and i hope even unix
Hi All, Some help !
RDD.first or RDD.take(1) gives the first item, is there a straight forward
way to access the last element in a similar way ?
I coudnt fine a tail/last method for RDD. !!
Oh ya, Thanks Adnan.
On Thu, Apr 24, 2014 at 10:30 AM, Adnan Yaqoob nsyaq...@gmail.com wrote:
You can use following code:
RDD.take(RDD.count())
On Thu, Apr 24, 2014 at 9:51 AM, Sai Prasanna ansaiprasa...@gmail.comwrote:
Hi All, Some help !
RDD.first or RDD.take(1) gives the first item
Adnan, but RDD.take(RDD.count()) returns all the elements of the RDD.
I want only to access the last element.
On Thu, Apr 24, 2014 at 10:33 AM, Sai Prasanna ansaiprasa...@gmail.comwrote:
Oh ya, Thanks Adnan.
On Thu, Apr 24, 2014 at 10:30 AM, Adnan Yaqoob nsyaq...@gmail.com wrote:
You can
:
RDD.take(RDD.count()).last
On Thu, Apr 24, 2014 at 10:28 AM, Sai Prasanna ansaiprasa...@gmail.comwrote:
Adnan, but RDD.take(RDD.count()) returns all the elements of the RDD.
I want only to access the last element.
On Thu, Apr 24, 2014 at 10:33 AM, Sai Prasanna
ansaiprasa
Hi All,
I want to access a particular column of a DB table stored in a CSV format
and perform some aggregate queries over it. I wrote the following query in
scala as a first step.
*var add=(x:String)=x.split(\\s+)(2).toInt*
*var result=List[Int]()*
*input.split(\\n).foreach(x=result::=add(x)) *
Hi All,
I wanted Spark on Yarn to up and running.
I did *SPARK_HADOOP_VERSION=2.3.0 SPARK_YARN=true ./sbt/sbt assembly*
Then i ran
*SPARK_JAR=./assembly/target/scala-2.9.3/spark-assembly-0.8.1-incubating-hadoop2.3.0.jar
Hi All,
I have a five node spark cluster, Master, s1,s2,s3,s4.
I have passwordless ssh to all slaves from master and vice-versa.
But only one machine, s2, what happens is after 2-3 minutes of my
connection from master to slave, the write-pipe is broken. So if try to
connect again from master i
and employing things like Kryo, some
are solved by specifying the number of tasks to break down a given
operation into etc.
Ognen
On 3/27/14, 10:21 AM, Sai Prasanna wrote:
java.lang.OutOfMemoryError: GC overhead limit exceeded
What is the problem. The same code, i run, one instance
scheme are you using? I am guessing it is MEMORY_ONLY. In
large datasets, MEMORY_AND_DISK or MEMORY_AND_DISK_SER work better.
You can call unpersist on an RDD to remove it from Cache though.
On Thu, Mar 27, 2014 at 11:57 AM, Sai Prasanna ansaiprasa...@gmail.comwrote:
No i am running on 0.8.1
similar ???
--
*Sai Prasanna. AN*
*II M.Tech (CS), SSSIHL*
*Entire water in the ocean can never sink a ship, Unless it gets inside.All
the pressures of life can never hurt you, Unless you let them in.*
master URL
if the later case, also yes, you can observe the distributed task in the
Spark UI
--
Nan Zhu
On Wednesday, March 26, 2014 at 8:54 AM, Sai Prasanna wrote:
Is it possible to run across cluster using Spark Interactive Shell ?
To be more explicit, is the procedure similar to running
-env.sh
export SPARK_DEAMON_MEMORY=8g
export SPARK_WORKER_MEMORY=8g
export SPARK_DEAMON_JAVA_OPTS=-Xms8g -Xmx8g
export SPARK_JAVA_OPTS=-Xms8g -Xmx8g
export HADOOP_HEAPSIZE=4000
Any suggestions ??
--
*Sai Prasanna. AN*
*II M.Tech (CS), SSSIHL*
)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1206)
at org.apache.hadoop.ipc.Client.call(Client.java:1050)
... 53 more
--
*Sai Prasanna. AN*
*II M.Tech (CS), SSSIHL*
*Entire water in the ocean can never sink a ship, Unless it gets inside.All
the pressures of life can never hurt you, Unless you let them in.*
need to set somewhere a timeout ???
Thank you !!
--
*Sai Prasanna. AN*
*II M.Tech (CS), SSSIHL*
Solved...but dont know whats the difference...
just giving ./spark-shell fixes it all...but dont know why !!
On Mon, Mar 17, 2014 at 1:32 PM, Sai Prasanna ansaiprasa...@gmail.comwrote:
Hi everyone !!
I installed scala 2.9.3, spark 0.8.1, oracle java 7...
I launched master and logged
28 matches
Mail list logo