What is the concept of Block and BlockManager in Spark? How is a Block
related to a Partition of a RDD?
Hi All,
When I run the program shown below, I receive the error shown below.
I am running the current version of branch-0.9 from github. Note that
I do not receive the error when I replace 2 ** 29 with 2 ** X,
where X 29. More interestingly, I do not receive the error when X =
30, and when X
Can someone help me on how to build spark over proxy settings ..
--
REGARDS
ASHUTOSH JAIN
IIT-BHU VARANASI
http://mail-archives.apache.org/mod_mbox/spark-user/201403.mbox/%3ccaaqhkj48japuzqc476es67c+rrfime87uprambdoofhcl0k...@mail.gmail.com%3E
On Tue, Mar 11, 2014 at 11:44 AM, hades dark hades.o...@gmail.com wrote:
Can someone help me on how to build spark over proxy settings ..
--
REGARDS
Hi all,
I'm trying to read a sequenceFile that represent a set of jpeg image
generated using this tool :
http://stuartsierra.com/2008/04/24/a-million-little-files . According to
the documentation : Each key is the name of a file (a Hadoop “Text”), the
value is the binary contents of the file (a
Hi,
I am new to spark.
I would like to run jobs in Spark stand alone cluster mode.
No cluser managers other than spark is used.
(https://spark.apache.org/docs/0.9.0/spark-standalone.html)
I have tried wordcount from spark shell and stand alone scala app.
The code reads input from HDFS and
does sbt show full-classpath show spark-core on the classpath? I am
still pretty new to scala but it seems like you have val sparkCore
= org.apache.spark %% spark-core% V.spark %
provided -- I believe the provided part means it's in your
classpath. Spark-shell script sets up
Dear Sparkians,
We are working on a system to do relational modeling on top of Spark, all
done in pyspark. While we've been learning a lot about Spark internals so
far, we're currently running into memory issues and wondering how best to
profile to fix them. Here are our symptoms:
- We're
Hi,
I have some questions regarding usage patterns and debugging in spark/spark
streaming.
1. What is some used design patterns of using broadcast variable? In my
application i created some and also created a scheduled task which
periodically refreshes the variables. I want to know how
Since it’s from Scala, it might mean you’re running with a different version of
Scala than you compiled Spark with. Spark 0.8 and earlier use Scala 2.9, while
Spark 0.9 uses Scala 2.10.
Matei
On Mar 11, 2014, at 8:19 AM, Jeyaraj, Arockia R (Arockia)
arockia.r.jeya...@verizon.com wrote:
Hi,
Thanks, added you.
On Mar 11, 2014, at 2:47 AM, Christoph Böhm listenbru...@gmx.net wrote:
Dear Spark team,
thanks for the great work and congrats on becoming an Apache top-level
project!
You could add us to your Powered-by-page, because we are using Spark (and
Shark) to perform
Hi Aaron,
When you say Java heap space is 1.5G per worker, 24 or 32 cores across 46
nodes. It seems like we should have more than enough to do this
comfortably., how are you configuring this?
-Sandy
On Tue, Mar 11, 2014 at 10:11 AM, Aaron Olson aaron.ol...@shopify.comwrote:
Dear Sparkians,
Hello,
I've been trying to run an iterative spark job that spills 1+ GB to disk
per iteration on a system with limited disk space. I believe there's
enough space if spark would clean up unused data from previous iterations,
but as it stands the number of iterations I can run is limited by
Ohh !
I thought you're unsubscribing :)
Kapil Malik | kma...@adobe.com | 33430 / 8800836581
-Original Message-
From: Matei Zaharia [mailto:matei.zaha...@gmail.com]
Sent: 12 March 2014 00:51
To: user@spark.apache.org
Subject: Re: unsubscribe
To unsubscribe from this list, please
Are you aware that you get an executor (and the 1.5GB) per machine, not per
core?
On Tue, Mar 11, 2014 at 12:52 PM, Aaron Olson aaron.ol...@shopify.comwrote:
Hi Sandy,
We're configuring that with the JAVA_OPTS environment variable in
$SPARK_HOME/spark-worker-env.sh like this:
# JAVA OPTS
Your input data read as RDD may be causing OOM, so thats where you can use
different memory configuration.
We are not getting any OOM exceptions, just akka future timeouts in
mapoutputtracker and unsuccessful get of shuffle outputs, therefore refetching
them.
What is the industry practice
Hi,
I'm implementing a recommender based on the algorithm described in
http://www2.research.att.com/~yifanhu/PUB/cf.pdf. This algorithm forms the
basis for Spark's ALS implementation for data sets with implicit features.
The data set I'm working with is proprietary and I cannot share it,
Hi Michael,
I can help check the current implementation. Would you please go to
https://spark-project.atlassian.net/browse/SPARK and create a ticket
about this issue with component MLlib? Thanks!
Best,
Xiangrui
On Tue, Mar 11, 2014 at 3:18 PM, Michael Allman m...@allman.ms wrote:
Hi,
I'm
Ah! Thank you. That'll work for now.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-create-RDD-from-Java-in-memory-data-tp2486p2570.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi Paul,
What do you mean by distributing the jars manually? If you register jars
that are local to the client with SparkContext.addJars, Spark should handle
distributing them to the workers. Are you taking advantage of this?
-Sandy
On Tue, Mar 11, 2014 at 3:09 PM, Paul Schooss
In a similar vein, it would be helpful to have an Iterable way to access the
data inside an RDD. The collect method takes everything in the RDD and puts
in a list, but this blows up memory. Since everything I want is already
inside the RDD, it could be easy to iterate over the content without
https://github.com/apache/incubator-spark/pull/421
Works pretty good, but really needs to be enhanced to work with
AsyncRDDActions.
On Tue, Mar 11, 2014 at 4:50 PM, wallacemann wall...@bandpage.com wrote:
In a similar vein, it would be helpful to have an Iterable way to access
the
data
I agree that we can’t keep adding these to the core API, partly because it will
get unwieldy to maintain and partly just because each storage system will bring
in lots of dependencies. We can simply have helper classes in different modules
for each storage system. There’s some discussion on
In my opinion, BlockManager manages many types of Block, RDD's partition,
a.k.a. RDDBlock, is one type of them. Other types of Blocks are
ShuffleBlock, IndirectBlock (if the task's return status is too large), etc.
So, BlockManager is a layer that is independent of RDD concept.
On Mar 11, 2014
You should probably be
asking the opposite question: why do you think it *should* be applied
immediately? Since the driver program hasn't requested any data back
(distinct generates a new RDD, it doesn't return any data), there's no
need to actually compute anything yet.
As the documentation
I think you misunderstood my question - I should have stated it better. I'm
not saying it should be applied immediately, but I'm trying to understand
how Spark achieves this lazy computation transformations. May be this is
due to my ignorance of how Scala works, but when I see the code, I see that
26 matches
Mail list logo