Assuming you're using a new enough version of Spark, you should use
spark.executor.memory to set the memory for your executors, without
changing the driver memory. See the docs for your version of Spark.
On Thu, Mar 27, 2014 at 10:48 PM, Tsai Li Ming mailingl...@ltsai.comwrote:
Hi,
My worker
Hi,
Thanks! I found out that I wasn’t setting the SPARK_JAVA_OPTS correctly..
I took a look at the process table and saw that the
“org.apache.spark.executor.CoarseGrainedExecutorBackend” didn’t have the
-Dspark.local.dir set.
On 28 Mar, 2014, at 1:05 pm, Matei Zaharia
Hi David,
I am sorry but your question is not clear to me. Are you talking about
taking some value and sharing it across your cluster so that it is present
on all the nodes? You can look at Spark's broadcasting in that case. On the
other hand, if you want to take one item and create an RDD of 100
Hi,
I am a newbie with Spark.
I tried installing 2 virtual machines, one as a client and one as standalone
mode worker+master.
Everything seems to run and connect fine, but when I try to run a simple
script, I get weird errors.
Here is the traceback, notice my program is just a one-liner:
Have you tried setting the partitioning ?
Best Regards,
Sonal
Nube Technologies http://www.nubetech.co
http://in.linkedin.com/in/sonalgoyal
On Thu, Mar 27, 2014 at 10:04 AM, lannyripple lanny.rip...@gmail.comwrote:
Hi all,
I've got something which I think should be straightforward but
Hi,
I just run a simple example to generate some data for the ALS
algorithm. my spark version is 0.9, and in local mode, the memory of my
node is 108G
but when I set conf.set(spark.akka.frameSize, 4096), it
then occurred the following problem, and when I do not set this, it runs
well .
I forgot to mention that I don't really use all of my data. Instead I use a
sample extracted with randomSample.
On Fri, Mar 28, 2014 at 10:58 AM, Jaonary Rabarisoa jaon...@gmail.comwrote:
Hi all,
I notice that RDD.cartesian has a strange behavior with cached and
uncached data. More
I sorted it out.
Turns out that if the client uses Python 2.7 and the server is Python 2.6,
you get some weird errors, like this and others.
So you would probably want not to do that...
--
View this message in context:
If you are learning about Spark Streaming, as I am, you've probably use
netcat nc as mentioned in the spark streaming programming guide. I
wanted something a little more useful, so I modified the
ClickStreamGenerator code to make a very simple script that simply reads a
file off disk and passes
I've played around with it. The CSV file looks like it gives 130
partitions. I'm assuming that's the standard 64MB split size for HDFS
files. I have increased number of partitions and number of tasks for
things like groupByKey and such. Usually I start blowing up on GC
Overlimit or sometimes
Classes are serialized and sent to all the workers as akka msgs
singletons and case classes I am not sure if they are javaserialized or
kryoserialized by default
But definitely your own classes if serialized by kryo will be much
efficient.there is an comparison that Matei did for all
As long as the amount of state being passed is relatively small, it's
probably easiest to send it back to the driver and to introduce it into RDD
transformations as the zero value of a fold.
On Fri, Mar 28, 2014 at 7:12 AM, Adrian Mocanu amoc...@verticalscope.comwrote:
I'd like to resurrect
Ok. Based on Sonal's message I dived more into memory and partitioning and
got it to work.
For the CSV file I used 1024 partitions [textFile(path, 1024)] which cut
the partition size down to 8MB (based on standard HDFS 64MB splits). For
the key file I also adjusted partitions to use about 8MB.
I'd like to resurrect this thread since I don't have an answer yet.
From: Adrian Mocanu [mailto:amoc...@verticalscope.com]
Sent: March-27-14 10:04 AM
To: u...@spark.incubator.apache.org
Subject: function state lost when next RDD is processed
Is there a way to pass a custom function to spark to
Hi Aureliano,
I followed this thread to create a custom saveAsObjectFile.
The following is the code.
/new org.apache.spark.rdd.SequenceFileRDDFunctions[NullWritable,
BytesWritable](saveRDD.mapPartitions(iter =
iter.grouped(10).map(_.toArray)).map(x = (NullWritable.get(), new
There is also this quote from the Tuning guide
(http://spark.incubator.apache.org/docs/latest/tuning.html):
Finally, if you don't register your classes, Kryo will still work, but
it will have to store the full class name with each object, which is
wasteful.
It implies that you don't really
Thanks!
Ya that's what I'm doing so far, but I wanted to see if it's possible to keep
the tuples inside Spark for fault tolerance purposes.
-A
From: Mark Hamstra [mailto:m...@clearstorydata.com]
Sent: March-28-14 10:45 AM
To: user@spark.apache.org
Subject: Re: function state lost when next RDD
Thanks a lot Ognen!
It's not a fancy class that I wrote, and now I realized I neither extends
Serializable or register with Kyro and that's why it is not working.
--
View this message in context:
Hi,
Thanks Nanzhu.I tried to implement your suggestion on following scenario.I
have RDD of say 24 elements.In that when i partioned into two groups of 12
elements each.Their is loss of order of elements in partition.Elemest are
partitioned randomly.I need to preserve the order such that the first
The cleaner ttl was introduced as a brute force method to clean all old
data and metadata in the system, so that the system can run 24/7. The
cleaner ttl should be set to a large value, so that RDDs older than that
are not used. Though there are some cases where you may want to use an RDD
again
I think you should sort each RDD
-Original Message-
From: yh18190 [mailto:yh18...@gmail.com]
Sent: March-28-14 4:44 PM
To: u...@spark.incubator.apache.org
Subject: Re: Splitting RDD and Grouping together to perform computation
Hi,
Thanks Nanzhu.I tried to implement your suggestion on
I say you need to remap so you have a key for each tuple that you can sort on.
Then call rdd.sortByKey(true) like this mystream.transform(rdd =
rdd.sortByKey(true))
For this fn to be available you need to import
org.apache.spark.rdd.OrderedRDDFunctions
-Original Message-
From: yh18190
From the jist of it, it seems like you need to override the default
partitioner to control how your data is distributed among partitions. Take
a look at different Partitioners available (Default, Range, Hash) if none
of these get you desired result, you might want to provide your own.
On Fri,
Hi Andriana,
Thanks for suggestion.Could you please modify my code part where I need to
do so..I apologise for inconvinience ,becoz i am new to spark I coudnt apply
appropriately..i would be thankful to you.
--
View this message in context:
Not sure how to change your code because you'd need to generate the keys where
you get the data. Sorry about that.
I can tell you where to put the code to remap and sort though.
import org.apache.spark.rdd.OrderedRDDFunctions
val res2=reduced_hccg.map(_._2)
.map( x= (newkey,x)).sortByKey(true)
Hey guys,
I need to tag individual RDD lines with some values. This tag value would
change at every iteration. Is this possible with RDD (I suppose this is
sort of like mutable RDD, but it's more) ?
If not, what would be the best way to do something like this? Basically, we
need to keep mutable
Sung Hwan, strictly speaking, RDDs are immutable, so the canonical way to
get what you want is to transform to another RDD. But you might look at
MutablePair (
https://github.com/apache/spark/blob/60abc252545ec7a5d59957a32e764cd18f6c16b4/core/src/main/scala/org/apache/spark/util/MutablePair.scala)
Weird, how exactly are you pulling out the sample? Do you have a small program
that reproduces this?
Matei
On Mar 28, 2014, at 3:09 AM, Jaonary Rabarisoa jaon...@gmail.com wrote:
I forgot to mention that I don't really use all of my data. Instead I use a
sample extracted with randomSample.
What does your saveRDD contain? If you are using custom objects, they
should be serializable.
Best Regards,
Sonal
Nube Technologies http://www.nubetech.co
http://in.linkedin.com/in/sonalgoyal
On Sat, Mar 29, 2014 at 12:02 AM, pradeeps8 srinivasa.prad...@gmail.comwrote:
Hi Aureliano,
I
Are you referring to Spark Streaming?
Can you save the sum as a RDD keep joining the two rdd together?
Regards
Mayur
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Fri, Mar 28, 2014 at 10:47 AM, Adrian Mocanu
Sung Hwan, yes, I'm saying exactly what you interpreted, including that if
you tried it, it would (mostly) work, and my uncertainty with respect to
guarantees on the semantics. Definitely there would be no fault tolerance
if the mutations depend on state that is not captured in the RDD lineage.
Thanks Patrick,
I was thinking about that... Upon analysis I realized (on date) it would be
something similar to the way Hive Context using CustomCatalog stuff.
I will review it again, on the lines of implementing SchemaRDD with
Cassandra. Thanks for the pointer.
Upon discussion with couple of
That helps! Thank you.
On Fri, Mar 28, 2014 at 12:36 AM, Sonal Goyal sonalgoy...@gmail.com wrote:
Hi David,
I am sorry but your question is not clear to me. Are you talking about
taking some value and sharing it across your cluster so that it is present
on all the nodes? You can look at
33 matches
Mail list logo