Found the issue, actually splits in HBase was not uniform, so one job was
taking 90% of time.
BTW, is there a way to save the details available port 4040 after job is
finished?
On Tue, Feb 25, 2014 at 7:26 AM, Nick Pentreath nick.pentre...@gmail.comwrote:
It's tricky really since you may not
Hi,
I am looking for ways to share the sparkContext, meaning i need to
be able to perform multiple operations on the same spark context.
Below is code of a simple app i am testing
def main(args: Array[String]) {
println(Welcome to example application!)
val sc = new
fair scheduler merely reorders tasks .. I think he is looking to run
multiple pieces of code on a single context on demand from customers...if
the code order is decided then fair scheduler will ensure that all tasks
get equal cluster time :)
Mayur Rustagi
Ph: +919632149971
h
okay you caught me on this.. I havnt used python api.
Lets try
http://www.cs.berkeley.edu/~pwendell/strataconf/api/pyspark/pyspark.rdd.RDD-class.html#partitionByon
the rdd customize the partitioner instead of hash to a custom
function.
Please update on the list if it works, it seems to be a
Hi Mayur,
Thanks for replying. Is it usually double the size of data on disk?
I have observed this many times. Storage section of Spark is telling me that
100% of RDD is cached using 97 GB of RAM while the data in HDFS is only 47 GB.
Thanks and Regards,
Suraj Sheth
From: Mayur Rustagi
The problem is that Java objects can take more space than the underlying data,
but there are options in Spark to store data in serialized form to get around
this. Take a look at https://spark.incubator.apache.org/docs/latest/tuning.html.
Matei
On Feb 25, 2014, at 12:01 PM, Suraj Satishkumar
It seems are you are already using parititonBy, you can simply plugin in
your custom function instead of lambda x:x it should use that to
partition. Range partitioner is available in Scala I am not sure if its
exposed directly in python.
Regards
Mayur
Mayur Rustagi
Ph: +919632149971
h
Thank you Mayur, I think that will help me a lot
Best,
Tao
2014-02-26 8:56 GMT+08:00 Mayur Rustagi mayur.rust...@gmail.com:
Type of Shuffling is best explained by Matei in Spark Internals .
http://www.youtube.com/watch?v=49Hr5xZyTEA#t=2203
Why dont you look at that then if you have follow
I'm not able to run the GraphX examples from the Scala REPL. Can anyone
point to the correct documentation that talks about the configuration
and/or how to build GraphX for the REPL ?
Thanks
Hi hyqgod,
This is probably a better question for the spark user's list than the dev
list (cc'ing user and bcc'ing dev on this reply).
To answer your question, though:
Amazon's Public Datasets Page is a nice place to start:
http://aws.amazon.com/datasets/ - these work well with spark because
10 matches
Mail list logo