Re: HBase row count

2014-02-25 Thread Soumitra Kumar
Found the issue, actually splits in HBase was not uniform, so one job was taking 90% of time. BTW, is there a way to save the details available port 4040 after job is finished? On Tue, Feb 25, 2014 at 7:26 AM, Nick Pentreath nick.pentre...@gmail.comwrote: It's tricky really since you may not

Sharing SparkContext

2014-02-25 Thread abhinav chowdary
Hi, I am looking for ways to share the sparkContext, meaning i need to be able to perform multiple operations on the same spark context. Below is code of a simple app i am testing def main(args: Array[String]) { println(Welcome to example application!) val sc = new

Re: Sharing SparkContext

2014-02-25 Thread Mayur Rustagi
fair scheduler merely reorders tasks .. I think he is looking to run multiple pieces of code on a single context on demand from customers...if the code order is decided then fair scheduler will ensure that all tasks get equal cluster time :) Mayur Rustagi Ph: +919632149971 h

Re: How to get well-distribute partition

2014-02-25 Thread Mayur Rustagi
okay you caught me on this.. I havnt used python api. Lets try http://www.cs.berkeley.edu/~pwendell/strataconf/api/pyspark/pyspark.rdd.RDD-class.html#partitionByon the rdd customize the partitioner instead of hash to a custom function. Please update on the list if it works, it seems to be a

RE: Size of RDD larger than Size of data on disk

2014-02-25 Thread Suraj Satishkumar Sheth
Hi Mayur, Thanks for replying. Is it usually double the size of data on disk? I have observed this many times. Storage section of Spark is telling me that 100% of RDD is cached using 97 GB of RAM while the data in HDFS is only 47 GB. Thanks and Regards, Suraj Sheth From: Mayur Rustagi

Re: Size of RDD larger than Size of data on disk

2014-02-25 Thread Matei Zaharia
The problem is that Java objects can take more space than the underlying data, but there are options in Spark to store data in serialized form to get around this. Take a look at https://spark.incubator.apache.org/docs/latest/tuning.html. Matei On Feb 25, 2014, at 12:01 PM, Suraj Satishkumar

Re: How to get well-distribute partition

2014-02-25 Thread Mayur Rustagi
It seems are you are already using parititonBy, you can simply plugin in your custom function instead of lambda x:x it should use that to partition. Range partitioner is available in Scala I am not sure if its exposed directly in python. Regards Mayur Mayur Rustagi Ph: +919632149971 h

Re: Need some tutorials and examples about customized partitioner

2014-02-25 Thread Tao Xiao
Thank you Mayur, I think that will help me a lot Best, Tao 2014-02-26 8:56 GMT+08:00 Mayur Rustagi mayur.rust...@gmail.com: Type of Shuffling is best explained by Matei in Spark Internals . http://www.youtube.com/watch?v=49Hr5xZyTEA#t=2203 Why dont you look at that then if you have follow

Help with building and running examples with GraphX from the REPL

2014-02-25 Thread Soumya Simanta
I'm not able to run the GraphX examples from the Scala REPL. Can anyone point to the correct documentation that talks about the configuration and/or how to build GraphX for the REPL ? Thanks

Re: [HELP] ask for some information about public data set

2014-02-25 Thread Evan R. Sparks
Hi hyqgod, This is probably a better question for the spark user's list than the dev list (cc'ing user and bcc'ing dev on this reply). To answer your question, though: Amazon's Public Datasets Page is a nice place to start: http://aws.amazon.com/datasets/ - these work well with spark because