Re: Nothing happens when executing on cluster

2014-02-25 Thread Anders Bennehag
I believe I solved my problem. The worker-node didn't know where to return the answers. I set SPARK_LOCAL_IP and the program runs as it should. On Mon, Feb 24, 2014 at 3:55 PM, Anders Bennehag wrote: > Hello there, > > I'm having some trouble with my spark-cluster consisting of > > master.censo

Re: HBase row count

2014-02-25 Thread Nick Pentreath
cache only caches the data on the first action (count) - the first time it still needs to read the data from the source. So the first time you call count it will take the same amount of time whether cache is enabled or not. The second time you call count on a cached RDD, you should see that it take

Re:

2014-02-25 Thread Eugen Cepoi
Yes it is doing it twice, try to cache the initial RDD. 2014-02-25 8:14 GMT+01:00 Soumitra Kumar : > I have a code which reads an HBase table, and counts number of rows > containing a field. > > def readFields(rdd : RDD[(ImmutableBytesWritable, Result)]) : > RDD[List[Array[Byte]]] = { >

Need some tutorials and examples about customized partitioner

2014-02-25 Thread Tao Xiao
I am a newbie to Spark and I need to know how RDD partitioning can be controlled in the process of shuffling. I have googled for examples but haven't found much concrete examples, in contrast with the fact that there are many good tutorials about Hadoop's shuffling and partitioner. Can anybody sho

Re:

2014-02-25 Thread Cheng Lian
RDD.count() is an action, which triggers a distributed job, no matter the RDD is cached or not. If the RDD is cached, there won't be duplicated HBase scan. How do you want to improve the performance? Are you trying to reduce unnecessary distributed jobs, or improve the performance of the second

Re: HBase row count

2014-02-25 Thread Cheng Lian
RDD.cache() is a lazy operation, the method itself doesn't perform the cache operation, it just asks Spark runtime to cache the content of the RDD when the first action is invoked. In your case, the first action is the first count() call, which conceptually does 3 things: 1. Performs the HBase

Re: HBase row count

2014-02-25 Thread Cheng Lian
BTW, unlike RDD.cache(), the reverse operation RDD.unpersist() is not lazy, which is somewhat confusing... On Tue, Feb 25, 2014 at 7:48 PM, Cheng Lian wrote: > RDD.cache() is a lazy operation, the method itself doesn't perform the > cache operation, it just asks Spark runtime to cache the conte

Re: HBase row count

2014-02-25 Thread Koert Kuipers
i find them both somewhat confusing actually. * RDD.cache is lazy, and mutates the RDD in place * RDD.unpersist has a direct effect of unloading, and also mutates the RDD in place to disable future lazy caching i have found that if i need to unload an RDD from memory, but still want it to be cache

RE: Akka Connection refused - standalone cluster using spark-0.9.0

2014-02-25 Thread Li, Rui
Hi Pillis, I met with the same problem here. Could you share how you solved the issue more specifically? I added an entry in /etc/hosts, but it doesn't help. From: Pillis W [mailto:pillis.w...@gmail.com] Sent: Sunday, February 09, 2014 4:49 AM To: u...@spark.incubator.apache.org Subject: Re: Akk

Size of RDD larger than Size of data on disk

2014-02-25 Thread Suraj Satishkumar Sheth
Hi All, I have a folder in HDFS which has files with size of 47GB. I am loading this in Spark as RDD[String] and caching it. The total amount of RAM that Spark uses to cache it is around 97GB. I want to know why Spark is taking up so much of Space for the RDD? Can we reduce the RDD size in Spark

Re: HBase row count

2014-02-25 Thread Soumitra Kumar
Thanks Nick. How do I figure out if the RDD fits in memory? On Tue, Feb 25, 2014 at 1:04 AM, Nick Pentreath wrote: > cache only caches the data on the first action (count) - the first time it > still needs to read the data from the source. So the first time you call > count it will take the sam

Re: HBase row count

2014-02-25 Thread Nick Pentreath
It's tricky really since you may not know upfront how much data is in there. You could possibly take a look at how much data is in the HBase tables to get an idea. It may take a bit of trial and error, like running out of memory trying to cache the dataset, and checking the Spark UI on port 4040 t

Re: Akka Connection refused - standalone cluster using spark-0.9.0

2014-02-25 Thread Akhil Das
Hi Rui, If you are getting a "Connection refused" exception, You can resolve it by checking *=> Master is running on the specific host* - *netstat -at | grep 7077* You will get something similar to: - *tcp0 0 akhldz.master.io:7077 *:*

Spark in YARN HDP problem

2014-02-25 Thread aecc
Hi, I'm trying to run the Spark examples in YARN and I get the following error: appDiagnostics: Application application_1390483691679_0124 failed 2 times due to AM Container for appattempt_1390483691679_0124_02 exited with exitCode: 1 due to: Exception from container-launch: org.apache.hadoo

Kryo serialization does not compress

2014-02-25 Thread pradeeps8
Hi All, We are currently trying to benchmark the various cache options on RDDs with respect to speed and efficiency. The data that we are using is mostly filled with numbers (floating point). We have noticed that the memory consumption of the RDD for MEMORY_ONLY (519.1 MB) and MEMORY_ONLY_SER (51

RE: ETL on pyspark

2014-02-25 Thread Adrian Mocanu
Hi Matei If Spark crashes while writing the file, after recovery from the failure does it continue where it left off or will there be duplicates in the file? -A From: Matei Zaharia [mailto:matei.zaha...@gmail.com] Sent: February-24-14 4:20 PM To: u...@spark.incubator.apache.org Subject: Re: ETL o

Re: HBase row count

2014-02-25 Thread Soumitra Kumar
Found the issue, actually splits in HBase was not uniform, so one job was taking 90% of time. BTW, is there a way to save the details available port 4040 after job is finished? On Tue, Feb 25, 2014 at 7:26 AM, Nick Pentreath wrote: > It's tricky really since you may not know upfront how much da

Re: Size of RDD larger than Size of data on disk

2014-02-25 Thread Mayur Rustagi
Spark may take more RAM than reqiured by RDD, can you look at storage section of Spark & see how much space RDD is taking in memory. It may still take more storage than disk as Java objects have some overhead. Consider enabling compression in RDD. Mayur Rustagi Ph: +919632149971 h

Sharing SparkContext

2014-02-25 Thread abhinav chowdary
Hi, I am looking for ways to share the sparkContext, meaning i need to be able to perform multiple operations on the same spark context. Below is code of a simple app i am testing def main(args: Array[String]) { println("Welcome to example application!") val sc = new SparkContext

Re: Sharing SparkContext

2014-02-25 Thread Mayur Rustagi
how do you want to pass the operations to the spark context? Mayur Rustagi Ph: +919632149971 h ttp://www.sigmoidanalytics.com https://twitter.com/mayur_rustagi On Tue, Feb 25, 2014 at 9:59 AM, abhinav chowdary < abhinav.chowd...@gmail.com> wrote: > Hi, >

Re: Sharing SparkContext

2014-02-25 Thread abhinav chowdary
Sorry for not being clear earlier how do you want to pass the operations to the spark context? this is partly what i am looking for . How to access the active spark context and possible ways to pass operations Thanks On Tue, Feb 25, 2014 at 10:02 AM, Mayur Rustagi wrote: > how do you want to p

Re: Sharing SparkContext

2014-02-25 Thread Mayur Rustagi
So there is no way to share context currently, 1. you can try jobserver by Ooyala but I havnt used it & frankly nobody has shared feedback on it. 2. If you can load that rdd to Shark then you get a sql interface on that RDD + columnar storage 3. You can try a crude method of starting a spark shell

Re: Sharing SparkContext

2014-02-25 Thread Ognen Duzlevski
Doesn't the fair scheduler solve this? Ognen On 2/25/14, 12:08 PM, abhinav chowdary wrote: Sorry for not being clear earlier how do you want to pass the operations to the spark context? this is partly what i am looking for . How to access the active spark context and possible ways to pass opera

Re: Sharing SparkContext

2014-02-25 Thread Mayur Rustagi
fair scheduler merely reorders tasks .. I think he is looking to run multiple pieces of code on a single context on demand from customers...if the code & order is decided then fair scheduler will ensure that all tasks get equal cluster time :) Mayur Rustagi Ph: +919632149971 h

Re: How to get well-distribute partition

2014-02-25 Thread Mayur Rustagi
okay you caught me on this.. I havnt used python api. Lets try http://www.cs.berkeley.edu/~pwendell/strataconf/api/pyspark/pyspark.rdd.RDD-class.html#partitionByon the rdd & customize the partitioner instead of hash to a custom function. Please update on the list if it works, it seems to be a commo

Re: ETL on pyspark

2014-02-25 Thread Matei Zaharia
It will only move a file to the final directory when it’s successfully finished writing it, so the file shouldn’t have any duplicates. Old attempts will just be deleted. Matei On Feb 25, 2014, at 9:19 AM, Adrian Mocanu wrote: > Hi Matei > If Spark crashes while writing the file, after recover

RE: Size of RDD larger than Size of data on disk

2014-02-25 Thread Suraj Satishkumar Sheth
Hi Mayur, Thanks for replying. Is it usually double the size of data on disk? I have observed this many times. Storage section of Spark is telling me that 100% of RDD is cached using 97 GB of RAM while the data in HDFS is only 47 GB. Thanks and Regards, Suraj Sheth From: Mayur Rustagi [mailto:ma

Re: Size of RDD larger than Size of data on disk

2014-02-25 Thread Matei Zaharia
The problem is that Java objects can take more space than the underlying data, but there are options in Spark to store data in serialized form to get around this. Take a look at https://spark.incubator.apache.org/docs/latest/tuning.html. Matei On Feb 25, 2014, at 12:01 PM, Suraj Satishkumar She

RE: ETL on pyspark

2014-02-25 Thread Adrian Mocanu
Matei, one more follow up, If you write the stream data to your file by iterating through each RDD? I know foreach is not idempotent. Can this rewrite some tuples or RDDs twice? stream.foreachRDD(rdd=>rdd.foreach({ tuple=>appendToFileTuple(tuple) })) Thanks a lot! -A From: Matei Zaharia

Re: Sharing SparkContext

2014-02-25 Thread abhinav chowdary
Thank You Mayur I will try Ooyala job server to begin with. Is there a way to load RDD created via sparkContext into shark? Only reason i ask is my RDD is being created from Cassandra (not Hadoop, we are trying to get shark work with Cassandra as well, having troubles with it when running in dist

Re: How to get well-distribute partition

2014-02-25 Thread Mayur Rustagi
It seems are you are already using parititonBy, you can simply plugin in your custom function instead of lambda x:x & it should use that to partition. Range partitioner is available in Scala I am not sure if its exposed directly in python. Regards Mayur Mayur Rustagi Ph: +919632149971 h

Re: Need some tutorials and examples about customized partitioner

2014-02-25 Thread Mayur Rustagi
Type of Shuffling is best explained by Matei in Spark Internals . http://www.youtube.com/watch?v=49Hr5xZyTEA#t=2203 Why dont you look at that & then if you have follow up questions ask here, also would be good to watch this whole talk as it talks about Spark job flows in a lot more detail. SCALA i

Re: Need some tutorials and examples about customized partitioner

2014-02-25 Thread Tao Xiao
Thank you Mayur, I think that will help me a lot Best, Tao 2014-02-26 8:56 GMT+08:00 Mayur Rustagi : > Type of Shuffling is best explained by Matei in Spark Internals . > http://www.youtube.com/watch?v=49Hr5xZyTEA#t=2203 > Why dont you look at that & then if you have follow up questions ask he

Help with building and running examples with GraphX from the REPL

2014-02-25 Thread Soumya Simanta
I'm not able to run the GraphX examples from the Scala REPL. Can anyone point to the correct documentation that talks about the configuration and/or how to build GraphX for the REPL ? Thanks

Re: [HELP] ask for some information about public data set

2014-02-25 Thread Evan R. Sparks
Hi hyqgod, This is probably a better question for the spark user's list than the dev list (cc'ing user and bcc'ing dev on this reply). To answer your question, though: Amazon's Public Datasets Page is a nice place to start: http://aws.amazon.com/datasets/ - these work well with spark because the

Re: Sharing SparkContext

2014-02-25 Thread Ognen Duzlevski
On 2/25/14, 12:24 PM, Mayur Rustagi wrote: So there is no way to share context currently, 1. you can try jobserver by Ooyala but I havnt used it & frankly nobody has shared feedback on it. One of the major show stoppers for me is that when compiled with Hadoop 2.2.0 - Ooyala standalone serve

NullPointerException from 'Count' on DStream

2014-02-25 Thread anoldbrain
Dear all, I encountered NullPointerException running a simple program like below: > val sparkconf = new SparkConf() > .setMaster(master) > .setAppName("myapp") > // and other setups > > val ssc = new StreamingContext(sparkconf, Seconds(30)) > val flume = new FlumeInputDStream(ssc, f

Re: Sharing SparkContext

2014-02-25 Thread Ognen Duzlevski
In that case, I must have misunderstood the following (from http://spark.incubator.apache.org/docs/0.8.1/job-scheduling.html). Apologies. Ognen "Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads

Re: Need some tutorials and examples about customized partitioner

2014-02-25 Thread Matei Zaharia
Take a look at the “advanced Spark features” talk here too: http://ampcamp.berkeley.edu/amp-camp-one-berkeley-2012/. Matei On Feb 25, 2014, at 6:22 PM, Tao Xiao wrote: > Thank you Mayur, I think that will help me a lot > > > Best, > Tao > > > 2014-02-26 8:56 GMT+08:00 Mayur Rustagi : > Ty

Implementing a custom Spark shell

2014-02-25 Thread Sampo Niskanen
Hi, I'd like to create a custom version of the Spark shell, which has automatically defined some other variables / RDDs (in addition to 'sc') specific to our application. Is this possible? I took a look at the code that the spark-shell invokes, and it seems quite complex. Can this be reused fro