Re: Master registers itself at startup?

2014-04-13 Thread Aaron Davidson
This was actually a bug in the log message itself, where the Master would print its own ip and port instead of the registered worker's. It has been fixed in 0.9.1 and 1.0.0 (here's the patch: https://github.com/apache/spark/commit/c0795cf481d47425ec92f4fd0780e2e0b3fdda85 ). Sorry about the

Re: what is the difference between persist() and cache()?

2014-04-13 Thread Andrea Esposito
AFAIK cache() is just a shortcut to the persist method with MEMORY_ONLY as storage level.. from the source code of RDD: /** Persist this RDD with the default storage level (`MEMORY_ONLY`). */ def persist(): RDD[T] = persist(StorageLevel.MEMORY_ONLY) /** Persist this RDD with the default

Re: Spark - ready for prime time?

2014-04-13 Thread Jim Blomo
On Thu, Apr 10, 2014 at 12:24 PM, Andrew Ash and...@andrewash.com wrote: The biggest issue I've come across is that the cluster is somewhat unstable when under memory pressure. Meaning that if you attempt to persist an RDD that's too big for memory, even with MEMORY_AND_DISK, you'll often

Re: Spark - ready for prime time?

2014-04-13 Thread Andrew Ash
It's highly dependent on what the issue is with your particular job, but the ones I modify most commonly are: spark.storage.memoryFraction spark.shuffle.memoryFraction parallelism (a parameter on many RDD calls) -- increase from the default level to get more, smaller tasks that are more likely to

Re: Creating a SparkR standalone job

2014-04-13 Thread Shivaram Venkataraman
Thanks for attaching code. If I get your use case right you want to call the sentiment analysis code from Spark Streaming right ? For that I think you can just use jvmr if that works and I don't think you need SparkR. SparkR is mainly intended as an API for large scale jobs which are written in

how to use a single filter instead of multiple filters

2014-04-13 Thread Joe L
Hi, I have multiple filters as shown below, should I use a single optimal filter instead of them? these filters can degrade the performance of spark? http://apache-spark-user-list.1001560.n3.nabble.com/file/n4185/Capture.png -- View this message in context:

Re: function state lost when next RDD is processed

2014-04-13 Thread Chris Fregly
or how about the UpdateStateByKey() operation? https://spark.apache.org/docs/0.9.0/streaming-programming-guide.html the StatefulNetworkWordCount example demonstrates how to keep state across RDDs. On Mar 28, 2014, at 8:44 PM, Mayur Rustagi mayur.rust...@gmail.com wrote: Are you referring to

Re: Master registers itself at startup?

2014-04-13 Thread Gerd Koenig
Many thanks for your explanation. So there's just my issue with that TaskSchedulerImpl: Initial job has not accepted any resources stuff that prevents me from starting with Spark (at least execute the examples successfully) ;) br, Gerd On 13 April 2014 10:17, Aaron Davidson ilike...@gmail.com

Re: Master registers itself at startup?

2014-04-13 Thread Aaron Davidson
This is usually due to a memory misconfiguration somewhere. Your job may be requesting that each executor has 512MB, and your cluster may not be able to satisfy that (if you're only allowing 64MB executors, for instance). Try setting spark.executor.memory to be the same as SPARK_WORKER_MEMORY.

Re: Master registers itself at startup?

2014-04-13 Thread YouPeng Yang
Hi The 512MB is the default memory size which each executor needs. and actually, your job does not need as much as the default memory size. you can create a SparkContext with sc = new SparkContext(local-cluster[2,1,512], test) // suppose you use the local-cluster model. Here the 512 is the

Checkpoint Vs Cache

2014-04-13 Thread David Thomas
What is the difference between checkpointing and caching an RDD?

how to count maps without shuffling too much data?

2014-04-13 Thread Joe L
-- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-count-maps-without-shuffling-too-much-data-tp4194.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

moving SparkContext around

2014-04-13 Thread Schein, Sagi
A few questions about the resilience of the client side of spark. what would happen if the client process crashes, can it reconstruct its state ? Suppose I just want to serialize it and reload it back is this possible ? More advanced use case, is there a way to move SparkContext between