Any advice for using big spark.cleaner.delay value in Spark Streaming?

2014-04-27 Thread buremba
It seems default value for spark.cleaner.delay is 3600 seconds but I need to be able to count things on daily, weekly or even monthly based. I suppose the aim of DStream batches and spark.cleaner.delay is to avoid space issues (running out of memory etc.). I usually use HyperLogLog for counting

Re: Re: what is the best way to do cartesian

2014-04-27 Thread qinwei
Thanks a lot for your reply, but i have tried the  built-in RDD.cartesian() method before, it didn't make it faster. qinwei  From: Alex BoisvertDate: 2014-04-26 00:32To: userSubject: Re: what is the best way to do cartesianYou might want to try the built-in RDD.cartesian() method. On

Re: Re: Problem with the Item-Based Collaborative Filtering Recommendation Algorithms in spark

2014-04-27 Thread qinwei
Thanks a lot for your reply, it gave me much inspiration. qinwei  From: Sean OwenDate: 2014-04-25 14:10To: userSubject: Re: Problem with the Item-Based Collaborative Filtering Recommendation Algorithms in sparkSo you are computing all-pairs similarity over 20M users? This going to take

Re: Re: Problem with the Item-Based Collaborative Filtering Recommendation Algorithms in spark

2014-04-27 Thread Qin Wei
Thanks a lot for your reply, it gave me much inspiration. qinwei  From: Sean Owen-2 [via Apache Spark User List]Date: 2014-04-25 14:11To: Qin WeiSubject: Re: Problem with the Item-Based Collaborative Filtering Recommendation Algorithms in spark So you are computing all-pairs

Re: parallelize for a large Seq is extreamly slow.

2014-04-27 Thread Earthson
That's not work. I don't think it is just slow, It never ends(with 30+ hours, and I killed it). -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/parallelize-for-a-large-Seq-is-extreamly-slow-tp4801p4900.html Sent from the Apache Spark User List mailing list

help

2014-04-27 Thread Joe L
I am getting this error, please help me to fix it 4/04/28 02:16:20 INFO SparkDeploySchedulerBackend: Executor app-20140428021620-0007/10 removed: class java.io.IOException: Cannot run program /home/exobrain/install/spark-0.9.1/bin/compute-classpath.sh (in directory .): error=13, -- View this

Re: Spark on Yarn or Mesos?

2014-04-27 Thread Andrew Ash
That thread was mostly about benchmarking YARN vs standalone, and the results are what I'd expect -- spinning up a Spark cluster on demand through YARN has higher startup latency than using a standalone cluster, where the JVMs are already initialized and ready. Given that there's a lot more

Re: Spark on Yarn or Mesos?

2014-04-27 Thread Matei Zaharia
From my point of view, both are supported equally. The YARN support is newer and that’s why there’s been a lot more action there in recent months. Matei On Apr 27, 2014, at 12:08 PM, Andrew Ash and...@andrewash.com wrote: That thread was mostly about benchmarking YARN vs standalone, and the

Re: Spark on Yarn or Mesos?

2014-04-27 Thread Andrew Ash
Much thanks for the perspective Matei. On Sun, Apr 27, 2014 at 10:51 PM, Matei Zaharia matei.zaha...@gmail.comwrote: From my point of view, both are supported equally. The YARN support is newer and that’s why there’s been a lot more action there in recent months. Matei On Apr 27, 2014, at

Running a spark-submit compatible app in spark-shell

2014-04-27 Thread Roger Hoover
Hi, From the meetup talk about the 1.0 release, I saw that spark-submit will be the preferred way to launch apps going forward. How do you recommend launching such jobs in a development cycle? For example, how can I load an app that's expecting to a given to spark-submit into spark-shell?

Re: Running a spark-submit compatible app in spark-shell

2014-04-27 Thread Matei Zaharia
Hi Roger, You should be able to use the --jars argument of spark-shell to add JARs onto the classpath and then work with those classes in the shell. (A recent patch, https://github.com/apache/spark/pull/542, made spark-shell use the same command-line arguments as spark-submit). But this is a

Re: Strange lookup behavior. Possible bug?

2014-04-27 Thread Yadid Ayzenberg
Can someone please suggest how I can move forward with this? My spark version is 0.9.1. The big challenge is that this issue is not recreated when running in local mode. What could be the difference? I would really appreciate any pointers, as currently the the job just hangs. On 4/25/14,

Re: parallelize for a large Seq is extreamly slow.

2014-04-27 Thread Earthson
It's my fault! I upload a wrong jar when I changed the number of partitions. and Now it just works fine:) The size of word_mapping is 2444185. So it will take very long time for large object serialization? I don't think two million is very large, because the cost at local for such size is

Re: is it okay to reuse objects across RDD's?

2014-04-27 Thread DB Tsai
Hi Todd, As Patrick and you already pointed out, it's really dangerous to mutate the status of RDD. However, when we implement the glmnet in Spark, if we can reuse the residuals for each row in RDD computed from the previous step, it can speed up 4~5x. As a result, we add extra column in RDD for