It seems default value for spark.cleaner.delay is 3600 seconds but I need to
be able to count things on daily, weekly or even monthly based.
I suppose the aim of DStream batches and spark.cleaner.delay is to avoid
space issues (running out of memory etc.). I usually use HyperLogLog for
counting
Thanks a lot for your reply, but i have tried the built-in RDD.cartesian()
method before, it didn't make it faster.
qinwei
From: Alex BoisvertDate: 2014-04-26 00:32To: userSubject: Re: what is the best
way to do cartesianYou might want to try the built-in RDD.cartesian() method.
On
Thanks a lot for your reply, it gave me much inspiration.
qinwei
From: Sean OwenDate: 2014-04-25 14:10To: userSubject: Re: Problem with the
Item-Based Collaborative Filtering Recommendation Algorithms in sparkSo you are
computing all-pairs similarity over 20M users?
This going to take
Thanks a lot for your reply, it gave me much inspiration.
qinwei
From: Sean Owen-2 [via Apache Spark User List]Date: 2014-04-25 14:11To: Qin
WeiSubject: Re: Problem with the Item-Based Collaborative Filtering
Recommendation Algorithms in spark
So you are computing all-pairs
That's not work. I don't think it is just slow, It never ends(with 30+ hours,
and I killed it).
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/parallelize-for-a-large-Seq-is-extreamly-slow-tp4801p4900.html
Sent from the Apache Spark User List mailing list
I am getting this error, please help me to fix it
4/04/28 02:16:20 INFO SparkDeploySchedulerBackend: Executor
app-20140428021620-0007/10 removed: class java.io.IOException: Cannot run
program /home/exobrain/install/spark-0.9.1/bin/compute-classpath.sh (in
directory .): error=13,
--
View this
That thread was mostly about benchmarking YARN vs standalone, and the
results are what I'd expect -- spinning up a Spark cluster on demand
through YARN has higher startup latency than using a standalone cluster,
where the JVMs are already initialized and ready.
Given that there's a lot more
From my point of view, both are supported equally. The YARN support is newer
and that’s why there’s been a lot more action there in recent months.
Matei
On Apr 27, 2014, at 12:08 PM, Andrew Ash and...@andrewash.com wrote:
That thread was mostly about benchmarking YARN vs standalone, and the
Much thanks for the perspective Matei.
On Sun, Apr 27, 2014 at 10:51 PM, Matei Zaharia matei.zaha...@gmail.comwrote:
From my point of view, both are supported equally. The YARN support is
newer and that’s why there’s been a lot more action there in recent months.
Matei
On Apr 27, 2014, at
Hi,
From the meetup talk about the 1.0 release, I saw that spark-submit will be
the preferred way to launch apps going forward.
How do you recommend launching such jobs in a development cycle? For
example, how can I load an app that's expecting to a given to spark-submit
into spark-shell?
Hi Roger,
You should be able to use the --jars argument of spark-shell to add JARs onto
the classpath and then work with those classes in the shell. (A recent patch,
https://github.com/apache/spark/pull/542, made spark-shell use the same
command-line arguments as spark-submit). But this is a
Can someone please suggest how I can move forward with this?
My spark version is 0.9.1.
The big challenge is that this issue is not recreated when running in
local mode. What could be the difference?
I would really appreciate any pointers, as currently the the job just hangs.
On 4/25/14,
It's my fault! I upload a wrong jar when I changed the number of partitions.
and Now it just works fine:)
The size of word_mapping is 2444185.
So it will take very long time for large object serialization? I don't think
two million is very large, because the cost at local for such size is
Hi Todd,
As Patrick and you already pointed out, it's really dangerous to mutate the
status of RDD. However, when we implement the glmnet in Spark, if we can
reuse the residuals for each row in RDD computed from the previous step, it
can speed up 4~5x.
As a result, we add extra column in RDD for
14 matches
Mail list logo