The short answer:
count(), as the sum can be partially aggregated on the mappers.
The long answer:
http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/dont_call_collect_on_a_very_large_rdd.html
—
FG
On Thu, Feb 26, 2015 at 2:28 PM, Emre Sevinc
Note I’m assuming you were going for the size of your RDD, meaning in the
‘collect’ alternative, you would go for a size() right after the collect().
If you were simply trying to materialize your RDD, Sean’s answer is more
complete.
—
FG
On Thu, Feb 26, 2015 at 2:33 PM, Emre Sevinc
In a nutshell : because it’s moving all of your data, compared to other
operations (e.g. reduce) that summarize it in one form or another before moving
it.
For the longer answer:
Sanity-check: would it be possible that `threshold_var` be negative ?
—
FG
On Fri, Jan 30, 2015 at 5:06 PM, Peter Thai thai.pe...@gmail.com wrote:
Hello,
I am seeing negative values for accumulators. Here's my implementation in a
standalone app in Spark 1.1.1rc:
implicit object
Oh, I’m sorry, I meant `aggregateByKey`.
https://spark.apache.org/docs/1.2.0/api/scala/#org.apache.spark.rdd.PairRDDFunctions
—
FG
On Thu, Jan 29, 2015 at 7:58 PM, Mohit Jaggi mohitja...@gmail.com wrote:
Francois,
RDD.aggregate() does not support aggregation by key. But, indeed, that is
Sorry, I answered too fast. Please disregard my last message: I did mean
aggregate.
You say: RDD.aggregate() does not support aggregation by key.
What would you need aggregation by key for, if you do not, at the beginning,
have an RDD of key-value pairs, and do not want to build one ?
Have you looked at the `aggregate` function in the RDD API ?
If your way of extracting the “key” (identifier) and “value” (payload) parts of
the RDD elements is uniform (a function), it’s unclear to me how this would be
more efficient that extracting key and value and then using combine,
Why would you want to ? That would be equivalent to setting your batch interval
to the same duration you’re suggesting to specify for both window length and
slide interval, IIUC.
Here’s a nice explanation of windowing by TD :
https://groups.google.com/forum/#!topic/spark-users/GQoxJHAAtX4
Hi Mukesh,
If my understanding is correct, each Stream only has a single Receiver. So, if
you have each receiver consuming 9 partitions, you need 10 input DStreams to
create 10 concurrent receivers:
- You are launching up to 10 threads/topic per Receiver. Are you sure your
receivers can support 10 threads each ? (i.e. in the default configuration, do
they have 10 cores). If they have 2 cores, that would explain why this works
with 20 partitions or less.
- If you have 90 partitions, why
IIUC, Receivers run on workers, colocated with other tasks.
The Driver, on the other hand, can either run on the querying machine (local
mode) or as a worker (cluster mode).
—
FG
On Fri, Dec 12, 2014 at 4:49 PM, twizansk twiza...@gmail.com wrote:
Thanks for the reply. I might be
Hi Mukesh,
There’s been some great work on Spark Streaming reliability lately
I’m not aware of any doc yet (did I miss something ?) but you can look at the
ReliableKafkaReceiver’s test suite:
—
FG
On Wed, Dec 10, 2014 at 11:17 AM, Mukesh Jha me.mukesh@gmail.com
wrote:
Hello
[sorry for the botched half-message]
Hi Mukesh,
There’s been some great work on Spark Streaming reliability lately.
https://www.youtube.com/watch?v=jcJq3ZalXD8
Look at the links from:
https://issues.apache.org/jira/browse/SPARK-3129
I’m not aware of any doc yet (did I miss
How about writing to a buffer ? Then you would flush the buffer to Kafka if and
only if the output operation reports successful completion. In the event of a
worker failure, that would not happen.
—
FG
On Sun, Nov 30, 2014 at 2:28 PM, Josh J joshjd...@gmail.com wrote:
Is there a way to do
Hi Steve,
Are you talking about sequence alignment ?
—
FG
On Fri, Oct 31, 2014 at 5:44 PM, Steve Lewis lordjoe2...@gmail.com
wrote:
The original problem is in biology but the following captures the CS
issues, Assume I have a large number of locks and a large number of keys.
There is a
15 matches
Mail list logo