Re: Which one is faster / consumes less memory: collect() or count()?

2015-02-26 Thread francois . garillot
The short answer: count(), as the sum can be partially aggregated on the mappers. The long answer: http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/dont_call_collect_on_a_very_large_rdd.html — FG On Thu, Feb 26, 2015 at 2:28 PM, Emre Sevinc

Re: Which one is faster / consumes less memory: collect() or count()?

2015-02-26 Thread francois . garillot
Note I’m assuming you were going for the size of your RDD, meaning in the ‘collect’ alternative, you would go for a size() right after the collect(). If you were simply trying to materialize your RDD, Sean’s answer is more complete. — FG On Thu, Feb 26, 2015 at 2:33 PM, Emre Sevinc

Re: Why groupBy is slow?

2015-02-18 Thread francois . garillot
In a nutshell : because it’s moving all of your data, compared to other operations (e.g. reduce) that summarize it in one form or another before moving it. For the longer answer:

Re: Negative Accumulators

2015-01-30 Thread francois . garillot
Sanity-check: would it be possible that `threshold_var` be negative ? — FG On Fri, Jan 30, 2015 at 5:06 PM, Peter Thai thai.pe...@gmail.com wrote: Hello, I am seeing negative values for accumulators. Here's my implementation in a standalone app in Spark 1.1.1rc: implicit object

Re: RDD.combineBy without intermediate (k,v) pair allocation

2015-01-29 Thread francois . garillot
Oh, I’m sorry, I meant `aggregateByKey`. https://spark.apache.org/docs/1.2.0/api/scala/#org.apache.spark.rdd.PairRDDFunctions — FG On Thu, Jan 29, 2015 at 7:58 PM, Mohit Jaggi mohitja...@gmail.com wrote: Francois, RDD.aggregate() does not support aggregation by key. But, indeed, that is

Re: RDD.combineBy without intermediate (k,v) pair allocation

2015-01-29 Thread francois . garillot
Sorry, I answered too fast. Please disregard my last message: I did mean aggregate.  You say: RDD.aggregate() does not support aggregation by key. What would you need aggregation by key for, if you do not, at the beginning, have an RDD of key-value pairs, and do not want to build one ?

Re: RDD.combineBy

2015-01-27 Thread francois . garillot
Have you looked at the `aggregate` function in the RDD API ? If your way of extracting the “key” (identifier) and “value” (payload) parts of the RDD elements is uniform (a function), it’s unclear to me how this would be more efficient that extracting key and value and then using combine,

Re: about window length and slide interval

2015-01-14 Thread francois . garillot
Why would you want to ? That would be equivalent to setting your batch interval to the same duration you’re suggesting to specify for both window length and slide interval, IIUC. Here’s a nice explanation of windowing by TD : https://groups.google.com/forum/#!topic/spark-users/GQoxJHAAtX4

Re: KafkaUtils not consuming all the data from all partitions

2015-01-07 Thread francois . garillot
Hi Mukesh, If my understanding is correct, each Stream only has a single Receiver. So, if you have each receiver consuming 9 partitions, you need 10 input DStreams to create 10 concurrent receivers:

Re: KafkaUtils not consuming all the data from all partitions

2015-01-07 Thread francois . garillot
- You are launching up to 10 threads/topic per Receiver. Are you sure your receivers can support 10 threads each ? (i.e. in the default configuration, do they have 10 cores). If they have 2 cores, that would explain why this works with 20 partitions or less. - If you have 90 partitions, why

Re: Spark Streaming in Production

2014-12-12 Thread francois . garillot
IIUC, Receivers run on workers, colocated with other tasks. The Driver, on the other hand, can either run on the querying machine (local mode) or as a worker (cluster mode). — FG On Fri, Dec 12, 2014 at 4:49 PM, twizansk twiza...@gmail.com wrote: Thanks for the reply. I might be

Re: KafkaUtils explicit acks

2014-12-10 Thread francois . garillot
Hi Mukesh, There’s been some great work on Spark Streaming reliability lately I’m not aware of any doc yet (did I miss something ?) but you can look at the ReliableKafkaReceiver’s test suite: — FG On Wed, Dec 10, 2014 at 11:17 AM, Mukesh Jha me.mukesh@gmail.com wrote: Hello

Re: KafkaUtils explicit acks

2014-12-10 Thread francois . garillot
[sorry for the botched half-message] Hi Mukesh, There’s been some great work on Spark Streaming reliability lately. https://www.youtube.com/watch?v=jcJq3ZalXD8 Look at the links from: https://issues.apache.org/jira/browse/SPARK-3129 I’m not aware of any doc yet (did I miss

Re: Publishing a transformed DStream to Kafka

2014-11-30 Thread francois . garillot
How about writing to a buffer ? Then you would flush the buffer to Kafka if and only if the output operation reports successful completion. In the event of a worker failure, that would not happen. — FG On Sun, Nov 30, 2014 at 2:28 PM, Josh J joshjd...@gmail.com wrote: Is there a way to do

Re: A Spark Design Problem

2014-10-31 Thread francois . garillot
Hi Steve, Are you talking about sequence alignment ? — FG On Fri, Oct 31, 2014 at 5:44 PM, Steve Lewis lordjoe2...@gmail.com wrote: The original problem is in biology but the following captures the CS issues, Assume I have a large number of locks and a large number of keys. There is a