RE: Memory problems and missing heartbeats

2016-02-16 Thread Ignacio Blasco
Hi Ximo. Regarding to #1 you can try to increase the number of partitions used for cogroup or reduce. AFAIK Spark needs to have enough memory space to handle in memory all the data processed by a given partition, increasing the number of partitions you can reduce that load. Probably we need to

Re: RDD[Future[T]] = Future[RDD[T]]

2015-07-26 Thread Ignacio Blasco
Maybe using mapPartitions and .sequence inside it? El 26/7/2015 10:22 p. m., Ayoub benali.ayoub.i...@gmail.com escribió: Hello, I am trying to convert the result I get after doing some async IO : val rdd: RDD[T] = // some rdd val result: RDD[Future[T]] = rdd.map(httpCall) Is there a way

Re: Java 8 vs Scala

2015-07-15 Thread Ignacio Blasco
The main advantage of using scala vs java 8 is being able to use a console 2015-07-15 9:27 GMT+02:00 诺铁 noty...@gmail.com: I think different team got different answer for this question. my team use scala, and happy with it. On Wed, Jul 15, 2015 at 1:31 PM, Tristan Blakers

Re: SQL vs. DataFrame API

2015-06-23 Thread Ignacio Blasco
.name) (numbers.value != numbers2.other), how=inner) \ .select(numbers.name, numbers.value, numbers2.other) \ .collect() On Mon, Jun 22, 2015 at 12:53 PM, Ignacio Blasco elnopin...@gmail.com wrote: Sorry thought it was scala/spark El 22/6

Re: SQL vs. DataFrame API

2015-06-23 Thread Ignacio Blasco
, 2015 at 9:16 AM, Ignacio Blasco elnopin...@gmail.com wrote: That issue happens only in python dsl? El 23/6/2015 5:05 p. m., Bob Corsaro rcors...@gmail.com escribió: Thanks! The solution: https://gist.github.com/dokipen/018a1deeab668efdf455 On Mon, Jun 22, 2015 at 4:33 PM Davies

Re: SQL vs. DataFrame API

2015-06-22 Thread Ignacio Blasco
Sorry thought it was scala/spark El 22/6/2015 9:49 p. m., Bob Corsaro rcors...@gmail.com escribió: That's invalid syntax. I'm pretty sure pyspark is using a DSL to create a query here and not actually doing an equality operation. On Mon, Jun 22, 2015 at 3:43 PM Ignacio Blasco elnopin

Re: SQL vs. DataFrame API

2015-06-22 Thread Ignacio Blasco
Probably you should use === instead of == and !== instead of != Can anyone explain why the dataframe API doesn't work as I expect it to here? It seems like the column identifiers are getting confused. https://gist.github.com/dokipen/4b324a7365ae87b7b0e5

Re: Questions about Accumulators

2015-05-03 Thread Ignacio Blasco
Given the lazy nature of an RDD if you use an accumulator inside a map() and then you call count and saveAsTextfile over that accumulator will be called twice. IMHO, accumulators are a bit nondeterministic you need to be sure when to read them to avoid unexpected re-executions El 3/5/2015 2:09 p.

Re: How to setup this false streaming problem

2015-04-29 Thread Ignacio Blasco
Hi Toni. Given there is more than one measure by (user, hour) what is the measure you want to keep? The sum?, the mean?, the most recent measure?. For the sum or the mean you don't need to care about the timing. And If you wan't to have the most recent then you can include the timestamp in the