Hi Ximo. Regarding to #1 you can try to increase the number of partitions
used for cogroup or reduce. AFAIK Spark needs to have enough memory space
to handle in memory all the data processed by a given partition, increasing
the number of partitions you can reduce that load. Probably we need to
Maybe using mapPartitions and .sequence inside it?
El 26/7/2015 10:22 p. m., Ayoub benali.ayoub.i...@gmail.com escribió:
Hello,
I am trying to convert the result I get after doing some async IO :
val rdd: RDD[T] = // some rdd
val result: RDD[Future[T]] = rdd.map(httpCall)
Is there a way
The main advantage of using scala vs java 8 is being able to use a console
2015-07-15 9:27 GMT+02:00 诺铁 noty...@gmail.com:
I think different team got different answer for this question. my team
use scala, and happy with it.
On Wed, Jul 15, 2015 at 1:31 PM, Tristan Blakers
.name) (numbers.value !=
numbers2.other),
how=inner) \
.select(numbers.name, numbers.value, numbers2.other) \
.collect()
On Mon, Jun 22, 2015 at 12:53 PM, Ignacio Blasco elnopin...@gmail.com
wrote:
Sorry thought it was scala/spark
El 22/6
, 2015 at 9:16 AM, Ignacio Blasco elnopin...@gmail.com
wrote:
That issue happens only in python dsl?
El 23/6/2015 5:05 p. m., Bob Corsaro rcors...@gmail.com escribió:
Thanks! The solution:
https://gist.github.com/dokipen/018a1deeab668efdf455
On Mon, Jun 22, 2015 at 4:33 PM Davies
Sorry thought it was scala/spark
El 22/6/2015 9:49 p. m., Bob Corsaro rcors...@gmail.com escribió:
That's invalid syntax. I'm pretty sure pyspark is using a DSL to create a
query here and not actually doing an equality operation.
On Mon, Jun 22, 2015 at 3:43 PM Ignacio Blasco elnopin
Probably you should use === instead of == and !== instead of !=
Can anyone explain why the dataframe API doesn't work as I expect it to
here? It seems like the column identifiers are getting confused.
https://gist.github.com/dokipen/4b324a7365ae87b7b0e5
Given the lazy nature of an RDD if you use an accumulator inside a map()
and then you call count and saveAsTextfile over that accumulator will be
called twice. IMHO, accumulators are a bit nondeterministic you need to be
sure when to read them to avoid unexpected re-executions
El 3/5/2015 2:09 p.
Hi Toni.
Given there is more than one measure by (user, hour) what is the measure
you want to keep? The sum?, the mean?, the most recent measure?. For the
sum or the mean you don't need to care about the timing. And If you wan't
to have the most recent then you can include the timestamp in the