I'm assuming some things here, but hopefully I understand. So, basically
you have a big table of data distributed across a bunch of executors. And,
you want an efficient way to call a native method for each row.
It sounds similar to a dataframe writer to me. Except, instead of writing
to disk or
I want to calculate quantiles on two different columns. I know that I can
calculate them with two separate operations. However, for performance
reasons, I'd like to calculate both with one operation.
Is this possible to do this with the Dataset API? I'm assuming that it
isn't. But, if it isn't,
That setting defines the total number of tasks that an executor can run in
parallel.
Each node is partitioned into executors, each with identical heap and
cores. So, it can be a balancing act to optimally set these values,
particularly if the goal is to maximize CPU usage with memory and other
That's helpful, thanks. I didn't see that thread earlier. But, it sounds like
the best solution is to use singletons in the executors, which I'm already
doing. (BTW - the reason why I consider that method kind of hack-ish, is
because the it makes the code a bit more difficult for others to
eton in the JVM, then this does just have
> one copy per machine. Of course an executor is tied to an app, so if
> you mean to hold this data across executors that won't help.
>
>
> On Wed, Jun 29, 2016 at 3:00 PM, Aaron Perrin <aper...@timerazor.com>
> wrote:
> >
The user guide describes a broadcast as a way to move a large dataset to
each node:
"Broadcast variables allow the programmer to keep a read-only variable
cached on each machine rather than shipping a copy of it with tasks. They
can be used, for example, to give every node a copy of a large input