Re: What is the equivalent of mapPartitions in SpqrkSQL?

2017-06-27 Thread Aaron Perrin
I'm assuming some things here, but hopefully I understand. So, basically you have a big table of data distributed across a bunch of executors. And, you want an efficient way to call a native method for each row. It sounds similar to a dataframe writer to me. Except, instead of writing to disk or

Multiple quantile calculations

2017-01-31 Thread Aaron Perrin
I want to calculate quantiles on two different columns. I know that I can calculate them with two separate operations. However, for performance reasons, I'd like to calculate both with one operation. Is this possible to do this with the Dataset API? I'm assuming that it isn't. But, if it isn't,

Re: Tuning spark.executor.cores

2017-01-09 Thread Aaron Perrin
That setting defines the total number of tasks that an executor can run in parallel. Each node is partitioned into executors, each with identical heap and cores. So, it can be a balancing act to optimally set these values, particularly if the goal is to maximize CPU usage with memory and other

Re: Possible to broadcast a function?

2016-06-30 Thread Aaron Perrin
That's helpful, thanks. I didn't see that thread earlier. But, it sounds like the best solution is to use singletons in the executors, which I'm already doing. (BTW - the reason why I consider that method kind of hack-ish, is because the it makes the code a bit more difficult for others to

Re: Possible to broadcast a function?

2016-06-29 Thread Aaron Perrin
eton in the JVM, then this does just have > one copy per machine. Of course an executor is tied to an app, so if > you mean to hold this data across executors that won't help. > > > On Wed, Jun 29, 2016 at 3:00 PM, Aaron Perrin <aper...@timerazor.com> > wrote: > >

Possible to broadcast a function?

2016-06-29 Thread Aaron Perrin
The user guide describes a broadcast as a way to move a large dataset to each node: "Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input