I want to share and brainstorm on an experiment before I try it all the way.
I hope that Spark contributors can comment. To be clear, it is not my intent
to use MLLib where I get partial control on the work being done and I'm not
seeing it scale well enough yet. I have fundamental questions about
The reason we are not using MLLib and Breeze is the lack of control over the
data and performance. After computing the covariance matrix, there isn't too
much we can do after that. Many of the methods are private. For now, we need
the max value and the coresponding pair of columns. Later, we may
Need help getting around these errors.
I have this program that runs fine on smaller input sizes. As it gets
larger, Spark has increasing difficulty of being efficient and functioning
without errors. We have about 46GB free on each node. The workers and
executors are configured to use this up
Here is a partial comparison.
http://dspace.mit.edu/bitstream/handle/1721.1/82517/MIT-CSAIL-TR-2013-028.pdf?sequence=2
SciDB uses MPI with Intel HW and libraries. Amazing performance at the cost
of more work.
In case the link stops working:
A Complex Analytics Genomics Benchmark Rebecca Taft-,
I want to ask this, not because I can't read endless documentation and
several tutorials, but because there seems to be many ways of doing things
and I keep having issues. How do you run /your /spark app?
I had it working when I was only using yarn+hadoop1 (Cloudera), then I had
to get Spark and