Re: treeAggregate timing / SGD performance with miniBatchFraction < 1

2015-09-26 Thread Evan R. Sparks
tely > see why it should take longer to transfer the local gradient vectors > in that level, since they are dense in every level. Furthermore, the > driver is receiving the result of only 4 tasks, which is relatively > small. > > Mike > > > On 9/26/15, Evan R. Sparks

Re: RDD API patterns

2015-09-26 Thread Evan R. Sparks
Mike, I believe the reason you're seeing near identical performance on the gradient computations is twofold 1) Gradient computations for GLM models are computationally pretty cheap from a FLOPs/byte read perspective. They are essentially a BLAS "gemv" call in the dense case, which is well known to

Re: Scan Sharing in Spark

2015-05-05 Thread Evan R. Sparks
Scan sharing can indeed be a useful optimization in spark, because you amortize not only the time spent scanning over the data, but also time spent in task launch and scheduling overheads. Here's a trivial example in scala. I'm not aware of a place in SparkSQL where this is used - I'd imagine that

Re: Pandas' Shift in Dataframe

2015-04-29 Thread Evan R. Sparks
In general there's a tension between ordered data and set-oriented data model underlying DataFrames. You can force a total ordering on the data, but it may come at a high cost with respect to performance. It would be good to get a sense of the use case you're trying to support, but one suggestion

Re: Using CUDA within Spark / boosting linear algebra

2015-04-02 Thread Evan R. Sparks
ject's readme.md > > https://github.com/fommil/netlib-java/wiki/NVBLAS > > > > Best regards, Alexander > > -Original Message- > > From: Xiangrui Meng [mailto:men...@gmail.com] > > Sent: Monday, March 30, 2015 2:43 PM > > To: Sean Owen > &

Re: Storing large data for MLlib machine learning

2015-03-26 Thread Evan R. Sparks
les in hdfs https://github.com/twitter/elephant-bird > > > > > > *From:* Evan R. Sparks [mailto:evan.spa...@gmail.com] > *Sent:* Thursday, March 26, 2015 2:34 PM > *To:* Stephen Boesch > *Cc:* Ulanov, Alexander; dev@spark.apache.org > *Subject:* Re: Storing large data for

Re: Storing large data for MLlib machine learning

2015-03-26 Thread Evan R. Sparks
On binary file formats - I looked at HDF5+Spark a couple of years ago and found it barely JVM-friendly and very Hadoop-unfriendly (e.g. the APIs needed filenames as input, you couldn't pass it anything like an InputStream). I don't know if it has gotten any better. Parquet plays much more nicely a

Re: Using CUDA within Spark / boosting linear algebra

2015-03-26 Thread Evan R. Sparks
to make Open BLAS the default - is not always better and I think >> natives really need DevOps buy-in. It's not the right solution for >> everybody. >> On 26 Mar 2015 01:23, "Evan R. Sparks" wrote: >> >>> Yeah, much more reasonable - nice to know that

Re: Using CUDA within Spark / boosting linear algebra

2015-03-25 Thread Evan R. Sparks
rch 25, 2015 2:31 PM > To: Sam Halliday > Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks; > jfcanny > Subject: RE: Using CUDA within Spark / boosting linear algebra > > Hi again, > > I finally managed to use nvblas within Spark+netlib-java. It has > except

Re: Using CUDA within Spark / boosting linear algebra

2015-03-25 Thread Evan R. Sparks
cblas from Atlas or Openblas because they link to their > implementation and not to Fortran blas. > > Best regards, Alexander > > -Original Message- > From: Ulanov, Alexander > Sent: Tuesday, March 24, 2015 6:57 PM > To: Sam Halliday > Cc: dev@spark.apache.org; Xi

Re: ideas for MLlib development

2015-03-03 Thread Evan R. Sparks
Hi Robert, There's some work to do LDA via Gibbs sampling in this JIRA: https://issues.apache.org/jira/browse/SPARK-1405 as well as this one: https://issues.apache.org/jira/browse/SPARK-5556 It may make sense to have a more general Gibbs sampling framework, but it might be good to have a few desi

Re: Using CUDA within Spark / boosting linear algebra

2015-02-26 Thread Evan R. Sparks
netlib-java? >> >> CC'ed Sam, the author of netlib-java. >> >> Best, >> Xiangrui >> >> On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley >> wrote: >> > Better documentation for linking would be very helpful! Here's a JIRA: >>

Re: Using CUDA within Spark / boosting linear algebra

2015-02-25 Thread Evan R. Sparks
Mx378T9J5r7kwKSPkY/edit?usp=sharing > > One thing still needs exploration: does BIDMat-cublas perform copying > to/from machine’s RAM? > > -Original Message- > From: Ulanov, Alexander > Sent: Tuesday, February 10, 2015 2:12 PM > To: Evan R. Sparks > Cc: Josep

Re: [MLlib] Performance problem in GeneralizedLinearAlgorithm

2015-02-17 Thread Evan R. Sparks
Josh - thanks for the detailed write up - this seems a little funny to me. I agree that with the current code path there is extra work being done than needs to be (e.g. the features are re-scaled at every iteration, but the relatively costly process of fitting the StandardScaler should not be re-do

Re: Spark SQL value proposition in batch pipelines

2015-02-12 Thread Evan R. Sparks
Well, you can always join as many RDDs as you want by chaining them together, e.g. a.join(b).join(c)... - I probably wouldn't join thousands of RDDs in this way but 10 is probably doable. That said - SparkSQL has an optimizer under the covers that can make clever decisions e.g. pushing the predica

Re: Using CUDA within Spark / boosting linear algebra

2015-02-09 Thread Evan R. Sparks
ib-java) > interested to compare their libraries. > > > > Best regards, Alexander > > > > *From:* Evan R. Sparks [mailto:evan.spa...@gmail.com] > *Sent:* Friday, February 06, 2015 5:58 PM > > *To:* Ulanov, Alexander > *Cc:* Joseph Bradley; dev@spark.apache.org > *Sub

Re: Using CUDA within Spark / boosting linear algebra

2015-02-08 Thread Evan R. Sparks
uppose that > netlib is using it. > > > > *From:* Evan R. Sparks [mailto:evan.spa...@gmail.com] > *Sent:* Friday, February 06, 2015 5:19 PM > *To:* Ulanov, Alexander > *Cc:* Joseph Bradley; dev@spark.apache.org > > *Subject:* Re: Using CUDA within Spark / boosting linear algeb

Spark SQL Window Functions

2015-02-08 Thread Evan R. Sparks
Currently there's no standard way of handling time series data in Spark. We were kicking around some ideas in the lab today and one thing that came up was SQL Window Functions as a way to support them and query over time series (do things like moving average, etc.) These don't seem to be implement

Re: Using CUDA within Spark / boosting linear algebra

2015-02-08 Thread Evan R. Sparks
der > > From: Joseph Bradley [mailto:jos...@databricks.com] > Sent: Thursday, February 05, 2015 5:29 PM > To: Ulanov, Alexander > Cc: Evan R. Sparks; dev@spark.apache.org > Subject: Re: Using CUDA within Spark / boosting linear algebra > > Hi Alexander, > > Using GPUs wit

Re: Using CUDA within Spark / boosting linear algebra

2015-02-05 Thread Evan R. Sparks
; another group in Berkeley. Could you elaborate on how these all might be > connected with Spark Mllib? If you take BIDMat for linear algebra why don’t > you take BIDMach for optimization and learning? > > > > Best regards, Alexander > > > > *From:* Evan R. Sparks [

Re: Using CUDA within Spark / boosting linear algebra

2015-02-05 Thread Evan R. Sparks
I'd expect that we can make GPU-accelerated BLAS faster than CPU blas in many cases. You might consider taking a look at the codepaths that BIDMat ( https://github.com/BIDData/BIDMat) takes and comparing them to netlib-java/breeze. John Canny et. al. have done a bunch of work optimizing to make th

Re: renaming SchemaRDD -> DataFrame

2015-01-28 Thread Evan R. Sparks
You've got to be a little bit careful here. "NA" in systems like R or pandas may have special meaning that is distinct from "null". See, e.g. http://www.r-bloggers.com/r-na-vs-null/ On Wed, Jan 28, 2015 at 4:42 PM, Reynold Xin wrote: > Isn't that just "null" in SQL? > > On Wed, Jan 28, 2015 a

Re: renaming SchemaRDD -> DataFrame

2015-01-27 Thread Evan R. Sparks
I'm +1 on this, although a little worried about unknowingly introducing SparkSQL dependencies every time someone wants to use this. It would be great if the interface can be abstract and the implementation (in this case, SparkSQL backend) could be swapped out. One alternative suggestion on the nam

Re: Any interest in 'weighting' VectorTransformer which does component-wise scaling?

2015-01-27 Thread Evan R. Sparks
Hmm... Scaler and Scalar are very close together both in terms of pronunciation and spelling - and I wouldn't want to create confusion between the two. Further - this operation (elementwise multiplication by a static vector) is general enough that maybe it should have a more general name? On Tue,

Re: Notes on writing complex spark applications

2014-11-24 Thread Evan R. Sparks
Nov 23, 2014 at 8:27 PM, Inkyu Lee wrote: > > Very helpful!! > > > > thank you very much! > > > > 2014-11-24 2:17 GMT+09:00 Sam Bessalah : > > > >> Thanks Evan, this is great. > >> On Nov 23, 2014 5:58 PM, "Evan R. Sparks" > wrote: &

Notes on writing complex spark applications

2014-11-23 Thread Evan R. Sparks
Hi all, Shivaram Venkataraman, Joseph Gonzalez, Tomer Kaftan, and I have been working on a short document about writing high performance Spark applications based on our experience developing MLlib, GraphX, ml-matrix, pipelines, etc. It may be a useful document both for users and new Spark develope

Re: Gaussian Mixture Model clustering

2014-09-19 Thread Evan R. Sparks
Hey Meethu - what are you setting "K" to in the benchmarks you show? This can greatly affect the runtime. On Thu, Sep 18, 2014 at 10:38 PM, Meethu Mathew wrote: > Hi all, > Please find attached the image of benchmark results. The table in the > previous mail got messed up. Thanks. > > > > On Fr

Re: [mllib] Add multiplying large scale matrices

2014-09-05 Thread Evan R. Sparks
There's some work on this going on in the AMP Lab. Create a ticket and we can update with our progress so that we don't duplicate effort. On Fri, Sep 5, 2014 at 8:18 AM, Yu Ishikawa wrote: > Hi RJ, > > Thank you for your comment. I am interested in to have other matrix > operations too. > I wil

Re: Is breeze thread safe in Spark?

2014-09-03 Thread Evan R. Sparks
Additionally, at the higher level, MLlib allocates separate Breeze Vectors/Matrices on a Per-executor basis. The only place I can think of where data structures might be over-written concurrently is in a .aggregate() call, and these calls happen sequentially. RJ - Do you have a JIRA reference for

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Evan R. Sparks
If you're thinking along these lines, have a look at the DecisionTree implementation in MLlib. It uses the same idea and is optimized to prevent multiple passes over the data by computing several splits at each level of tree building. The tradeoff is increased model state and computation per pass o

Re: Could the function MLUtils.loadLibSVMFile be modified to support zero-based-index data?

2014-07-08 Thread Evan R. Sparks
As Sean mentions, if you can change the data to the standard format, that's probably a good idea. If you'd rather read the data raw, then writing your own version of loadLibSVMFile - then you could make your own loader function which is very similar to the existing one with a few characters removed

Re: Contributing to MLlib

2014-07-02 Thread Evan R. Sparks
Hi there, Generally we try to avoid duplicating logic if possible, particularly for algorithms that share a great deal of algorithmic similarity. See, for example, the way we implement Logistic regression vs. Linear regression vs. Linear SVM with different gradient functions all on top of SGD or L

Re: Any plans for new clustering algorithms?

2014-04-21 Thread Evan R. Sparks
While DBSCAN and others would be welcome contributions, I couldn't agree more with Sean. On Mon, Apr 21, 2014 at 8:58 AM, Sean Owen wrote: > Nobody asked me, and this is a comment on a broader question, not this > one, but: > > In light of a number of recent items about adding more algorithms

Re: [HELP] ask for some information about public data set

2014-02-25 Thread Evan R. Sparks
Hi hyqgod, This is probably a better question for the spark user's list than the dev list (cc'ing user and bcc'ing dev on this reply). To answer your question, though: Amazon's Public Datasets Page is a nice place to start: http://aws.amazon.com/datasets/ - these work well with spark because the

Re: MLLib - Thoughts about refactoring Updater for LBFGS?

2014-02-25 Thread Evan R. Sparks
Hi everyone, Sorry I'm late to the thread here, but I want to point out a few things. This is, of course, a most welcome contribution and it will be immediately useful to everything currently using the stochastic gradient optimizers! 1) I'm all for refactoring the optimization methods to make the