from:"Evan R. Sparks"

Re: treeAggregate timing / SGD performance with miniBatchFraction < 1

2015-09-26 Thread Evan R. Sparks

tely > see why it should take longer to transfer the local gradient vectors > in that level, since they are dense in every level. Furthermore, the > driver is receiving the result of only 4 tasks, which is relatively > small. > > Mike > > > On 9/26/15, Evan R. Sparks

Re: RDD API patterns

2015-09-26 Thread Evan R. Sparks

Mike, I believe the reason you're seeing near identical performance on the gradient computations is twofold 1) Gradient computations for GLM models are computationally pretty cheap from a FLOPs/byte read perspective. They are essentially a BLAS "gemv" call in the dense case, which is well known to

Re: Scan Sharing in Spark

2015-05-05 Thread Evan R. Sparks

Scan sharing can indeed be a useful optimization in spark, because you amortize not only the time spent scanning over the data, but also time spent in task launch and scheduling overheads. Here's a trivial example in scala. I'm not aware of a place in SparkSQL where this is used - I'd imagine that

Re: Pandas' Shift in Dataframe

2015-04-29 Thread Evan R. Sparks

In general there's a tension between ordered data and set-oriented data model underlying DataFrames. You can force a total ordering on the data, but it may come at a high cost with respect to performance. It would be good to get a sense of the use case you're trying to support, but one suggestion

Re: Using CUDA within Spark / boosting linear algebra

2015-04-02 Thread Evan R. Sparks

ject's readme.md > > https://github.com/fommil/netlib-java/wiki/NVBLAS > > > > Best regards, Alexander > > -Original Message- > > From: Xiangrui Meng [mailto:men...@gmail.com] > > Sent: Monday, March 30, 2015 2:43 PM > > To: Sean Owen > &

Re: Storing large data for MLlib machine learning

2015-03-26 Thread Evan R. Sparks

les in hdfs https://github.com/twitter/elephant-bird > > > > > > *From:* Evan R. Sparks [mailto:evan.spa...@gmail.com] > *Sent:* Thursday, March 26, 2015 2:34 PM > *To:* Stephen Boesch > *Cc:* Ulanov, Alexander; dev@spark.apache.org > *Subject:* Re: Storing large data for

Re: Storing large data for MLlib machine learning

2015-03-26 Thread Evan R. Sparks

On binary file formats - I looked at HDF5+Spark a couple of years ago and found it barely JVM-friendly and very Hadoop-unfriendly (e.g. the APIs needed filenames as input, you couldn't pass it anything like an InputStream). I don't know if it has gotten any better. Parquet plays much more nicely a

Re: Using CUDA within Spark / boosting linear algebra

2015-03-26 Thread Evan R. Sparks

to make Open BLAS the default - is not always better and I think >> natives really need DevOps buy-in. It's not the right solution for >> everybody. >> On 26 Mar 2015 01:23, "Evan R. Sparks" wrote: >> >>> Yeah, much more reasonable - nice to know that

Re: Using CUDA within Spark / boosting linear algebra

2015-03-25 Thread Evan R. Sparks

rch 25, 2015 2:31 PM > To: Sam Halliday > Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks; > jfcanny > Subject: RE: Using CUDA within Spark / boosting linear algebra > > Hi again, > > I finally managed to use nvblas within Spark+netlib-java. It has > except

Re: Using CUDA within Spark / boosting linear algebra

2015-03-25 Thread Evan R. Sparks

cblas from Atlas or Openblas because they link to their > implementation and not to Fortran blas. > > Best regards, Alexander > > -Original Message- > From: Ulanov, Alexander > Sent: Tuesday, March 24, 2015 6:57 PM > To: Sam Halliday > Cc: dev@spark.apache.org; Xi

Re: ideas for MLlib development

2015-03-03 Thread Evan R. Sparks

Hi Robert, There's some work to do LDA via Gibbs sampling in this JIRA: https://issues.apache.org/jira/browse/SPARK-1405 as well as this one: https://issues.apache.org/jira/browse/SPARK-5556 It may make sense to have a more general Gibbs sampling framework, but it might be good to have a few desi

Re: Using CUDA within Spark / boosting linear algebra

2015-02-26 Thread Evan R. Sparks

netlib-java? >> >> CC'ed Sam, the author of netlib-java. >> >> Best, >> Xiangrui >> >> On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley >> wrote: >> > Better documentation for linking would be very helpful! Here's a JIRA: >>

Re: Using CUDA within Spark / boosting linear algebra

2015-02-25 Thread Evan R. Sparks

Mx378T9J5r7kwKSPkY/edit?usp=sharing > > One thing still needs exploration: does BIDMat-cublas perform copying > to/from machine’s RAM? > > -Original Message- > From: Ulanov, Alexander > Sent: Tuesday, February 10, 2015 2:12 PM > To: Evan R. Sparks > Cc: Josep

Re: [MLlib] Performance problem in GeneralizedLinearAlgorithm

2015-02-17 Thread Evan R. Sparks

Josh - thanks for the detailed write up - this seems a little funny to me. I agree that with the current code path there is extra work being done than needs to be (e.g. the features are re-scaled at every iteration, but the relatively costly process of fitting the StandardScaler should not be re-do

Re: Spark SQL value proposition in batch pipelines

2015-02-12 Thread Evan R. Sparks

Well, you can always join as many RDDs as you want by chaining them together, e.g. a.join(b).join(c)... - I probably wouldn't join thousands of RDDs in this way but 10 is probably doable. That said - SparkSQL has an optimizer under the covers that can make clever decisions e.g. pushing the predica

Re: Using CUDA within Spark / boosting linear algebra

2015-02-09 Thread Evan R. Sparks

ib-java) > interested to compare their libraries. > > > > Best regards, Alexander > > > > *From:* Evan R. Sparks [mailto:evan.spa...@gmail.com] > *Sent:* Friday, February 06, 2015 5:58 PM > > *To:* Ulanov, Alexander > *Cc:* Joseph Bradley; dev@spark.apache.org > *Sub

Re: Using CUDA within Spark / boosting linear algebra

2015-02-08 Thread Evan R. Sparks

uppose that > netlib is using it. > > > > *From:* Evan R. Sparks [mailto:evan.spa...@gmail.com] > *Sent:* Friday, February 06, 2015 5:19 PM > *To:* Ulanov, Alexander > *Cc:* Joseph Bradley; dev@spark.apache.org > > *Subject:* Re: Using CUDA within Spark / boosting linear algeb

Spark SQL Window Functions

2015-02-08 Thread Evan R. Sparks

Currently there's no standard way of handling time series data in Spark. We were kicking around some ideas in the lab today and one thing that came up was SQL Window Functions as a way to support them and query over time series (do things like moving average, etc.) These don't seem to be implement

Re: Using CUDA within Spark / boosting linear algebra

2015-02-08 Thread Evan R. Sparks

der > > From: Joseph Bradley [mailto:jos...@databricks.com] > Sent: Thursday, February 05, 2015 5:29 PM > To: Ulanov, Alexander > Cc: Evan R. Sparks; dev@spark.apache.org > Subject: Re: Using CUDA within Spark / boosting linear algebra > > Hi Alexander, > > Using GPUs wit

Re: Using CUDA within Spark / boosting linear algebra

2015-02-05 Thread Evan R. Sparks

; another group in Berkeley. Could you elaborate on how these all might be > connected with Spark Mllib? If you take BIDMat for linear algebra why don’t > you take BIDMach for optimization and learning? > > > > Best regards, Alexander > > > > *From:* Evan R. Sparks [

Re: Using CUDA within Spark / boosting linear algebra

2015-02-05 Thread Evan R. Sparks

I'd expect that we can make GPU-accelerated BLAS faster than CPU blas in many cases. You might consider taking a look at the codepaths that BIDMat ( https://github.com/BIDData/BIDMat) takes and comparing them to netlib-java/breeze. John Canny et. al. have done a bunch of work optimizing to make th

Re: renaming SchemaRDD -> DataFrame

2015-01-28 Thread Evan R. Sparks

You've got to be a little bit careful here. "NA" in systems like R or pandas may have special meaning that is distinct from "null". See, e.g. http://www.r-bloggers.com/r-na-vs-null/ On Wed, Jan 28, 2015 at 4:42 PM, Reynold Xin wrote: > Isn't that just "null" in SQL? > > On Wed, Jan 28, 2015 a

Re: renaming SchemaRDD -> DataFrame

2015-01-27 Thread Evan R. Sparks

I'm +1 on this, although a little worried about unknowingly introducing SparkSQL dependencies every time someone wants to use this. It would be great if the interface can be abstract and the implementation (in this case, SparkSQL backend) could be swapped out. One alternative suggestion on the nam

Re: Any interest in 'weighting' VectorTransformer which does component-wise scaling?

2015-01-27 Thread Evan R. Sparks

Hmm... Scaler and Scalar are very close together both in terms of pronunciation and spelling - and I wouldn't want to create confusion between the two. Further - this operation (elementwise multiplication by a static vector) is general enough that maybe it should have a more general name? On Tue,

Re: Notes on writing complex spark applications

2014-11-24 Thread Evan R. Sparks

Nov 23, 2014 at 8:27 PM, Inkyu Lee wrote: > > Very helpful!! > > > > thank you very much! > > > > 2014-11-24 2:17 GMT+09:00 Sam Bessalah : > > > >> Thanks Evan, this is great. > >> On Nov 23, 2014 5:58 PM, "Evan R. Sparks" > wrote: &

Notes on writing complex spark applications

2014-11-23 Thread Evan R. Sparks

Hi all, Shivaram Venkataraman, Joseph Gonzalez, Tomer Kaftan, and I have been working on a short document about writing high performance Spark applications based on our experience developing MLlib, GraphX, ml-matrix, pipelines, etc. It may be a useful document both for users and new Spark develope

Re: Gaussian Mixture Model clustering

2014-09-19 Thread Evan R. Sparks

Hey Meethu - what are you setting "K" to in the benchmarks you show? This can greatly affect the runtime. On Thu, Sep 18, 2014 at 10:38 PM, Meethu Mathew wrote: > Hi all, > Please find attached the image of benchmark results. The table in the > previous mail got messed up. Thanks. > > > > On Fr

Re: [mllib] Add multiplying large scale matrices

2014-09-05 Thread Evan R. Sparks

There's some work on this going on in the AMP Lab. Create a ticket and we can update with our progress so that we don't duplicate effort. On Fri, Sep 5, 2014 at 8:18 AM, Yu Ishikawa wrote: > Hi RJ, > > Thank you for your comment. I am interested in to have other matrix > operations too. > I wil

Re: Is breeze thread safe in Spark?

2014-09-03 Thread Evan R. Sparks

Additionally, at the higher level, MLlib allocates separate Breeze Vectors/Matrices on a Per-executor basis. The only place I can think of where data structures might be over-written concurrently is in a .aggregate() call, and these calls happen sequentially. RJ - Do you have a JIRA reference for

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Evan R. Sparks

If you're thinking along these lines, have a look at the DecisionTree implementation in MLlib. It uses the same idea and is optimized to prevent multiple passes over the data by computing several splits at each level of tree building. The tradeoff is increased model state and computation per pass o

Re: Could the function MLUtils.loadLibSVMFile be modified to support zero-based-index data?

2014-07-08 Thread Evan R. Sparks

As Sean mentions, if you can change the data to the standard format, that's probably a good idea. If you'd rather read the data raw, then writing your own version of loadLibSVMFile - then you could make your own loader function which is very similar to the existing one with a few characters removed

Re: Contributing to MLlib

2014-07-02 Thread Evan R. Sparks

Hi there, Generally we try to avoid duplicating logic if possible, particularly for algorithms that share a great deal of algorithmic similarity. See, for example, the way we implement Logistic regression vs. Linear regression vs. Linear SVM with different gradient functions all on top of SGD or L

Re: Any plans for new clustering algorithms?

2014-04-21 Thread Evan R. Sparks

While DBSCAN and others would be welcome contributions, I couldn't agree more with Sean. On Mon, Apr 21, 2014 at 8:58 AM, Sean Owen wrote: > Nobody asked me, and this is a comment on a broader question, not this > one, but: > > In light of a number of recent items about adding more algorithms

Re: [HELP] ask for some information about public data set

2014-02-25 Thread Evan R. Sparks

Hi hyqgod, This is probably a better question for the spark user's list than the dev list (cc'ing user and bcc'ing dev on this reply). To answer your question, though: Amazon's Public Datasets Page is a nice place to start: http://aws.amazon.com/datasets/ - these work well with spark because the

Re: MLLib - Thoughts about refactoring Updater for LBFGS?

2014-02-25 Thread Evan R. Sparks

Hi everyone, Sorry I'm late to the thread here, but I want to point out a few things. This is, of course, a most welcome contribution and it will be immediately useful to everything currently using the stochastic gradient optimizers! 1) I'm all for refactoring the optimization methods to make the

Re: treeAggregate timing / SGD performance with miniBatchFraction < 1

Re: RDD API patterns

Re: Scan Sharing in Spark

Re: Pandas' Shift in Dataframe

Re: Using CUDA within Spark / boosting linear algebra

Re: Storing large data for MLlib machine learning

Re: Storing large data for MLlib machine learning

Re: Using CUDA within Spark / boosting linear algebra

Re: Using CUDA within Spark / boosting linear algebra

Re: Using CUDA within Spark / boosting linear algebra

Re: ideas for MLlib development

Re: Using CUDA within Spark / boosting linear algebra

Re: Using CUDA within Spark / boosting linear algebra

Re: [MLlib] Performance problem in GeneralizedLinearAlgorithm

Re: Spark SQL value proposition in batch pipelines

Re: Using CUDA within Spark / boosting linear algebra

Re: Using CUDA within Spark / boosting linear algebra

Spark SQL Window Functions

Re: Using CUDA within Spark / boosting linear algebra

Re: Using CUDA within Spark / boosting linear algebra

Re: Using CUDA within Spark / boosting linear algebra

Re: renaming SchemaRDD -> DataFrame

Re: renaming SchemaRDD -> DataFrame

Re: Any interest in 'weighting' VectorTransformer which does component-wise scaling?

Re: Notes on writing complex spark applications

Notes on writing complex spark applications

Re: Gaussian Mixture Model clustering

Re: [mllib] Add multiplying large scale matrices

Re: Is breeze thread safe in Spark?

Re: Contributing to MLlib: Proposal for Clustering Algorithms

Re: Could the function MLUtils.loadLibSVMFile be modified to support zero-based-index data?

Re: Contributing to MLlib

Re: Any plans for new clustering algorithms?

Re: [HELP] ask for some information about public data set

Re: MLLib - Thoughts about refactoring Updater for LBFGS?

35 matches

Site Navigation

Mail list logo

Footer information