tely
> see why it should take longer to transfer the local gradient vectors
> in that level, since they are dense in every level. Furthermore, the
> driver is receiving the result of only 4 tasks, which is relatively
> small.
>
> Mike
>
>
> On 9/26/15, Evan R. Sparks
Mike,
I believe the reason you're seeing near identical performance on the
gradient computations is twofold
1) Gradient computations for GLM models are computationally pretty cheap
from a FLOPs/byte read perspective. They are essentially a BLAS "gemv" call
in the dense case, which is well known to
Scan sharing can indeed be a useful optimization in spark, because you
amortize not only the time spent scanning over the data, but also time
spent in task launch and scheduling overheads.
Here's a trivial example in scala. I'm not aware of a place in SparkSQL
where this is used - I'd imagine that
In general there's a tension between ordered data and set-oriented data
model underlying DataFrames. You can force a total ordering on the data,
but it may come at a high cost with respect to performance.
It would be good to get a sense of the use case you're trying to support,
but one suggestion
ject's readme.md
> > https://github.com/fommil/netlib-java/wiki/NVBLAS
> >
> > Best regards, Alexander
> > -Original Message-
> > From: Xiangrui Meng [mailto:men...@gmail.com]
> > Sent: Monday, March 30, 2015 2:43 PM
> > To: Sean Owen
> &
les in hdfs https://github.com/twitter/elephant-bird
>
>
>
>
>
> *From:* Evan R. Sparks [mailto:evan.spa...@gmail.com]
> *Sent:* Thursday, March 26, 2015 2:34 PM
> *To:* Stephen Boesch
> *Cc:* Ulanov, Alexander; dev@spark.apache.org
> *Subject:* Re: Storing large data for
On binary file formats - I looked at HDF5+Spark a couple of years ago and
found it barely JVM-friendly and very Hadoop-unfriendly (e.g. the APIs
needed filenames as input, you couldn't pass it anything like an
InputStream). I don't know if it has gotten any better.
Parquet plays much more nicely a
to make Open BLAS the default - is not always better and I think
>> natives really need DevOps buy-in. It's not the right solution for
>> everybody.
>> On 26 Mar 2015 01:23, "Evan R. Sparks" wrote:
>>
>>> Yeah, much more reasonable - nice to know that
rch 25, 2015 2:31 PM
> To: Sam Halliday
> Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks;
> jfcanny
> Subject: RE: Using CUDA within Spark / boosting linear algebra
>
> Hi again,
>
> I finally managed to use nvblas within Spark+netlib-java. It has
> except
cblas from Atlas or Openblas because they link to their
> implementation and not to Fortran blas.
>
> Best regards, Alexander
>
> -Original Message-
> From: Ulanov, Alexander
> Sent: Tuesday, March 24, 2015 6:57 PM
> To: Sam Halliday
> Cc: dev@spark.apache.org; Xi
Hi Robert,
There's some work to do LDA via Gibbs sampling in this JIRA:
https://issues.apache.org/jira/browse/SPARK-1405 as well as this one:
https://issues.apache.org/jira/browse/SPARK-5556
It may make sense to have a more general Gibbs sampling framework, but it
might be good to have a few desi
netlib-java?
>>
>> CC'ed Sam, the author of netlib-java.
>>
>> Best,
>> Xiangrui
>>
>> On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley
>> wrote:
>> > Better documentation for linking would be very helpful! Here's a JIRA:
>>
Mx378T9J5r7kwKSPkY/edit?usp=sharing
>
> One thing still needs exploration: does BIDMat-cublas perform copying
> to/from machine’s RAM?
>
> -Original Message-
> From: Ulanov, Alexander
> Sent: Tuesday, February 10, 2015 2:12 PM
> To: Evan R. Sparks
> Cc: Josep
Josh - thanks for the detailed write up - this seems a little funny to me.
I agree that with the current code path there is extra work being done than
needs to be (e.g. the features are re-scaled at every iteration, but the
relatively costly process of fitting the StandardScaler should not be
re-do
Well, you can always join as many RDDs as you want by chaining them
together, e.g. a.join(b).join(c)... - I probably wouldn't join thousands of
RDDs in this way but 10 is probably doable.
That said - SparkSQL has an optimizer under the covers that can make clever
decisions e.g. pushing the predica
ib-java)
> interested to compare their libraries.
>
>
>
> Best regards, Alexander
>
>
>
> *From:* Evan R. Sparks [mailto:evan.spa...@gmail.com]
> *Sent:* Friday, February 06, 2015 5:58 PM
>
> *To:* Ulanov, Alexander
> *Cc:* Joseph Bradley; dev@spark.apache.org
> *Sub
uppose that
> netlib is using it.
>
>
>
> *From:* Evan R. Sparks [mailto:evan.spa...@gmail.com]
> *Sent:* Friday, February 06, 2015 5:19 PM
> *To:* Ulanov, Alexander
> *Cc:* Joseph Bradley; dev@spark.apache.org
>
> *Subject:* Re: Using CUDA within Spark / boosting linear algeb
Currently there's no standard way of handling time series data in Spark. We
were kicking around some ideas in the lab today and one thing that came up
was SQL Window Functions as a way to support them and query over time
series (do things like moving average, etc.)
These don't seem to be implement
der
>
> From: Joseph Bradley [mailto:jos...@databricks.com]
> Sent: Thursday, February 05, 2015 5:29 PM
> To: Ulanov, Alexander
> Cc: Evan R. Sparks; dev@spark.apache.org
> Subject: Re: Using CUDA within Spark / boosting linear algebra
>
> Hi Alexander,
>
> Using GPUs wit
; another group in Berkeley. Could you elaborate on how these all might be
> connected with Spark Mllib? If you take BIDMat for linear algebra why don’t
> you take BIDMach for optimization and learning?
>
>
>
> Best regards, Alexander
>
>
>
> *From:* Evan R. Sparks [
I'd expect that we can make GPU-accelerated BLAS faster than CPU blas in
many cases.
You might consider taking a look at the codepaths that BIDMat (
https://github.com/BIDData/BIDMat) takes and comparing them to
netlib-java/breeze. John Canny et. al. have done a bunch of work optimizing
to make th
You've got to be a little bit careful here. "NA" in systems like R or
pandas may have special meaning that is distinct from "null".
See, e.g. http://www.r-bloggers.com/r-na-vs-null/
On Wed, Jan 28, 2015 at 4:42 PM, Reynold Xin wrote:
> Isn't that just "null" in SQL?
>
> On Wed, Jan 28, 2015 a
I'm +1 on this, although a little worried about unknowingly introducing
SparkSQL dependencies every time someone wants to use this. It would be
great if the interface can be abstract and the implementation (in this
case, SparkSQL backend) could be swapped out.
One alternative suggestion on the nam
Hmm... Scaler and Scalar are very close together both in terms of
pronunciation and spelling - and I wouldn't want to create confusion
between the two. Further - this operation (elementwise multiplication by a
static vector) is general enough that maybe it should have a more general
name?
On Tue,
Nov 23, 2014 at 8:27 PM, Inkyu Lee wrote:
> > Very helpful!!
> >
> > thank you very much!
> >
> > 2014-11-24 2:17 GMT+09:00 Sam Bessalah :
> >
> >> Thanks Evan, this is great.
> >> On Nov 23, 2014 5:58 PM, "Evan R. Sparks"
> wrote:
&
Hi all,
Shivaram Venkataraman, Joseph Gonzalez, Tomer Kaftan, and I have been
working on a short document about writing high performance Spark
applications based on our experience developing MLlib, GraphX, ml-matrix,
pipelines, etc. It may be a useful document both for users and new Spark
develope
Hey Meethu - what are you setting "K" to in the benchmarks you show? This
can greatly affect the runtime.
On Thu, Sep 18, 2014 at 10:38 PM, Meethu Mathew
wrote:
> Hi all,
> Please find attached the image of benchmark results. The table in the
> previous mail got messed up. Thanks.
>
>
>
> On Fr
There's some work on this going on in the AMP Lab. Create a ticket and we
can update with our progress so that we don't duplicate effort.
On Fri, Sep 5, 2014 at 8:18 AM, Yu Ishikawa
wrote:
> Hi RJ,
>
> Thank you for your comment. I am interested in to have other matrix
> operations too.
> I wil
Additionally, at the higher level, MLlib allocates separate Breeze
Vectors/Matrices on a Per-executor basis. The only place I can think of
where data structures might be over-written concurrently is in a
.aggregate() call, and these calls happen sequentially.
RJ - Do you have a JIRA reference for
If you're thinking along these lines, have a look at the DecisionTree
implementation in MLlib. It uses the same idea and is optimized to prevent
multiple passes over the data by computing several splits at each level of
tree building. The tradeoff is increased model state and computation per
pass o
As Sean mentions, if you can change the data to the standard format, that's
probably a good idea. If you'd rather read the data raw, then writing your
own version of loadLibSVMFile - then you could make your own loader
function which is very similar to the existing one with a few characters
removed
Hi there,
Generally we try to avoid duplicating logic if possible, particularly for
algorithms that share a great deal of algorithmic similarity. See, for
example, the way we implement Logistic regression vs. Linear regression vs.
Linear SVM with different gradient functions all on top of SGD or L
While DBSCAN and others would be welcome contributions, I couldn't agree
more with Sean.
On Mon, Apr 21, 2014 at 8:58 AM, Sean Owen wrote:
> Nobody asked me, and this is a comment on a broader question, not this
> one, but:
>
> In light of a number of recent items about adding more algorithms
Hi hyqgod,
This is probably a better question for the spark user's list than the dev
list (cc'ing user and bcc'ing dev on this reply).
To answer your question, though:
Amazon's Public Datasets Page is a nice place to start:
http://aws.amazon.com/datasets/ - these work well with spark because
the
Hi everyone,
Sorry I'm late to the thread here, but I want to point out a few things.
This is, of course, a most welcome contribution and it will be immediately
useful to everything currently using the stochastic gradient optimizers!
1) I'm all for refactoring the optimization methods to make the
35 matches
Mail list logo