Re: treeAggregate timing / SGD performance with miniBatchFraction < 1

2015-09-26 Thread Evan R. Sparks
Off the top of my head, I'm not sure, but it looks like virtually all the
extra time between each stage is accounted for with T_{io} in your plot,
which I'm guessing is time spent communicating results over the network? Is
your driver running on the master or is it on a different node? If you look
at the code for treeAggregate, the last stage uses a .reduce() for the
final combination, which happens on the driver. In this case, the size of
the gradients is O(1GB) so if you've got to go over a slow link for the
last portion this could really make a difference.

On Sat, Sep 26, 2015 at 10:20 AM, Mike Hynes <91m...@gmail.com> wrote:

> Hi Evan,
>
> (I just realized my initial email was a reply to the wrong thread; I'm
> very sorry about this).
>
> Thanks for your email, and your thoughts on the sampling. That the
> gradient computations are essentially the cost of a pass through each
> element of the partition makes sense, especially given the sparsity of
> the feature vectors.
>
> Would you have any idea why the communication time is so much larger
> in the final level of the aggregation, however? I can't immediately
> see why it should take longer to transfer the local gradient vectors
> in that level, since they are dense in every level. Furthermore, the
> driver is receiving the result of only 4 tasks, which is relatively
> small.
>
> Mike
>
>
> On 9/26/15, Evan R. Sparks  wrote:
> > Mike,
> >
> > I believe the reason you're seeing near identical performance on the
> > gradient computations is twofold
> > 1) Gradient computations for GLM models are computationally pretty cheap
> > from a FLOPs/byte read perspective. They are essentially a BLAS "gemv"
> call
> > in the dense case, which is well known to be bound by memory bandwidth on
> > modern processors. So, you're basically paying the cost of a scan of the
> > points you've sampled to do the gradient computation.
> > 2) The default sampling mechanism used by the GradientDescent optimizer
> in
> > MLlib is implemented via RDD.sample, which does reservoir sampling on
> each
> > partition. This requires a full scan of each partition at every iteration
> > to collect the samples.
> >
> > So - you're going to pay the cost of a scan to do the sampling anyway,
> and
> > the gradient computation is essentially free at this point (and can be
> > pipelined, etc.).
> >
> > It is quite possible to improve #2 by coming up with a better sampling
> > algorithm. One easy algorithm would be to assume the data is already
> > randomly shuffled (or do that once) and then use the first
> > miniBatchFraction*partitionSize records on the first iteration, the
> second
> > set on the second set on the second iteration, and so on. You could
> > protoype this algorithm pretty easily by converting your data to an
> > RDD[Array[DenseVector]] and doing some bookkeeping at each iteration.
> >
> > That said - eventually the overheads of the platform catch up to you. As
> a
> > rule of thumb I estimate about 50ms/iteration as a floor for things like
> > task serialization and other platform overheads. You've got to balance
> how
> > much computation you want to do vs. the amount of time you want to spend
> > waiting for the platform.
> >
> > - Evan
> >
> > On Sat, Sep 26, 2015 at 9:27 AM, Mike Hynes <91m...@gmail.com> wrote:
> >
> >> Hello Devs,
> >>
> >> This email concerns some timing results for a treeAggregate in
> >> computing a (stochastic) gradient over an RDD of labelled points, as
> >> is currently done in the MLlib optimization routine for SGD.
> >>
> >> In SGD, the underlying RDD is downsampled by a fraction f \in (0,1],
> >> and the subgradients over all the instances in the downsampled RDD are
> >> aggregated to the driver as a dense vector. However, we have noticed
> >> some unusual behaviour when f < 1: it takes the same amount of time to
> >> compute the stochastic gradient for a stochastic minibatch as it does
> >> for a full batch (f = 1).
> >>
> >> Attached are two plots of the mean task timing metrics for each level
> >> in the aggregation, which has been performed with 4 levels (level 4 is
> >> the final level, in which the results are communicated to the driver).
> >> 16 nodes are used, and the RDD has 256 partitions. We run in (client)
> >> standalone mode. Here, the total time for the tasks is shown (\tau)
> >> alongside the execution time (not counting GC),
> >> serialization/deserialization time, the GC time, and 

Re: RDD API patterns

2015-09-26 Thread Evan R. Sparks
Mike,

I believe the reason you're seeing near identical performance on the
gradient computations is twofold
1) Gradient computations for GLM models are computationally pretty cheap
from a FLOPs/byte read perspective. They are essentially a BLAS "gemv" call
in the dense case, which is well known to be bound by memory bandwidth on
modern processors. So, you're basically paying the cost of a scan of the
points you've sampled to do the gradient computation.
2) The default sampling mechanism used by the GradientDescent optimizer in
MLlib is implemented via RDD.sample, which does reservoir sampling on each
partition. This requires a full scan of each partition at every iteration
to collect the samples.

So - you're going to pay the cost of a scan to do the sampling anyway, and
the gradient computation is essentially free at this point (and can be
pipelined, etc.).

It is quite possible to improve #2 by coming up with a better sampling
algorithm. One easy algorithm would be to assume the data is already
randomly shuffled (or do that once) and then use the first
miniBatchFraction*partitionSize records on the first iteration, the second
set on the second set on the second iteration, and so on. You could
protoype this algorithm pretty easily by converting your data to an
RDD[Array[DenseVector]] and doing some bookkeeping at each iteration.

That said - eventually the overheads of the platform catch up to you. As a
rule of thumb I estimate about 50ms/iteration as a floor for things like
task serialization and other platform overheads. You've got to balance how
much computation you want to do vs. the amount of time you want to spend
waiting for the platform.

- Evan

On Sat, Sep 26, 2015 at 9:27 AM, Mike Hynes <91m...@gmail.com> wrote:

> Hello Devs,
>
> This email concerns some timing results for a treeAggregate in
> computing a (stochastic) gradient over an RDD of labelled points, as
> is currently done in the MLlib optimization routine for SGD.
>
> In SGD, the underlying RDD is downsampled by a fraction f \in (0,1],
> and the subgradients over all the instances in the downsampled RDD are
> aggregated to the driver as a dense vector. However, we have noticed
> some unusual behaviour when f < 1: it takes the same amount of time to
> compute the stochastic gradient for a stochastic minibatch as it does
> for a full batch (f = 1).
>
> Attached are two plots of the mean task timing metrics for each level
> in the aggregation, which has been performed with 4 levels (level 4 is
> the final level, in which the results are communicated to the driver).
> 16 nodes are used, and the RDD has 256 partitions. We run in (client)
> standalone mode. Here, the total time for the tasks is shown (\tau)
> alongside the execution time (not counting GC),
> serialization/deserialization time, the GC time, and the difference
> between tau and all other times, assumed to be variable
> IO/communication/waiting time. The RDD in this case is a labelled
> point representation of the KDD Bridge to Algebra dataset, with 20M
> (sparse) instances and a problem dimension of 30M. The sparsity of the
> instances is very high---each individual instance vector may have only
> a hundred nonzeros. All metrics have been taken from the JSON Spark
> event logs.
>
> The plot gradient_f1.pdf shows the times for a gradient computation
> with f = 1, and gradient_f-3.pdf shows the same metrics with f = 1e-3.
> For other f values in {1e-1 1e-2 ... 1e-5}, the same effect is
> observed.
>
> What I would like to mention about these plots, and ask if anyone has
> experience with, is the following:
> 1. The times are essentially identical; I would have thought that
> downsampling the RDD before aggregating the subgradients would at
> least reduce the execution time required, if not the
> communication/serialization times.
> 2. The serialization time in level 4 is almost entirely from the
> result serialization to the driver, and not the task deserialization.
> In each level of the treeAggregation, however, the local (dense)
> gradients have to be communicated between compute nodes, so I am
> surprised that it takes so much longer to return the vectors to the
> driver.
>
> I initially wondered if the large IO overhead in the last stage had
> anything to do with client mode vs cluster mode, since, from what I
> understand, only a single core is allocated to the driver thread in
> client mode. However, when running tests in the two modes, I have
> previously seen no appreciable difference in the running time for
> other (admittedly smaller) problems. Furthermore, I am still very
> confused about why the execution time for each task is just as large
> for the downsampled RDD. It seems unlikely that sampling each
> partition would be as expensive as the gradient computations, even for
> sparse feature vectors.
>
> If anyone has experience working with the sampling in minibatch SGD or
> has tested the scalability of the treeAggregation operation for
> vectors, I'd really app

Re: Scan Sharing in Spark

2015-05-05 Thread Evan R. Sparks
Scan sharing can indeed be a useful optimization in spark, because you
amortize not only the time spent scanning over the data, but also time
spent in task launch and scheduling overheads.

Here's a trivial example in scala. I'm not aware of a place in SparkSQL
where this is used - I'd imagine that most development effort is being
placed on single-query optimization right now.

//This function takes a sequence of functions of type A => B and returns a
function of A => Seq[B] where each item in the input list corresponds to a
def combineFunctions[A,B](fns: Seq[A=>B]): A => Seq[B] = {
   def combf(a: A): Seq[B] = {
 fns.map(f => f(a))
   }
  combf
}

def plusOne(x: Int) = x + 1
def timesFive(x: Int) = x * 5

val sharedF = combineFunctions(Seq[Int => Int](plusOne, timesFive))

val data = sc.parallelize(Array(1,2,3,4,5,6,7))

//Apply this combine function to each of your data elements.
val res = data.map(sharedF)

res.take(5)

The result will look something like this.

res5: Array[Seq[Int]] = Array(List(2, 5), List(3, 10), List(4, 15), List(5,
20), List(6, 25))



On Tue, May 5, 2015 at 8:53 AM, Quang-Nhat HOANG-XUAN  wrote:

> Hi everyone,
>
> I have two Spark jobs inside a Spark Application, which read from the same
> input file.
> They are executed in 2 threads.
>
> Right now, I cache the input file into memory before executing these two
> jobs.
>
> Are there another ways to share their same input with just only one read?
> I know there is something called Multiple Query Optimization, but I don't
> know if it can be applicable on Spark (or SparkSQL) or not?
>
> Thank you.
>
> Quang-Nhat
>


Re: Pandas' Shift in Dataframe

2015-04-29 Thread Evan R. Sparks
In general there's a tension between ordered data and set-oriented data
model underlying DataFrames. You can force a total ordering on the data,
but it may come at a high cost with respect to performance.

It would be good to get a sense of the use case you're trying to support,
but one suggestion would be to apply I can imagine achieving a similar
result by applying a datetime.timedelta (in Python terms) to a time
attribute (your "axis") and then performing join between the base table and
this derived table to merge the data back together. This type of join could
then be optimized if the use case is frequent enough to warrant it.

- Evan

On Wed, Apr 29, 2015 at 1:25 PM, Reynold Xin  wrote:

> In this case it's fine to discuss whether this would fit in Spark
> DataFrames' high level direction before putting it in JIRA. Otherwise we
> might end up creating a lot of tickets just for querying whether something
> might be a good idea.
>
> About this specific feature -- I'm not sure what it means in general given
> we don't have axis in Spark DataFrames. But I think it'd probably be good
> to be able to shift a column by one so we can support the end time / begin
> time case, although it'd require two passes over the data.
>
>
>
> On Wed, Apr 29, 2015 at 1:08 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
> > I can't comment on the direction of the DataFrame API (that's more for
> > Reynold or Michael I guess), but I just wanted to point out that the JIRA
> > would be the recommended way to create a central place for discussing a
> > feature add like that.
> >
> > Nick
> >
> > On Wed, Apr 29, 2015 at 3:43 PM Olivier Girardot <
> > o.girar...@lateral-thoughts.com> wrote:
> >
> > > Hi Nicholas,
> > > yes I've already checked, and I've just created the
> > > https://issues.apache.org/jira/browse/SPARK-7247
> > > I'm not even sure why this would be a good feature to add except the
> fact
> > > that some of the data scientists I'm working with are using it, and it
> > > would be therefore useful for me to translate Pandas code to Spark...
> > >
> > > Isn't the goal of Spark Dataframe to allow all the features of Pandas/R
> > > Dataframe using Spark ?
> > >
> > > Regards,
> > >
> > > Olivier.
> > >
> > > Le mer. 29 avr. 2015 à 21:09, Nicholas Chammas <
> > nicholas.cham...@gmail.com>
> > > a écrit :
> > >
> > >> You can check JIRA for any existing plans. If there isn't any, then
> feel
> > >> free to create a JIRA and make the case there for why this would be a
> > good
> > >> feature to add.
> > >>
> > >> Nick
> > >>
> > >> On Wed, Apr 29, 2015 at 7:30 AM Olivier Girardot <
> > >> o.girar...@lateral-thoughts.com> wrote:
> > >>
> > >>> Hi,
> > >>> Is there any plan to add the "shift" method from Pandas to Spark
> > >>> Dataframe,
> > >>> not that I think it's an easy task...
> > >>>
> > >>> c.f.
> > >>>
> > >>>
> >
> http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.shift.html
> > >>>
> > >>> Regards,
> > >>>
> > >>> Olivier.
> > >>>
> > >>
> >
>


Re: Using CUDA within Spark / boosting linear algebra

2015-04-02 Thread Evan R. Sparks
Yeah, thanks Alex!

On Thu, Apr 2, 2015 at 5:05 PM, Xiangrui Meng  wrote:

> This is great! Thanks! -Xiangrui
>
> On Wed, Apr 1, 2015 at 12:11 PM, Ulanov, Alexander
>  wrote:
> > FYI, I've added instructions to Netlib-java wiki, Sam added the link to
> them from the project's readme.md
> > https://github.com/fommil/netlib-java/wiki/NVBLAS
> >
> > Best regards, Alexander
> > -Original Message-
> > From: Xiangrui Meng [mailto:men...@gmail.com]
> > Sent: Monday, March 30, 2015 2:43 PM
> > To: Sean Owen
> > Cc: Evan R. Sparks; Sam Halliday; dev@spark.apache.org; Ulanov,
> Alexander; jfcanny
> > Subject: Re: Using CUDA within Spark / boosting linear algebra
> >
> > Hi Alex,
> >
> > Since it is non-trivial to make nvblas work with netlib-java, it would
> be great if you can send the instructions to netlib-java as part of the
> README. Hopefully we don't need to modify netlib-java code to use nvblas.
> >
> > Best,
> > Xiangrui
> >
> > On Thu, Mar 26, 2015 at 9:54 AM, Sean Owen  wrote:
> >> The license issue is with libgfortran, rather than OpenBLAS.
> >>
> >> (FWIW I am going through the motions to get OpenBLAS set up by default
> >> on CDH in the near future, and the hard part is just handling
> >> libgfortran.)
> >>
> >> On Thu, Mar 26, 2015 at 4:07 PM, Evan R. Sparks 
> wrote:
> >>> Alright Sam - you are the expert here. If the GPL issues are
> >>> unavoidable, that's fine - what is the exact bit of code that is GPL?
> >>>
> >>> The suggestion to use OpenBLAS is not to say it's the best option,
> >>> but that it's a *free, reasonable default* for many users - keep in
> >>> mind the most common deployment for Spark/MLlib is on 64-bit linux on
> EC2[1].
> >>> Additionally, for many of the problems we're targeting, this
> >>> reasonable default can provide a 1-2 orders of magnitude improvement
> >>> in performance over the f2jblas implementation that netlib-java falls
> back on.
> >>
> >> -
> >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For
> >> additional commands, e-mail: dev-h...@spark.apache.org
> >>
>


Re: Storing large data for MLlib machine learning

2015-03-26 Thread Evan R. Sparks
Protobufs are great for serializing individual records - but parquet is
good for efficiently storing a whole bunch of these objects.

Matt Massie has a good (slightly dated) blog post on using
Spark+Parquet+Avro (and you can pretty much s/Avro/Protobuf/) describing
how they all work together here:
http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/

Your use case (storing dense features, presumably as a single column) is
pretty straightforward and the extra layers of indirection are maybe
overkill.

Lastly - you might consider using some of SparkSQL/DataFrame's built-in
features for persistence, which support lots of storage backends.
https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#data-sources

On Thu, Mar 26, 2015 at 2:51 PM, Ulanov, Alexander 
wrote:

>  Thanks, Evan. What do you think about Protobuf? Twitter has a library to
> manage protobuf files in hdfs https://github.com/twitter/elephant-bird
>
>
>
>
>
> *From:* Evan R. Sparks [mailto:evan.spa...@gmail.com]
> *Sent:* Thursday, March 26, 2015 2:34 PM
> *To:* Stephen Boesch
> *Cc:* Ulanov, Alexander; dev@spark.apache.org
> *Subject:* Re: Storing large data for MLlib machine learning
>
>
>
> On binary file formats - I looked at HDF5+Spark a couple of years ago and
> found it barely JVM-friendly and very Hadoop-unfriendly (e.g. the APIs
> needed filenames as input, you couldn't pass it anything like an
> InputStream). I don't know if it has gotten any better.
>
>
>
> Parquet plays much more nicely and there are lots of spark-related
> projects using it already. Keep in mind that it's column-oriented which
> might impact performance - but basically you're going to want your features
> in a byte array and deser should be pretty straightforward.
>
>
>
> On Thu, Mar 26, 2015 at 2:26 PM, Stephen Boesch  wrote:
>
> There are some convenience methods you might consider including:
>
>MLUtils.loadLibSVMFile
>
> and   MLUtils.loadLabeledPoint
>
> 2015-03-26 14:16 GMT-07:00 Ulanov, Alexander :
>
>
> > Hi,
> >
> > Could you suggest what would be the reasonable file format to store
> > feature vector data for machine learning in Spark MLlib? Are there any
> best
> > practices for Spark?
> >
> > My data is dense feature vectors with labels. Some of the requirements
> are
> > that the format should be easy loaded/serialized, randomly accessible,
> with
> > a small footprint (binary). I am considering Parquet, hdf5, protocol
> buffer
> > (protobuf), but I have little to no experience with them, so any
> > suggestions would be really appreciated.
> >
> > Best regards, Alexander
> >
>
>
>


Re: Storing large data for MLlib machine learning

2015-03-26 Thread Evan R. Sparks
On binary file formats - I looked at HDF5+Spark a couple of years ago and
found it barely JVM-friendly and very Hadoop-unfriendly (e.g. the APIs
needed filenames as input, you couldn't pass it anything like an
InputStream). I don't know if it has gotten any better.

Parquet plays much more nicely and there are lots of spark-related projects
using it already. Keep in mind that it's column-oriented which might impact
performance - but basically you're going to want your features in a byte
array and deser should be pretty straightforward.

On Thu, Mar 26, 2015 at 2:26 PM, Stephen Boesch  wrote:

> There are some convenience methods you might consider including:
>
>MLUtils.loadLibSVMFile
>
> and   MLUtils.loadLabeledPoint
>
> 2015-03-26 14:16 GMT-07:00 Ulanov, Alexander :
>
> > Hi,
> >
> > Could you suggest what would be the reasonable file format to store
> > feature vector data for machine learning in Spark MLlib? Are there any
> best
> > practices for Spark?
> >
> > My data is dense feature vectors with labels. Some of the requirements
> are
> > that the format should be easy loaded/serialized, randomly accessible,
> with
> > a small footprint (binary). I am considering Parquet, hdf5, protocol
> buffer
> > (protobuf), but I have little to no experience with them, so any
> > suggestions would be really appreciated.
> >
> > Best regards, Alexander
> >
>


Re: Using CUDA within Spark / boosting linear algebra

2015-03-26 Thread Evan R. Sparks
Alright Sam - you are the expert here. If the GPL issues are unavoidable,
that's fine - what is the exact bit of code that is GPL?

The suggestion to use OpenBLAS is not to say it's the best option, but that
it's a *free, reasonable default* for many users - keep in mind the most
common deployment for Spark/MLlib is on 64-bit linux on EC2[1].
Additionally, for many of the problems we're targeting, this reasonable
default can provide a 1-2 orders of magnitude improvement in performance
over the f2jblas implementation that netlib-java falls back on.

The JVM issues are trickier, I agree - so it sounds like a good user guide
explaining the tradeoffs and configurations procedures as they relate to
spark is a reasonable way forward.

[1] -
https://gigaom.com/2015/01/27/a-few-interesting-numbers-about-apache-spark/

On Thu, Mar 26, 2015 at 12:54 AM, Sam Halliday 
wrote:

> Btw, OpenBLAS requires GPL runtime binaries which are typically considered
> "system libraries" (and these fall under something similar to the Java
> classpath exception rule)... so it's basically impossible to distribute
> OpenBLAS the way you're suggesting, sorry. Indeed, there is work ongoing in
> Spark right now to clear up something of this nature.
>
> On a more technical level, I'd recommend watching my talk at ScalaX which
> explains in detail why high performance only comes from machine optimised
> binaries, which requires DevOps buy-in (and, I'd recommend using MKL anyway
> on the CPU, not OpenBLAS).
>
> On an even deeper level, using natives has consequences to JIT and GC
> which isn't suitable for everybody and we'd really like people to go into
> that with their eyes wide open.
> On 26 Mar 2015 07:43, "Sam Halliday"  wrote:
>
>> I'm not at all surprised ;-) I fully expect the GPU performance to get
>> better automatically as the hardware improves.
>>
>> Netlib natives still need to be shipped separately. I'd also oppose any
>> move to make Open BLAS the default - is not always better and I think
>> natives really need DevOps buy-in. It's not the right solution for
>> everybody.
>> On 26 Mar 2015 01:23, "Evan R. Sparks"  wrote:
>>
>>> Yeah, much more reasonable - nice to know that we can get full GPU
>>> performance from breeze/netlib-java - meaning there's no compelling
>>> performance reason to switch out our current linear algebra library (at
>>> least as far as this benchmark is concerned).
>>>
>>> Instead, it looks like a user guide for configuring Spark/MLlib to use
>>> the right BLAS library will get us most of the way there. Or, would it make
>>> sense to finally ship openblas compiled for some common platforms (64-bit
>>> linux, windows, mac) directly with Spark - hopefully eliminating the jblas
>>> warnings once and for all for most users? (Licensing is BSD) Or am I
>>> missing something?
>>>
>>> On Wed, Mar 25, 2015 at 6:03 PM, Ulanov, Alexander <
>>> alexander.ula...@hp.com> wrote:
>>>
>>>> As everyone suggested, the results were too good to be true, so I
>>>> double-checked them. It turns that nvblas did not do multiplication due to
>>>> parameter NVBLAS_TILE_DIM from "nvblas.conf" and returned zero matrix. My
>>>> previously posted results with nvblas are matrices copying only. The
>>>> default NVBLAS_TILE_DIM==2048 is too big for my graphic card/matrix size. I
>>>> handpicked other values that worked. As a result, netlib+nvblas is on par
>>>> with BIDMat-cuda. As promised, I am going to post a how-to for nvblas
>>>> configuration.
>>>>
>>>>
>>>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
>>>>
>>>>
>>>>
>>>> -Original Message-
>>>> From: Ulanov, Alexander
>>>> Sent: Wednesday, March 25, 2015 2:31 PM
>>>> To: Sam Halliday
>>>> Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R.
>>>> Sparks; jfcanny
>>>> Subject: RE: Using CUDA within Spark / boosting linear algebra
>>>>
>>>> Hi again,
>>>>
>>>> I finally managed to use nvblas within Spark+netlib-java. It has
>>>> exceptional performance for big matrices with Double, faster than
>>>> BIDMat-cuda with Float. But for smaller matrices, if you will copy them
>>>> to/from GPU, OpenBlas or MKL might be a better choice. This correlates with
>>>> original nvblas presentation on GPU conf 2013 (slide 21):
>>

Re: Using CUDA within Spark / boosting linear algebra

2015-03-25 Thread Evan R. Sparks
Yeah, much more reasonable - nice to know that we can get full GPU
performance from breeze/netlib-java - meaning there's no compelling
performance reason to switch out our current linear algebra library (at
least as far as this benchmark is concerned).

Instead, it looks like a user guide for configuring Spark/MLlib to use the
right BLAS library will get us most of the way there. Or, would it make
sense to finally ship openblas compiled for some common platforms (64-bit
linux, windows, mac) directly with Spark - hopefully eliminating the jblas
warnings once and for all for most users? (Licensing is BSD) Or am I
missing something?

On Wed, Mar 25, 2015 at 6:03 PM, Ulanov, Alexander 
wrote:

> As everyone suggested, the results were too good to be true, so I
> double-checked them. It turns that nvblas did not do multiplication due to
> parameter NVBLAS_TILE_DIM from "nvblas.conf" and returned zero matrix. My
> previously posted results with nvblas are matrices copying only. The
> default NVBLAS_TILE_DIM==2048 is too big for my graphic card/matrix size. I
> handpicked other values that worked. As a result, netlib+nvblas is on par
> with BIDMat-cuda. As promised, I am going to post a how-to for nvblas
> configuration.
>
>
> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
>
>
>
> -Original Message-
> From: Ulanov, Alexander
> Sent: Wednesday, March 25, 2015 2:31 PM
> To: Sam Halliday
> Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks;
> jfcanny
> Subject: RE: Using CUDA within Spark / boosting linear algebra
>
> Hi again,
>
> I finally managed to use nvblas within Spark+netlib-java. It has
> exceptional performance for big matrices with Double, faster than
> BIDMat-cuda with Float. But for smaller matrices, if you will copy them
> to/from GPU, OpenBlas or MKL might be a better choice. This correlates with
> original nvblas presentation on GPU conf 2013 (slide 21):
> http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf
>
> My results:
>
> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
>
> Just in case, these tests are not for generalization of performance of
> different libraries. I just want to pick a library that does at best dense
> matrices multiplication for my task.
>
> P.S. My previous issue with nvblas was the following: it has Fortran blas
> functions, at the same time netlib-java uses C cblas functions. So, one
> needs cblas shared library to use nvblas through netlib-java. Fedora does
> not have cblas (but Debian and Ubuntu have), so I needed to compile it. I
> could not use cblas from Atlas or Openblas because they link to their
> implementation and not to Fortran blas.
>
> Best regards, Alexander
>
> -Original Message-----
> From: Ulanov, Alexander
> Sent: Tuesday, March 24, 2015 6:57 PM
> To: Sam Halliday
> Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
> Subject: RE: Using CUDA within Spark / boosting linear algebra
>
> Hi,
>
> I am trying to use nvblas with netlib-java from Spark. nvblas functions
> should replace current blas functions calls after executing LD_PRELOAD as
> suggested in http://docs.nvidia.com/cuda/nvblas/#Usage without any
> changes to netlib-java. It seems to work for simple Java example, but I
> cannot make it work with Spark. I run the following:
> export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64
> env LD_PRELOAD=/usr/local/cuda-6.5/lib64/libnvblas.so ./spark-shell
> --driver-memory 4G In nvidia-smi I observe that Java is to use GPU:
>
> +-+
> | Processes:   GPU
> Memory |
> |  GPU   PID  Type  Process name   Usage
> |
>
> |=|
> |0  8873C   bash
> 39MiB |
> |0  8910C   /usr/lib/jvm/java-1.7.0/bin/java
> 39MiB |
>
> +-+
>
> In Spark shell I do matrix multiplication and see the following:
> 15/03/25 06:48:01 INFO JniLoader: successfully loaded
> /tmp/jniloader8192964377009965483netlib-native_system-linux-x86_64.so
> So I am sure that netlib-native is loaded and cblas supposedly used.
> However, matrix multiplication does executes on CPU since I see 16% of CPU
> used and 0% of GPU used. I also checked different matrix sizes, from
> 100x100 to 12000x12000
>
> Could you suggest might the LD_PRELOAD not affect Spark shell?
>
> Best rega

Re: Using CUDA within Spark / boosting linear algebra

2015-03-25 Thread Evan R. Sparks
Alex - great stuff, and the nvblas numbers are pretty remarkable (almost
too good... did you check the results for correctness? - also, is it
possible that the "unified memory model" of nvblas is somehow hiding pci
transfer time?)

this last bit (getting nvblas + netlib-java to play together) sounds like
it's non-trivial and took you a while to figure out! Would you mind posting
a gist or something of maybe the shell scripts/exports you used to make
this work - I can imagine it being highly useful for others in the future.

Thanks!
Evan

On Wed, Mar 25, 2015 at 2:31 PM, Ulanov, Alexander 
wrote:

> Hi again,
>
> I finally managed to use nvblas within Spark+netlib-java. It has
> exceptional performance for big matrices with Double, faster than
> BIDMat-cuda with Float. But for smaller matrices, if you will copy them
> to/from GPU, OpenBlas or MKL might be a better choice. This correlates with
> original nvblas presentation on GPU conf 2013 (slide 21):
> http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf
>
> My results:
>
> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
>
> Just in case, these tests are not for generalization of performance of
> different libraries. I just want to pick a library that does at best dense
> matrices multiplication for my task.
>
> P.S. My previous issue with nvblas was the following: it has Fortran blas
> functions, at the same time netlib-java uses C cblas functions. So, one
> needs cblas shared library to use nvblas through netlib-java. Fedora does
> not have cblas (but Debian and Ubuntu have), so I needed to compile it. I
> could not use cblas from Atlas or Openblas because they link to their
> implementation and not to Fortran blas.
>
> Best regards, Alexander
>
> -Original Message-
> From: Ulanov, Alexander
> Sent: Tuesday, March 24, 2015 6:57 PM
> To: Sam Halliday
> Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
> Subject: RE: Using CUDA within Spark / boosting linear algebra
>
> Hi,
>
> I am trying to use nvblas with netlib-java from Spark. nvblas functions
> should replace current blas functions calls after executing LD_PRELOAD as
> suggested in http://docs.nvidia.com/cuda/nvblas/#Usage without any
> changes to netlib-java. It seems to work for simple Java example, but I
> cannot make it work with Spark. I run the following:
> export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64
> env LD_PRELOAD=/usr/local/cuda-6.5/lib64/libnvblas.so ./spark-shell
> --driver-memory 4G In nvidia-smi I observe that Java is to use GPU:
>
> +-+
> | Processes:   GPU
> Memory |
> |  GPU   PID  Type  Process name   Usage
> |
>
> |=|
> |0  8873C   bash
> 39MiB |
> |0  8910C   /usr/lib/jvm/java-1.7.0/bin/java
> 39MiB |
>
> +-+
>
> In Spark shell I do matrix multiplication and see the following:
> 15/03/25 06:48:01 INFO JniLoader: successfully loaded
> /tmp/jniloader8192964377009965483netlib-native_system-linux-x86_64.so
> So I am sure that netlib-native is loaded and cblas supposedly used.
> However, matrix multiplication does executes on CPU since I see 16% of CPU
> used and 0% of GPU used. I also checked different matrix sizes, from
> 100x100 to 12000x12000
>
> Could you suggest might the LD_PRELOAD not affect Spark shell?
>
> Best regards, Alexander
>
>
>
> From: Sam Halliday [mailto:sam.halli...@gmail.com]
> Sent: Monday, March 09, 2015 6:01 PM
> To: Ulanov, Alexander
> Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
> Subject: RE: Using CUDA within Spark / boosting linear algebra
>
>
> Thanks so much for following up on this!
>
> Hmm, I wonder if we should have a concerted effort to chart performance on
> various pieces of hardware...
> On 9 Mar 2015 21:08, "Ulanov, Alexander"  alexander.ula...@hp.com>> wrote:
> Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added the
> comment that BIDMat 0.9.7 uses Float matrices in GPU (although I see the
> support of Double in the current source code), did the test with BIDMat and
> CPU Double matrices. BIDMat MKL is indeed on par with netlib MKL.
>
>
> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
>
> Best regards, Alexander
>
> -Original Message-
>

Re: ideas for MLlib development

2015-03-03 Thread Evan R. Sparks
Hi Robert,

There's some work to do LDA via Gibbs sampling in this JIRA:
https://issues.apache.org/jira/browse/SPARK-1405 as well as this one:
https://issues.apache.org/jira/browse/SPARK-5556

It may make sense to have a more general Gibbs sampling framework, but it
might be good to have a few desired applications in mind (e.g. higher level
models that rely on Gibbs) to help API design, parallelization strategy,
etc.

See the guide (
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-ContributingNewAlgorithmstoMLLib)
for information about contributing to MLlib.

- Evan

On Tue, Mar 3, 2015 at 5:51 PM, Robert Dodier 
wrote:

> Hi,
>
> I have some ideas for MLlib that I think might be of general interest
> so I'd like to see what people think and maybe find some collaborators.
>
> (1) Some form of Markov chain Monte Carlo such as Gibbs sampling
> or Metropolis-Hastings. Any kind of Monte Carlo method is readily
> parallelized so Spark seems like a natural platform for them.
> MCMC plays an important role in computational implementations
> of Bayesian inference.


> (2) A function to compute the calibration of a probabilistic classifier.
> The question this answers is, if the classifier outputs 0.x for some
> group of examples, is the actual proportion approximately 0.x ?
> This is useful to know if the classifier outputs are used to compute
> expected loss in some decision procedure.
>
> Of course (1) is much bigger than (2). Perhaps (2) is a one-person
> job but (1) will take a lot of teamwork. I am thinking that in the short
> term, we could at least make some progress on an outline or
> framework for (1).
>
> I am a newcomer to Scala and Spark but I have a lot of experience
> in statistical computing. I am thinking that maybe one or the other
> of these projects will be a good way for me to learn more about
> Spark and make a useful contribution. Thanks for your interest
> and I look forward to your comments.
>
> Robert Dodier
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: Using CUDA within Spark / boosting linear algebra

2015-02-26 Thread Evan R. Sparks
I couldn't agree with you more, Sam. The GPU/Matrix guys typically don't
count their copy times, but claim that you should be doing *as much as
possible* on the GPU - so, maybe for some applications where you can
generate the data on the GPU this makes sense. But, in the context of Spark
we should be *very* careful about enumerating the applications we want GPU
support for and deciding whether it's appropriate to measure the overheads
of getting the data to the GPU.

On Thu, Feb 26, 2015 at 1:55 PM, Sam Halliday 
wrote:

> Btw, I wish people would stop cheating when comparing CPU and GPU timings
> for things like matrix multiply :-P
>
> Please always compare apples with apples and include the time it takes to
> set up the matrices, send it to the processing unit, doing the calculation
> AND copying it back to where you need to see the results.
>
> Ignoring this method will make you believe that your GPU is thousands of
> times faster than it really is. Again, jump to the end of my talk for
> graphs and more discussion  especially the bit about me being keen on
> funding to investigate APU hardware further ;-) (I believe it will solve
> the problem)
> On 26 Feb 2015 21:16, "Xiangrui Meng"  wrote:
>
>> Hey Alexander,
>>
>> I don't quite understand the part where netlib-cublas is about 20x
>> slower than netlib-openblas. What is the overhead of using a GPU BLAS
>> with netlib-java?
>>
>> CC'ed Sam, the author of netlib-java.
>>
>> Best,
>> Xiangrui
>>
>> On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley 
>> wrote:
>> > Better documentation for linking would be very helpful!  Here's a JIRA:
>> > https://issues.apache.org/jira/browse/SPARK-6019
>> >
>> >
>> > On Wed, Feb 25, 2015 at 2:53 PM, Evan R. Sparks 
>> > wrote:
>> >
>> >> Thanks for compiling all the data and running these benchmarks, Alex.
>> The
>> >> big takeaways here can be seen with this chart:
>> >>
>> >>
>> https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZHl6kmAJeaZZggr0/pubchart?oid=1899767119&format=interactive
>> >>
>> >> 1) A properly configured GPU matrix multiply implementation (e.g.
>> >> BIDMat+GPU) can provide substantial (but less than an order of
>> magnitude)
>> >> benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
>> >> netlib-java+openblas-compiled).
>> >> 2) A poorly tuned CPU implementation can be 1-2 orders of magnitude
>> worse
>> >> than a well-tuned CPU implementation, particularly for larger matrices.
>> >> (netlib-f2jblas or netlib-ref) This is not to pick on netlib - this
>> >> basically agrees with the authors own benchmarks (
>> >> https://github.com/fommil/netlib-java)
>> >>
>> >> I think that most of our users are in a situation where using GPUs may
>> not
>> >> be practical - although we could consider having a good GPU backend
>> >> available as an option. However, *ALL* users of MLlib could benefit
>> >> (potentially tremendously) from using a well-tuned CPU-based BLAS
>> >> implementation. Perhaps we should consider updating the mllib guide
>> with a
>> >> more complete section for enabling high performance binaries on OSX and
>> >> Linux? Or better, figure out a way for the system to fetch these
>> >> automatically.
>> >>
>> >> - Evan
>> >>
>> >>
>> >>
>> >> On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander <
>> >> alexander.ula...@hp.com> wrote:
>> >>
>> >>> Just to summarize this thread, I was finally able to make all
>> performance
>> >>> comparisons that we discussed. It turns out that:
>> >>> BIDMat-cublas>>BIDMat
>> >>>
>> MKL==netlib-mkl==netlib-openblas-compiled>netlib-openblas-yum-repo==netlib-cublas>netlib-blas>f2jblas
>> >>>
>> >>> Below is the link to the spreadsheet with full results.
>> >>>
>> >>>
>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
>> >>>
>> >>> One thing still needs exploration: does BIDMat-cublas perform copying
>> >>> to/from machine’s RAM?
>> >>>
>> >>> -Original Message-
>> >>> From: Ulanov, Alexander
>> >>> Sent: Tuesday, February 10, 2015 2:12 PM
>> >>> To: Evan R. Sparks
>> &g

Re: Using CUDA within Spark / boosting linear algebra

2015-02-25 Thread Evan R. Sparks
Thanks for compiling all the data and running these benchmarks, Alex. The
big takeaways here can be seen with this chart:
https://docs.google.com/spreadsheets/d/1aRm2IADRfXQV7G2vrcVh4StF50uZHl6kmAJeaZZggr0/pubchart?oid=1899767119&format=interactive

1) A properly configured GPU matrix multiply implementation (e.g.
BIDMat+GPU) can provide substantial (but less than an order of magnitude)
benefit over a well-tuned CPU implementation (e.g. BIDMat+MKL or
netlib-java+openblas-compiled).
2) A poorly tuned CPU implementation can be 1-2 orders of magnitude worse
than a well-tuned CPU implementation, particularly for larger matrices.
(netlib-f2jblas or netlib-ref) This is not to pick on netlib - this
basically agrees with the authors own benchmarks (
https://github.com/fommil/netlib-java)

I think that most of our users are in a situation where using GPUs may not
be practical - although we could consider having a good GPU backend
available as an option. However, *ALL* users of MLlib could benefit
(potentially tremendously) from using a well-tuned CPU-based BLAS
implementation. Perhaps we should consider updating the mllib guide with a
more complete section for enabling high performance binaries on OSX and
Linux? Or better, figure out a way for the system to fetch these
automatically.

- Evan



On Thu, Feb 12, 2015 at 4:18 PM, Ulanov, Alexander 
wrote:

> Just to summarize this thread, I was finally able to make all performance
> comparisons that we discussed. It turns out that:
> BIDMat-cublas>>BIDMat
> MKL==netlib-mkl==netlib-openblas-compiled>netlib-openblas-yum-repo==netlib-cublas>netlib-blas>f2jblas
>
> Below is the link to the spreadsheet with full results.
>
> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
>
> One thing still needs exploration: does BIDMat-cublas perform copying
> to/from machine’s RAM?
>
> -Original Message-
> From: Ulanov, Alexander
> Sent: Tuesday, February 10, 2015 2:12 PM
> To: Evan R. Sparks
> Cc: Joseph Bradley; dev@spark.apache.org
> Subject: RE: Using CUDA within Spark / boosting linear algebra
>
> Thanks, Evan! It seems that ticket was marked as duplicate though the
> original one discusses slightly different topic. I was able to link netlib
> with MKL from BIDMat binaries. Indeed, MKL is statically linked inside a
> 60MB library.
>
> |A*B  size | BIDMat MKL | Breeze+Netlib-MKL  from BIDMat|
> Breeze+Netlib-OpenBlas(native system)| Breeze+Netlib-f2jblas |
> +---+
> |100x100*100x100 | 0,00205596 | 0,000381 | 0,03810324 | 0,002556 |
> |1000x1000*1000x1000 | 0,018320947 | 0,038316857 | 0,51803557 |1,638475459
> |
> |1x1*1x1 | 23,78046632 | 32,94546697 |445,0935211 |
> 1569,233228 |
>
> It turn out that pre-compiled MKL is faster than precompiled OpenBlas on
> my machine. Probably, I’ll add two more columns with locally compiled
> openblas and cuda.
>
> Alexander
>
> From: Evan R. Sparks [mailto:evan.spa...@gmail.com]
> Sent: Monday, February 09, 2015 6:06 PM
> To: Ulanov, Alexander
> Cc: Joseph Bradley; dev@spark.apache.org
> Subject: Re: Using CUDA within Spark / boosting linear algebra
>
> Great - perhaps we can move this discussion off-list and onto a JIRA
> ticket? (Here's one: https://issues.apache.org/jira/browse/SPARK-5705)
>
> It seems like this is going to be somewhat exploratory for a while (and
> there's probably only a handful of us who really care about fast linear
> algebra!)
>
> - Evan
>
> On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander  <mailto:alexander.ula...@hp.com>> wrote:
> Hi Evan,
>
> Thank you for explanation and useful link. I am going to build OpenBLAS,
> link it with Netlib-java and perform benchmark again.
>
> Do I understand correctly that BIDMat binaries contain statically linked
> Intel MKL BLAS? It might be the reason why I am able to run BIDMat not
> having MKL BLAS installed on my server. If it is true, I wonder if it is OK
> because Intel sells this library. Nevertheless, it seems that in my case
> precompiled MKL BLAS performs better than precompiled OpenBLAS given that
> BIDMat and Netlib-java are supposed to be on par with JNI overheads.
>
> Though, it might be interesting to link Netlib-java with Intel MKL, as you
> suggested. I wonder, are John Canny (BIDMat) and Sam Halliday (Netlib-java)
> interested to compare their libraries.
>
> Best regards, Alexander
>
> From: Evan R. Sparks [mailto:evan.spa...@gmail.com evan.spa...@gmail.com>]
> Sent: Friday, February 06, 2015 5:58 PM
>
> To: Ulanov, Alexander
> Cc: Joseph Bradley; dev@spark.apache.org<mailto:dev@spark.apache.org>
> Subject: Re: Using CUDA within Spark / 

Re: [MLlib] Performance problem in GeneralizedLinearAlgorithm

2015-02-17 Thread Evan R. Sparks
Josh - thanks for the detailed write up - this seems a little funny to me.
I agree that with the current code path there is extra work being done than
needs to be (e.g. the features are re-scaled at every iteration, but the
relatively costly process of fitting the StandardScaler should not be
re-done at each iteration. Instead, at each iteration, all points are
re-scaled according to the pre-computed standard-deviations in the
StandardScalerModel, and then an intercept is appended.

Just to be clear - you're currently calling .persist() before you pass data
to LogisticRegressionWithLBFGS?

Also - can you give some parameters about the problem/cluster size you're
solving this on? How much memory per node? How big are n and d, what is its
sparsity (if any) and how many iterations are you running for? Is 0:45 the
per-iteration time or total time for some number of iterations?

A useful test might be to call GeneralizedLinearAlgorithm useFeatureScaling
set to false (and maybe also addIntercept set to false) on persisted data,
and see if you see the same performance wins. If that's the case we've
isolated the issue and can start profiling to see where all the time is
going.

It would be great if you can open a JIRA.

Thanks!



On Tue, Feb 17, 2015 at 6:36 AM, Josh Devins  wrote:

> Cross-posting as I got no response on the users mailing list last
> week. Any response would be appreciated :)
>
> Josh
>
>
> -- Forwarded message --
> From: Josh Devins 
> Date: 9 February 2015 at 15:59
> Subject: [MLlib] Performance problem in GeneralizedLinearAlgorithm
> To: "u...@spark.apache.org" 
>
>
> I've been looking into a performance problem when using
> LogisticRegressionWithLBFGS (and in turn GeneralizedLinearAlgorithm).
> Here's an outline of what I've figured out so far and it would be
> great to get some confirmation of the problem, some input on how
> wide-spread this problem might be and any ideas on a nice way to fix
> this.
>
> Context:
> - I will reference `branch-1.1` as we are currently on v1.1.1 however
> this appears to still be a problem on `master`
> - The cluster is run on YARN, on bare-metal hardware (no VMs)
> - I've not filed a Jira issue yet but can do so
> - This problem affects all algorithms based on
> GeneralizedLinearAlgorithm (GLA) that use feature scaling (and less so
> when not, but still a problem) (e.g. LogisticRegressionWithLBFGS)
>
> Problem Outline:
> - Starting at GLA line 177
> (
> https://github.com/apache/spark/blob/branch-1.1/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala#L177
> ),
> a feature scaler is created using the `input` RDD
> - Refer next to line 186 which then maps over the `input` RDD and
> produces a new `data` RDD
> (
> https://github.com/apache/spark/blob/branch-1.1/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala#L186
> )
> - If you are using feature scaling or adding intercepts, the user
> `input` RDD has been mapped over *after* the user has persisted it
> (hopefully) and *before* going into the (iterative) optimizer on line
> 204 (
> https://github.com/apache/spark/blob/branch-1.1/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala#L204
> )
> - Since the RDD `data` that is iterated over in the optimizer is
> unpersisted, when we are running the cost function in the optimizer
> (e.g. LBFGS --
> https://github.com/apache/spark/blob/branch-1.1/mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala#L198
> ),
> the map phase will actually first go back and rerun the feature
> scaling (map tasks on `input`) and then map with the cost function
> (two maps pipelined into one stage)
> - As a result, parts of the StandardScaler will actually be run again
> (perhaps only because the variable is `lazy`?) and this can be costly,
> see line 84 (
> https://github.com/apache/spark/blob/branch-1.1/mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala#L84
> )
> - For small datasets and/or few iterations, this is not really a
> problem, however we found that by adding a `data.persist()` right
> before running the optimizer, we went from map iterations in the
> optimizer that went from 5:30 down to 0:45
>
> I had a very tough time coming up with a nice way to describe my
> debugging sessions in an email so I hope this gets the main points
> across. Happy to clarify anything if necessary (also by live
> debugging/Skype/phone if that's helpful).
>
> Thanks,
>
> Josh
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: Spark SQL value proposition in batch pipelines

2015-02-12 Thread Evan R. Sparks
Well, you can always join as many RDDs as you want by chaining them
together, e.g. a.join(b).join(c)... - I probably wouldn't join thousands of
RDDs in this way but 10 is probably doable.

That said - SparkSQL has an optimizer under the covers that can make clever
decisions e.g. pushing the predicates in the WHERE clause down to the base
data (even to external data sources if you have them), ordering joins, and
choosing between join implementations (like using broadcast joins instead
of the default shuffle-based hash join in RDD.join). These decisions can
make your queries run orders of magnitude faster than they would if you
implemented them using basic RDD transformations. The best part is at this
stage, I'd expect the optimizer will continue to improve - meaning many of
your queries will get faster with each new release.

I'm sure the SparkSQL devs can enumerate many other benefits - but as soon
as you're working with multiple tables and doing fairly textbook SQL stuff
- you likely want the engine figuring this stuff out for you rather than
hand coding it yourself. That said - with Spark, you can always drop back
to plain old RDDs and use map/reduce/filter/cogroup, etc. when you need to.

On Thu, Feb 12, 2015 at 8:56 AM, vha14  wrote:

> My team is building a batch data processing pipeline using Spark API and
> trying to understand if Spark SQL can help us. Below are what we found so
> far:
>
> - SQL's declarative style may be more readable in some cases (e.g. joining
> of more than two RDDs), although some devs prefer the fluent style
> regardless.
> - Cogrouping of more than 4 RDDs is not supported and it's not clear if
> Spark SQL supports joining of arbitrary number of RDDs.
> - It seems that Spark SQL's features such as optimization based on
> predicate
> pushdown and dynamic schema inference are less applicable in a batch
> environment.
>
> Your inputs/suggestions are most welcome!
>
> Thanks,
> Vu Ha
> CTO, Semantic Scholar
> http://www.quora.com/What-is-Semantic-Scholar-and-how-will-it-work
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-value-proposition-in-batch-pipelines-tp10607.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: Using CUDA within Spark / boosting linear algebra

2015-02-09 Thread Evan R. Sparks
Great - perhaps we can move this discussion off-list and onto a JIRA
ticket? (Here's one: https://issues.apache.org/jira/browse/SPARK-5705)

It seems like this is going to be somewhat exploratory for a while (and
there's probably only a handful of us who really care about fast linear
algebra!)

- Evan

On Mon, Feb 9, 2015 at 4:48 PM, Ulanov, Alexander 
wrote:

>  Hi Evan,
>
>
>
> Thank you for explanation and useful link. I am going to build OpenBLAS,
> link it with Netlib-java and perform benchmark again.
>
>
>
> Do I understand correctly that BIDMat binaries contain statically linked
> Intel MKL BLAS? It might be the reason why I am able to run BIDMat not
> having MKL BLAS installed on my server. If it is true, I wonder if it is OK
> because Intel sells this library. Nevertheless, it seems that in my case
> precompiled MKL BLAS performs better than precompiled OpenBLAS given that
> BIDMat and Netlib-java are supposed to be on par with JNI overheads.
>
>
>
> Though, it might be interesting to link Netlib-java with Intel MKL, as you
> suggested. I wonder, are John Canny (BIDMat) and Sam Halliday (Netlib-java)
> interested to compare their libraries.
>
>
>
> Best regards, Alexander
>
>
>
> *From:* Evan R. Sparks [mailto:evan.spa...@gmail.com]
> *Sent:* Friday, February 06, 2015 5:58 PM
>
> *To:* Ulanov, Alexander
> *Cc:* Joseph Bradley; dev@spark.apache.org
> *Subject:* Re: Using CUDA within Spark / boosting linear algebra
>
>
>
> I would build OpenBLAS yourself, since good BLAS performance comes from
> getting cache sizes, etc. set up correctly for your particular hardware -
> this is often a very tricky process (see, e.g. ATLAS), but we found that on
> relatively modern Xeon chips, OpenBLAS builds quickly and yields
> performance competitive with MKL.
>
>
>
> To make sure the right library is getting used, you have to make sure it's
> first on the search path - export LD_LIBRARY_PATH=/path/to/blas/library.so
> will do the trick here.
>
>
>
> For some examples of getting netlib-java setup on an ec2 node and some
> example benchmarking code we ran a while back, see:
> https://github.com/shivaram/matrix-bench
>
>
>
> In particular - build-openblas-ec2.sh shows you how to build the library
> and set up symlinks correctly, and scala/run-netlib.sh shows you how to get
> the path setup and get that library picked up by netlib-java.
>
>
>
> In this way - you could probably get cuBLAS set up to be used by
> netlib-java as well.
>
>
>
> - Evan
>
>
>
> On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander 
> wrote:
>
>  Evan, could you elaborate on how to force BIDMat and netlib-java to
> force loading the right blas? For netlib, I there are few JVM flags, such
> as -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, so I
> can force it to use Java implementation. Not sure I understand how to force
> use a specific blas (not specific wrapper for blas).
>
>
>
> Btw. I have installed openblas (yum install openblas), so I suppose that
> netlib is using it.
>
>
>
> *From:* Evan R. Sparks [mailto:evan.spa...@gmail.com]
> *Sent:* Friday, February 06, 2015 5:19 PM
> *To:* Ulanov, Alexander
> *Cc:* Joseph Bradley; dev@spark.apache.org
>
>
> *Subject:* Re: Using CUDA within Spark / boosting linear algebra
>
>
>
> Getting breeze to pick up the right blas library is critical for
> performance. I recommend using OpenBLAS (or MKL, if you already have it).
> It might make sense to force BIDMat to use the same underlying BLAS library
> as well.
>
>
>
> On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander 
> wrote:
>
> Hi Evan, Joseph
>
> I did few matrix multiplication test and BIDMat seems to be ~10x faster
> than netlib-java+breeze (sorry for weird table formatting):
>
> |A*B  size | BIDMat MKL | Breeze+Netlib-java native_system_linux_x86-64|
> Breeze+Netlib-java f2jblas |
> +---+
> |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
> |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
> |1x1*1x1 | 23,78046632 | 445,0935211 | 1569,233228 |
>
> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19
> Linux, Scala 2.11.
>
> Later I will make tests with Cuda. I need to install new Cuda version for
> this purpose.
>
> Do you have any ideas why breeze-netlib with native blas is so much slower
> than BIDMat MKL?
>
> Best regards, Alexander
>
> From: Joseph Bradley [mailto:jos...@databricks.com]
> Sent: Thursday, February 05, 2015 5:29 PM
> To: Ulanov, Alexander
> Cc: Evan R. Sparks; dev@spark.a

Re: Using CUDA within Spark / boosting linear algebra

2015-02-08 Thread Evan R. Sparks
I would build OpenBLAS yourself, since good BLAS performance comes from
getting cache sizes, etc. set up correctly for your particular hardware -
this is often a very tricky process (see, e.g. ATLAS), but we found that on
relatively modern Xeon chips, OpenBLAS builds quickly and yields
performance competitive with MKL.

To make sure the right library is getting used, you have to make sure it's
first on the search path - export LD_LIBRARY_PATH=/path/to/blas/library.so
will do the trick here.

For some examples of getting netlib-java setup on an ec2 node and some
example benchmarking code we ran a while back, see:
https://github.com/shivaram/matrix-bench

In particular - build-openblas-ec2.sh shows you how to build the library
and set up symlinks correctly, and scala/run-netlib.sh shows you how to get
the path setup and get that library picked up by netlib-java.

In this way - you could probably get cuBLAS set up to be used by
netlib-java as well.

- Evan

On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander 
wrote:

>  Evan, could you elaborate on how to force BIDMat and netlib-java to
> force loading the right blas? For netlib, I there are few JVM flags, such
> as -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, so I
> can force it to use Java implementation. Not sure I understand how to force
> use a specific blas (not specific wrapper for blas).
>
>
>
> Btw. I have installed openblas (yum install openblas), so I suppose that
> netlib is using it.
>
>
>
> *From:* Evan R. Sparks [mailto:evan.spa...@gmail.com]
> *Sent:* Friday, February 06, 2015 5:19 PM
> *To:* Ulanov, Alexander
> *Cc:* Joseph Bradley; dev@spark.apache.org
>
> *Subject:* Re: Using CUDA within Spark / boosting linear algebra
>
>
>
> Getting breeze to pick up the right blas library is critical for
> performance. I recommend using OpenBLAS (or MKL, if you already have it).
> It might make sense to force BIDMat to use the same underlying BLAS library
> as well.
>
>
>
> On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander 
> wrote:
>
> Hi Evan, Joseph
>
> I did few matrix multiplication test and BIDMat seems to be ~10x faster
> than netlib-java+breeze (sorry for weird table formatting):
>
> |A*B  size | BIDMat MKL | Breeze+Netlib-java native_system_linux_x86-64|
> Breeze+Netlib-java f2jblas |
> +---+
> |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
> |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
> |1x1*1x1 | 23,78046632 | 445,0935211 | 1569,233228 |
>
> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19
> Linux, Scala 2.11.
>
> Later I will make tests with Cuda. I need to install new Cuda version for
> this purpose.
>
> Do you have any ideas why breeze-netlib with native blas is so much slower
> than BIDMat MKL?
>
> Best regards, Alexander
>
> From: Joseph Bradley [mailto:jos...@databricks.com]
> Sent: Thursday, February 05, 2015 5:29 PM
> To: Ulanov, Alexander
> Cc: Evan R. Sparks; dev@spark.apache.org
>
> Subject: Re: Using CUDA within Spark / boosting linear algebra
>
> Hi Alexander,
>
> Using GPUs with Spark would be very exciting.  Small comment: Concerning
> your question earlier about keeping data stored on the GPU rather than
> having to move it between main memory and GPU memory on each iteration, I
> would guess this would be critical to getting good performance.  If you
> could do multiple local iterations before aggregating results, then the
> cost of data movement to the GPU could be amortized (and I believe that is
> done in practice).  Having Spark be aware of the GPU and using it as
> another part of memory sounds like a much bigger undertaking.
>
> Joseph
>
> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander 
> wrote:
> Thank you for explanation! I’ve watched the BIDMach presentation by John
> Canny and I am really inspired by his talk and comparisons with Spark MLlib.
>
> I am very interested to find out what will be better within Spark: BIDMat
> or netlib-java with CPU or GPU natives. Could you suggest a fair way to
> benchmark them? Currently I do benchmarks on artificial neural networks in
> batch mode. While it is not a “pure” test of linear algebra, it involves
> some other things that are essential to machine learning.
>
> From: Evan R. Sparks [mailto:evan.spa...@gmail.com]
> Sent: Thursday, February 05, 2015 1:29 PM
> To: Ulanov, Alexander
> Cc: dev@spark.apache.org
> Subject: Re: Using CUDA within Spark / boosting linear algebra
>
> I'd be surprised of BIDMat+OpenBLAS was significantly faster than
> netlib-java+OpenBLAS, but if it is much faster it's probably due to data
> layout and f

Spark SQL Window Functions

2015-02-08 Thread Evan R. Sparks
Currently there's no standard way of handling time series data in Spark. We
were kicking around some ideas in the lab today and one thing that came up
was SQL Window Functions as a way to support them and query over time
series (do things like moving average, etc.)

These don't seem to be implemented in Spark SQL yet, but there's some
discussion on JIRA (https://issues.apache.org/jira/browse/SPARK-3587)
asking for them, and there have also been a couple of pull requests -
https://github.com/apache/spark/pull/3703 and
https://github.com/apache/spark/pull/2953.

Is any work currently underway here?


Re: Using CUDA within Spark / boosting linear algebra

2015-02-08 Thread Evan R. Sparks
Getting breeze to pick up the right blas library is critical for
performance. I recommend using OpenBLAS (or MKL, if you already have it).
It might make sense to force BIDMat to use the same underlying BLAS library
as well.

On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander 
wrote:

> Hi Evan, Joseph
>
> I did few matrix multiplication test and BIDMat seems to be ~10x faster
> than netlib-java+breeze (sorry for weird table formatting):
>
> |A*B  size | BIDMat MKL | Breeze+Netlib-java native_system_linux_x86-64|
> Breeze+Netlib-java f2jblas |
> +---+
> |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
> |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
> |1x1*1x1 | 23,78046632 | 445,0935211 | 1569,233228 |
>
> Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19
> Linux, Scala 2.11.
>
> Later I will make tests with Cuda. I need to install new Cuda version for
> this purpose.
>
> Do you have any ideas why breeze-netlib with native blas is so much slower
> than BIDMat MKL?
>
> Best regards, Alexander
>
> From: Joseph Bradley [mailto:jos...@databricks.com]
> Sent: Thursday, February 05, 2015 5:29 PM
> To: Ulanov, Alexander
> Cc: Evan R. Sparks; dev@spark.apache.org
> Subject: Re: Using CUDA within Spark / boosting linear algebra
>
> Hi Alexander,
>
> Using GPUs with Spark would be very exciting.  Small comment: Concerning
> your question earlier about keeping data stored on the GPU rather than
> having to move it between main memory and GPU memory on each iteration, I
> would guess this would be critical to getting good performance.  If you
> could do multiple local iterations before aggregating results, then the
> cost of data movement to the GPU could be amortized (and I believe that is
> done in practice).  Having Spark be aware of the GPU and using it as
> another part of memory sounds like a much bigger undertaking.
>
> Joseph
>
> On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander 
> wrote:
> Thank you for explanation! I’ve watched the BIDMach presentation by John
> Canny and I am really inspired by his talk and comparisons with Spark MLlib.
>
> I am very interested to find out what will be better within Spark: BIDMat
> or netlib-java with CPU or GPU natives. Could you suggest a fair way to
> benchmark them? Currently I do benchmarks on artificial neural networks in
> batch mode. While it is not a “pure” test of linear algebra, it involves
> some other things that are essential to machine learning.
>
> From: Evan R. Sparks [mailto:evan.spa...@gmail.com]
> Sent: Thursday, February 05, 2015 1:29 PM
> To: Ulanov, Alexander
> Cc: dev@spark.apache.org
> Subject: Re: Using CUDA within Spark / boosting linear algebra
>
> I'd be surprised of BIDMat+OpenBLAS was significantly faster than
> netlib-java+OpenBLAS, but if it is much faster it's probably due to data
> layout and fewer levels of indirection - it's definitely a worthwhile
> experiment to run. The main speedups I've seen from using it come from
> highly optimized GPU code for linear algebra. I know that in the past Canny
> has gone as far as to write custom GPU kernels for performance-critical
> regions of code.[1]
>
> BIDMach is highly optimized for single node performance or performance on
> small clusters.[2] Once data doesn't fit easily in GPU memory (or can be
> batched in that way) the performance tends to fall off. Canny argues for
> hardware/software codesign and as such prefers machine configurations that
> are quite different than what we find in most commodity cluster nodes -
> e.g. 10 disk cahnnels and 4 GPUs.
>
> In contrast, MLlib was designed for horizontal scalability on commodity
> clusters and works best on very big datasets - order of terabytes.
>
> For the most part, these projects developed concurrently to address
> slightly different use cases. That said, there may be bits of BIDMach we
> could repurpose for MLlib - keep in mind we need to be careful about
> maintaining cross-language compatibility for our Java and Python-users,
> though.
>
> - Evan
>
> [1] - http://arxiv.org/abs/1409.5402
> [2] - http://eecs.berkeley.edu/~hzhao/papers/BD.pdf
>
> On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander  <mailto:alexander.ula...@hp.com>> wrote:
> Hi Evan,
>
> Thank you for suggestion! BIDMat seems to have terrific speed. Do you know
> what makes them faster than netlib-java?
>
> The same group has BIDMach library that implements machine learning. For
> some examples they use Caffe convolutional neural network library owned by
> another group in Berkeley. Could you elaborate on how these a

Re: Using CUDA within Spark / boosting linear algebra

2015-02-05 Thread Evan R. Sparks
I'd be surprised of BIDMat+OpenBLAS was significantly faster than
netlib-java+OpenBLAS, but if it is much faster it's probably due to data
layout and fewer levels of indirection - it's definitely a worthwhile
experiment to run. The main speedups I've seen from using it come from
highly optimized GPU code for linear algebra. I know that in the past Canny
has gone as far as to write custom GPU kernels for performance-critical
regions of code.[1]

BIDMach is highly optimized for single node performance or performance on
small clusters.[2] Once data doesn't fit easily in GPU memory (or can be
batched in that way) the performance tends to fall off. Canny argues for
hardware/software codesign and as such prefers machine configurations that
are quite different than what we find in most commodity cluster nodes -
e.g. 10 disk cahnnels and 4 GPUs.

In contrast, MLlib was designed for horizontal scalability on commodity
clusters and works best on very big datasets - order of terabytes.

For the most part, these projects developed concurrently to address
slightly different use cases. That said, there may be bits of BIDMach we
could repurpose for MLlib - keep in mind we need to be careful about
maintaining cross-language compatibility for our Java and Python-users,
though.

- Evan

[1] - http://arxiv.org/abs/1409.5402
[2] - http://eecs.berkeley.edu/~hzhao/papers/BD.pdf

On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander 
wrote:

>  Hi Evan,
>
>
>
> Thank you for suggestion! BIDMat seems to have terrific speed. Do you know
> what makes them faster than netlib-java?
>
>
>
> The same group has BIDMach library that implements machine learning. For
> some examples they use Caffe convolutional neural network library owned by
> another group in Berkeley. Could you elaborate on how these all might be
> connected with Spark Mllib? If you take BIDMat for linear algebra why don’t
> you take BIDMach for optimization and learning?
>
>
>
> Best regards, Alexander
>
>
>
> *From:* Evan R. Sparks [mailto:evan.spa...@gmail.com]
> *Sent:* Thursday, February 05, 2015 12:09 PM
> *To:* Ulanov, Alexander
> *Cc:* dev@spark.apache.org
> *Subject:* Re: Using CUDA within Spark / boosting linear algebra
>
>
>
> I'd expect that we can make GPU-accelerated BLAS faster than CPU blas in
> many cases.
>
>
>
> You might consider taking a look at the codepaths that BIDMat (
> https://github.com/BIDData/BIDMat) takes and comparing them to
> netlib-java/breeze. John Canny et. al. have done a bunch of work optimizing
> to make this work really fast from Scala. I've run it on my laptop and
> compared to MKL and in certain cases it's 10x faster at matrix multiply.
> There are a lot of layers of indirection here and you really want to avoid
> data copying as much as possible.
>
>
>
> We could also consider swapping out BIDMat for Breeze, but that would be a
> big project and if we can figure out how to get breeze+cublas to comparable
> performance that would be a big win.
>
>
>
> On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander <
> alexander.ula...@hp.com> wrote:
>
> Dear Spark developers,
>
> I am exploring how to make linear algebra operations faster within Spark.
> One way of doing this is to use Scala Breeze library that is bundled with
> Spark. For matrix operations, it employs Netlib-java that has a Java
> wrapper for BLAS (basic linear algebra subprograms) and LAPACK native
> binaries if they are available on the worker node. It also has its own
> optimized Java implementation of BLAS. It is worth mentioning, that native
> binaries provide better performance only for BLAS level 3, i.e.
> matrix-matrix operations or general matrix multiplication (GEMM). This is
> confirmed by GEMM test on Netlib-java page
> https://github.com/fommil/netlib-java. I also confirmed it with my
> experiments with training of artificial neural network
> https://github.com/apache/spark/pull/1290#issuecomment-70313952. However,
> I would like to boost performance more.
>
> GPU is supposed to work fast with linear algebra and there is Nvidia CUDA
> implementation of BLAS, called cublas. I have one Linux server with Nvidia
> GPU and I was able to do the following. I linked cublas (instead of
> cpu-based blas) with Netlib-java wrapper and put it into Spark, so
> Breeze/Netlib is using it. Then I did some performance measurements with
> regards to artificial neural network batch learning in Spark MLlib that
> involves matrix-matrix multiplications. It turns out that for matrices of
> size less than ~1000x780 GPU cublas has the same speed as CPU blas. Cublas
> becomes slower for bigger matrices. It worth mentioning that it is was not
> a test for ONLY multiplication since there are other operations i

Re: Using CUDA within Spark / boosting linear algebra

2015-02-05 Thread Evan R. Sparks
I'd expect that we can make GPU-accelerated BLAS faster than CPU blas in
many cases.

You might consider taking a look at the codepaths that BIDMat (
https://github.com/BIDData/BIDMat) takes and comparing them to
netlib-java/breeze. John Canny et. al. have done a bunch of work optimizing
to make this work really fast from Scala. I've run it on my laptop and
compared to MKL and in certain cases it's 10x faster at matrix multiply.
There are a lot of layers of indirection here and you really want to avoid
data copying as much as possible.

We could also consider swapping out BIDMat for Breeze, but that would be a
big project and if we can figure out how to get breeze+cublas to comparable
performance that would be a big win.

On Thu, Feb 5, 2015 at 11:55 AM, Ulanov, Alexander 
wrote:

> Dear Spark developers,
>
> I am exploring how to make linear algebra operations faster within Spark.
> One way of doing this is to use Scala Breeze library that is bundled with
> Spark. For matrix operations, it employs Netlib-java that has a Java
> wrapper for BLAS (basic linear algebra subprograms) and LAPACK native
> binaries if they are available on the worker node. It also has its own
> optimized Java implementation of BLAS. It is worth mentioning, that native
> binaries provide better performance only for BLAS level 3, i.e.
> matrix-matrix operations or general matrix multiplication (GEMM). This is
> confirmed by GEMM test on Netlib-java page
> https://github.com/fommil/netlib-java. I also confirmed it with my
> experiments with training of artificial neural network
> https://github.com/apache/spark/pull/1290#issuecomment-70313952. However,
> I would like to boost performance more.
>
> GPU is supposed to work fast with linear algebra and there is Nvidia CUDA
> implementation of BLAS, called cublas. I have one Linux server with Nvidia
> GPU and I was able to do the following. I linked cublas (instead of
> cpu-based blas) with Netlib-java wrapper and put it into Spark, so
> Breeze/Netlib is using it. Then I did some performance measurements with
> regards to artificial neural network batch learning in Spark MLlib that
> involves matrix-matrix multiplications. It turns out that for matrices of
> size less than ~1000x780 GPU cublas has the same speed as CPU blas. Cublas
> becomes slower for bigger matrices. It worth mentioning that it is was not
> a test for ONLY multiplication since there are other operations involved.
> One of the reasons for slowdown might be the overhead of copying the
> matrices from computer memory to graphic card memory and back.
>
> So, few questions:
> 1) Do these results with CUDA make sense?
> 2) If the problem is with copy overhead, are there any libraries that
> allow to force intermediate results to stay in graphic card memory thus
> removing the overhead?
> 3) Any other options to speed-up linear algebra in Spark?
>
> Thank you, Alexander
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: renaming SchemaRDD -> DataFrame

2015-01-28 Thread Evan R. Sparks
You've got to be a little bit careful here. "NA" in systems like R or
pandas may have special meaning that is distinct from "null".

See, e.g. http://www.r-bloggers.com/r-na-vs-null/



On Wed, Jan 28, 2015 at 4:42 PM, Reynold Xin  wrote:

> Isn't that just "null" in SQL?
>
> On Wed, Jan 28, 2015 at 4:41 PM, Evan Chan 
> wrote:
>
> > I believe that most DataFrame implementations out there, like Pandas,
> > supports the idea of missing values / NA, and some support the idea of
> > Not Meaningful as well.
> >
> > Does Row support anything like that?  That is important for certain
> > applications.  I thought that Row worked by being a mutable object,
> > but haven't looked into the details in a while.
> >
> > -Evan
> >
> > On Wed, Jan 28, 2015 at 4:23 PM, Reynold Xin 
> wrote:
> > > It shouldn't change the data source api at all because data sources
> > create
> > > RDD[Row], and that gets converted into a DataFrame automatically
> > (previously
> > > to SchemaRDD).
> > >
> > >
> >
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala
> > >
> > > One thing that will break the data source API in 1.3 is the location of
> > > types. Types were previously defined in sql.catalyst.types, and now
> > moved to
> > > sql.types. After 1.3, sql.catalyst is hidden from users, and all public
> > APIs
> > > have first class classes/objects defined in sql directly.
> > >
> > >
> > >
> > > On Wed, Jan 28, 2015 at 4:20 PM, Evan Chan 
> > wrote:
> > >>
> > >> Hey guys,
> > >>
> > >> How does this impact the data sources API?  I was planning on using
> > >> this for a project.
> > >>
> > >> +1 that many things from spark-sql / DataFrame is universally
> > >> desirable and useful.
> > >>
> > >> By the way, one thing that prevents the columnar compression stuff in
> > >> Spark SQL from being more useful is, at least from previous talks with
> > >> Reynold and Michael et al., that the format was not designed for
> > >> persistence.
> > >>
> > >> I have a new project that aims to change that.  It is a
> > >> zero-serialisation, high performance binary vector library, designed
> > >> from the outset to be a persistent storage friendly.  May be one day
> > >> it can replace the Spark SQL columnar compression.
> > >>
> > >> Michael told me this would be a lot of work, and recreates parts of
> > >> Parquet, but I think it's worth it.  LMK if you'd like more details.
> > >>
> > >> -Evan
> > >>
> > >> On Tue, Jan 27, 2015 at 4:35 PM, Reynold Xin 
> > wrote:
> > >> > Alright I have merged the patch (
> > >> > https://github.com/apache/spark/pull/4173
> > >> > ) since I don't see any strong opinions against it (as a matter of
> > fact
> > >> > most were for it). We can still change it if somebody lays out a
> > strong
> > >> > argument.
> > >> >
> > >> > On Tue, Jan 27, 2015 at 12:25 PM, Matei Zaharia
> > >> > 
> > >> > wrote:
> > >> >
> > >> >> The type alias means your methods can specify either type and they
> > will
> > >> >> work. It's just another name for the same type. But Scaladocs and
> > such
> > >> >> will
> > >> >> show DataFrame as the type.
> > >> >>
> > >> >> Matei
> > >> >>
> > >> >> > On Jan 27, 2015, at 12:10 PM, Dirceu Semighini Filho <
> > >> >> dirceu.semigh...@gmail.com> wrote:
> > >> >> >
> > >> >> > Reynold,
> > >> >> > But with type alias we will have the same problem, right?
> > >> >> > If the methods doesn't receive schemardd anymore, we will have to
> > >> >> > change
> > >> >> > our code to migrade from schema to dataframe. Unless we have an
> > >> >> > implicit
> > >> >> > conversion between DataFrame and SchemaRDD
> > >> >> >
> > >> >> >
> > >> >> >
> > >> >> > 2015-01-27 17:18 GMT-02:00 Reynold Xin :
> > >> >> >
> > >> >> >> Dirceu,
> > >> >> >>
> > >> >> >> That is not possible because one cannot overload return types.
> > >> >> >>
> > >> >> >> SQLContext.parquetFile (and many other methods) needs to return
> > some
> > >> >> type,
> > >> >> >> and that type cannot be both SchemaRDD and DataFrame.
> > >> >> >>
> > >> >> >> In 1.3, we will create a type alias for DataFrame called
> SchemaRDD
> > >> >> >> to
> > >> >> not
> > >> >> >> break source compatibility for Scala.
> > >> >> >>
> > >> >> >>
> > >> >> >> On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini Filho <
> > >> >> >> dirceu.semigh...@gmail.com> wrote:
> > >> >> >>
> > >> >> >>> Can't the SchemaRDD remain the same, but deprecated, and be
> > removed
> > >> >> >>> in
> > >> >> the
> > >> >> >>> release 1.5(+/- 1)  for example, and the new code been added to
> > >> >> DataFrame?
> > >> >> >>> With this, we don't impact in existing code for the next few
> > >> >> >>> releases.
> > >> >> >>>
> > >> >> >>>
> > >> >> >>>
> > >> >> >>> 2015-01-27 0:02 GMT-02:00 Kushal Datta  >:
> > >> >> >>>
> > >> >>  I want to address the issue that Matei raised about the heavy
> > >> >>  lifting
> > >> >>  required for a full SQL support. It is amazing that even after
> > 30
> > >> >> years
> 

Re: renaming SchemaRDD -> DataFrame

2015-01-27 Thread Evan R. Sparks
I'm +1 on this, although a little worried about unknowingly introducing
SparkSQL dependencies every time someone wants to use this. It would be
great if the interface can be abstract and the implementation (in this
case, SparkSQL backend) could be swapped out.

One alternative suggestion on the name - why not call it DataTable?
DataFrame seems like a name carried over from pandas (and by extension, R),
and it's never been obvious to me what a "Frame" is.



On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia 
wrote:

> (Actually when we designed Spark SQL we thought of giving it another name,
> like Spark Schema, but we decided to stick with SQL since that was the most
> obvious use case to many users.)
>
> Matei
>
> > On Jan 26, 2015, at 5:31 PM, Matei Zaharia 
> wrote:
> >
> > While it might be possible to move this concept to Spark Core long-term,
> supporting structured data efficiently does require quite a bit of the
> infrastructure in Spark SQL, such as query planning and columnar storage.
> The intent of Spark SQL though is to be more than a SQL server -- it's
> meant to be a library for manipulating structured data. Since this is
> possible to build over the core API, it's pretty natural to organize it
> that way, same as Spark Streaming is a library.
> >
> > Matei
> >
> >> On Jan 26, 2015, at 4:26 PM, Koert Kuipers  wrote:
> >>
> >> "The context is that SchemaRDD is becoming a common data format used for
> >> bringing data into Spark from external systems, and used for various
> >> components of Spark, e.g. MLlib's new pipeline API."
> >>
> >> i agree. this to me also implies it belongs in spark core, not sql
> >>
> >> On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
> >> michaelma...@yahoo.com.invalid> wrote:
> >>
> >>> And in the off chance that anyone hasn't seen it yet, the Jan. 13 Bay
> Area
> >>> Spark Meetup YouTube contained a wealth of background information on
> this
> >>> idea (mostly from Patrick and Reynold :-).
> >>>
> >>> https://www.youtube.com/watch?v=YWppYPWznSQ
> >>>
> >>> 
> >>> From: Patrick Wendell 
> >>> To: Reynold Xin 
> >>> Cc: "dev@spark.apache.org" 
> >>> Sent: Monday, January 26, 2015 4:01 PM
> >>> Subject: Re: renaming SchemaRDD -> DataFrame
> >>>
> >>>
> >>> One thing potentially not clear from this e-mail, there will be a 1:1
> >>> correspondence where you can get an RDD to/from a DataFrame.
> >>>
> >>>
> >>> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin 
> wrote:
>  Hi,
> 
>  We are considering renaming SchemaRDD -> DataFrame in 1.3, and wanted
> to
>  get the community's opinion.
> 
>  The context is that SchemaRDD is becoming a common data format used
> for
>  bringing data into Spark from external systems, and used for various
>  components of Spark, e.g. MLlib's new pipeline API. We also expect
> more
> >>> and
>  more users to be programming directly against SchemaRDD API rather
> than
> >>> the
>  core RDD API. SchemaRDD, through its less commonly used DSL originally
>  designed for writing test cases, always has the data-frame like API.
> In
>  1.3, we are redesigning the API to make the API usable for end users.
> 
> 
>  There are two motivations for the renaming:
> 
>  1. DataFrame seems to be a more self-evident name than SchemaRDD.
> 
>  2. SchemaRDD/DataFrame is actually not going to be an RDD anymore
> (even
>  though it would contain some RDD functions like map, flatMap, etc),
> and
>  calling it Schema*RDD* while it is not an RDD is highly confusing.
> >>> Instead.
>  DataFrame.rdd will return the underlying RDD for all RDD methods.
> 
> 
>  My understanding is that very few users program directly against the
>  SchemaRDD API at the moment, because they are not well documented.
> >>> However,
>  oo maintain backward compatibility, we can create a type alias
> DataFrame
>  that is still named SchemaRDD. This will maintain source compatibility
> >>> for
>  Scala. That said, we will have to update all existing materials to use
>  DataFrame rather than SchemaRDD.
> >>>
> >>> -
> >>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> >>> For additional commands, e-mail: dev-h...@spark.apache.org
> >>>
> >>> -
> >>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> >>> For additional commands, e-mail: dev-h...@spark.apache.org
> >>>
> >>>
> >
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: Any interest in 'weighting' VectorTransformer which does component-wise scaling?

2015-01-27 Thread Evan R. Sparks
Hmm... Scaler and Scalar are very close together both in terms of
pronunciation and spelling - and I wouldn't want to create confusion
between the two. Further - this operation (elementwise multiplication by a
static vector) is general enough that maybe it should have a more general
name?

On Tue, Jan 27, 2015 at 7:54 AM, Xiangrui Meng  wrote:

> I would call it Scaler. You might want to add it to the spark.ml pipieline
> api. Please check the spark.ml.HashingTF implementation. Note that this
> should handle sparse vectors efficiently.
>
> Hadamard and FFTs are quite useful. If you are intetested, make sure that
> we call an FFT libary that is license-compatible with Apache.
>
> -Xiangrui
> On Jan 24, 2015 8:27 AM, "Octavian Geagla"  wrote:
>
> > Hello,
> >
> > I found it useful to implement the  Hadamard Product
> > 
> >  as
> > a VectorTransformer.  It can be applied to scale (by a constant) a
> certain
> > dimension (column) of the data set.
> >
> > Since I've already implemented it and am using it, I thought I'd see if
> > there's interest in this feature going in as Experimental.  I'm not sold
> on
> > the name 'Weighter', either.
> >
> > Here's the current branch with the work (docs, impl, tests).
> > 
> >
> > The implementation was heavily inspired by those of StandardScalerModel
> and
> > Normalizer.
> >
> > Thanks
> > Octavian
> >
> >
> >
> > --
> > View this message in context:
> >
> http://apache-spark-developers-list.1001551.n3.nabble.com/Any-interest-in-weighting-VectorTransformer-which-does-component-wise-scaling-tp10265.html
> > Sent from the Apache Spark Developers List mailing list archive at
> > Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > For additional commands, e-mail: dev-h...@spark.apache.org
> >
> >
>


Re: Notes on writing complex spark applications

2014-11-24 Thread Evan R. Sparks
Thanks Patrick,

You raise a good point - for this to be useful it's imperative that it is
updated with new versions of spark.

My thought with putting it on the wiki was that it's lower friction for
community members to edit, but it likely won't have the same level of
quality control as the existing documentation.

At a higher level - some of these tips are best practices for writing
applications that depend on Spark. I'm wondering if a new document is in
order for things like "this is how you set up a project skeleton to link
against spark," and "this is how you handle external libraries," - etc.? I
know that in the past I've run into stumbling blocks on things like getting
classpaths correct, trying to link against a different version of akka, and
so on that would be useful to have in such a document, in addition to some
of the application architecture suggestions we propose in *this* document.

- Evan

On Sun, Nov 23, 2014 at 9:02 PM, Patrick Wendell  wrote:

> Hey Evan,
>
> It might be nice to merge this into existing documentation. In
> particular, a lot of this could serve to update the current tuning
> section and programming guides.
>
> It could also work to paste this wholesale as a reference for Spark
> users, but in that case it's less likely to get updated when other
> things change, or be found by users reading through the spark docs.
>
> - Patrick
>
> On Sun, Nov 23, 2014 at 8:27 PM, Inkyu Lee  wrote:
> > Very helpful!!
> >
> > thank you very much!
> >
> > 2014-11-24 2:17 GMT+09:00 Sam Bessalah :
> >
> >> Thanks Evan, this is great.
> >> On Nov 23, 2014 5:58 PM, "Evan R. Sparks" 
> wrote:
> >>
> >> > Hi all,
> >> >
> >> > Shivaram Venkataraman, Joseph Gonzalez, Tomer Kaftan, and I have been
> >> > working on a short document about writing high performance Spark
> >> > applications based on our experience developing MLlib, GraphX,
> ml-matrix,
> >> > pipelines, etc. It may be a useful document both for users and new
> Spark
> >> > developers - perhaps it should go on the wiki?
> >> >
> >> > The document itself is here:
> >> >
> >> >
> >>
> https://docs.google.com/document/d/1gEIawzRsOwksV_bq4je3ofnd-7Xu-u409mdW-RXTDnQ/edit?usp=sharing
> >> > and I've created SPARK-4565
> >> > <https://issues.apache.org/jira/browse/SPARK-4565> to track this.
> >> >
> >> > - Evan
> >> >
> >>
>


Notes on writing complex spark applications

2014-11-23 Thread Evan R. Sparks
Hi all,

Shivaram Venkataraman, Joseph Gonzalez, Tomer Kaftan, and I have been
working on a short document about writing high performance Spark
applications based on our experience developing MLlib, GraphX, ml-matrix,
pipelines, etc. It may be a useful document both for users and new Spark
developers - perhaps it should go on the wiki?

The document itself is here:
https://docs.google.com/document/d/1gEIawzRsOwksV_bq4je3ofnd-7Xu-u409mdW-RXTDnQ/edit?usp=sharing
and I've created SPARK-4565
 to track this.

- Evan


Re: Gaussian Mixture Model clustering

2014-09-19 Thread Evan R. Sparks
Hey Meethu - what are you setting "K" to in the benchmarks you show? This
can greatly affect the runtime.

On Thu, Sep 18, 2014 at 10:38 PM, Meethu Mathew 
wrote:

>  Hi all,
> Please find attached the image of benchmark results. The table in the
> previous mail got messed up. Thanks.
>
>
>
> On Friday 19 September 2014 10:55 AM, Meethu Mathew wrote:
>
> Hi all,
>
> We have come up with an initial distributed implementation of Gaussian
> Mixture Model in pyspark where the parameters are estimated using the
> Expectation-Maximization algorithm.Our current implementation considers
> diagonal covariance matrix for each component.
> We did an initial benchmark study on a 2 node Spark standalone cluster
> setup where each node config is 8 Cores,8 GB RAM, the spark version used
> is 1.0.0. We also evaluated python version of k-means available in spark
> on the same datasets.Below are the results from this benchmark study.
> The reported stats are average from 10 runs.Tests were done on multiple
> datasets with varying number of features and instances.
>
>
>   Dataset   Gaussian mixture model
>  Kmeans(Python)
>
> Instances Dimensions  Avg time per iteration  Time for 100 iterations
>   Avg time per iteration  Time for 100 iterations
> 0.7million13
>   7s
>   12min
> 13s   26min
> 1.8million11
>   17s
>29min 33s
>53min
> 10 million16
>   1.6min  2.7hr
> 1.2min2 hr
>
>
> We are interested in contributing this implementation as a patch to
> SPARK. Does MLLib accept python implementations? If not, can we
> contribute to the pyspark component
> I have created a JIRA for the same 
> https://issues.apache.org/jira/browse/SPARK-3588 .How do I get the
> ticket assigned to myself?
>
> Please review and suggest how to take this forward.
>
>
>
>
>
> --
>
> Regards,
>
>
>
> *Meethu Mathew*
>
> *Engineer*
>
> *Flytxt*
>
> Skype: meethu.mathew7
>
>  F:  +91 471.2700202
>
> www.flytxt.com | Visit our blog  |  Follow us
>  | *Connect on Linkedin
> *
>
>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>


Re: [mllib] Add multiplying large scale matrices

2014-09-05 Thread Evan R. Sparks
There's some work on this going on in the AMP Lab. Create a ticket and we
can update with our progress so that we don't duplicate effort.


On Fri, Sep 5, 2014 at 8:18 AM, Yu Ishikawa 
wrote:

> Hi RJ,
>
> Thank you for your comment. I am interested in to have other matrix
> operations too.
> I will create a JIRA issue in the first place.
>
> thanks,
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/mllib-Add-multiplying-large-scale-matrices-tp8291p8293.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: Is breeze thread safe in Spark?

2014-09-03 Thread Evan R. Sparks
Additionally, at the higher level, MLlib allocates separate Breeze
Vectors/Matrices on a Per-executor basis. The only place I can think of
where data structures might be over-written concurrently is in a
.aggregate() call, and these calls happen sequentially.

RJ - Do you have a JIRA reference for that bug?

Thanks!


On Wed, Sep 3, 2014 at 11:50 AM, David Hall  wrote:

> In general, in Breeze we allocate separate work arrays for each call to
> lapack, so it should be fine. In general concurrent modification isn't
> thread safe of course, but things that "ought" to be thread safe really
> should be.
>
>
> On Wed, Sep 3, 2014 at 10:41 AM, RJ Nowling  wrote:
>
> > No, it's not in all cases.   Since Breeze uses lapack under the hood,
> > changes to memory between different threads is bad.
> >
> > There's actually a potential bug in the KMeans code where it uses +=
> > instead of +.
> >
> >
> > On Wed, Sep 3, 2014 at 1:26 PM, Ulanov, Alexander <
> alexander.ula...@hp.com
> > >
> > wrote:
> >
> > > Hi,
> > >
> > > Is breeze library called thread safe from Spark mllib code in case when
> > > native libs for blas and lapack are used? Might it be an issue when
> > running
> > > Spark locally?
> > >
> > > Best regards, Alexander
> > > -
> > > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > > For additional commands, e-mail: dev-h...@spark.apache.org
> > >
> > >
> >
> >
> > --
> > em rnowl...@gmail.com
> > c 954.496.2314
> >
>


Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Evan R. Sparks
If you're thinking along these lines, have a look at the DecisionTree
implementation in MLlib. It uses the same idea and is optimized to prevent
multiple passes over the data by computing several splits at each level of
tree building. The tradeoff is increased model state and computation per
pass over the data, but fewer total passes and hopefully lower
communication overheads than, say, shuffling data around that belongs to
one cluster or another. Something like that could work here as well.

I'm not super-familiar with hierarchical K-Means so perhaps there's a more
efficient way to implement it, though.


On Tue, Jul 8, 2014 at 2:06 PM, Hector Yee  wrote:

> No was thinking more top-down:
>
> assuming a distributed kmeans system already existing, recursively apply
> the kmeans algorithm on data already partitioned by the previous level of
> kmeans.
>
> I haven't been much of a fan of bottom up approaches like HAC mainly
> because they assume there is already a distance metric for items to items.
> This makes it hard to cluster new content. The distances between sibling
> clusters is also hard to compute (if you have thrown away the similarity
> matrix), do you count paths to same parent node if you are computing
> distances between items in two adjacent nodes for example. It is also a bit
> harder to distribute the computation for bottom up approaches as you have
> to already find the nearest neighbor to an item to begin the process.
>
>
> On Tue, Jul 8, 2014 at 1:59 PM, RJ Nowling  wrote:
>
> > The scikit-learn implementation may be of interest:
> >
> >
> >
> http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Ward.html#sklearn.cluster.Ward
> >
> > It's a bottom up approach.  The pair of clusters for merging are
> > chosen to minimize variance.
> >
> > Their code is under a BSD license so it can be used as a template.
> >
> > Is something like that you were thinking Hector?
> >
> > On Tue, Jul 8, 2014 at 4:50 PM, Dmitriy Lyubimov 
> > wrote:
> > > sure. more interesting problem here is choosing k at each level. Kernel
> > > methods seem to be most promising.
> > >
> > >
> > > On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee 
> wrote:
> > >
> > >> No idea, never looked it up. Always just implemented it as doing
> k-means
> > >> again on each cluster.
> > >>
> > >> FWIW standard k-means with euclidean distance has problems too with
> some
> > >> dimensionality reduction methods. Swapping out the distance metric
> with
> > >> negative dot or cosine may help.
> > >>
> > >> Other more useful clustering would be hierarchical SVD. The reason
> why I
> > >> like hierarchical clustering is it makes for faster inference
> especially
> > >> over billions of users.
> > >>
> > >>
> > >> On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov 
> > >> wrote:
> > >>
> > >> > Hector, could you share the references for hierarchical K-means?
> > thanks.
> > >> >
> > >> >
> > >> > On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee 
> > wrote:
> > >> >
> > >> > > I would say for bigdata applications the most useful would be
> > >> > hierarchical
> > >> > > k-means with back tracking and the ability to support k nearest
> > >> > centroids.
> > >> > >
> > >> > >
> > >> > > On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling 
> > >> wrote:
> > >> > >
> > >> > > > Hi all,
> > >> > > >
> > >> > > > MLlib currently has one clustering algorithm implementation,
> > KMeans.
> > >> > > > It would benefit from having implementations of other clustering
> > >> > > > algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical
> > >> > > > Clustering, and Affinity Propagation.
> > >> > > >
> > >> > > > I recently submitted a PR [1] for a MiniBatch KMeans
> > implementation,
> > >> > > > and I saw an email on this list about interest in implementing
> > Fuzzy
> > >> > > > C-Means.
> > >> > > >
> > >> > > > Based on Sean Owen's review of my MiniBatch KMeans code, it
> became
> > >> > > > apparent that before I implement more clustering algorithms, it
> > would
> > >> > > > be useful to hammer out a framework to reduce code duplication
> and
> > >> > > > implement a consistent API.
> > >> > > >
> > >> > > > I'd like to gauge the interest and goals of the MLlib community:
> > >> > > >
> > >> > > > 1. Are you interested in having more clustering algorithms
> > available?
> > >> > > >
> > >> > > > 2. Is the community interested in specifying a common framework?
> > >> > > >
> > >> > > > Thanks!
> > >> > > > RJ
> > >> > > >
> > >> > > > [1] - https://github.com/apache/spark/pull/1248
> > >> > > >
> > >> > > >
> > >> > > > --
> > >> > > > em rnowl...@gmail.com
> > >> > > > c 954.496.2314
> > >> > > >
> > >> > >
> > >> > >
> > >> > >
> > >> > > --
> > >> > > Yee Yang Li Hector 
> > >> > > *google.com/+HectorYee *
> > >> > >
> > >> >
> > >>
> > >>
> > >>
> > >> --
> > >> Yee Yang Li Hector 
> > >> *google.com/+HectorYee *
> > >>
> >
> >
> >

Re: Could the function MLUtils.loadLibSVMFile be modified to support zero-based-index data?

2014-07-08 Thread Evan R. Sparks
As Sean mentions, if you can change the data to the standard format, that's
probably a good idea. If you'd rather read the data raw, then writing your
own version of loadLibSVMFile - then you could make your own loader
function which is very similar to the existing one with a few characters
removed:

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala#L81

You will also likely need to change the logic where it determines the
number of features (currently line 95)


On Tue, Jul 8, 2014 at 12:22 AM, Sean Owen  wrote:

> On Tue, Jul 8, 2014 at 7:29 AM, Lizhengbing (bing, BIPA) <
> zhengbing...@huawei.com> wrote:
>
> >
> > 1)  I download the imdb data from
> > http://komarix.org/ac/ds/Blanc__Mel.txt.bz2 and use this data to test
> > LBFGS
> > 2)  I find the imdb data are zero-based-index data
> >
>
> Since the method is for parsing the LIBSVM format, and its labels are
> always 1-indexed IIUC, I don't think it would make sense to read 0-indexed
> labels. It sounds like that input is not properly formatted, unless anyone
> knows to the contrary?
>


Re: Contributing to MLlib

2014-07-02 Thread Evan R. Sparks
Hi there,

Generally we try to avoid duplicating logic if possible, particularly for
algorithms that share a great deal of algorithmic similarity. See, for
example, the way we implement Logistic regression vs. Linear regression vs.
Linear SVM with different gradient functions all on top of SGD or L-BFGS.

Based on my (brief) look at the FCM algorithm, it appears that the main
difference is the ability to assign a weight vector associating the degree
of relationship of a given point to some centroid. My guess is that you can
figure out a way to inherit much of the K-Means logic in an algorithm for
FCM.

Regardless, if you'd like to add an algorithm, please create a JIRA ticket
for it and then send a pull request which references that JIRA where we can
discuss the specifics of that implementation and whether it is of broad
enough interest to warrant inclusion in the library.

- Evan


On Wed, Jul 2, 2014 at 11:02 AM, salexln  wrote:

> guys??? anyone???
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-tp7125p7155.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>


Re: Any plans for new clustering algorithms?

2014-04-21 Thread Evan R. Sparks
While DBSCAN and others would be welcome contributions, I couldn't agree
more with Sean.




On Mon, Apr 21, 2014 at 8:58 AM, Sean Owen  wrote:

> Nobody asked me, and this is a comment on a broader question, not this
> one, but:
>
> In light of a number of recent items about adding more algorithms,
> I'll say that I personally think an explosion of algorithms should
> come after the MLlib "core" is more fully baked. I'm thinking of
> finishing out the changes to vectors and matrices, for example. Things
> are going to change significantly in the short term as people use the
> algorithms and see how well the abstractions do or don't work. I've
> seen another similar project suffer mightily from too many algorithms
> too early, so maybe I'm just paranoid.
>
> Anyway, long-term, I think lots of good algorithms is a right and
> proper goal for MLlib, myself. Consistent approaches, representations
> and APIs will make or break MLlib much more than having or not having
> a particular algorithm. With the plumbing in place, writing the algo
> is the fun easy part.
> --
> Sean Owen | Director, Data Science | London
>
>
> On Mon, Apr 21, 2014 at 4:39 PM, Aliaksei Litouka
>  wrote:
> > Hi, Spark developers.
> > Are there any plans for implementing new clustering algorithms in MLLib?
> As
> > far as I understand, current version of Spark ships with only one
> > clustering algorithm - K-Means. I want to contribute to Spark and I'm
> > thinking of adding more clustering algorithms - maybe
> > DBSCAN.
> > I can start working on it. Does anyone want to join me?
>


Re: [HELP] ask for some information about public data set

2014-02-25 Thread Evan R. Sparks
Hi hyqgod,

This is probably a better question for the spark user's list than the dev
list (cc'ing user and bcc'ing dev on this reply).

To answer your question, though:

Amazon's Public Datasets Page is a nice place to start:
http://aws.amazon.com/datasets/ - these work well with spark because
they're often stored on s3 (which spark can read from natively) and it's
very easy to spin up a spark cluster on EC2 to begin experimenting with the
data.

There's also a pretty good list of (mostly big) datasets that google has
released over the years here:
http://svonava.com/post/62186512058/datasets-released-by-google

- Evan

On Tue, Feb 25, 2014 at 6:33 PM, 黄远强  wrote:

> Hi all:
> I am a freshman in Spark community. i dream of being a expert in the field
> of big data.  But i have no idea where to start after i have gone through
> the published  documents in Spark website and examples in  Spark source
> code.  I want to know if there are some public data set in the internet
> that can be utilized  to learn Spark and test my some new ideas base on
> Spark.
>   Thanks a lot.
>
>
> ---
> Best regards
> hyqgod


Re: MLLib - Thoughts about refactoring Updater for LBFGS?

2014-02-25 Thread Evan R. Sparks
Hi everyone,

Sorry I'm late to the thread here, but I want to point out a few things.
This is, of course, a most welcome contribution and it will be immediately
useful to everything currently using the stochastic gradient optimizers!

1) I'm all for refactoring the optimization methods to make them a little
more general - Perhaps there should be a "FirstOrderUpdater" that
subclasses updater which takes things like stepsize as paramters, but still
has an "compute" method. While the updater APIs are public, I'd be
surprised if anyone is using them directly, instead, I'd expect most people
to be using the APIs that rely on them (namely the SVM, Logistic, and
Linear regression classes). It should *definitely* be possible to keep the
loss function we're minimizing separate from the optimization method.

2) That said - LBFGS *does* rely on being able to take gradients of the
loss function, and is a gradient based method.

3) On that note, in your branch, the access pattern (and code pattern) for
L-BFGS is basically identical to the code for minibatchSGD - I may be
missing something, but we really should try to factor out the parts that
are the same and avoid duplicating this logic. I *think* coming up with an
LBFGSUpdater with an appropriate compute method is all we need
(particularly since we're keeping track of the loss history), but I might
be wrong here.

4) In general, I think we should think about incurring technical debt
through duplicated (in functionality) code in the codebase (e.g. yet
another Vector sum/multiply class) and code written in languages other than
Scala in MLlib - the fortran/C++ implementations of L-BFGS aren't doing
anything magical, and as long as we can get similar performance it will be
much easier to maintain if everything is in Scala (with some critical bits
in other languages - but I don't think this falls into that case).
Additionally, we should think about whether we really need these additional
dependencies. While I'm sure Mallet is great, I'm a little worried that
adding it as a dependency for one or two functions we could pretty easily
reimplement might be a little heavy and present problems in the future.

Anyway, you should submit a PR and we can work on it!

- Evan




On Tue, Feb 25, 2014 at 5:52 PM, Debasish Das wrote:

> Hi DB,
>
> Could you please point me to your spark PR ?
>
> Thanks.
> Deb
>
>
> On Tue, Feb 25, 2014 at 5:03 PM, DB Tsai  wrote:
>
> > Hi Deb, Xiangrui
> >
> > I just moved the LBFGS code to maven central, and cleaned up the code
> > a little bit.
> >
> > https://github.com/AlpineNow/incubator-spark/commits/dbtsai-LBFGS
> >
> > After looking at Mallet, the api is pretty simple, and it's probably
> > can be easily tested
> > based on my PR.
> >
> > It will be tricky to just benchmark the time of optimizers by
> > excluding the parallel gradientSum
> > and lossSum computation, and I don't have good approach yet. Let's
> > compare the accuracy for the time being.
> >
> > Thanks.
> >
> > Sincerely,
> >
> > DB Tsai
> > Machine Learning Engineer
> > Alpine Data Labs
> > --
> > Web: http://alpinenow.com/
> >
> >
> > On Tue, Feb 25, 2014 at 12:07 PM, Debasish Das  >
> > wrote:
> > > Hi DB,
> > >
> > > I am considering building on your PR and add Mallet as the dependency
> so
> > > that we can run some basic comparisons test on large scale sparse
> > datasets
> > > that I have.
> > >
> > > In the meantime, let's discuss if there are other optimization packages
> > > that we should try.
> > >
> > > My wishlist has bounded bfgs as well and I will add it to the PR.
> > >
> > > About the PR getting merged to mllib, we can plan that later.
> > >
> > > Thanks.
> > > Deb
> > >
> > >
> > >
> > > On Tue, Feb 25, 2014 at 11:36 AM, DB Tsai 
> wrote:
> > >
> > >> I find some comparison between Mallet vs Fortran version. The result
> > >> is closed but not the same.
> > >>
> > >>
> >
> http://t3827.ai-mallet-development.aitalk.info/help-with-l-bfgs-t3827.html
> > >>
> > >> Here is LBFGS-B
> > >> Cost: 0.6902411220175793
> > >> Gradient: -5.453609E-007, -2.858372E-008, -1.369706E-007
> > >> Theta: -0.014186210102171406, -0.303521206706629,
> -0.018132348904129902
> > >>
> > >> And Mallet LBFGS (Tollerance .001)
> > >> Cost: 0.6902412268833071
> > >> Gradient: 0.000117, -4.615523E-005, 0.000114
> > >> Theta: -0.013914961040040107, -0.30419883021414335,
> > -0.016838481937958744
> > >>
> > >> So this shows me, that Mallet is close, but Plain ol Gradient Descent
> > >> and LBFGS-B are really close.
> > >> I see that Mallet also has a "LineOptimizer" and "Evaluator" that I
> > >> have yet to explore...
> > >>
> > >> Sincerely,
> > >>
> > >> DB Tsai
> > >> Machine Learning Engineer
> > >> Alpine Data Labs
> > >> --
> > >> Web: http://alpinenow.com/
> > >>
> > >>
> > >> On Tue, Feb 25, 2014 at 11:16 AM, DB Tsai 
> wrote:
> > >> > Hi Deb,
> > >> >
> > >> > On Tue, Feb 25, 2014 at 7:07 AM, Debasish Da