Re: renaming SchemaRDD -> DataFrame

2015-01-27 Thread Dmitriy Lyubimov
It has been pretty evident for some time that's what it is, hasn't it?

Yes that's a better name IMO.

On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin  wrote:

> Hi,
>
> We are considering renaming SchemaRDD -> DataFrame in 1.3, and wanted to
> get the community's opinion.
>
> The context is that SchemaRDD is becoming a common data format used for
> bringing data into Spark from external systems, and used for various
> components of Spark, e.g. MLlib's new pipeline API. We also expect more and
> more users to be programming directly against SchemaRDD API rather than the
> core RDD API. SchemaRDD, through its less commonly used DSL originally
> designed for writing test cases, always has the data-frame like API. In
> 1.3, we are redesigning the API to make the API usable for end users.
>
>
> There are two motivations for the renaming:
>
> 1. DataFrame seems to be a more self-evident name than SchemaRDD.
>
> 2. SchemaRDD/DataFrame is actually not going to be an RDD anymore (even
> though it would contain some RDD functions like map, flatMap, etc), and
> calling it Schema*RDD* while it is not an RDD is highly confusing. Instead.
> DataFrame.rdd will return the underlying RDD for all RDD methods.
>
>
> My understanding is that very few users program directly against the
> SchemaRDD API at the moment, because they are not well documented. However,
> oo maintain backward compatibility, we can create a type alias DataFrame
> that is still named SchemaRDD. This will maintain source compatibility for
> Scala. That said, we will have to update all existing materials to use
> DataFrame rather than SchemaRDD.
>


Re: Using CUDA within Spark / boosting linear algebra

2015-03-25 Thread Dmitriy Lyubimov
Alexander,

does using netlib imply that one cannot switch between CPU and GPU blas
alternatives at will at the same time? the choice is always determined by
linking aliternatives to libblas.so, right?

On Wed, Mar 25, 2015 at 2:31 PM, Ulanov, Alexander 
wrote:

> Hi again,
>
> I finally managed to use nvblas within Spark+netlib-java. It has
> exceptional performance for big matrices with Double, faster than
> BIDMat-cuda with Float. But for smaller matrices, if you will copy them
> to/from GPU, OpenBlas or MKL might be a better choice. This correlates with
> original nvblas presentation on GPU conf 2013 (slide 21):
> http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf
>
> My results:
>
> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
>
> Just in case, these tests are not for generalization of performance of
> different libraries. I just want to pick a library that does at best dense
> matrices multiplication for my task.
>
> P.S. My previous issue with nvblas was the following: it has Fortran blas
> functions, at the same time netlib-java uses C cblas functions. So, one
> needs cblas shared library to use nvblas through netlib-java. Fedora does
> not have cblas (but Debian and Ubuntu have), so I needed to compile it. I
> could not use cblas from Atlas or Openblas because they link to their
> implementation and not to Fortran blas.
>
> Best regards, Alexander
>
> -Original Message-
> From: Ulanov, Alexander
> Sent: Tuesday, March 24, 2015 6:57 PM
> To: Sam Halliday
> Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
> Subject: RE: Using CUDA within Spark / boosting linear algebra
>
> Hi,
>
> I am trying to use nvblas with netlib-java from Spark. nvblas functions
> should replace current blas functions calls after executing LD_PRELOAD as
> suggested in http://docs.nvidia.com/cuda/nvblas/#Usage without any
> changes to netlib-java. It seems to work for simple Java example, but I
> cannot make it work with Spark. I run the following:
> export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64
> env LD_PRELOAD=/usr/local/cuda-6.5/lib64/libnvblas.so ./spark-shell
> --driver-memory 4G In nvidia-smi I observe that Java is to use GPU:
>
> +-+
> | Processes:   GPU
> Memory |
> |  GPU   PID  Type  Process name   Usage
> |
>
> |=|
> |0  8873C   bash
> 39MiB |
> |0  8910C   /usr/lib/jvm/java-1.7.0/bin/java
> 39MiB |
>
> +-+
>
> In Spark shell I do matrix multiplication and see the following:
> 15/03/25 06:48:01 INFO JniLoader: successfully loaded
> /tmp/jniloader8192964377009965483netlib-native_system-linux-x86_64.so
> So I am sure that netlib-native is loaded and cblas supposedly used.
> However, matrix multiplication does executes on CPU since I see 16% of CPU
> used and 0% of GPU used. I also checked different matrix sizes, from
> 100x100 to 12000x12000
>
> Could you suggest might the LD_PRELOAD not affect Spark shell?
>
> Best regards, Alexander
>
>
>
> From: Sam Halliday [mailto:sam.halli...@gmail.com]
> Sent: Monday, March 09, 2015 6:01 PM
> To: Ulanov, Alexander
> Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
> Subject: RE: Using CUDA within Spark / boosting linear algebra
>
>
> Thanks so much for following up on this!
>
> Hmm, I wonder if we should have a concerted effort to chart performance on
> various pieces of hardware...
> On 9 Mar 2015 21:08, "Ulanov, Alexander"  alexander.ula...@hp.com>> wrote:
> Hi Everyone, I've updated the benchmark as Xiangrui suggested. Added the
> comment that BIDMat 0.9.7 uses Float matrices in GPU (although I see the
> support of Double in the current source code), did the test with BIDMat and
> CPU Double matrices. BIDMat MKL is indeed on par with netlib MKL.
>
>
> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
>
> Best regards, Alexander
>
> -Original Message-
> From: Sam Halliday [mailto:sam.halli...@gmail.com sam.halli...@gmail.com>]
> Sent: Tuesday, March 03, 2015 1:54 PM
> To: Xiangrui Meng; Joseph Bradley
> Cc: Evan R. Sparks; Ulanov, Alexander; dev@spark.apache.org dev@spark.apache.org>
> Subject: Re: Using CUDA within Spark / boosting linear algebra
>
> BTW, is anybody on this list going to the London Meetup in a few weeks?
>
>
> https://skillsmatter.com/meetups/6987-apache-spark-living-the-post-mapreduce-world#community
>
> Would be nice to meet other people working on the guts of Spark! :-)
>
>
> Xiangrui Meng mailto:men...@gmail.com>> writes:
>
> > Hey Alexander,
> >
> > I don't quite understand the part wh

Re: Using CUDA within Spark / boosting linear algebra

2015-03-25 Thread Dmitriy Lyubimov
Sam,

whould it be easier to hack netlib-java to allow multiple (configurable)
 library contexts? And so enable 3rd party configurations and optimizers to
make their own choices until then?

On Wed, Mar 25, 2015 at 3:07 PM, Sam Halliday 
wrote:

> Yeah, MultiBLAS... it is dynamic.
>
> Except, I haven't written it yet :-P
> On 25 Mar 2015 22:06, "Ulanov, Alexander"  wrote:
>
>>  Netlib knows nothing about GPU (or CPU), it just uses cblas symbols
>> from the provided libblas.so.3 library at the runtime. So, you can switch
>> at the runtime by providing another library. Sam, please suggest if there
>> is another way.
>>
>>
>>
>> *From:* Dmitriy Lyubimov [mailto:dlie...@gmail.com]
>> *Sent:* Wednesday, March 25, 2015 2:55 PM
>> *To:* Ulanov, Alexander
>> *Cc:* Sam Halliday; dev@spark.apache.org; Xiangrui Meng; Joseph Bradley;
>> Evan R. Sparks; jfcanny
>> *Subject:* Re: Using CUDA within Spark / boosting linear algebra
>>
>>
>>
>> Alexander,
>>
>>
>>
>> does using netlib imply that one cannot switch between CPU and GPU blas
>> alternatives at will at the same time? the choice is always determined by
>> linking aliternatives to libblas.so, right?
>>
>>
>>
>> On Wed, Mar 25, 2015 at 2:31 PM, Ulanov, Alexander <
>> alexander.ula...@hp.com> wrote:
>>
>> Hi again,
>>
>> I finally managed to use nvblas within Spark+netlib-java. It has
>> exceptional performance for big matrices with Double, faster than
>> BIDMat-cuda with Float. But for smaller matrices, if you will copy them
>> to/from GPU, OpenBlas or MKL might be a better choice. This correlates with
>> original nvblas presentation on GPU conf 2013 (slide 21):
>> http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3108-New-Features-CUDA%206%20-GPU-Acceleration.pdf
>>
>> My results:
>>
>> https://docs.google.com/spreadsheets/d/1lWdVSuSragOobb0A_oeouQgHUMx378T9J5r7kwKSPkY/edit?usp=sharing
>>
>> Just in case, these tests are not for generalization of performance of
>> different libraries. I just want to pick a library that does at best dense
>> matrices multiplication for my task.
>>
>> P.S. My previous issue with nvblas was the following: it has Fortran blas
>> functions, at the same time netlib-java uses C cblas functions. So, one
>> needs cblas shared library to use nvblas through netlib-java. Fedora does
>> not have cblas (but Debian and Ubuntu have), so I needed to compile it. I
>> could not use cblas from Atlas or Openblas because they link to their
>> implementation and not to Fortran blas.
>>
>> Best regards, Alexander
>>
>> -Original Message-
>> From: Ulanov, Alexander
>>
>> Sent: Tuesday, March 24, 2015 6:57 PM
>> To: Sam Halliday
>> Cc: dev@spark.apache.org; Xiangrui Meng; Joseph Bradley; Evan R. Sparks
>> Subject: RE: Using CUDA within Spark / boosting linear algebra
>>
>> Hi,
>>
>> I am trying to use nvblas with netlib-java from Spark. nvblas functions
>> should replace current blas functions calls after executing LD_PRELOAD as
>> suggested in http://docs.nvidia.com/cuda/nvblas/#Usage without any
>> changes to netlib-java. It seems to work for simple Java example, but I
>> cannot make it work with Spark. I run the following:
>> export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64
>> env LD_PRELOAD=/usr/local/cuda-6.5/lib64/libnvblas.so ./spark-shell
>> --driver-memory 4G In nvidia-smi I observe that Java is to use GPU:
>>
>> +-+
>> | Processes:   GPU
>> Memory |
>> |  GPU   PID  Type  Process name   Usage
>> |
>>
>> |=|
>> |0  8873C   bash
>> 39MiB |
>> |0  8910C   /usr/lib/jvm/java-1.7.0/bin/java
>> 39MiB |
>>
>> +-+
>>
>> In Spark shell I do matrix multiplication and see the following:
>> 15/03/25 06:48:01 INFO JniLoader: successfully loaded
>> /tmp/jniloader8192964377009965483netlib-native_system-linux-x86_64.so
>> So I am sure that netlib-native is loaded and cblas supposedly used.
>> However, matrix multiplication does executes on CPU since I see 16% of CPU
>> used and 0% of GPU used. I also checked different matrix sizes, from
>> 100x100 to 12000x12000
>>
>> Could you suggest might the L

Double lhbase dependency in spark 0.9.1

2014-04-17 Thread Dmitriy Lyubimov
Not sure if I am seeing double.

SparkBuild.scala for 0.9.1 has dobule hbase declaration

  "org.apache.hbase" %  "hbase"   % "0.94.6"
excludeAll(excludeNetty, excludeAsm),
  "org.apache.hbase" % "hbase" % HBASE_VERSION excludeAll(excludeNetty,
excludeAsm),


as a result i am not getting the right version of hbase here. Perhaps the
old declaration crept in during a merge at some point?

-d


Re: Kryo not default?

2014-05-13 Thread Dmitriy Lyubimov
On Mon, May 12, 2014 at 2:47 PM, Anand Avati  wrote:

> Hi,
> Can someone share the reason why Kryo serializer is not the default?

why should it be?

On top of it, the only way to serialize a closure into the backend (even
now) is java serialization (which means java serialization is required of
all closure attributes)


> Is
> there anything to be careful about (because of which it is not enabled by
> default)?
>

Yes. Kind of stems from above. There's still a number of api calls that use
closure attributes to serialize data to backend (see fold(), for example).
which means even if you enable kryo, some api still requires java
serialization of an attribute.

I fixed parallelize(), collect() and something else that i don't remember
already in that regard, but i think even up till now there's still a number
of apis lingering whose data parameters  wouldn't work with kryo.


> Thanks!
>


Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Dmitriy Lyubimov
Hector, could you share the references for hierarchical K-means? thanks.


On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee  wrote:

> I would say for bigdata applications the most useful would be hierarchical
> k-means with back tracking and the ability to support k nearest centroids.
>
>
> On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling  wrote:
>
> > Hi all,
> >
> > MLlib currently has one clustering algorithm implementation, KMeans.
> > It would benefit from having implementations of other clustering
> > algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical
> > Clustering, and Affinity Propagation.
> >
> > I recently submitted a PR [1] for a MiniBatch KMeans implementation,
> > and I saw an email on this list about interest in implementing Fuzzy
> > C-Means.
> >
> > Based on Sean Owen's review of my MiniBatch KMeans code, it became
> > apparent that before I implement more clustering algorithms, it would
> > be useful to hammer out a framework to reduce code duplication and
> > implement a consistent API.
> >
> > I'd like to gauge the interest and goals of the MLlib community:
> >
> > 1. Are you interested in having more clustering algorithms available?
> >
> > 2. Is the community interested in specifying a common framework?
> >
> > Thanks!
> > RJ
> >
> > [1] - https://github.com/apache/spark/pull/1248
> >
> >
> > --
> > em rnowl...@gmail.com
> > c 954.496.2314
> >
>
>
>
> --
> Yee Yang Li Hector 
> *google.com/+HectorYee *
>


Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-08 Thread Dmitriy Lyubimov
sure. more interesting problem here is choosing k at each level. Kernel
methods seem to be most promising.


On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee  wrote:

> No idea, never looked it up. Always just implemented it as doing k-means
> again on each cluster.
>
> FWIW standard k-means with euclidean distance has problems too with some
> dimensionality reduction methods. Swapping out the distance metric with
> negative dot or cosine may help.
>
> Other more useful clustering would be hierarchical SVD. The reason why I
> like hierarchical clustering is it makes for faster inference especially
> over billions of users.
>
>
> On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov 
> wrote:
>
> > Hector, could you share the references for hierarchical K-means? thanks.
> >
> >
> > On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee  wrote:
> >
> > > I would say for bigdata applications the most useful would be
> > hierarchical
> > > k-means with back tracking and the ability to support k nearest
> > centroids.
> > >
> > >
> > > On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling 
> wrote:
> > >
> > > > Hi all,
> > > >
> > > > MLlib currently has one clustering algorithm implementation, KMeans.
> > > > It would benefit from having implementations of other clustering
> > > > algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical
> > > > Clustering, and Affinity Propagation.
> > > >
> > > > I recently submitted a PR [1] for a MiniBatch KMeans implementation,
> > > > and I saw an email on this list about interest in implementing Fuzzy
> > > > C-Means.
> > > >
> > > > Based on Sean Owen's review of my MiniBatch KMeans code, it became
> > > > apparent that before I implement more clustering algorithms, it would
> > > > be useful to hammer out a framework to reduce code duplication and
> > > > implement a consistent API.
> > > >
> > > > I'd like to gauge the interest and goals of the MLlib community:
> > > >
> > > > 1. Are you interested in having more clustering algorithms available?
> > > >
> > > > 2. Is the community interested in specifying a common framework?
> > > >
> > > > Thanks!
> > > > RJ
> > > >
> > > > [1] - https://github.com/apache/spark/pull/1248
> > > >
> > > >
> > > > --
> > > > em rnowl...@gmail.com
> > > > c 954.496.2314
> > > >
> > >
> > >
> > >
> > > --
> > > Yee Yang Li Hector <http://google.com/+HectorYee>
> > > *google.com/+HectorYee <http://google.com/+HectorYee>*
> > >
> >
>
>
>
> --
> Yee Yang Li Hector <http://google.com/+HectorYee>
> *google.com/+HectorYee <http://google.com/+HectorYee>*
>


"log" overloaded in SparkContext/ Spark 1.0.x

2014-08-04 Thread Dmitriy Lyubimov
it would seem the code like

import o.a.spark.SparkContext._
import math._



a = log(b)

does not seem to compile anymore with Spark 1.0.x since SparkContext._ also
exposes a `log` function. Which happens a lot to a guy like me.

obvious workaround is to use something like

import o.a.spark.SparkContext.{log => sparkLog,  _}

but wouldn't it be easier just to avoid so expected clash in the first
place?

thank you.
-d


Unit test best practice for Spark-derived projects

2014-08-05 Thread Dmitriy Lyubimov
Hello,

I 've been switching Mahout from Spark 0.9 to Spark 1.0.x [1] and noticed
that tests now run much slower compared to 0.9 with CPU running idle most
of the time. I had to conclude that most of that time is spent on tearing
down/resetting Spark context which apparently now takes significantly
longer time in local mode than before.

Q1 --- Is there a way to mitigate long session startup times with local
context?

Q2 -- Our unit tests are basically mixing in a rip-off of
LocalSparkContext, and we are using local[3]. Looking into 1.0.x code, i
 noticed that a lot of Spark unit test code has switched to
SharedSparkContext (i.e. no context reset between individual tests). Is
that now recommended practice to write Spark-based unit tests?

Q3 -- Any other reasons that i may have missed for degraded test
performance?


[1] https://github.com/apache/mahout/pull/40

thank you in advance.
-Dmitriy


Re: Unit test best practice for Spark-derived projects

2014-08-07 Thread Dmitriy Lyubimov
Thanks.

let me check this hypothesis (i have dhcp connection on a private net but
consequently not sure if there's an inverse).


On Thu, Aug 7, 2014 at 10:29 AM, Madhu  wrote:

> How long does it take to get a spark context?
> I found that if you don't have a network connection (reverse DNS lookup
> most
> likely), it can take up 30 seconds to start up locally. I think a hosts
> file
> entry is sufficient.
>
>
>
> -
> --
> Madhu
> https://www.linkedin.com/in/msiddalingaiah
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Unit-test-best-practice-for-Spark-derived-projects-tp7704p7731.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>