I think it would be great to port our kMeans implementation to Spark. It should be done by using Dmitriy's DSL similar to what I'm trying in https://issues.apache.org/jira/browse/MAHOUT-1464

On 03/19/2014 07:56 AM, chalitha udara Perera wrote:
Hi Dmitriy,

I agree with you that i need to be more specific on this matter. Here I was
referring to some suggestion given by Suneel on Mahout 1.0 goals [1], b and
c.

For example this is one thing i have experienced while using mahout
clustering. I have used both simple kmeans and spectral kmeans and for
simple kmeans input is the sequence file containing the tfidf vectors of
the documents while for spectral kmeans it is a csv file defining the
similarity matrix. It would have been much easier for users if spectral
kmeans also takes the tfidf vectors and create the similarity matrix
internally. I think that would improve the usability.

And most of these algorithms are designed to run via the command line. I
know currently lot of programmers just use run(String []) method for
programming. I am not saying it is impossible to use Mahout clustering
algorithms as required. but it takes some effort, most of the you need to
dive into the code internals to use it properly and most of the people are
not going to do that. Please provide your valuable insight on this.

I also really interested in the new direction mahout is heading with Spark
given that interest for Spark will only grow largely in near future. If you
think implementing some of clustering algorithms for example simple kmeans
to support spark is more important for next release, I would be happy to
work on that.

Regards,
Chalitha

[1]
http://mail-archives.apache.org/mod_mbox/mahout-dev/201402.mbox/%3c1393554632.3930.yahoomail...@web160202.mail.bf1.yahoo.com%3E



On Wed, Mar 19, 2014 at 11:39 AM, Dmitriy Lyubimov <dlie...@gmail.com>wrote:

I think you need to be a little bit more specific as to what you are
proposing exactly.  I think "uniform clustering api" needs a bit of
elaboration. I, generally, cannot say that I experienced any pain calling
out clustering algorithms say in R as a well-documented function. In Mahout
just doing the same was primarily a pain; but assuming one can call it with
ease and even interactively, I can't say I experienced any major
inconvenience with just doing this.

I guess one can see that one can abstract away notions of clusters and
clustering output, but I don't have enough experience to tell whether it is
a good idea to cover _any_ possible clustering methodology.


On Tue, Mar 18, 2014 at 10:50 PM, chalitha udara Perera <
chalithaud...@gmail.com> wrote:

Hi everyone,

Greatly appreciate your interest on this issue. I have gone through the
document ScalaSparkBindings [1] . In this project my initial idea was to
provide high level API for end user programmers so that they have the
flexibility of plugin in different types of algorithms without concerning
about underline details of different types of inputs or outputs. Also I
consider providing proper test coverage for all clustering algorithm is a
must for the 1.0 release.

Would like to get your opinion regarding this and little more detail on
current requirements for clustering would help me to improve proposal.

Thanks,
Chalitha



On Mon, Mar 17, 2014 at 11:21 PM, Dmitriy Lyubimov <dlie...@gmail.com
wrote:

Yes. there's interest.
Note that we are trying to unify linear algebra primitives and
optimization
on Spark as well. All new linear algebra and interaction with spark
context
should probably go thru this layer. This is ongoing thing but some
stuff
is
working [1]

[1] mAHOUT-1346 https://issues.apache.org/jira/browse/MAHOUT-1346


On Mon, Mar 17, 2014 at 10:37 AM, chalitha udara Perera <
chalithaud...@gmail.com> wrote:

Hi All,

Going through the mail tread Mahout 1.0 goals, I found that the main
focus
of mahout is now towards the code re-factoring and integration with
Spark
rather than implementing new algorithms. Recently I have used mahout
for
implementing document clustering module a Content Management System.

To be honest we had some problems with lack of uniformity among
different
clustering algorithms. For example simple Kmeans takes input as the
sequence file with document TF-IDF vectors, while Spectral Kmeans
takes
the
csv file that defines the similarity matrix.

I think if we can provide a uniform clustering API as mentioned in
1.0
goals, it would be very useful for end user developers.

I would like to proceed with this idea as my GSOC 2014 project.
Please
let
me know if you are interested in this project
--
J.M Chalitha Udara Perera

*Department of Computer Science and Engineering,*
*University of Moratuwa,*
*Sri Lanka*





--
J.M Chalitha Udara Perera

*Department of Computer Science and Engineering,*
*University of Moratuwa,*
*Sri Lanka*






Reply via email to