Re: kMeans

2008-03-27 Thread Marko Novakovic
OK, thanks for information.

I wrote application for this topic.
I want to know if this topic is acceptable for Google
Summer of Code.

--- Ted Dunning <[EMAIL PROTECTED]> wrote:

> 
> Kmeans can be used to cluster web-sites if you use a
> cosine measure of
> similarity based on content.
> 
> You can also use the first few eigenvectors of the
> linkage graph to do
> spectral clustering (this will essentially be a
> strongly connected component
> analysis).
> 
> Using browse logs can also give you clusters if you
> look at common viewing
> of pages during particular sessions.  This should
> mostly replicate the
> linkage graph analysis.
> 
> 
> On 3/26/08 4:02 PM, "Marko Novakovic"
> <[EMAIL PROTECTED]> wrote:
> 
> > Is good idea to apply project for integrating
> kMeans
> > algorithm to clustering web pages?
> > 
> > 
> >   
> >
>
__
> > __
> > Never miss a thing.  Make Yahoo your home page.
> > http://www.yahoo.com/r/hs
> 
> 



  

Looking for last minute shopping deals?  
Find them fast with Yahoo! Search.  
http://tools.search.yahoo.com/newsearch/category.php?category=shopping


Re: kMeans

2008-03-27 Thread Marko Novakovic
Try to view the issue of IEEE Computer 2007. There are
a lot of phenomenons about indexnig results. Maybe you
could find some good reference there about the
clustering of index.

--- Khalil Honsali <[EMAIL PROTECTED]> wrote:

> Hello,
> 
> Is there any relevant papers/work about
> index-clustering (not search results
> clustering) ? I wonder if it will impact queries if
> index is clustered and
> distributed somehow?
> 
> K. Honsali
> 
> On 27/03/2008, Ted Dunning <[EMAIL PROTECTED]>
> wrote:
> >
> >
> > Kmeans can be used to cluster web-sites if you use
> a cosine measure of
> > similarity based on content.
> >
> > You can also use the first few eigenvectors of the
> linkage graph to do
> > spectral clustering (this will essentially be a
> strongly connected
> > component
> > analysis).
> >
> > Using browse logs can also give you clusters if
> you look at common viewing
> > of pages during particular sessions.  This should
> mostly replicate the
> > linkage graph analysis.
> >
> >
> >
> > On 3/26/08 4:02 PM, "Marko Novakovic"
> <[EMAIL PROTECTED]> wrote:
> >
> > > Is good idea to apply project for integrating
> kMeans
> > > algorithm to clustering web pages?
> > >
> > >
> > >
> > >
> >
>
__
> > > __
> > > Never miss a thing.  Make Yahoo your home page.
> > > http://www.yahoo.com/r/hs
> >
> >
> 



  

Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now.  
http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ


Re: [?? Probable Spam] 答复: kMeans

2008-03-27 Thread Dawid Weiss


Carrot2 is for clustering web search results -- it's not exactly the same thing.

D.

shunkai.fu wrote:

There is one project called Carrot2 focusing on this topic already.

-邮件原件-
发件人: Marko Novakovic [mailto:[EMAIL PROTECTED] 
发送时间: 2008年3月27日 7:03

收件人: mahout-dev@lucene.apache.org
主题: kMeans

Is good idea to apply project for integrating kMeans
algorithm to clustering web pages?


 



Never miss a thing.  Make Yahoo your home page. 
http://www.yahoo.com/r/hs




Re: kMeans

2008-03-27 Thread Marko Novakovic
Is it acceptable solution for Google Summer of Code?

--- Dawid Weiss <[EMAIL PROTECTED]> wrote:

> 
> Carrot2 is for clustering web search results -- it's
> not exactly the same thing.
> 
> D.
> 
> shunkai.fu wrote:
> > There is one project called Carrot2 focusing on
> this topic already.
> > 
> > -邮件原件-
> > 发件人: Marko Novakovic
> [mailto:[EMAIL PROTECTED] 
> > 发送时间: 2008年3月27日 7:03
> > 收件人: mahout-dev@lucene.apache.org
> > 主题: kMeans
> > 
> > Is good idea to apply project for integrating
> kMeans
> > algorithm to clustering web pages?
> > 
> > 
> >  
> >
>

> > 
> > Never miss a thing.  Make Yahoo your home page. 
> > http://www.yahoo.com/r/hs
> > 
> 



  

Never miss a thing.  Make Yahoo your home page. 
http://www.yahoo.com/r/hs


Re: kMeans

2008-03-27 Thread Dawid Weiss


Hi Marko,


Is it acceptable solution for Google Summer of Code?


I don't think it's an acceptable project for Mahout -- Mahout goals are in large 
data set processing, supported by Map-Reduce. Clustering search results is 
usually in-memory, on-line clustering with few information sources (titles, 
snippets) and the resulting high noise.


That said, what I envisage could be done is to work on data structures that 
could _support_ sensible on-line faceting/clustering of search results, 
similarly to what Google supposedly does behind the scenes to reorder search 
results (similar concept clustering). Building semantic relationships between 
terms or detecting frequently recurring phrases with significantly different 
meanings is definitely interesting and challenging (if not done naively), 
especially on large scale.


Dawid


Re : Re : Re : GSoC Evolutionary Algorithm Idea

2008-03-27 Thread deneche abdelhakim
> GA is a special case of evolutionary algorithms in general.
> 
> If we ignore cross-over for the moment, hadoop is ideal for EA in general.
> The natural implementation would have each map input record represent a
> single member of the population.  The mapper would mutate this member and
> evaluate the fitness, outputting all records with a random key from a small
> set (depends on how many reducers we want) and the combiner and reducer
> would sieve out the top N members for each key value.  Multiple passes of
> this would be the way to run the algorithm.  If the key set has cardinality
> 1, then this reduces to ordinary mutation and selection, if it is larger,
> then the selection of the top n becomes a bit approximate, but should not
> cause any significant problems.  A second pass could be used to reduce to an
> exact selection if needed.  Depending on how the combiner words, using a
> single reduce might be very fast.
> 
> Crossover requires that pairs items be brought together, roughly at random,
> and might require an extra map-reduce.

This idea is interesting, especially that I am used to Artificial Immune 
Systems and that they don't need the crossover operator.
But its more complicated than what I proposed, and as I am new to Mahout and 
Hadoop I think I should start with the first idea.

> My own preference with these algorithms is to avoid cross-over and focus on
> meta-mutation where some of the state in the records specifies how the
> mutation should proceed.  This can have dramatic positive effects in
> accelerating convergence by providing somewhat of a trade-off against
> convergence guarantees.  Since everybody gives up the convergence guarantees
> anyway, this is a nice knob to have.  I describe on approach that can be
> very effective in an old paper that can be found here:
> http://arxiv.org/abs/0803.3838 .  I can provide a sample implementation in
> C, but the paper is probably easier to read.

I could'nt access the paper, after registering to arXiv.org, it asks me for the 
ownership password of the paper !!! here is what it says

"We do not allow people other than the authors of an article to claim ownership 
of an article before it has been publicly announced..."

Is the paper available somewhere else ?

Abdel Hakim Deneche






  
_ 
Envoyez avec Yahoo! Mail. Capacité de stockage illimitée pour vos emails. 
http://mail.yahoo.fr

Re: kMeans

2008-03-27 Thread Karl Wettin

Marko Novakovic skrev:

Is good idea to apply project for integrating kMeans
algorithm to clustering web pages?


(Your question is better suited the users- than the dev-forum.)

It depends on your needs, so you need to be more specific about how you 
plan to use the results in order to get a good answer.


But choosing an algorithm to extract clusters is only half of your 
problem. You need to transform the web pages to instance data accepted 
by the clusterer. How much effort do you want to put in to that?



karl



Re: kMeans

2008-03-27 Thread Karl Wettin
At the same time as I sent my reply, I received all the other replies 
that I did not read yet :)


GSoC Evolutionary Algorithm Proposal

2008-03-27 Thread deneche abdelhakim
I've written my proposal, and because I could no more change it after I submit 
it to GSoc, I first post it here
if someone have some suggestions you are welcome.
I will wait until saturday morning to post it to the GSoC

**
Application for Summer of Code 2008Mahout Project

Deneche Abdel Hakim

Codename Mahout.GA


I. Synopsis

I will add a genetic algorithm (GA) for binary classification on large datasets 
to the Mahout project. To gain time I will use an existing framework for 
genetic algorithms WatchMaker [WatchMaker] with an Apache Software License. I 
will also add a parallelized measure that indicates the quality of 
classification rules on a given dataset, this measure will be available 
independently of the GA. And if I have enough time I will make the GA more 
generic and apply it on a different problem (multiclass classification).


II. Project

A GA works by evolving a population of individuals toward a desired goal. To 
get a satisfying solution, the GA needs to run thousands of iterations with 
hundreds of individuals. For each iteration and individual the fitness is 
calculated, it indicates the closeness of that individual to the desired 
solution. The main advantage of GAs is there ability to find solution of 
problems given only a fitness measure (and of course a sufficient CPU power), 
this is particularly helpful when the problem is complex and no mathematical 
solution is available.

My primary goal is to implement the GA described in [GA]. It uses a fitness 
function that is easy to implement and can benefit from the Map-Reduce strategy 
to exploit distributed computing (when the training dataset is very large). It 
will be available as ready to use tool (Mahout.GA) that discovers binary 
classification rules for any given dataset. Concretely, the main program will 
launch the GA using WatchMaker, each time the GA needs to evaluate the fitness 
of the population it calls a specific class given by us, this class will 
configure and launch a Hadoop Job on a distributed cluster.

My secondary goal is to make Mahout.GA problem independent, thus allowing us to 
use it for different problems such as multiclass classification, optimization, 
clustering. This will be done by implementing a ready to use generic fitness 
function for WatchMaker that calls internally Hadoop. As a proof of concept I 
will use it for multiclass classification (if I don't run out of time of 
course!).


III. Profit for Mahout

1.The GA will be integrated with Mahout as a ready to use rule discovering tool 
for binary classification;
2.Explore the integration of existing frameworks with Mahout, for example how 
to design the program in a way that the framework libraries will not be needed 
in the slave nodes (technically its feasible, but I still need to learn how to 
do it);
3.The parallelized fitness function can be used independently of Mahout.GA. 
It’s a good measure of the quality of binary classification rules;
4.Simplify the process of using Mahout.GA for other problems. The user will 
still need to design the solutions representation and to implement a fitness 
function, but all the Hadoop stuff should be hidden or at least made simpler;
5.Apply the generalized Mahout.GA to multiclass classification and write a 
corresponding tutorial that explains how to use Mahout.GA to solve new problems.


IV. Success Criteria

Main goals
  1.Implement the parallelized fitness function described in [GA] and validate 
its results on a small dataset;
  2.Implement Mahout.GA for binary classification rule discovery. A simpler 
(not parallelized) version of this algorithm should also be implemented to 
validate the results of Mahout.GA;

Secondary goals
  1.Allow the parallelized fitness function to be used independently of 
Mahout.GA;
  2.Use Mahout.GA on a different problem (multiclass classification) and write 
a corresponding tutorial.


V. Roadmap

[April, 14: accepted students known]
  1.Familiarize myself with Hadoop
Modify one of the examples of Hadoop to simulate an iterative process. For 
each iteration, a new Job is executed with different parameters, and its 
results are imported back by the program.
  2.Implement the GA without parallelism
a.Start by implementing the tutorial example that comes with WatchMaker;
b.Implement my own Individual and Fitness function classes;
c.Validate the algorithm using a small dataset, and find the parameters 
that will give acceptable results.
  3.Prepare whatever I may need in the development period
[May, 26 coding starts]
  4.Implement the parallelized fitness function
a.Use Hadoop Map-Reduce to implement it [2 weeks];
b.Validate it on a small dataset [1 week].
  5.Implement Mahout.GA
a.Write an intermediary component between WatchMaker and the parallelized 
fitness function. This component takes a population, configures and launches a 
Job, waits for its end

Index clustering (was: kMeans)

2008-03-27 Thread Karl Wettin

Khalil Honsali skrev:

Hello,


Hi Khalil,


Is there any relevant papers/work about index-clustering (not search results
clustering) ? I wonder if it will impact queries if index is clustered and
distributed somehow?


LUCENE-1025 is a heirarchial clusterer that I later refactored to be 
persist the tree in a BDB so I could build a cluster of a complete index 
that could come up with "more like this"-suggestions in an instant. It 
was sort of slow, but the results where not too bad. Never compared it 
with anything else thogh. It never became more than a proof of concept.


I'm looking at reimplenting this for Mahout, but I have a hard time 
figuring out if building the tree is something one wants to (or even if 
one can do) using map reduce. The more I think of it there more I want 
to solve it with a grid.




karl


Re: some question about the nips paper

2008-03-27 Thread Niranjan Balasubramanian

On Mar 26, 2008, at 12:06 AM, Hao Zheng wrote:

it's log(P). I just don't know how log(P) is obtained.



I am doing some guess work here but I think the log (P) factor arises  
out of the combining the data that is passed back from the various  
processors. Imagine tree of processors with depth log (P) and at each  
level the communication amounts to some function of n and m, f(m,n),  
then the total communication cost would be


f(m,n) log (P).

~ Niranjan


On 3/26/08, Grant Ingersoll <[EMAIL PROTECTED]> wrote:


 On Mar 25, 2008, at 4:11 PM, Isabel Drost wrote:




2. Sect. 4.1, too.
"reduce phase can minimize communication by combining data as it's
passed back; this accounts for the logP factor", Could you help me
figure out how logP is calculated.


Anyone else who can help out here?



Isn't this just log(P) where P is the number of cores?  The  
version of

 the paper I have says log(P) not logP, so maybe there is a typo?

  From earlier in 4.1:
 "We assume that the dimension of the inputs is n (i.e., x
 ∈R
 n
 ), that we have m
 training examples, and that there are P cores"



 -Grant







Re: Regarding Google Summer of Code Lucene Mahout Project

2008-03-27 Thread Jason Rennie
On Tue, Mar 25, 2008 at 3:49 PM, Isabel Drost <[EMAIL PROTECTED]>
wrote:

> The paper looks interesting: The modifications to naive bayes presented in
> the
> paper seem to lead to a classifier that is comparable to SVM performance
> for
> text classification while having far better performance.


The text transforms aren't particularly interesting---they're essentially
just a TFIDF pre-processing step.  This botches the multinomial model a bit,
but works well and has some theoretical motivation (in sec. 4).

I thought the discovery of the "skewed data bias" was quite interesting.
Was never completely satisfied with our solution, Complement Naive Bayes.
But, it did seem to solve the problem and yield improved performance for
data where the number of examples per class varied widely.

Seems like you need an algorithm that outputs comparable scores for each
> document and is neither under- nor overconfident. I remember vaguely that
> the
> vanilla NB had some problems in this respect.


Complement NB gets rid of some of this problem, though Logistic Regression
or Softmax (the multiclass variant) is probably a generally better
solution.  'course LR/Softmax requires optimization whereas (C)NB requires
little more than counting and basic math ops...  easier to implement...

Jason

-- 
Jason Rennie
Head of Machine Learning Technologies, StyleFeeder
http://www.stylefeeder.com/
Samantha's blog & pictures: http://samanthalyrarennie.blogspot.com/


Re: some question about the nips paper

2008-03-27 Thread Hao Zheng
let it be a tree of depth log(P), then the total communication cost will be
f(m,n)*(1/2+1/4+1/8+...)*P = f(m,n)*P, but not log(P)
do I see what you mean?

2008/3/27 Niranjan Balasubramanian <[EMAIL PROTECTED]>:
> On Mar 26, 2008, at 12:06 AM, Hao Zheng wrote:
>
> > it's log(P). I just don't know how log(P) is obtained.
>  >
>
>  I am doing some guess work here but I think the log (P) factor arises
>  out of the combining the data that is passed back from the various
>  processors. Imagine tree of processors with depth log (P) and at each
>  level the communication amounts to some function of n and m, f(m,n),
>  then the total communication cost would be
>
>  f(m,n) log (P).
>
>  ~ Niranjan
>
>
>
>  > On 3/26/08, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
>  >>
>  >>  On Mar 25, 2008, at 4:11 PM, Isabel Drost wrote:
>  >>>
>  >>>
>   2. Sect. 4.1, too.
>   "reduce phase can minimize communication by combining data as it's
>   passed back; this accounts for the logP factor", Could you help me
>   figure out how logP is calculated.
>  >>>
>  >>> Anyone else who can help out here?
>  >>
>  >>
>  >> Isn't this just log(P) where P is the number of cores?  The
>  >> version of
>  >>  the paper I have says log(P) not logP, so maybe there is a typo?
>  >>
>  >>   From earlier in 4.1:
>  >>  "We assume that the dimension of the inputs is n (i.e., x
>  >>  ∈R
>  >>  n
>  >>  ), that we have m
>  >>  training examples, and that there are P cores"
>  >>
>  >>
>  >>
>  >>  -Grant
>  >>
>  >>
>  >>
>
>


Re: some question about the nips paper

2008-03-27 Thread Niranjan Balasubramanian
The cost is f(m,n) at each level in the tree (not at each node).  
Thus, you have f(m,n) x the number of levels which gives f(m,n) log (P).


I might be way off here but this is what I guess.
~ Niranjan



On Mar 27, 2008, at 12:08 PM, Hao Zheng wrote:

let it be a tree of depth log(P), then the total communication cost  
will be

f(m,n)*(1/2+1/4+1/8+...)*P = f(m,n)*P, but not log(P)
do I see what you mean?

2008/3/27 Niranjan Balasubramanian <[EMAIL PROTECTED]>:

On Mar 26, 2008, at 12:06 AM, Hao Zheng wrote:


it's log(P). I just don't know how log(P) is obtained.



 I am doing some guess work here but I think the log (P) factor  
arises

 out of the combining the data that is passed back from the various
 processors. Imagine tree of processors with depth log (P) and at  
each

 level the communication amounts to some function of n and m, f(m,n),
 then the total communication cost would be

 f(m,n) log (P).

 ~ Niranjan




On 3/26/08, Grant Ingersoll <[EMAIL PROTECTED]> wrote:


 On Mar 25, 2008, at 4:11 PM, Isabel Drost wrote:




2. Sect. 4.1, too.
"reduce phase can minimize communication by combining data as  
it's
passed back; this accounts for the logP factor", Could you  
help me

figure out how logP is calculated.


Anyone else who can help out here?



Isn't this just log(P) where P is the number of cores?  The
version of
 the paper I have says log(P) not logP, so maybe there is a typo?

  From earlier in 4.1:
 "We assume that the dimension of the inputs is n (i.e., x
 ∈R
 n
 ), that we have m
 training examples, and that there are P cores"



 -Grant










Re: Re : Re : Re : GSoC Evolutionary Algorithm Idea

2008-03-27 Thread Ted Dunning

Arxiv can be very strange to use.

I have sent a PDF directly to Deneche and will send a copy to anyone who
needs one.



On 3/27/08 5:55 AM, "deneche abdelhakim" <[EMAIL PROTECTED]> wrote:

>> GA is a special case of evolutionary algorithms in general.
>> 
>> If we ignore cross-over for the moment, hadoop is ideal for EA in general.
>> The natural implementation would have each map input record represent a
>> single member of the population.  The mapper would mutate this member and
>> evaluate the fitness, outputting all records with a random key from a small
>> set (depends on how many reducers we want) and the combiner and reducer
>> would sieve out the top N members for each key value.  Multiple passes of
>> this would be the way to run the algorithm.  If the key set has cardinality
>> 1, then this reduces to ordinary mutation and selection, if it is larger,
>> then the selection of the top n becomes a bit approximate, but should not
>> cause any significant problems.  A second pass could be used to reduce to an
>> exact selection if needed.  Depending on how the combiner words, using a
>> single reduce might be very fast.
>> 
>> Crossover requires that pairs items be brought together, roughly at random,
>> and might require an extra map-reduce.
> 
> This idea is interesting, especially that I am used to Artificial Immune
> Systems and that they don't need the crossover operator.
> But its more complicated than what I proposed, and as I am new to Mahout and
> Hadoop I think I should start with the first idea.
> 
>> My own preference with these algorithms is to avoid cross-over and focus on
>> meta-mutation where some of the state in the records specifies how the
>> mutation should proceed.  This can have dramatic positive effects in
>> accelerating convergence by providing somewhat of a trade-off against
>> convergence guarantees.  Since everybody gives up the convergence guarantees
>> anyway, this is a nice knob to have.  I describe on approach that can be
>> very effective in an old paper that can be found here:
>> http://arxiv.org/abs/0803.3838 .  I can provide a sample implementation in
>> C, but the paper is probably easier to read.
> 
> I could'nt access the paper, after registering to arXiv.org, it asks me for
> the ownership password of the paper !!! here is what it says
> 
> "We do not allow people other than the authors of an article to claim
> ownership of an article before it has been publicly announced..."
> 
> Is the paper available somewhere else ?
> 
> Abdel Hakim Deneche
> 
> 
> 
> 
> 
> 
>   
> _
> Envoyez avec Yahoo! Mail. Capacité de stockage illimitée pour vos emails.
> http://mail.yahoo.fr



Re: some question about the nips paper

2008-03-27 Thread Hao Zheng
If the cost is at each level, you are right. Maybe this is the exact
answer. But I am still not clear how to implement it. Could anyone
provide a more detailed description?

2008/3/28 Niranjan Balasubramanian <[EMAIL PROTECTED]>:
> The cost is f(m,n) at each level in the tree (not at each node).
>  Thus, you have f(m,n) x the number of levels which gives f(m,n) log (P).
>
>  I might be way off here but this is what I guess.
>  ~ Niranjan
>
>
>
>
>
>  On Mar 27, 2008, at 12:08 PM, Hao Zheng wrote:
>
>  > let it be a tree of depth log(P), then the total communication cost
>  > will be
>  > f(m,n)*(1/2+1/4+1/8+...)*P = f(m,n)*P, but not log(P)
>  > do I see what you mean?
>  >
>  > 2008/3/27 Niranjan Balasubramanian <[EMAIL PROTECTED]>:
>  >> On Mar 26, 2008, at 12:06 AM, Hao Zheng wrote:
>  >>
>  >>> it's log(P). I just don't know how log(P) is obtained.
>  >>>
>  >>
>  >>  I am doing some guess work here but I think the log (P) factor
>  >> arises
>  >>  out of the combining the data that is passed back from the various
>  >>  processors. Imagine tree of processors with depth log (P) and at
>  >> each
>  >>  level the communication amounts to some function of n and m, f(m,n),
>  >>  then the total communication cost would be
>  >>
>  >>  f(m,n) log (P).
>  >>
>  >>  ~ Niranjan
>  >>
>  >>
>  >>
>  >>> On 3/26/08, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
>  
>    On Mar 25, 2008, at 4:11 PM, Isabel Drost wrote:
>  >
>  >
>  >> 2. Sect. 4.1, too.
>  >> "reduce phase can minimize communication by combining data as
>  >> it's
>  >> passed back; this accounts for the logP factor", Could you
>  >> help me
>  >> figure out how logP is calculated.
>  >
>  > Anyone else who can help out here?
>  
>  
>   Isn't this just log(P) where P is the number of cores?  The
>   version of
>    the paper I have says log(P) not logP, so maybe there is a typo?
>  
>     From earlier in 4.1:
>    "We assume that the dimension of the inputs is n (i.e., x
>    ∈R
>    n
>    ), that we have m
>    training examples, and that there are P cores"
>  
>  
>  
>    -Grant
>  
>  
>  
>  >>
>  >>
>
>


More information for Mentors

2008-03-27 Thread Isabel Drost

Hello,

I just a link to found the following URL on the GSoC page of Open Moko:

http://www.gnome.org/~federico/docs/summer-of-code-mentoring-howto/index.html

I think it contains quite some important information on being a mentor, that 
might not be available from the Google FAQ.

Isabel

-- 
You are the only person to ever get this message.
  |\  _,,,---,,_   Web:   
  /,`.-'`'-.  ;-;;,_
 |,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  


signature.asc
Description: This is a digitally signed message part.