Re: some question about the nips paper

2008-03-25 Thread Isabel Drost
On Wednesday 26 March 2008, Hao Zheng wrote:
> for the 4th question, maybe we can still gain some speedups on
> multi-machine clusters. But I suspect that we should also explicitly
> consider the communication cost, which is non-trivial in such setting.
> What do you think?

I agree with you. I could imagine that it is not as easy as for a multi-core 
machine to get a correct result. There should be variables like amount of 
information sent over the network, latency of the network, bandwidth and the 
like... 

Isabel


-- 
Any circuit design must contain at least one part which is obsolete, two parts 
which are unobtainable, and three parts which are still under development.
  |\  _,,,---,,_   Web:   
  /,`.-'`'-.  ;-;;,_
 |,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  


signature.asc
Description: This is a digitally signed message part.


Re: Google Summer of Code

2008-03-25 Thread Isabel Drost
On Tuesday 25 March 2008, Marko Novakovic wrote:
> I understood kMeans algorithm from paper related to
> this project.
> Where could I find this code?

Just have a look into the svn. The url to access it are on the website.


> Does it exist any algotithm, which is not realized?

I think you can always have a look into our JIRA to find out about which parts 
are currently being worked on.

Isabel


-- 
Inform all the troops that communications have completely broken down.
  |\  _,,,---,,_   Web:   
  /,`.-'`'-.  ;-;;,_
 |,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  


signature.asc
Description: This is a digitally signed message part.


Re: Hadoop summit video capture?

2008-03-25 Thread Isabel Drost
On Wednesday 26 March 2008, Jeff Eastman wrote:
> I personally got a lot of positive feedback and interest in Mahout, so
> expect your inbox to explode in the next couple of days.

Sounds great. I was already happy we received quite some traffic after we 
published that we would take part in the GSoC.

Isabel

-- 
kernel, n.: A part of an operating system that preserves the medieval   
traditions of sorcery and black art.
  |\  _,,,---,,_   Web:   
  /,`.-'`'-.  ;-;;,_
 |,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  


signature.asc
Description: This is a digitally signed message part.


RE: Hadoop summit video capture?

2008-03-25 Thread Jeff Eastman
I don't know if there was a live version, but the entire summit was recorded
on video so it will be available. BTW, it was an overwhelming success and
the speakers are all well worth waiting for. I personally got a lot of
positive feedback and interest in Mahout, so expect your inbox to explode in
the next couple of days.

Jeff

> -Original Message-
> From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, March 25, 2008 8:03 PM
> To: [EMAIL PROTECTED]
> Subject: Hadoop summit video capture?
> 
> Hi,
> 
> Wasn't there going to be a live stream from the Hadoop summit?  I couldn't
> find any references on the event site/page, and searches on veoh, youtube
> and google video yielded nothing.
> 
> Is an archived version of the video (going to be) available?
> 
> Thanks,
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 





Re: some question about the nips paper

2008-03-25 Thread Hao Zheng
Isabel, thanks for your answer.
for the 4th question, maybe we can still gain some speedups on
multi-machine clusters. But I suspect that we should also explicitly
consider the communication cost, which is non-trivial in such setting.
What do you think?

On 3/26/08, Isabel Drost <[EMAIL PROTECTED]> wrote:
> On Tuesday 25 March 2008, Hao Zheng wrote:
>  > 1. Sect. 4.1 Algorithm Time Complexity Analysis.
>  > the paper assumes m >> n, that is, the training instances are much
>  > larger than the features. its datasets do have very few features. but
>  > this may not be true for many tasks, e.g. text classification, where
>  > feature dimensions will reach 10^4-10^5. then will the analysis still
>  > hold?
>
>
> What I could directly read from the paper in the very same section: The
>  analysis will not hold in this case for those algorithms that require matrix
>  inversions or eigen decompositions as long as these operations are not
>  executed in parallel. The authors did not implement parallel versions for
>  these operations - the reason they state is the fact that in their datasets m
>  >> n.
>
>  The authors state themselves that there is extensive research on 
> parallelising
>  eigen decomposition and matrix inversion as well - so if we assume that we do
>  have a matrix package that can do these operations in a distributed way, IMHO
>  the analysis in the paper should still hold even for algorithms that require
>  these steps.
>
>
>
>  > 2. Sect. 4.1, too.
>  > "reduce phase can minimize communication by combining data as it's
>  > passed back; this accounts for the logP factor", Could you help me
>  > figure out how logP is calculated.
>
>
> Anyone else who can help out here?
>
>
>
>  > 3. Sect 5.4 Results and Discussion
>  > "SVM gets about 13.6% speed up on average over 16 cores", it's 13.6%
>  > or 13.6? From figure 2, it seems should be 13.6?
>
>
> The axis on the graphs do not have clear titles, but I would agree that it
>  should be 13.6 as well.
>
>
>
>  > 4. Sect 5.4, too.
>  > "Finally, the above are runs on multiprocessor machines." No matter
>  > multiprocess or multicore, it runs on a single machine which have a
>  > share memory.
>
>
> The main motivation for the paper was the rise of multi core machines that ask
>  for parallel algorithms even though one might not have a cluster available.
>
>
>
>  > But actually, M/R is for multi-machine, which will involve much more cost 
> on
>  > inter-machine communication. So the results of the paper may be
>  > questionable?
>
>
> I think you should not expect to get the exact same speedups on multi-machine
>  clusters. Still I think one can expect faster computation for large datasets
>  even in this setting. What do others think?
>
>
>
>  Isabel
>
>
>
>  --
>  There is no TRUTH.  There is no REALITY.  There is no CONSISTENCY. There are
>  no ABSOLUTE STATEMENTS.   I'm very probably wrong.
>   |\  _,,,---,,_   Web:   
>   /,`.-'`'-.  ;-;;,_
>   |,4-  ) )-,_..;\ (  `'-'
>  '---''(_/--'  `-'\_) (fL)  IM:  
>
>


Re: some question about the nips paper

2008-03-25 Thread Hao Zheng
it's log(P). I just don't know how log(P) is obtained.

On 3/26/08, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
>
>  On Mar 25, 2008, at 4:11 PM, Isabel Drost wrote:
>  >
>  >
>  >> 2. Sect. 4.1, too.
>  >> "reduce phase can minimize communication by combining data as it's
>  >> passed back; this accounts for the logP factor", Could you help me
>  >> figure out how logP is calculated.
>  >
>  > Anyone else who can help out here?
>
>
> Isn't this just log(P) where P is the number of cores?  The version of
>  the paper I have says log(P) not logP, so maybe there is a typo?
>
>   From earlier in 4.1:
>  "We assume that the dimension of the inputs is n (i.e., x
>  ∈R
>  n
>  ), that we have m
>  training examples, and that there are P cores"
>
>
>
>  -Grant
>
>
>


Re: some question about the nips paper

2008-03-25 Thread Grant Ingersoll


On Mar 25, 2008, at 4:11 PM, Isabel Drost wrote:




2. Sect. 4.1, too.
"reduce phase can minimize communication by combining data as it's
passed back; this accounts for the logP factor", Could you help me
figure out how logP is calculated.


Anyone else who can help out here?


Isn't this just log(P) where P is the number of cores?  The version of  
the paper I have says log(P) not logP, so maybe there is a typo?


From earlier in 4.1:
"We assume that the dimension of the inputs is n (i.e., x
∈R
n
), that we have m
training examples, and that there are P cores"


-Grant




Re: Re : GSoC Evolutionary Algorithm Idea

2008-03-25 Thread Grant Ingersoll




My work was to add new algorithms to the framework...


I plan to use some an existing open source framework for the genetic
algorithm, the framework should take care of all tha GA stuff,


Which framework would that be?


Many possibilities exist, I think I will start with WatchMaker, it  
seems simple and well documented...but I still need to try it (never  
knows what could happen !!!)


I see this is ASL, so that seems workable, but I'd be interested in  
hearing more about how it integrates w/ what we are doing?


-Grant


Re: Google Summer of Code

2008-03-25 Thread Marko Novakovic
I understood kMeans algorithm from paper related to
this project.
Where could I find this code?
I could realize whatever algorithm, it needn't be
kMeans. Which algorithm is more interesing to realize
for you? Does it exist any algotithm, which is not
realized?

My mentor from my faculty said to me that clustering
was the most interesting for him. I will consult with
him tomorrow to understand deployment of clustering to
his Search Engine.

Does Apache have any conventions for writting student
abstracts and description in application? And which is
crteria for student selection?

Best Regards

--- Isabel Drost <[EMAIL PROTECTED]>
wrote:

> On Tuesday 25 March 2008, Marko Novakovic wrote:
> > I attached beta version of presentation.
> > I must consult with mentor form my college to
> examine
> > exact which the role of clusterin is in this
> system.
> 
> Hmm, one of the slides talks about using the
> clustering algorithm to identify 
> new topics. I guess I still do not get the full
> picture.
> 
> Did you happen to have a chance to look at the
> k-Means code in the repository 
> yet?
> 
> Isabel
> 
> -- 
> I don't mind arguing with myself.  It's when I lose
> that it bothers me.   -- 
> Richard Powers
>   |\  _,,,---,,_   Web:  
> 
>   /,`.-'`'-.  ;-;;,_
>  |,4-  ) )-,_..;\ (  `'-'
> '---''(_/--'  `-'\_) (fL)  IM: 
> 
> 



  

Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now.  
http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ


Re : GSoC Evolutionary Algorithm Idea

2008-03-25 Thread deneche abdelhakim
>> Im a PhD student on AI and adaptive systems, I have been working on
>> evolutionary algorithms for the last 4 years. 
>
>Which problems have you applied your algorithms to?

I used an Artificial Immune System (AIS) for the learning phase of a pattern 
recognition system. I also used it for a Multiple Sequence Alignment problem 
(bioinformatics)

>> I implemented my own Aritifial Immune System with Matlab and as a Java
>> extension to Yale,
>
>Did you submit the extension back to Yale? Can one try it out after 
>downloading the standard distribution or is it available for download 
>somewhere else?

I built a feature selection operator with an AIS at its core, it was for Yale 
3.0...the operator was working but needed more tuning and for lack of time, I 
didn't publish it anywhere. The source code is still available.

>> I also worked with a C++ framework for multi-objective optimization.
>
>Which framework was that?

ParadisEO

My work was to add new algorithms to the framework...

>> I plan to use some an existing open source framework for the genetic
>> algorithm, the framework should take care of all tha GA stuff,
>
>Which framework would that be?

Many possibilities exist, I think I will start with WatchMaker, it seems simple 
and well documented...but I still need to try it (never knows what could happen 
!!!)

>> What do you think about it ?
>
> How about writing a proposal for your project idea?

Iam writing it, just wated to know if the idea would help the Mahout 
project...and I already installed and tried Hadoop (not that easy on Windows 
!!!)


Abdel Hakim Deneche
Mentouri University of Constantine




  
_ 
Envoyez avec Yahoo! Mail. Capacité de stockage illimitée pour vos emails. 
http://mail.yahoo.fr

Re: some question about the nips paper

2008-03-25 Thread Isabel Drost
On Tuesday 25 March 2008, Hao Zheng wrote:
> 1. Sect. 4.1 Algorithm Time Complexity Analysis.
> the paper assumes m >> n, that is, the training instances are much
> larger than the features. its datasets do have very few features. but
> this may not be true for many tasks, e.g. text classification, where
> feature dimensions will reach 10^4-10^5. then will the analysis still
> hold?

What I could directly read from the paper in the very same section: The 
analysis will not hold in this case for those algorithms that require matrix 
inversions or eigen decompositions as long as these operations are not 
executed in parallel. The authors did not implement parallel versions for 
these operations - the reason they state is the fact that in their datasets m 
>> n.

The authors state themselves that there is extensive research on parallelising 
eigen decomposition and matrix inversion as well - so if we assume that we do 
have a matrix package that can do these operations in a distributed way, IMHO 
the analysis in the paper should still hold even for algorithms that require 
these steps.


> 2. Sect. 4.1, too.
> "reduce phase can minimize communication by combining data as it's
> passed back; this accounts for the logP factor", Could you help me
> figure out how logP is calculated.

Anyone else who can help out here?


> 3. Sect 5.4 Results and Discussion
> "SVM gets about 13.6% speed up on average over 16 cores", it's 13.6%
> or 13.6? From figure 2, it seems should be 13.6?

The axis on the graphs do not have clear titles, but I would agree that it 
should be 13.6 as well.


> 4. Sect 5.4, too.
> "Finally, the above are runs on multiprocessor machines." No matter
> multiprocess or multicore, it runs on a single machine which have a
> share memory.

The main motivation for the paper was the rise of multi core machines that ask 
for parallel algorithms even though one might not have a cluster available.


> But actually, M/R is for multi-machine, which will involve much more cost on
> inter-machine communication. So the results of the paper may be
> questionable? 

I think you should not expect to get the exact same speedups on multi-machine 
clusters. Still I think one can expect faster computation for large datasets 
even in this setting. What do others think?



Isabel


-- 
There is no TRUTH.  There is no REALITY.  There is no CONSISTENCY. There are 
no ABSOLUTE STATEMENTS.   I'm very probably wrong.
  |\  _,,,---,,_   Web:   
  /,`.-'`'-.  ;-;;,_
 |,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  


signature.asc
Description: This is a digitally signed message part.


Re: Google Summer of Code

2008-03-25 Thread Isabel Drost
On Tuesday 25 March 2008, Marko Novakovic wrote:
> I attached beta version of presentation.
> I must consult with mentor form my college to examine
> exact which the role of clusterin is in this system.

Hmm, one of the slides talks about using the clustering algorithm to identify 
new topics. I guess I still do not get the full picture.

Did you happen to have a chance to look at the k-Means code in the repository 
yet?

Isabel

-- 
I don't mind arguing with myself.  It's when I lose that it bothers me. 
-- 
Richard Powers
  |\  _,,,---,,_   Web:   
  /,`.-'`'-.  ;-;;,_
 |,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  


signature.asc
Description: This is a digitally signed message part.


Re: Regarding Google Summer of Code Lucene Mahout Project

2008-03-25 Thread Isabel Drost
On Tuesday 25 March 2008, Robin Anil wrote:
> You may be interested in reading the paper which talks more about it Here
> .

The paper looks interesting: The modifications to naive bayes presented in the 
paper seem to lead to a classifier that is comparable to SVM performance for 
text classification while having far better performance.


> the feature selection module is overloaded for each of them.

Sounds reasonable to me. I would guess the feature selection module is 
independent of the classifier?


> > So what you are hoping for is a system that can crawl and answer queries
> > at the same time, integrating more and more information as it becomes
> > available, right?
>
> No because the queries arent fixed. If you disregard the TREC queries, say
> a person is sitting there asking for opinion about a target. He may type
> "Nokia 6600" or "My left hand". Now, I would have to go though the DB and
> find everything which talks about Nokia and the other and do post
> processing if its not yet processed.

I see - you want to do the sentiment classification step at query time and 
therefore you need it to be efficient. This implies that you need to store 
each text unit (say each blog posting) either in clear text or as some 
general feature vector (depends on whether your features are query dependant 
or not) and do the classification at query time.


> Another reason is the ranking of the results become a problem. How do i say
> which among the 1000 results gives the better opinion. The doc that talks
> more about the target or the one which has more opinions about the target.
> Neither, we need to rank them based on the output of Classification
> Algorithms. 

Seems like you need an algorithm that outputs comparable scores for each 
document and is neither under- nor overconfident. I remember vaguely that the 
vanilla NB had some problems in this respect.

Isabel


-- 
The most important design issue... is the fact that Linux is supposed to be 
fun...  -- Linus Torvalds at the First Dutch International Symposium on 
Linux
  |\  _,,,---,,_   Web:   
  /,`.-'`'-.  ;-;;,_
 |,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  


signature.asc
Description: This is a digitally signed message part.


Re: GSoC Evolutionary Algorithm Idea

2008-03-25 Thread Isabel Drost
On Tuesday 25 March 2008, deneche abdelhakim wrote:
> Im a PhD student on AI and adaptive systems, I have been working on
> evolutionary algorithms for the last 4 years. 

Which problems have you applied your algorithms to?


> I implemented my own Aritifial Immune System with Matlab and as a Java
> extension to Yale,

Did you submit the extension back to Yale? Can one try it out after 
downloading the standard distribution or is it available for download 
somewhere else?


> I also worked with a C++ framework for multi-objective optimization.

Which framework was that?


> its a Genetic Algorithm for binary classification. The fitness function
> (that iterates over all the training dataset) can benefit from the
> Map-Reduce model of Hadoop.

Sounds good to me.


> I plan to use some an existing open source framework for the genetic
> algorithm, the framework should take care of all tha GA stuff,

Which framework would that be?


> What do you think about it ?

How about writing a proposal for your project idea?

Isabel

-- 
BOFH excuse #294:PCMCIA slave driver
  |\  _,,,---,,_   Web:   
  /,`.-'`'-.  ;-;;,_
 |,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  


signature.asc
Description: This is a digitally signed message part.


Re: Regarding Google Summer of Code Lucene Mahout Project

2008-03-25 Thread Matthew Riley
Maybe you can get some feedback from Jason Rennie (one of the authors of the
paper you linked) on your implementation - I seem to remember seeing some
comments from him on this mailing list about a week ago.

Matt

On Tue, Mar 25, 2008 at 8:55 AM, Robin Anil <[EMAIL PROTECTED]> wrote:

> Hi Isabel,
>
> On Tue, Mar 25, 2008 at 2:52 AM, Isabel Drost <
> [EMAIL PROTECTED]>
> wrote:
> >
> > On Monday 24 March 2008, Robin Anil wrote:
> >
> > > The Complement-Naive-Bayes-Classifier(coded up for this project) then
> run on
> > > the retrieved document to do post processing.
> >
> > The ideas presented in the slides look pretty interesting to me. Could
> you
> > please provide some pointers to information in the Complement Naive
> Bayes
> > Classifier? What were the reasons you chose this classifier?
> >
> Before going into Complement Naive Bayes there are certain things about
> Text
> Classification. Given a good amount of data as it is in the case of
> textual
> Data, Naive Bayes Suprisingly performs better than most of the other
> supervised learners. Reason as i see it is, Naive Bayes class margins are
> so
> bluntly defined that chances of overfitting is rare. This is also the
> reason
> why, given the proper features Naive Bayes doesnt measure up to other
> Methods. So you may say Naive Bayes in a Good Classifier for Textual Data.
> Now Complement Naive Bayes does the reverse. Instead of calculating which
> class fits the document best. It does, which complement class least fits
> the
> document.  Also it removes the bias problem due to prior probability term
> in
> NB equation. You may be interested in reading the paper which talks more
> about it Here .
> My
> BaseClassifier implementation reproduces the work there. But for different
> classifiers (SpamDetection, Subjectivity, Polarity) , all of them inherits
> the base classifier but the feature selection module is overloaded for
> each
> of them.
>
> As you can see all of them except Polarity(Classes are Pos, Neg, Neutral)
> are Binary Classifiers where the CNB is Exactly the same as NB(just a -ve
> sign difference). But other things like normalization made a lot of
> difference in removing the false positives and biased classes.
>
> >
> >
> > > If its possible to have the classifier run along with Lucene and
> > > spit out sentences and add them to a field in real-time, It would
> > > essentially enable this system to be online and allow for real-time
> > > queries.
> >
> > So what you are hoping for is a system that can crawl and answer queries
> at
> > the same time, integrating more and more information as it becomes
> available,
> > right?
> >
> Yes and No,
> Yes because System needs to go through the index get documents and process
> the Sentences and get all opinions, Not necessarity the Target.
> No because the queries arent fixed. If you disregard the TREC queries, say
> a
> person is sitting there asking for opinion about a target. He may type
> "Nokia 6600" or "My left hand". Now, I would have to go though the DB and
> find everything which talks about Nokia and the other and do post
> processing
> if its not yet processed. Another reason is the ranking of the results
> become a problem. How do i say which among the 1000 results gives the
> better
> opinion. The doc that talks more about the target or the one which has
> more
> opinions about the target. Neither, we need to rank them based on the
> output
> of Classification Algorithms.
>
> This is where i see the use of Mahout. Say we have the core Lucene
> Architecture modded with Mahout. If i can give the results of Mahout
> Classifier to lucene for Ranking function. Based on Subjectivity, Polarity
> etc. Not only will it become easy to Implement Good IR Systems for
> Research.
> It can give rise to Some real funky use cases for Complex Production IR
> Systems.
> >
> > > I would gladly answer any queries except results
> >
> > Hmm, so for this competition there is no sample dataset available to
> test
> the
> > performance of the algorithms against? Sounds like there is no way to
> > determine which of two competing solutions is better except making two
> > submissions...
> >
> Well throughout the year, Competing researchers give One or two Queries
> and
> Hand Made results. Which is compiled and tested against each other.
>
> > Isabel
> >
> >
> > --
> > The ideal voice for radio may be defined as showing no substance, no
> sex,no
> > owner, and a message of importance for every housewife. -- Harry
> V. Wade
> >
> >
> >
> >  |\  _,,,---,,_   Web:   
> >  /,`.-'`'-.  ;-;;,_
> >  |,4-  ) )-,_..;\ (  `'-'
> > '---''(_/--'  `-'\_) (fL)  IM:  
> >
>
>
>
> --
> Robin Anil
> 4th Year Dual Degree Student
> Department of Computer Science & Engineering
> IIT Kharagpur
>
>
> 
> techdigger.wordpress.com
> A 

Re: Regarding Google Summer of Code Lucene Mahout Project

2008-03-25 Thread Robin Anil
Hi Isabel,

On Tue, Mar 25, 2008 at 2:52 AM, Isabel Drost <[EMAIL PROTECTED]>
wrote:
>
> On Monday 24 March 2008, Robin Anil wrote:
>
> > The Complement-Naive-Bayes-Classifier(coded up for this project) then
run on
> > the retrieved document to do post processing.
>
> The ideas presented in the slides look pretty interesting to me. Could you
> please provide some pointers to information in the Complement Naive Bayes
> Classifier? What were the reasons you chose this classifier?
>
Before going into Complement Naive Bayes there are certain things about Text
Classification. Given a good amount of data as it is in the case of textual
Data, Naive Bayes Suprisingly performs better than most of the other
supervised learners. Reason as i see it is, Naive Bayes class margins are so
bluntly defined that chances of overfitting is rare. This is also the reason
why, given the proper features Naive Bayes doesnt measure up to other
Methods. So you may say Naive Bayes in a Good Classifier for Textual Data.
Now Complement Naive Bayes does the reverse. Instead of calculating which
class fits the document best. It does, which complement class least fits the
document.  Also it removes the bias problem due to prior probability term in
NB equation. You may be interested in reading the paper which talks more
about it Here . My
BaseClassifier implementation reproduces the work there. But for different
classifiers (SpamDetection, Subjectivity, Polarity) , all of them inherits
the base classifier but the feature selection module is overloaded for each
of them.

As you can see all of them except Polarity(Classes are Pos, Neg, Neutral)
are Binary Classifiers where the CNB is Exactly the same as NB(just a -ve
sign difference). But other things like normalization made a lot of
difference in removing the false positives and biased classes.

>
>
> > If its possible to have the classifier run along with Lucene and
> > spit out sentences and add them to a field in real-time, It would
> > essentially enable this system to be online and allow for real-time
> > queries.
>
> So what you are hoping for is a system that can crawl and answer queries
at
> the same time, integrating more and more information as it becomes
available,
> right?
>
Yes and No,
Yes because System needs to go through the index get documents and process
the Sentences and get all opinions, Not necessarity the Target.
No because the queries arent fixed. If you disregard the TREC queries, say a
person is sitting there asking for opinion about a target. He may type
"Nokia 6600" or "My left hand". Now, I would have to go though the DB and
find everything which talks about Nokia and the other and do post processing
if its not yet processed. Another reason is the ranking of the results
become a problem. How do i say which among the 1000 results gives the better
opinion. The doc that talks more about the target or the one which has more
opinions about the target. Neither, we need to rank them based on the output
of Classification Algorithms.

This is where i see the use of Mahout. Say we have the core Lucene
Architecture modded with Mahout. If i can give the results of Mahout
Classifier to lucene for Ranking function. Based on Subjectivity, Polarity
etc. Not only will it become easy to Implement Good IR Systems for Research.
It can give rise to Some real funky use cases for Complex Production IR
Systems.
>
> > I would gladly answer any queries except results
>
> Hmm, so for this competition there is no sample dataset available to test
the
> performance of the algorithms against? Sounds like there is no way to
> determine which of two competing solutions is better except making two
> submissions...
>
Well throughout the year, Competing researchers give One or two Queries and
Hand Made results. Which is compiled and tested against each other.

> Isabel
>
>
> --
> The ideal voice for radio may be defined as showing no substance, no
sex,no
> owner, and a message of importance for every housewife. -- Harry
V. Wade
>
>
>
>  |\  _,,,---,,_   Web:   
>  /,`.-'`'-.  ;-;;,_
>  |,4-  ) )-,_..;\ (  `'-'
> '---''(_/--'  `-'\_) (fL)  IM:  
>



-- 
Robin Anil
4th Year Dual Degree Student
Department of Computer Science & Engineering
IIT Kharagpur


techdigger.wordpress.com
A discursive take on the world around us

www.minekey.com
You Might Like This

www.ithink.com
Express Yourself


GSoC Evolutionary Algorithm Idea

2008-03-25 Thread deneche abdelhakim
Hi

Im a PhD student on AI and adaptive systems, I have been working on 
evolutionary algorithms for the last 4 years. I implemented my own Aritifial 
Immune System with Matlab and as a Java extension to Yale, I also worked with a 
C++ framework for multi-objective optimization.

My project is to build a classification genetic algorithm in Mahoot.

I've already done some research and found the following paper 

Discovering Comprehensible Classification Rules with a Genetic Algorithm

its a Genetic Algorithm for binary classification. The fitness function (that 
iterates over all the training dataset) can benefit from the Map-Reduce model 
of Hadoop.

I plan to use some an existing open source framework for the genetic algorithm, 
the framework should take care of all tha GA stuff, and I will be left with:
. the representation of individuals, as described in the article
. the fitness function that uses Hadoop

This algorithm can also be adapted to work with more than two classes...but 
that's another story

What do you think about it ?


Abdel Hakim Deneche
Mentouri Unversity of Constantine, Algeria




  
_ 
Envoyez avec Yahoo! Mail. Capacité de stockage illimitée pour vos emails. 
http://mail.yahoo.fr

some question about the nips paper

2008-03-25 Thread Hao Zheng
hi all devs,

I have read through the dev mail list, and have a rough idea of the
progress of Mahout. I have read the google paper and the nips paper.
As for the nips paper "map reduce for ML on Multicore", i have some
questions.

1. Sect. 4.1 Algorithm Time Complexity Analysis.
the paper assumes m >> n, that is, the training instances are much
larger than the features. its datasets do have very few features. but
this may not be true for many tasks, e.g. text classification, where
feature dimensions will reach 10^4-10^5. then will the analysis still
hold?

2. Sect. 4.1, too.
"reduce phase can minimize communication by combining data as it's
passed back; this accounts for the logP factor", Could you help me
figure out how logP is calculated.

3. Sect 5.4 Results and Discussion
"SVM gets about 13.6% speed up on average over 16 cores", it's 13.6%
or 13.6? From figure 2, it seems should be 13.6?

4. Sect 5.4, too.
"Finally, the above are runs on multiprocessor machines." No matter
multiprocess or multicore, it runs on a single machine which have a
share memory. But actually, M/R is for multi-machine, which will
involve much more cost on inter-machine communication. So the results
of the paper may be questionable?

Maybe some of the questions are a complete misinterpretation. Please
help me to get an full understanding of the paper, thanks.