Re: Mahout on EC2
Hi Robin We definitely want to move from the current NB implementation to a more > Matrix Like implementation. So for the time being you will have to create > the dataset like i said above. OK. It would be worth moving the text processing part to a separate package as it might be useful for other Classifier implementations and have a number of utility classes to generate a Matrix from a collection of text. Each Classifier could then rely on the Matrix API to get the input. I'd be happy to help with that if I can. Will any other Classifiers implementations be added soon? I remember there was a number of proposal for the GSOC. Julien -- DigitalPebble Ltd http://www.digitalpebble.com
Re: Mahout on EC2
Ok, in this case its the main program that has a Swing GUI, the Map-Reduce Jobs have no GUIs at all. But yeah, it's always good to separate the GUI code from the logic. --- En date de : Dim 21.9.08, Ted Dunning <[EMAIL PROTECTED]> a écrit : > De: Ted Dunning <[EMAIL PROTECTED]> > Objet: Re: Mahout on EC2 > À: mahout-dev@lucene.apache.org > Date: Dimanche 21 Septembre 2008, 23h08 > For the master machine that launches the map-reduce > computation, you can > tunnel an X display from somewhere else to display swing > applications. > > You will also need to do the separation for the reason that > Sean says... you > will be running on many machines. > > On Sat, Sep 20, 2008 at 2:34 AM, Sean Owen > <[EMAIL PROTECTED]> wrote: > > > I think you can run a program that uses Swing - unless > I am wrong this > > no longer result in an error when running on a > 'headless' machine - > > for example a box without X11. > > > > But no I don't think there is anyway to interact > with it, especially > > considering you might be running on many machines at > once. > > > > But the same is true of the console - you won't be > able to interact > > with the program that way either. > > > > It does sound good, in any event, to separate out > Swing client code > > from the core logic. > > > > On 9/20/08, deneche abdelhakim > <[EMAIL PROTECTED]> wrote: > > > Sounds cool :) > > > > > > I'll do the TSP part, but it may take some > time because I'm a bit busy > > > (PhD's administrative stuff). > > > > > > There are many available large TSP benchmarks, > and it seems that there is > > a > > > common file format for them TSPLIB > > > ( > > > http://www.informatik.uni-heidelberg.de/groups/comopt/software/TSPLIB95/DOC.PS > > ). > > > So the TSP example should be modified to load > those benchmark files. > > > > > > I have a question about EC2 : can you run Java > Swing programs and see the > > > GUI because the TSP example has a Swing GUI, or > should we should make a > > > console version of the example ? > > > > > > --- En date de : Ven 19.9.08, Grant Ingersoll > <[EMAIL PROTECTED]> a > > > écrit : > > > > > >> De: Grant Ingersoll > <[EMAIL PROTECTED]> > > >> Objet: Mahout on EC2 > > >> À: mahout-dev@lucene.apache.org > > >> Date: Vendredi 19 Septembre 2008, 17h18 > > >> Amazon has generously donated some credits, > so I plan on > > >> putting > > >> Mahout up and doing some testing. Was > wondering if people > > >> had > > >> suggestions on things they would like to see > from Mahout. > > >> For > > >> starters, I'm going to put up a public > image containing > > >> 0.1 when it's > > >> ready, but I'd also like to wiki up some > examples. > > >> I.e. go here, get > > >> this data, put it in this format and then do > X. We have > > >> some simple > > >> examples, but I think it would be cool to > show how to do > > >> something a > > >> bit more complex, like maybe classify web > pages according > > >> to DMOZ or > > >> to cluster on stuff, or maybe put in a large > traveling > > >> salesman > > >> problem using the GA stuff Deneche did. > > >> > > >> Thoughts? Anyone else interested in setting > up some use > > >> cases? > > >> > > >> -Grant > > > > > > > > > > > > > > > > > > -- > ted
Re: Mahout on EC2
For the master machine that launches the map-reduce computation, you can tunnel an X display from somewhere else to display swing applications. You will also need to do the separation for the reason that Sean says... you will be running on many machines. On Sat, Sep 20, 2008 at 2:34 AM, Sean Owen <[EMAIL PROTECTED]> wrote: > I think you can run a program that uses Swing - unless I am wrong this > no longer result in an error when running on a 'headless' machine - > for example a box without X11. > > But no I don't think there is anyway to interact with it, especially > considering you might be running on many machines at once. > > But the same is true of the console - you won't be able to interact > with the program that way either. > > It does sound good, in any event, to separate out Swing client code > from the core logic. > > On 9/20/08, deneche abdelhakim <[EMAIL PROTECTED]> wrote: > > Sounds cool :) > > > > I'll do the TSP part, but it may take some time because I'm a bit busy > > (PhD's administrative stuff). > > > > There are many available large TSP benchmarks, and it seems that there is > a > > common file format for them TSPLIB > > ( > http://www.informatik.uni-heidelberg.de/groups/comopt/software/TSPLIB95/DOC.PS > ). > > So the TSP example should be modified to load those benchmark files. > > > > I have a question about EC2 : can you run Java Swing programs and see the > > GUI because the TSP example has a Swing GUI, or should we should make a > > console version of the example ? > > > > --- En date de : Ven 19.9.08, Grant Ingersoll <[EMAIL PROTECTED]> a > > écrit : > > > >> De: Grant Ingersoll <[EMAIL PROTECTED]> > >> Objet: Mahout on EC2 > >> À: mahout-dev@lucene.apache.org > >> Date: Vendredi 19 Septembre 2008, 17h18 > >> Amazon has generously donated some credits, so I plan on > >> putting > >> Mahout up and doing some testing. Was wondering if people > >> had > >> suggestions on things they would like to see from Mahout. > >> For > >> starters, I'm going to put up a public image containing > >> 0.1 when it's > >> ready, but I'd also like to wiki up some examples. > >> I.e. go here, get > >> this data, put it in this format and then do X. We have > >> some simple > >> examples, but I think it would be cool to show how to do > >> something a > >> bit more complex, like maybe classify web pages according > >> to DMOZ or > >> to cluster on stuff, or maybe put in a large traveling > >> salesman > >> problem using the GA stuff Deneche did. > >> > >> Thoughts? Anyone else interested in setting up some use > >> cases? > >> > >> -Grant > > > > > > > > > -- ted
Re: Mahout on EC2
I think you can run a program that uses Swing - unless I am wrong this no longer result in an error when running on a 'headless' machine - for example a box without X11. But no I don't think there is anyway to interact with it, especially considering you might be running on many machines at once. But the same is true of the console - you won't be able to interact with the program that way either. It does sound good, in any event, to separate out Swing client code from the core logic. On 9/20/08, deneche abdelhakim <[EMAIL PROTECTED]> wrote: > Sounds cool :) > > I'll do the TSP part, but it may take some time because I'm a bit busy > (PhD's administrative stuff). > > There are many available large TSP benchmarks, and it seems that there is a > common file format for them TSPLIB > (http://www.informatik.uni-heidelberg.de/groups/comopt/software/TSPLIB95/DOC.PS). > So the TSP example should be modified to load those benchmark files. > > I have a question about EC2 : can you run Java Swing programs and see the > GUI because the TSP example has a Swing GUI, or should we should make a > console version of the example ? > > --- En date de : Ven 19.9.08, Grant Ingersoll <[EMAIL PROTECTED]> a > écrit : > >> De: Grant Ingersoll <[EMAIL PROTECTED]> >> Objet: Mahout on EC2 >> À: mahout-dev@lucene.apache.org >> Date: Vendredi 19 Septembre 2008, 17h18 >> Amazon has generously donated some credits, so I plan on >> putting >> Mahout up and doing some testing. Was wondering if people >> had >> suggestions on things they would like to see from Mahout. >> For >> starters, I'm going to put up a public image containing >> 0.1 when it's >> ready, but I'd also like to wiki up some examples. >> I.e. go here, get >> this data, put it in this format and then do X. We have >> some simple >> examples, but I think it would be cool to show how to do >> something a >> bit more complex, like maybe classify web pages according >> to DMOZ or >> to cluster on stuff, or maybe put in a large traveling >> salesman >> problem using the GA stuff Deneche did. >> >> Thoughts? Anyone else interested in setting up some use >> cases? >> >> -Grant > > > >
Re : Mahout on EC2
Sounds cool :) I'll do the TSP part, but it may take some time because I'm a bit busy (PhD's administrative stuff). There are many available large TSP benchmarks, and it seems that there is a common file format for them TSPLIB (http://www.informatik.uni-heidelberg.de/groups/comopt/software/TSPLIB95/DOC.PS). So the TSP example should be modified to load those benchmark files. I have a question about EC2 : can you run Java Swing programs and see the GUI because the TSP example has a Swing GUI, or should we should make a console version of the example ? --- En date de : Ven 19.9.08, Grant Ingersoll <[EMAIL PROTECTED]> a écrit : > De: Grant Ingersoll <[EMAIL PROTECTED]> > Objet: Mahout on EC2 > À: mahout-dev@lucene.apache.org > Date: Vendredi 19 Septembre 2008, 17h18 > Amazon has generously donated some credits, so I plan on > putting > Mahout up and doing some testing. Was wondering if people > had > suggestions on things they would like to see from Mahout. > For > starters, I'm going to put up a public image containing > 0.1 when it's > ready, but I'd also like to wiki up some examples. > I.e. go here, get > this data, put it in this format and then do X. We have > some simple > examples, but I think it would be cool to show how to do > something a > bit more complex, like maybe classify web pages according > to DMOZ or > to cluster on stuff, or maybe put in a large traveling > salesman > problem using the GA stuff Deneche did. > > Thoughts? Anyone else interested in setting up some use > cases? > > -Grant
Re: Mahout on EC2
+1 More inline. On Sep 19, 2008, at 12:42 PM, Julien Nioche wrote: Hi, I am currently working on the classification of pages according to DMOZ :-) I have been planning to give Mahout a serious try but never managed to do it so that could be a good opportunity to do that. We have downloaded and parsed the latest DMOZ snapshot. Everything is currently stored in a DB, we have the following fields for each document: - URL - category (level 1 from DMOZ) - content - title - description (taken from the HTML meta tags) - keywords (taken from the HTML meta tags) - status (unavailable|fetched) We are using our own API to convert the information for each document into a vector with a choice of which weighting scheme to use (tf-idf, frequency, etc...). The weighting takes the fields into account i.e. if using tf.idf the weight of a given term takes into account its frequency in this specific field (say title). I could describe the whole process on a Wiki page but that would be quite long (especially if we need to go through all the details of Nutch), I think you could just say something like "Go get Nutch and point it at X" The Nutch getting started isn't too hard. maybe I could simply generate a textual representation of the matrix and put it in a place where people could download it? If that's feasible. I don't think there would be distribution issues, right? You're just putting up a matrix, not the actual content, but IANAL. That could be the starting point of the use case. There would also be a lexicon file containing the mapping between the attribute labels and their index. There could be all sorts of possible experiments from there e.g. trying to see which attributes are the most discriminant etc... Does that make sense? I think this would be great. Julien 2008/9/19 Grant Ingersoll <[EMAIL PROTECTED]> Amazon has generously donated some credits, so I plan on putting Mahout up and doing some testing. Was wondering if people had suggestions on things they would like to see from Mahout. For starters, I'm going to put up a public image containing 0.1 when it's ready, but I'd also like to wiki up some examples. I.e. go here, get this data, put it in this format and then do X. We have some simple examples, but I think it would be cool to show how to do something a bit more complex, like maybe classify web pages according to DMOZ or to cluster on stuff, or maybe put in a large traveling salesman problem using the GA stuff Deneche did. Thoughts? Anyone else interested in setting up some use cases? -Grant -- Grant Ingersoll http://www.lucidimagination.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ
Re: Mahout on EC2
Any document you give to the mahout NB is assumed to be a list of features with weight=number of times it occurs. So if you happen to give a weight of 5 to say title. just repeat the title five times in the document you generate for mahout NB. Then you can draw comparable results to the schemes you use. We definitely want to move from the current NB implementation to a more Matrix Like implementation. So for the time being you will have to create the dataset like i said above. There is no cross-fold validation being done right now. A dataset creator will have to be written for generating each of the n-folds from the train data -- Robin Anil Blog: http://techdigger.wordpress.com -- 9/11 Quotes "Terrorism against our nation will not stand." - George W. Bush - Remarks at Emma... On Fri, Sep 19, 2008 at 10:47 PM, Julien Nioche < [EMAIL PROTECTED]> wrote: > Hi Robin, > > I had a quick look at the NB implementation in the meantime. We could > certainly give it a try and compare the results of the weighting schemes. > It > will be interesting to compare that with my field-based representation of > the documents but that would require being able to use the NB > implementation > with the Matrix API. I had the impression that the Matrix objects where not > used in any of the classifiers / clusterers, is that the case? > > BTW is there any way we can do Cross Validation with Mahout? > > J. > > 2008/9/19 Robin Anil <[EMAIL PROTECTED]> > > > Hi Julien, It would be great if you can test it on the NB/CNB > > classifier implementation in Mahout. Could you create a dump of the files > > in > > the directory format (docs of each category resides in its directory)used > > by > > the Mahout NB implementation. There is no need of a separate mapping > table > > between lexicon and features, as the implementation takes care of > features > > in text format. Maybe with a good test-train split you can compare it > > across > > various weighting techniques > > -- > > Robin Anil > > Blog: http://techdigger.wordpress.com > > > > On Fri, Sep 19, 2008 at 10:12 PM, Julien Nioche < > > [EMAIL PROTECTED]> wrote: > > > > > Hi, > > > > > > I am currently working on the classification of pages according to DMOZ > > :-) > > > I have been planning to give Mahout a serious try but never managed to > do > > > it > > > so that could be a good opportunity to do that. > > > > > > We have downloaded and parsed the latest DMOZ snapshot. Everything is > > > currently stored in a DB, we have the following fields for each > document: > > > - URL > > > - category (level 1 from DMOZ) > > > - content > > > - title > > > - description (taken from the HTML meta tags) > > > - keywords (taken from the HTML meta tags) > > > - status (unavailable|fetched) > > > > > > We are using our own API to convert the information for each document > > into > > > a > > > vector with a choice of which weighting scheme to use (tf-idf, > frequency, > > > etc...). The weighting takes the fields into account i.e. if using > tf.idf > > > the weight of a given term takes into account its frequency in this > > > specific > > > field (say title). > > > > > > I could describe the whole process on a Wiki page but that would be > quite > > > long (especially if we need to go through all the details of Nutch), > > maybe > > > I > > > could simply generate a textual representation of the matrix and put it > > in > > > a > > > place where people could download it? That could be the starting point > of > > > the use case. There would also be a lexicon file containing the mapping > > > between the attribute labels and their index. > > > > > > There could be all sorts of possible experiments from there e.g. trying > > to > > > see which attributes are the most discriminant etc... > > > > > > Does that make sense? > > > > > > Julien > > > > > > > > > 2008/9/19 Grant Ingersoll <[EMAIL PROTECTED]> > > > > > > > Amazon has generously donated some credits, so I plan on putting > Mahout > > > up > > > > and doing some testing. Was wondering if people had suggestions on > > > things > > > > they would like to see from Mahout. For starters, I'm going to put > up > > a > > > > public image containing 0.1 when it's ready, but I'd also like to > wiki > > up > > > > some examples. I.e. go here, get this data, put it in this format > and > > > then > > > > do X. We have some simple examples, but I think it would be cool to > > show > > > > how to do something a bit more complex, like maybe classify web pages > > > > according to DMOZ or to cluster on stuff, or maybe put in a large > > > traveling > > > > salesman problem using the GA stuff Deneche did. > > > > > > > > Thoughts? Anyone else interested in setting up some use cases? > > > > > > > > -Grant > > > > > > > > > > > > > -- > DigitalPebble Ltd > http://www.digitalpebble.com >
Re: Mahout on EC2
Hi Ted, No, I've been using the FreeGenerator in Nutch so no linkDB has been built. I suppose the text from anchors could make good features though. Where you thinking about using the actual links as features? J. 2008/9/19 Ted Dunning <[EMAIL PROTECTED]> > Julien, > > That sounds great. > > Do you record linking information as well? > > On Fri, Sep 19, 2008 at 9:42 AM, Julien Nioche < > [EMAIL PROTECTED]> wrote: > > > Hi, > > > > I am currently working on the classification of pages according to DMOZ > :-) > > I have been planning to give Mahout a serious try but never managed to do > > it > > so that could be a good opportunity to do that. > > > > We have downloaded and parsed the latest DMOZ snapshot. Everything is > > currently stored in a DB, we have the following fields for each document: > > - URL > > - category (level 1 from DMOZ) > > - content > > - title > > - description (taken from the HTML meta tags) > > - keywords (taken from the HTML meta tags) > > - status (unavailable|fetched) > > > > We are using our own API to convert the information for each document > into > > a > > vector with a choice of which weighting scheme to use (tf-idf, frequency, > > etc...). The weighting takes the fields into account i.e. if using tf.idf > > the weight of a given term takes into account its frequency in this > > specific > > field (say title). > > > > I could describe the whole process on a Wiki page but that would be quite > > long (especially if we need to go through all the details of Nutch), > maybe > > I > > could simply generate a textual representation of the matrix and put it > in > > a > > place where people could download it? That could be the starting point of > > the use case. There would also be a lexicon file containing the mapping > > between the attribute labels and their index. > > > > There could be all sorts of possible experiments from there e.g. trying > to > > see which attributes are the most discriminant etc... > > > > Does that make sense? > > > > Julien > > > > > > 2008/9/19 Grant Ingersoll <[EMAIL PROTECTED]> > > > > > Amazon has generously donated some credits, so I plan on putting Mahout > > up > > > and doing some testing. Was wondering if people had suggestions on > > things > > > they would like to see from Mahout. For starters, I'm going to put up > a > > > public image containing 0.1 when it's ready, but I'd also like to wiki > up > > > some examples. I.e. go here, get this data, put it in this format and > > then > > > do X. We have some simple examples, but I think it would be cool to > show > > > how to do something a bit more complex, like maybe classify web pages > > > according to DMOZ or to cluster on stuff, or maybe put in a large > > traveling > > > salesman problem using the GA stuff Deneche did. > > > > > > Thoughts? Anyone else interested in setting up some use cases? > > > > > > -Grant > > > > > > > > > -- > ted > -- DigitalPebble Ltd http://www.digitalpebble.com
Re: Mahout on EC2
Hi Robin, I had a quick look at the NB implementation in the meantime. We could certainly give it a try and compare the results of the weighting schemes. It will be interesting to compare that with my field-based representation of the documents but that would require being able to use the NB implementation with the Matrix API. I had the impression that the Matrix objects where not used in any of the classifiers / clusterers, is that the case? BTW is there any way we can do Cross Validation with Mahout? J. 2008/9/19 Robin Anil <[EMAIL PROTECTED]> > Hi Julien, It would be great if you can test it on the NB/CNB > classifier implementation in Mahout. Could you create a dump of the files > in > the directory format (docs of each category resides in its directory)used > by > the Mahout NB implementation. There is no need of a separate mapping table > between lexicon and features, as the implementation takes care of features > in text format. Maybe with a good test-train split you can compare it > across > various weighting techniques > -- > Robin Anil > Blog: http://techdigger.wordpress.com > > On Fri, Sep 19, 2008 at 10:12 PM, Julien Nioche < > [EMAIL PROTECTED]> wrote: > > > Hi, > > > > I am currently working on the classification of pages according to DMOZ > :-) > > I have been planning to give Mahout a serious try but never managed to do > > it > > so that could be a good opportunity to do that. > > > > We have downloaded and parsed the latest DMOZ snapshot. Everything is > > currently stored in a DB, we have the following fields for each document: > > - URL > > - category (level 1 from DMOZ) > > - content > > - title > > - description (taken from the HTML meta tags) > > - keywords (taken from the HTML meta tags) > > - status (unavailable|fetched) > > > > We are using our own API to convert the information for each document > into > > a > > vector with a choice of which weighting scheme to use (tf-idf, frequency, > > etc...). The weighting takes the fields into account i.e. if using tf.idf > > the weight of a given term takes into account its frequency in this > > specific > > field (say title). > > > > I could describe the whole process on a Wiki page but that would be quite > > long (especially if we need to go through all the details of Nutch), > maybe > > I > > could simply generate a textual representation of the matrix and put it > in > > a > > place where people could download it? That could be the starting point of > > the use case. There would also be a lexicon file containing the mapping > > between the attribute labels and their index. > > > > There could be all sorts of possible experiments from there e.g. trying > to > > see which attributes are the most discriminant etc... > > > > Does that make sense? > > > > Julien > > > > > > 2008/9/19 Grant Ingersoll <[EMAIL PROTECTED]> > > > > > Amazon has generously donated some credits, so I plan on putting Mahout > > up > > > and doing some testing. Was wondering if people had suggestions on > > things > > > they would like to see from Mahout. For starters, I'm going to put up > a > > > public image containing 0.1 when it's ready, but I'd also like to wiki > up > > > some examples. I.e. go here, get this data, put it in this format and > > then > > > do X. We have some simple examples, but I think it would be cool to > show > > > how to do something a bit more complex, like maybe classify web pages > > > according to DMOZ or to cluster on stuff, or maybe put in a large > > traveling > > > salesman problem using the GA stuff Deneche did. > > > > > > Thoughts? Anyone else interested in setting up some use cases? > > > > > > -Grant > > > > > > -- DigitalPebble Ltd http://www.digitalpebble.com
Re: Mahout on EC2
Julien, That sounds great. Do you record linking information as well? On Fri, Sep 19, 2008 at 9:42 AM, Julien Nioche < [EMAIL PROTECTED]> wrote: > Hi, > > I am currently working on the classification of pages according to DMOZ :-) > I have been planning to give Mahout a serious try but never managed to do > it > so that could be a good opportunity to do that. > > We have downloaded and parsed the latest DMOZ snapshot. Everything is > currently stored in a DB, we have the following fields for each document: > - URL > - category (level 1 from DMOZ) > - content > - title > - description (taken from the HTML meta tags) > - keywords (taken from the HTML meta tags) > - status (unavailable|fetched) > > We are using our own API to convert the information for each document into > a > vector with a choice of which weighting scheme to use (tf-idf, frequency, > etc...). The weighting takes the fields into account i.e. if using tf.idf > the weight of a given term takes into account its frequency in this > specific > field (say title). > > I could describe the whole process on a Wiki page but that would be quite > long (especially if we need to go through all the details of Nutch), maybe > I > could simply generate a textual representation of the matrix and put it in > a > place where people could download it? That could be the starting point of > the use case. There would also be a lexicon file containing the mapping > between the attribute labels and their index. > > There could be all sorts of possible experiments from there e.g. trying to > see which attributes are the most discriminant etc... > > Does that make sense? > > Julien > > > 2008/9/19 Grant Ingersoll <[EMAIL PROTECTED]> > > > Amazon has generously donated some credits, so I plan on putting Mahout > up > > and doing some testing. Was wondering if people had suggestions on > things > > they would like to see from Mahout. For starters, I'm going to put up a > > public image containing 0.1 when it's ready, but I'd also like to wiki up > > some examples. I.e. go here, get this data, put it in this format and > then > > do X. We have some simple examples, but I think it would be cool to show > > how to do something a bit more complex, like maybe classify web pages > > according to DMOZ or to cluster on stuff, or maybe put in a large > traveling > > salesman problem using the GA stuff Deneche did. > > > > Thoughts? Anyone else interested in setting up some use cases? > > > > -Grant > > > -- ted
Re: Mahout on EC2
Hi Julien, It would be great if you can test it on the NB/CNB classifier implementation in Mahout. Could you create a dump of the files in the directory format (docs of each category resides in its directory)used by the Mahout NB implementation. There is no need of a separate mapping table between lexicon and features, as the implementation takes care of features in text format. Maybe with a good test-train split you can compare it across various weighting techniques -- Robin Anil Blog: http://techdigger.wordpress.com On Fri, Sep 19, 2008 at 10:12 PM, Julien Nioche < [EMAIL PROTECTED]> wrote: > Hi, > > I am currently working on the classification of pages according to DMOZ :-) > I have been planning to give Mahout a serious try but never managed to do > it > so that could be a good opportunity to do that. > > We have downloaded and parsed the latest DMOZ snapshot. Everything is > currently stored in a DB, we have the following fields for each document: > - URL > - category (level 1 from DMOZ) > - content > - title > - description (taken from the HTML meta tags) > - keywords (taken from the HTML meta tags) > - status (unavailable|fetched) > > We are using our own API to convert the information for each document into > a > vector with a choice of which weighting scheme to use (tf-idf, frequency, > etc...). The weighting takes the fields into account i.e. if using tf.idf > the weight of a given term takes into account its frequency in this > specific > field (say title). > > I could describe the whole process on a Wiki page but that would be quite > long (especially if we need to go through all the details of Nutch), maybe > I > could simply generate a textual representation of the matrix and put it in > a > place where people could download it? That could be the starting point of > the use case. There would also be a lexicon file containing the mapping > between the attribute labels and their index. > > There could be all sorts of possible experiments from there e.g. trying to > see which attributes are the most discriminant etc... > > Does that make sense? > > Julien > > > 2008/9/19 Grant Ingersoll <[EMAIL PROTECTED]> > > > Amazon has generously donated some credits, so I plan on putting Mahout > up > > and doing some testing. Was wondering if people had suggestions on > things > > they would like to see from Mahout. For starters, I'm going to put up a > > public image containing 0.1 when it's ready, but I'd also like to wiki up > > some examples. I.e. go here, get this data, put it in this format and > then > > do X. We have some simple examples, but I think it would be cool to show > > how to do something a bit more complex, like maybe classify web pages > > according to DMOZ or to cluster on stuff, or maybe put in a large > traveling > > salesman problem using the GA stuff Deneche did. > > > > Thoughts? Anyone else interested in setting up some use cases? > > > > -Grant > > >
Re: Mahout on EC2
Hi, I am currently working on the classification of pages according to DMOZ :-) I have been planning to give Mahout a serious try but never managed to do it so that could be a good opportunity to do that. We have downloaded and parsed the latest DMOZ snapshot. Everything is currently stored in a DB, we have the following fields for each document: - URL - category (level 1 from DMOZ) - content - title - description (taken from the HTML meta tags) - keywords (taken from the HTML meta tags) - status (unavailable|fetched) We are using our own API to convert the information for each document into a vector with a choice of which weighting scheme to use (tf-idf, frequency, etc...). The weighting takes the fields into account i.e. if using tf.idf the weight of a given term takes into account its frequency in this specific field (say title). I could describe the whole process on a Wiki page but that would be quite long (especially if we need to go through all the details of Nutch), maybe I could simply generate a textual representation of the matrix and put it in a place where people could download it? That could be the starting point of the use case. There would also be a lexicon file containing the mapping between the attribute labels and their index. There could be all sorts of possible experiments from there e.g. trying to see which attributes are the most discriminant etc... Does that make sense? Julien 2008/9/19 Grant Ingersoll <[EMAIL PROTECTED]> > Amazon has generously donated some credits, so I plan on putting Mahout up > and doing some testing. Was wondering if people had suggestions on things > they would like to see from Mahout. For starters, I'm going to put up a > public image containing 0.1 when it's ready, but I'd also like to wiki up > some examples. I.e. go here, get this data, put it in this format and then > do X. We have some simple examples, but I think it would be cool to show > how to do something a bit more complex, like maybe classify web pages > according to DMOZ or to cluster on stuff, or maybe put in a large traveling > salesman problem using the GA stuff Deneche did. > > Thoughts? Anyone else interested in setting up some use cases? > > -Grant >