subject:"Re \: Mahout on EC2"

Re: Mahout on EC2

2008-09-22 Thread Julien Nioche

Hi Robin

We definitely want to move from the current NB implementation to a more
> Matrix Like implementation. So for the time being you will have to create
> the dataset like i said above.


OK. It would be worth moving the text processing part to a separate package
as it might be useful for other Classifier implementations and have a number
of utility classes to generate a Matrix from a collection of text. Each
Classifier could then rely on the Matrix API to get the input. I'd be happy
to help with that if I can.

Will any other Classifiers implementations be added soon? I remember there
was a number of proposal for the GSOC.

Julien
-- 
DigitalPebble Ltd
http://www.digitalpebble.com

Re: Mahout on EC2

2008-09-21 Thread deneche abdelhakim

Ok, in this case its the main program that has a Swing GUI, the Map-Reduce Jobs 
have no GUIs at all. But yeah, it's always good to separate the GUI code from 
the logic.


--- En date de : Dim 21.9.08, Ted Dunning <[EMAIL PROTECTED]> a écrit :

> De: Ted Dunning <[EMAIL PROTECTED]>
> Objet: Re: Mahout on EC2
> À: mahout-dev@lucene.apache.org
> Date: Dimanche 21 Septembre 2008, 23h08
> For the master machine that launches the map-reduce
> computation, you can
> tunnel an X display from somewhere else to display swing
> applications.
> 
> You will also need to do the separation for the reason that
> Sean says... you
> will be running on many machines.
> 
> On Sat, Sep 20, 2008 at 2:34 AM, Sean Owen
> <[EMAIL PROTECTED]> wrote:
> 
> > I think you can run a program that uses Swing - unless
> I am wrong this
> > no longer result in an error when running on a
> 'headless' machine -
> > for example a box without X11.
> >
> > But no I don't think there is anyway to interact
> with it, especially
> > considering you might be running on many machines at
> once.
> >
> > But the same is true of the console - you won't be
> able to interact
> > with the program that way either.
> >
> > It does sound good, in any event, to separate out
> Swing client code
> > from the core logic.
> >
> > On 9/20/08, deneche abdelhakim
> <[EMAIL PROTECTED]> wrote:
> > > Sounds cool :)
> > >
> > > I'll do the TSP part, but it may take some
> time because I'm a bit busy
> > > (PhD's administrative stuff).
> > >
> > > There are many available large TSP benchmarks,
> and it seems that there is
> > a
> > > common file format for them TSPLIB
> > > (
> >
> http://www.informatik.uni-heidelberg.de/groups/comopt/software/TSPLIB95/DOC.PS
> > ).
> > > So the TSP example should be modified to load
> those benchmark files.
> > >
> > > I have a question about EC2 : can you run Java
> Swing programs and see the
> > > GUI because the TSP example has a Swing GUI, or
> should we should make a
> > > console version of the example ?
> > >
> > > --- En date de : Ven 19.9.08, Grant Ingersoll
> <[EMAIL PROTECTED]> a
> > > écrit :
> > >
> > >> De: Grant Ingersoll
> <[EMAIL PROTECTED]>
> > >> Objet: Mahout on EC2
> > >> À: mahout-dev@lucene.apache.org
> > >> Date: Vendredi 19 Septembre 2008, 17h18
> > >> Amazon has generously donated some credits,
> so I plan on
> > >> putting
> > >> Mahout up and doing some testing.  Was
> wondering if people
> > >> had
> > >> suggestions on things they would like to see
> from Mahout.
> > >> For
> > >> starters, I'm going to put up a public
> image containing
> > >> 0.1 when it's
> > >> ready, but I'd also like to wiki up some
> examples.
> > >> I.e. go here, get
> > >> this data, put it in this format and then do
> X.  We have
> > >> some simple
> > >> examples, but I think it would be cool to
> show how to do
> > >> something a
> > >> bit more complex, like maybe classify web
> pages according
> > >> to DMOZ or
> > >> to cluster on stuff, or maybe put in a large
> traveling
> > >> salesman
> > >> problem using the GA stuff Deneche did.
> > >>
> > >> Thoughts?  Anyone else interested in setting
> up some use
> > >> cases?
> > >>
> > >> -Grant
> > >
> > >
> > >
> > >
> >
> 
> 
> 
> -- 
> ted

Re: Mahout on EC2

2008-09-21 Thread Ted Dunning

For the master machine that launches the map-reduce computation, you can
tunnel an X display from somewhere else to display swing applications.

You will also need to do the separation for the reason that Sean says... you
will be running on many machines.

On Sat, Sep 20, 2008 at 2:34 AM, Sean Owen <[EMAIL PROTECTED]> wrote:

> I think you can run a program that uses Swing - unless I am wrong this
> no longer result in an error when running on a 'headless' machine -
> for example a box without X11.
>
> But no I don't think there is anyway to interact with it, especially
> considering you might be running on many machines at once.
>
> But the same is true of the console - you won't be able to interact
> with the program that way either.
>
> It does sound good, in any event, to separate out Swing client code
> from the core logic.
>
> On 9/20/08, deneche abdelhakim <[EMAIL PROTECTED]> wrote:
> > Sounds cool :)
> >
> > I'll do the TSP part, but it may take some time because I'm a bit busy
> > (PhD's administrative stuff).
> >
> > There are many available large TSP benchmarks, and it seems that there is
> a
> > common file format for them TSPLIB
> > (
> http://www.informatik.uni-heidelberg.de/groups/comopt/software/TSPLIB95/DOC.PS
> ).
> > So the TSP example should be modified to load those benchmark files.
> >
> > I have a question about EC2 : can you run Java Swing programs and see the
> > GUI because the TSP example has a Swing GUI, or should we should make a
> > console version of the example ?
> >
> > --- En date de : Ven 19.9.08, Grant Ingersoll <[EMAIL PROTECTED]> a
> > écrit :
> >
> >> De: Grant Ingersoll <[EMAIL PROTECTED]>
> >> Objet: Mahout on EC2
> >> À: mahout-dev@lucene.apache.org
> >> Date: Vendredi 19 Septembre 2008, 17h18
> >> Amazon has generously donated some credits, so I plan on
> >> putting
> >> Mahout up and doing some testing.  Was wondering if people
> >> had
> >> suggestions on things they would like to see from Mahout.
> >> For
> >> starters, I'm going to put up a public image containing
> >> 0.1 when it's
> >> ready, but I'd also like to wiki up some examples.
> >> I.e. go here, get
> >> this data, put it in this format and then do X.  We have
> >> some simple
> >> examples, but I think it would be cool to show how to do
> >> something a
> >> bit more complex, like maybe classify web pages according
> >> to DMOZ or
> >> to cluster on stuff, or maybe put in a large traveling
> >> salesman
> >> problem using the GA stuff Deneche did.
> >>
> >> Thoughts?  Anyone else interested in setting up some use
> >> cases?
> >>
> >> -Grant
> >
> >
> >
> >
>



-- 
ted

Re: Mahout on EC2

2008-09-20 Thread Sean Owen

I think you can run a program that uses Swing - unless I am wrong this
no longer result in an error when running on a 'headless' machine -
for example a box without X11.

But no I don't think there is anyway to interact with it, especially
considering you might be running on many machines at once.

But the same is true of the console - you won't be able to interact
with the program that way either.

It does sound good, in any event, to separate out Swing client code
from the core logic.

On 9/20/08, deneche abdelhakim <[EMAIL PROTECTED]> wrote:
> Sounds cool :)
>
> I'll do the TSP part, but it may take some time because I'm a bit busy
> (PhD's administrative stuff).
>
> There are many available large TSP benchmarks, and it seems that there is a
> common file format for them TSPLIB
> (http://www.informatik.uni-heidelberg.de/groups/comopt/software/TSPLIB95/DOC.PS).
> So the TSP example should be modified to load those benchmark files.
>
> I have a question about EC2 : can you run Java Swing programs and see the
> GUI because the TSP example has a Swing GUI, or should we should make a
> console version of the example ?
>
> --- En date de : Ven 19.9.08, Grant Ingersoll <[EMAIL PROTECTED]> a
> écrit :
>
>> De: Grant Ingersoll <[EMAIL PROTECTED]>
>> Objet: Mahout on EC2
>> À: mahout-dev@lucene.apache.org
>> Date: Vendredi 19 Septembre 2008, 17h18
>> Amazon has generously donated some credits, so I plan on
>> putting
>> Mahout up and doing some testing.  Was wondering if people
>> had
>> suggestions on things they would like to see from Mahout.
>> For
>> starters, I'm going to put up a public image containing
>> 0.1 when it's
>> ready, but I'd also like to wiki up some examples.
>> I.e. go here, get
>> this data, put it in this format and then do X.  We have
>> some simple
>> examples, but I think it would be cool to show how to do
>> something a
>> bit more complex, like maybe classify web pages according
>> to DMOZ or
>> to cluster on stuff, or maybe put in a large traveling
>> salesman
>> problem using the GA stuff Deneche did.
>>
>> Thoughts?  Anyone else interested in setting up some use
>> cases?
>>
>> -Grant
>
>
>
>

Re : Mahout on EC2

2008-09-20 Thread deneche abdelhakim

Sounds cool :)

I'll do the TSP part, but it may take some time because I'm a bit busy (PhD's 
administrative stuff).

There are many available large TSP benchmarks, and it seems that there is a 
common file format for them TSPLIB 
(http://www.informatik.uni-heidelberg.de/groups/comopt/software/TSPLIB95/DOC.PS).
 So the TSP example should be modified to load those benchmark files.

I have a question about EC2 : can you run Java Swing programs and see the GUI 
because the TSP example has a Swing GUI, or should we should make a console 
version of the example ?

--- En date de : Ven 19.9.08, Grant Ingersoll <[EMAIL PROTECTED]> a écrit :

> De: Grant Ingersoll <[EMAIL PROTECTED]>
> Objet: Mahout on EC2
> À: mahout-dev@lucene.apache.org
> Date: Vendredi 19 Septembre 2008, 17h18
> Amazon has generously donated some credits, so I plan on
> putting  
> Mahout up and doing some testing.  Was wondering if people
> had  
> suggestions on things they would like to see from Mahout. 
> For  
> starters, I'm going to put up a public image containing
> 0.1 when it's  
> ready, but I'd also like to wiki up some examples. 
> I.e. go here, get  
> this data, put it in this format and then do X.  We have
> some simple  
> examples, but I think it would be cool to show how to do
> something a  
> bit more complex, like maybe classify web pages according
> to DMOZ or  
> to cluster on stuff, or maybe put in a large traveling
> salesman  
> problem using the GA stuff Deneche did.
> 
> Thoughts?  Anyone else interested in setting up some use
> cases?
> 
> -Grant

Re: Mahout on EC2

2008-09-19 Thread Grant Ingersoll


+1

More inline.

On Sep 19, 2008, at 12:42 PM, Julien Nioche wrote:


Hi,

I am currently working on the classification of pages according to  
DMOZ :-)
I have been planning to give Mahout a serious try but never managed  
to do it

so that could be a good opportunity to do that.

We have downloaded and parsed the latest DMOZ snapshot. Everything is
currently stored in a DB, we have the following fields for each  
document:

- URL
- category (level 1 from DMOZ)
- content
- title
- description (taken from the HTML meta tags)
- keywords (taken from the HTML meta tags)
- status (unavailable|fetched)

We are using our own API to convert the information for each  
document into a
vector with a choice of which weighting scheme to use (tf-idf,  
frequency,
etc...). The weighting takes the fields into account i.e. if using  
tf.idf
the weight of a given term takes into account its frequency in this  
specific

field (say title).

I could describe the whole process on a Wiki page but that would be  
quite

long (especially if we need to go through all the details of Nutch),


I think you could just say something like "Go get Nutch and point it  
at X"  The Nutch getting started isn't too hard.



maybe I
could simply generate a textual representation of the matrix and put  
it in a

place where people could download it?


If that's feasible.  I don't think there would be distribution issues,  
right?  You're just putting up a matrix, not the actual content, but  
IANAL.



That could be the starting point of
the use case. There would also be a lexicon file containing the  
mapping

between the attribute labels and their index.

There could be all sorts of possible experiments from there e.g.  
trying to

see which attributes are the most discriminant etc...

Does that make sense?


I think this would be great.




Julien


2008/9/19 Grant Ingersoll <[EMAIL PROTECTED]>

Amazon has generously donated some credits, so I plan on putting  
Mahout up
and doing some testing.  Was wondering if people had suggestions on  
things
they would like to see from Mahout.  For starters, I'm going to put  
up a
public image containing 0.1 when it's ready, but I'd also like to  
wiki up
some examples.  I.e. go here, get this data, put it in this format  
and then
do X.  We have some simple examples, but I think it would be cool  
to show

how to do something a bit more complex, like maybe classify web pages
according to DMOZ or to cluster on stuff, or maybe put in a large  
traveling

salesman problem using the GA stuff Deneche did.

Thoughts?  Anyone else interested in setting up some use cases?

-Grant



--
Grant Ingersoll
http://www.lucidimagination.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

Re: Mahout on EC2

2008-09-19 Thread Robin Anil

Any document you give to the mahout NB is assumed to be a list of features
with weight=number of times it occurs.
So if you happen to give a weight of 5 to say title. just repeat the title
five times in the document you generate for mahout NB. Then you can draw
comparable results to the schemes you use.

We definitely want to move from the current NB implementation to a more
Matrix Like implementation. So for the time being you will have to create
the dataset like i said above.

There is no cross-fold validation being done right now. A dataset creator
will have to be written for generating each of the n-folds from the train
data
--
Robin Anil
Blog: http://techdigger.wordpress.com
--
9/11 Quotes "Terrorism against our nation will not stand." - George W. Bush
- Remarks at Emma...

On Fri, Sep 19, 2008 at 10:47 PM, Julien Nioche <
[EMAIL PROTECTED]> wrote:

> Hi Robin,
>
> I had a quick look at the NB implementation in the meantime. We could
> certainly give it a try and compare the results of the weighting schemes.
> It
> will be interesting to compare that with my field-based representation of
> the documents but that would require being able to use the NB
> implementation
> with the Matrix API. I had the impression that the Matrix objects where not
> used in any of the classifiers / clusterers, is that the case?
>
> BTW is there any way we can do Cross Validation with Mahout?
>
> J.
>
> 2008/9/19 Robin Anil <[EMAIL PROTECTED]>
>
> > Hi Julien,   It would be great if you can test it on the NB/CNB
> > classifier implementation in Mahout. Could you create a dump of the files
> > in
> > the directory format (docs of each category resides in its directory)used
> > by
> > the Mahout NB implementation. There is no need of a separate mapping
> table
> > between lexicon and features, as the implementation takes care of
> features
> > in text format. Maybe with a good test-train split you can compare it
> > across
> > various weighting techniques
> > --
> > Robin Anil
> > Blog: http://techdigger.wordpress.com
> >
> > On Fri, Sep 19, 2008 at 10:12 PM, Julien Nioche <
> > [EMAIL PROTECTED]> wrote:
> >
> > > Hi,
> > >
> > > I am currently working on the classification of pages according to DMOZ
> > :-)
> > > I have been planning to give Mahout a serious try but never managed to
> do
> > > it
> > > so that could be a good opportunity to do that.
> > >
> > > We have downloaded and parsed the latest DMOZ snapshot. Everything is
> > > currently stored in a DB, we have the following fields for each
> document:
> > > - URL
> > > - category (level 1 from DMOZ)
> > > - content
> > > - title
> > > - description (taken from the HTML meta tags)
> > > - keywords (taken from the HTML meta tags)
> > > - status (unavailable|fetched)
> > >
> > > We are using our own API to convert the information for each document
> > into
> > > a
> > > vector with a choice of which weighting scheme to use (tf-idf,
> frequency,
> > > etc...). The weighting takes the fields into account i.e. if using
> tf.idf
> > > the weight of a given term takes into account its frequency in this
> > > specific
> > > field (say title).
> > >
> > > I could describe the whole process on a Wiki page but that would be
> quite
> > > long (especially if we need to go through all the details of Nutch),
> > maybe
> > > I
> > > could simply generate a textual representation of the matrix and put it
> > in
> > > a
> > > place where people could download it? That could be the starting point
> of
> > > the use case. There would also be a lexicon file containing the mapping
> > > between the attribute labels and their index.
> > >
> > > There could be all sorts of possible experiments from there e.g. trying
> > to
> > > see which attributes are the most discriminant etc...
> > >
> > > Does that make sense?
> > >
> > > Julien
> > >
> > >
> > > 2008/9/19 Grant Ingersoll <[EMAIL PROTECTED]>
> > >
> > > > Amazon has generously donated some credits, so I plan on putting
> Mahout
> > > up
> > > > and doing some testing.  Was wondering if people had suggestions on
> > > things
> > > > they would like to see from Mahout.  For starters, I'm going to put
> up
> > a
> > > > public image containing 0.1 when it's ready, but I'd also like to
> wiki
> > up
> > > > some examples.  I.e. go here, get this data, put it in this format
> and
> > > then
> > > > do X.  We have some simple examples, but I think it would be cool to
> > show
> > > > how to do something a bit more complex, like maybe classify web pages
> > > > according to DMOZ or to cluster on stuff, or maybe put in a large
> > > traveling
> > > > salesman problem using the GA stuff Deneche did.
> > > >
> > > > Thoughts?  Anyone else interested in setting up some use cases?
> > > >
> > > > -Grant
> > > >
> > >
> >
>
>
>
> --
> DigitalPebble Ltd
> http://www.digitalpebble.com
>

Re: Mahout on EC2

2008-09-19 Thread Julien Nioche

Hi Ted,

No, I've been using the FreeGenerator in Nutch so no linkDB has been built.
I suppose the text from anchors could make good features though.
Where you thinking about using the actual links as features?

J.

2008/9/19 Ted Dunning <[EMAIL PROTECTED]>

> Julien,
>
> That sounds great.
>
> Do you record linking information as well?
>
> On Fri, Sep 19, 2008 at 9:42 AM, Julien Nioche <
> [EMAIL PROTECTED]> wrote:
>
> > Hi,
> >
> > I am currently working on the classification of pages according to DMOZ
> :-)
> > I have been planning to give Mahout a serious try but never managed to do
> > it
> > so that could be a good opportunity to do that.
> >
> > We have downloaded and parsed the latest DMOZ snapshot. Everything is
> > currently stored in a DB, we have the following fields for each document:
> > - URL
> > - category (level 1 from DMOZ)
> > - content
> > - title
> > - description (taken from the HTML meta tags)
> > - keywords (taken from the HTML meta tags)
> > - status (unavailable|fetched)
> >
> > We are using our own API to convert the information for each document
> into
> > a
> > vector with a choice of which weighting scheme to use (tf-idf, frequency,
> > etc...). The weighting takes the fields into account i.e. if using tf.idf
> > the weight of a given term takes into account its frequency in this
> > specific
> > field (say title).
> >
> > I could describe the whole process on a Wiki page but that would be quite
> > long (especially if we need to go through all the details of Nutch),
> maybe
> > I
> > could simply generate a textual representation of the matrix and put it
> in
> > a
> > place where people could download it? That could be the starting point of
> > the use case. There would also be a lexicon file containing the mapping
> > between the attribute labels and their index.
> >
> > There could be all sorts of possible experiments from there e.g. trying
> to
> > see which attributes are the most discriminant etc...
> >
> > Does that make sense?
> >
> > Julien
> >
> >
> > 2008/9/19 Grant Ingersoll <[EMAIL PROTECTED]>
> >
> > > Amazon has generously donated some credits, so I plan on putting Mahout
> > up
> > > and doing some testing.  Was wondering if people had suggestions on
> > things
> > > they would like to see from Mahout.  For starters, I'm going to put up
> a
> > > public image containing 0.1 when it's ready, but I'd also like to wiki
> up
> > > some examples.  I.e. go here, get this data, put it in this format and
> > then
> > > do X.  We have some simple examples, but I think it would be cool to
> show
> > > how to do something a bit more complex, like maybe classify web pages
> > > according to DMOZ or to cluster on stuff, or maybe put in a large
> > traveling
> > > salesman problem using the GA stuff Deneche did.
> > >
> > > Thoughts?  Anyone else interested in setting up some use cases?
> > >
> > > -Grant
> > >
> >
>
>
>
> --
> ted
>



-- 
DigitalPebble Ltd
http://www.digitalpebble.com

Re: Mahout on EC2

2008-09-19 Thread Julien Nioche

Hi Robin,

I had a quick look at the NB implementation in the meantime. We could
certainly give it a try and compare the results of the weighting schemes. It
will be interesting to compare that with my field-based representation of
the documents but that would require being able to use the NB implementation
with the Matrix API. I had the impression that the Matrix objects where not
used in any of the classifiers / clusterers, is that the case?

BTW is there any way we can do Cross Validation with Mahout?

J.

2008/9/19 Robin Anil <[EMAIL PROTECTED]>

> Hi Julien,   It would be great if you can test it on the NB/CNB
> classifier implementation in Mahout. Could you create a dump of the files
> in
> the directory format (docs of each category resides in its directory)used
> by
> the Mahout NB implementation. There is no need of a separate mapping table
> between lexicon and features, as the implementation takes care of features
> in text format. Maybe with a good test-train split you can compare it
> across
> various weighting techniques
> --
> Robin Anil
> Blog: http://techdigger.wordpress.com
>
> On Fri, Sep 19, 2008 at 10:12 PM, Julien Nioche <
> [EMAIL PROTECTED]> wrote:
>
> > Hi,
> >
> > I am currently working on the classification of pages according to DMOZ
> :-)
> > I have been planning to give Mahout a serious try but never managed to do
> > it
> > so that could be a good opportunity to do that.
> >
> > We have downloaded and parsed the latest DMOZ snapshot. Everything is
> > currently stored in a DB, we have the following fields for each document:
> > - URL
> > - category (level 1 from DMOZ)
> > - content
> > - title
> > - description (taken from the HTML meta tags)
> > - keywords (taken from the HTML meta tags)
> > - status (unavailable|fetched)
> >
> > We are using our own API to convert the information for each document
> into
> > a
> > vector with a choice of which weighting scheme to use (tf-idf, frequency,
> > etc...). The weighting takes the fields into account i.e. if using tf.idf
> > the weight of a given term takes into account its frequency in this
> > specific
> > field (say title).
> >
> > I could describe the whole process on a Wiki page but that would be quite
> > long (especially if we need to go through all the details of Nutch),
> maybe
> > I
> > could simply generate a textual representation of the matrix and put it
> in
> > a
> > place where people could download it? That could be the starting point of
> > the use case. There would also be a lexicon file containing the mapping
> > between the attribute labels and their index.
> >
> > There could be all sorts of possible experiments from there e.g. trying
> to
> > see which attributes are the most discriminant etc...
> >
> > Does that make sense?
> >
> > Julien
> >
> >
> > 2008/9/19 Grant Ingersoll <[EMAIL PROTECTED]>
> >
> > > Amazon has generously donated some credits, so I plan on putting Mahout
> > up
> > > and doing some testing.  Was wondering if people had suggestions on
> > things
> > > they would like to see from Mahout.  For starters, I'm going to put up
> a
> > > public image containing 0.1 when it's ready, but I'd also like to wiki
> up
> > > some examples.  I.e. go here, get this data, put it in this format and
> > then
> > > do X.  We have some simple examples, but I think it would be cool to
> show
> > > how to do something a bit more complex, like maybe classify web pages
> > > according to DMOZ or to cluster on stuff, or maybe put in a large
> > traveling
> > > salesman problem using the GA stuff Deneche did.
> > >
> > > Thoughts?  Anyone else interested in setting up some use cases?
> > >
> > > -Grant
> > >
> >
>



-- 
DigitalPebble Ltd
http://www.digitalpebble.com

Re: Mahout on EC2

2008-09-19 Thread Ted Dunning

Julien,

That sounds great.

Do you record linking information as well?

On Fri, Sep 19, 2008 at 9:42 AM, Julien Nioche <
[EMAIL PROTECTED]> wrote:

> Hi,
>
> I am currently working on the classification of pages according to DMOZ :-)
> I have been planning to give Mahout a serious try but never managed to do
> it
> so that could be a good opportunity to do that.
>
> We have downloaded and parsed the latest DMOZ snapshot. Everything is
> currently stored in a DB, we have the following fields for each document:
> - URL
> - category (level 1 from DMOZ)
> - content
> - title
> - description (taken from the HTML meta tags)
> - keywords (taken from the HTML meta tags)
> - status (unavailable|fetched)
>
> We are using our own API to convert the information for each document into
> a
> vector with a choice of which weighting scheme to use (tf-idf, frequency,
> etc...). The weighting takes the fields into account i.e. if using tf.idf
> the weight of a given term takes into account its frequency in this
> specific
> field (say title).
>
> I could describe the whole process on a Wiki page but that would be quite
> long (especially if we need to go through all the details of Nutch), maybe
> I
> could simply generate a textual representation of the matrix and put it in
> a
> place where people could download it? That could be the starting point of
> the use case. There would also be a lexicon file containing the mapping
> between the attribute labels and their index.
>
> There could be all sorts of possible experiments from there e.g. trying to
> see which attributes are the most discriminant etc...
>
> Does that make sense?
>
> Julien
>
>
> 2008/9/19 Grant Ingersoll <[EMAIL PROTECTED]>
>
> > Amazon has generously donated some credits, so I plan on putting Mahout
> up
> > and doing some testing.  Was wondering if people had suggestions on
> things
> > they would like to see from Mahout.  For starters, I'm going to put up a
> > public image containing 0.1 when it's ready, but I'd also like to wiki up
> > some examples.  I.e. go here, get this data, put it in this format and
> then
> > do X.  We have some simple examples, but I think it would be cool to show
> > how to do something a bit more complex, like maybe classify web pages
> > according to DMOZ or to cluster on stuff, or maybe put in a large
> traveling
> > salesman problem using the GA stuff Deneche did.
> >
> > Thoughts?  Anyone else interested in setting up some use cases?
> >
> > -Grant
> >
>



-- 
ted

Re: Mahout on EC2

2008-09-19 Thread Robin Anil

Hi Julien,   It would be great if you can test it on the NB/CNB
classifier implementation in Mahout. Could you create a dump of the files in
the directory format (docs of each category resides in its directory)used by
the Mahout NB implementation. There is no need of a separate mapping table
between lexicon and features, as the implementation takes care of features
in text format. Maybe with a good test-train split you can compare it across
various weighting techniques
--
Robin Anil
Blog: http://techdigger.wordpress.com

On Fri, Sep 19, 2008 at 10:12 PM, Julien Nioche <
[EMAIL PROTECTED]> wrote:

> Hi,
>
> I am currently working on the classification of pages according to DMOZ :-)
> I have been planning to give Mahout a serious try but never managed to do
> it
> so that could be a good opportunity to do that.
>
> We have downloaded and parsed the latest DMOZ snapshot. Everything is
> currently stored in a DB, we have the following fields for each document:
> - URL
> - category (level 1 from DMOZ)
> - content
> - title
> - description (taken from the HTML meta tags)
> - keywords (taken from the HTML meta tags)
> - status (unavailable|fetched)
>
> We are using our own API to convert the information for each document into
> a
> vector with a choice of which weighting scheme to use (tf-idf, frequency,
> etc...). The weighting takes the fields into account i.e. if using tf.idf
> the weight of a given term takes into account its frequency in this
> specific
> field (say title).
>
> I could describe the whole process on a Wiki page but that would be quite
> long (especially if we need to go through all the details of Nutch), maybe
> I
> could simply generate a textual representation of the matrix and put it in
> a
> place where people could download it? That could be the starting point of
> the use case. There would also be a lexicon file containing the mapping
> between the attribute labels and their index.
>
> There could be all sorts of possible experiments from there e.g. trying to
> see which attributes are the most discriminant etc...
>
> Does that make sense?
>
> Julien
>
>
> 2008/9/19 Grant Ingersoll <[EMAIL PROTECTED]>
>
> > Amazon has generously donated some credits, so I plan on putting Mahout
> up
> > and doing some testing.  Was wondering if people had suggestions on
> things
> > they would like to see from Mahout.  For starters, I'm going to put up a
> > public image containing 0.1 when it's ready, but I'd also like to wiki up
> > some examples.  I.e. go here, get this data, put it in this format and
> then
> > do X.  We have some simple examples, but I think it would be cool to show
> > how to do something a bit more complex, like maybe classify web pages
> > according to DMOZ or to cluster on stuff, or maybe put in a large
> traveling
> > salesman problem using the GA stuff Deneche did.
> >
> > Thoughts?  Anyone else interested in setting up some use cases?
> >
> > -Grant
> >
>

Re: Mahout on EC2

2008-09-19 Thread Julien Nioche

Hi,

I am currently working on the classification of pages according to DMOZ :-)
I have been planning to give Mahout a serious try but never managed to do it
so that could be a good opportunity to do that.

We have downloaded and parsed the latest DMOZ snapshot. Everything is
currently stored in a DB, we have the following fields for each document:
- URL
- category (level 1 from DMOZ)
- content
- title
- description (taken from the HTML meta tags)
- keywords (taken from the HTML meta tags)
- status (unavailable|fetched)

We are using our own API to convert the information for each document into a
vector with a choice of which weighting scheme to use (tf-idf, frequency,
etc...). The weighting takes the fields into account i.e. if using tf.idf
the weight of a given term takes into account its frequency in this specific
field (say title).

I could describe the whole process on a Wiki page but that would be quite
long (especially if we need to go through all the details of Nutch), maybe I
could simply generate a textual representation of the matrix and put it in a
place where people could download it? That could be the starting point of
the use case. There would also be a lexicon file containing the mapping
between the attribute labels and their index.

There could be all sorts of possible experiments from there e.g. trying to
see which attributes are the most discriminant etc...

Does that make sense?

Julien


2008/9/19 Grant Ingersoll <[EMAIL PROTECTED]>

> Amazon has generously donated some credits, so I plan on putting Mahout up
> and doing some testing.  Was wondering if people had suggestions on things
> they would like to see from Mahout.  For starters, I'm going to put up a
> public image containing 0.1 when it's ready, but I'd also like to wiki up
> some examples.  I.e. go here, get this data, put it in this format and then
> do X.  We have some simple examples, but I think it would be cool to show
> how to do something a bit more complex, like maybe classify web pages
> according to DMOZ or to cluster on stuff, or maybe put in a large traveling
> salesman problem using the GA stuff Deneche did.
>
> Thoughts?  Anyone else interested in setting up some use cases?
>
> -Grant
>

Re: Mahout on EC2

Re: Mahout on EC2

Re: Mahout on EC2

Re: Mahout on EC2

Re : Mahout on EC2

Re: Mahout on EC2

Re: Mahout on EC2

Re: Mahout on EC2

Re: Mahout on EC2

Re: Mahout on EC2

Re: Mahout on EC2

Re: Mahout on EC2

12 matches

Site Navigation

Mail list logo

Footer information