Re: Mahout 0.3 Plan and other changes

2010-02-10 Thread Jake Mannix
I actually want to try and see how much runs on Amazon EMR (0.18.3*), as
that would
be good to document.  I like running on 0.20 better, and I certainly think
we should
recommend people use it, but there are certainly some jobs which simply
won't run
on 0.18, although it would be good to document which ones those are.

  -jake

On Wed, Feb 10, 2010 at 10:51 AM, Ted Dunning  wrote:

> +1 from me even though I am still on 19 at work.
>
> On Wed, Feb 10, 2010 at 3:53 AM, Isabel Drost  wrote:
>
> > On Wed Sean Owen  wrote:
> >
> > > I'd say we recommend 0.20, since that's what we develop against and
> > > it's the current stable release, and everything we have works on it.
> > >
> > > We can also say it should work on 0.19 and 0.18, but we don't
> > > guarantee or support that. (Slightly different than my last suggestion
> > > -- we don't actually know how it all goes on 0.19)
> >
> > Sounds good to me.
>
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>


Re: Mahout 0.3 Plan and other changes

2010-02-10 Thread Ted Dunning
+1 from me even though I am still on 19 at work.

On Wed, Feb 10, 2010 at 3:53 AM, Isabel Drost  wrote:

> On Wed Sean Owen  wrote:
>
> > I'd say we recommend 0.20, since that's what we develop against and
> > it's the current stable release, and everything we have works on it.
> >
> > We can also say it should work on 0.19 and 0.18, but we don't
> > guarantee or support that. (Slightly different than my last suggestion
> > -- we don't actually know how it all goes on 0.19)
>
> Sounds good to me.




-- 
Ted Dunning, CTO
DeepDyve


Re: Mahout 0.3 Plan and other changes

2010-02-10 Thread Benson Margulies
We could have a profile for that.

On Wed, Feb 10, 2010 at 11:17 AM, Drew Farris  wrote:
> On Wed, Feb 10, 2010 at 6:40 AM, Sean Owen  wrote:
>>
>> We can also say it should work on 0.19 and 0.18, but we don't
>> guarantee or support that. (Slightly different than my last suggestion
>> -- we don't actually know how it all goes on 0.19)
>>
>
> +1 -- we can't really know how it will work unless we build against
> the 0.19 jars and run the unit tests.
>


Re: Mahout 0.3 Plan and other changes

2010-02-10 Thread Drew Farris
On Wed, Feb 10, 2010 at 6:40 AM, Sean Owen  wrote:
>
> We can also say it should work on 0.19 and 0.18, but we don't
> guarantee or support that. (Slightly different than my last suggestion
> -- we don't actually know how it all goes on 0.19)
>

+1 -- we can't really know how it will work unless we build against
the 0.19 jars and run the unit tests.


Re: Mahout 0.3 Plan and other changes

2010-02-10 Thread Isabel Drost
On Wed Sean Owen  wrote:

> I'd say we recommend 0.20, since that's what we develop against and
> it's the current stable release, and everything we have works on it.
> 
> We can also say it should work on 0.19 and 0.18, but we don't
> guarantee or support that. (Slightly different than my last suggestion
> -- we don't actually know how it all goes on 0.19)

Sounds good to me.

Isabel


Re: Mahout 0.3 Plan and other changes

2010-02-10 Thread Sean Owen
I'd say we recommend 0.20, since that's what we develop against and
it's the current stable release, and everything we have works on it.

We can also say it should work on 0.19 and 0.18, but we don't
guarantee or support that. (Slightly different than my last suggestion
-- we don't actually know how it all goes on 0.19)

On Wed, Feb 10, 2010 at 11:36 AM, Isabel Drost  wrote:
> +1
>
> Assuming that the majority of the algorithms may work on e.g. 0.19, we
> could tell users something along the lines of "works with Hadoop 0.19,
> except $algorithms_for_20, may work with 0.18, not guarantee given".
>
> Isabel
>


Re: Mahout 0.3 Plan and other changes

2010-02-10 Thread Isabel Drost
On Wed, 10 Feb 2010 11:10:41 +
Sean  wrote:

> For simplicity, I'd document that Mahout works on 0.19 and 0.20, and
> may work on 0.18

+1

Assuming that the majority of the algorithms may work on e.g. 0.19, we
could tell users something along the lines of "works with Hadoop 0.19,
except $algorithms_for_20, may work with 0.18, not guarantee given".

Isabel


Re: Mahout 0.3 Plan and other changes

2010-02-10 Thread Robin Anil
fpm is purely based on 0.20.x api and works perfectly fine on that



On Wed, Feb 10, 2010 at 4:40 PM, Sean  wrote:

> For simplicity, I'd document that Mahout works on 0.19 and 0.20, and
> may work on 0.18. That's more what people need to know, rather than
> confuse the issue with talk of old/new APIs, since even I am confused
> about what's going on. The two are blending together, while one is
> deprecated, and it causes problems.
>
> In the one case here, there are two implementations, covering all bases.
>


Re: Mahout 0.3 Plan and other changes

2010-02-10 Thread Sean
For simplicity, I'd document that Mahout works on 0.19 and 0.20, and
may work on 0.18. That's more what people need to know, rather than
confuse the issue with talk of old/new APIs, since even I am confused
about what's going on. The two are blending together, while one is
deprecated, and it causes problems.

In the one case here, there are two implementations, covering all bases.


Re: Mahout 0.3 Plan and other changes

2010-02-10 Thread Isabel Drost
On Thu deneche abdelhakim  wrote:
> although I maintain two versions of Decision Forests, one with the old
> api and with the new one, the differences between the two APIs are so
> important that I can't just keep working on the two versions. Thus all
> the new stuff is being committed using the new API and as far as I can
> say it seems to work great.

If I understand you correctly, there is code in Mahout that still works
with the old API but also bits and pieces that depend on the new API.

Do we have some documentation we can include in the release that tells
users for which algorithms/ implementations they need to make sure they
are running a Hadoop version that provides the new API?

Isabel



Re: Mahout 0.3 Plan and other changes

2010-02-05 Thread Robin Anil
I am committing the first level of changes so that drew can work it. I have
updated the patch on the issue as a reference. Ted please take a look when
you get time. The names will change correspondingly

What I have right now is

4 Main Entry points
DocumentProcessor - does SequenceFile => StringTuple(later replaced by
StructuredDocumentWritable backed by AvroWritable)
DictionaryVectorizer - StringTuple of documents => Tf Vector
PartialVectorMerger - merges partial vectors based on their doc id. Does
optional normalizing(used by both DictionaryVectorizer(no normalizing) and
TFIDFConverter (optional normalizing0
TfidfConverter - Converts tf vector to tfidf vector with optional
normalizing

An example which uses all of them
hadoop jar examples/target/mahout-examples-0.3-SNAPSHOT.job
org.apache.mahout.text.SparseVectorsFromSequenceFiles -i reuters-seqfiles -o
reuters-vectors -w (tfidf|tf) --norm 2(works only with tfidf for now)

Robin


On Fri, Feb 5, 2010 at 12:46 PM, Ted Dunning  wrote:

> Drew has an early code drop that should be posted shortly.  He has a
> generic
> AvroWritable that can serialize anything with an appropriate schema.  That
> changes your names and philosophy a bit.
>
> Regarding n-grams, I think that will be best combined with a non-dictionary
> based vectorizer because of the large implied vocabulary that would
> otherwise result.  Also, in many cases vectorization and n-gram generation
> is best done in the learning algorithm itself to avoid moving massive
> amounts of data.  As such, vectorization will probably need to be a library
> rather than a map-reduce program.
>
>
> On Thu, Feb 4, 2010 at 7:49 PM, Robin Anil  wrote:
>
> > Lets break it down into milestones. See if you agree on the
> following(even
> > ClassNames ?)
> >
> > On Fri, Feb 5, 2010 at 12:27 AM, Ted Dunning 
> > wrote:
> >
> > > These are good questions.  I see the best course as answering these
> kinds
> > > of
> > > questions in phases.
> > >
> > > First, the only thing that is working right now is the current text =>
> > > vector stuff.  We should continue to refine this with alternative forms
> > of
> > > vectorization (random indexing, stochastic projection as well as the
> > > current
> > > dictionary approach).
> > >
> > > The input all these vectorization job is StucturedDocumentWritable
> format
> > which you and Drew will work on(Avro based)
> >
> > To create the StructuredDocumentWritable format we have to write
> Mapreduces
> > which will convert
> > a) SequenceFile => SingleField token array using Analyzer
> > I am going with simple Document
> > => StucturedDocumentWritable(encapsulating StringTuple)  in   M1.
> > Change it to StucturedDocumentWritable( in M2
> > b) Lucene Repo  => StucturedDocumentWritable   M2
> > c) Structured XML =>  StucturedDocumentWritable  M2
> > d) Other Formats/DataSources(RDBMS)  => StucturedDocumentWritable
> > M3
> >
> > Jobs using StructuredDocumentWritable
> > a) DictionaryVectorizer -> Makes VectorWritable M1
> > b) nGram Generator -> Makes ngrams ->
> >  1) Appends to the dictionary -> Creates Partial Vectors ->
> Merges
> > with vectors from Dictionary Vectorizer to create ngram based vectors
> > M1
> >  2) Appends to  other vectorizers(random indexing, stochastic)
> M1?
> > or M2
> > c) Random Indexing Job -> Makes VectorWritable  M1? or M2
> > d) StochasticProjection Job -> Makes Vector writable  M1? or M2
> >
> >
> > How does this sound ? Feel free to edit/reorder them
> >
> >
> >
> > A second step is to be able to store and represent more general documents
> > > similar to what is possible with Lucene.  This is critically important
> > for
> > > some of the things that I want to do where I need to store and
> segregate
> > > title, publisher, authors, abstracts and body text (and many other
> > > characteristics ... we probably have >100 of them).  It is also
> > critically
> > > important if we want to embrace the dualism between recommendation and
> > > search.  Representing documents can be done without discarding the
> > simpler
> > > approach we have now and it can be done in advance of good
> vectorization
> > of
> > > these complex documents.
> > >
> > > A third step is to define advanced vectorization for complex documents.
> >  As
> > > an interim step, we can simply vectorize using the dictionary and
> > > alternative vectorizers that we have now, but applied to a single field
> > of
> > > the document.  Shortly, though, we should be able to define cross
> > > occurrence
> > > features for a multi-field vectorization.
> > >
> > > The only dependencies here are that the third step depends on the first
> > and
> > > second.
> > >
> > > You have been working on the Dictionary vectorizer.  I did a bit of
> work
> > on
> > > stochastic projection with some cooccurrence.
> > >
> > > In parallel Drew and I have been working on building an Avro docume

Re: Mahout 0.3 Plan and other changes

2010-02-04 Thread Ted Dunning
Drew has an early code drop that should be posted shortly.  He has a generic
AvroWritable that can serialize anything with an appropriate schema.  That
changes your names and philosophy a bit.

Regarding n-grams, I think that will be best combined with a non-dictionary
based vectorizer because of the large implied vocabulary that would
otherwise result.  Also, in many cases vectorization and n-gram generation
is best done in the learning algorithm itself to avoid moving massive
amounts of data.  As such, vectorization will probably need to be a library
rather than a map-reduce program.


On Thu, Feb 4, 2010 at 7:49 PM, Robin Anil  wrote:

> Lets break it down into milestones. See if you agree on the following(even
> ClassNames ?)
>
> On Fri, Feb 5, 2010 at 12:27 AM, Ted Dunning 
> wrote:
>
> > These are good questions.  I see the best course as answering these kinds
> > of
> > questions in phases.
> >
> > First, the only thing that is working right now is the current text =>
> > vector stuff.  We should continue to refine this with alternative forms
> of
> > vectorization (random indexing, stochastic projection as well as the
> > current
> > dictionary approach).
> >
> > The input all these vectorization job is StucturedDocumentWritable format
> which you and Drew will work on(Avro based)
>
> To create the StructuredDocumentWritable format we have to write Mapreduces
> which will convert
> a) SequenceFile => SingleField token array using Analyzer
> I am going with simple Document
> => StucturedDocumentWritable(encapsulating StringTuple)  in   M1.
> Change it to StucturedDocumentWritable( in M2
> b) Lucene Repo  => StucturedDocumentWritable   M2
> c) Structured XML =>  StucturedDocumentWritable  M2
> d) Other Formats/DataSources(RDBMS)  => StucturedDocumentWritable
> M3
>
> Jobs using StructuredDocumentWritable
> a) DictionaryVectorizer -> Makes VectorWritable M1
> b) nGram Generator -> Makes ngrams ->
>  1) Appends to the dictionary -> Creates Partial Vectors -> Merges
> with vectors from Dictionary Vectorizer to create ngram based vectors
> M1
>  2) Appends to  other vectorizers(random indexing, stochastic) M1?
> or M2
> c) Random Indexing Job -> Makes VectorWritable  M1? or M2
> d) StochasticProjection Job -> Makes Vector writable  M1? or M2
>
>
> How does this sound ? Feel free to edit/reorder them
>
>
>
> A second step is to be able to store and represent more general documents
> > similar to what is possible with Lucene.  This is critically important
> for
> > some of the things that I want to do where I need to store and segregate
> > title, publisher, authors, abstracts and body text (and many other
> > characteristics ... we probably have >100 of them).  It is also
> critically
> > important if we want to embrace the dualism between recommendation and
> > search.  Representing documents can be done without discarding the
> simpler
> > approach we have now and it can be done in advance of good vectorization
> of
> > these complex documents.
> >
> > A third step is to define advanced vectorization for complex documents.
>  As
> > an interim step, we can simply vectorize using the dictionary and
> > alternative vectorizers that we have now, but applied to a single field
> of
> > the document.  Shortly, though, we should be able to define cross
> > occurrence
> > features for a multi-field vectorization.
> >
> > The only dependencies here are that the third step depends on the first
> and
> > second.
> >
> > You have been working on the Dictionary vectorizer.  I did a bit of work
> on
> > stochastic projection with some cooccurrence.
> >
> > In parallel Drew and I have been working on building an Avro document
> > schema.  This is driving forward on step 2.  I think that will actually
> > bear
> > some fruit quickly.  Once that is done, we should merge capabilities.  I
> am
> > hoping that the good momentum that you have established on (1) will mean
> > that merging your vectorization with the complex documents will be
> > relatively easy.
> >
> > Is that a workable idea?
> >
> > On Thu, Feb 4, 2010 at 10:45 AM, Robin Anil 
> wrote:
> >
> > > And how does it
> > > work with our sequence file format(string docid => string document>.
> All
> > we
> > > have is text=>text ?
> > > and finally its all vectors. How does same word in two different fields
> > > translate into vector?
> > >
> > > if you have a clear plan lets do it or lets do the first version with
> > just
> > >
> > > document -> analyzer -> token array -> vector
> > >  |-> ngram ->
> vector
> > >
> >
> >
> >
> > --
> > Ted Dunning, CTO
> > DeepDyve
> >
>



-- 
Ted Dunning, CTO
DeepDyve


Re: Mahout 0.3 Plan and other changes

2010-02-04 Thread Robin Anil
Lets break it down into milestones. See if you agree on the following(even
ClassNames ?)

On Fri, Feb 5, 2010 at 12:27 AM, Ted Dunning  wrote:

> These are good questions.  I see the best course as answering these kinds
> of
> questions in phases.
>
> First, the only thing that is working right now is the current text =>
> vector stuff.  We should continue to refine this with alternative forms of
> vectorization (random indexing, stochastic projection as well as the
> current
> dictionary approach).
>
> The input all these vectorization job is StucturedDocumentWritable format
which you and Drew will work on(Avro based)

To create the StructuredDocumentWritable format we have to write Mapreduces
which will convert
a) SequenceFile => SingleField token array using Analyzer
 I am going with simple Document
=> StucturedDocumentWritable(encapsulating StringTuple)  in   M1.
 Change it to StucturedDocumentWritable( in M2
b) Lucene Repo  => StucturedDocumentWritable   M2
c) Structured XML =>  StucturedDocumentWritable  M2
d) Other Formats/DataSources(RDBMS)  => StucturedDocumentWritable
 M3

Jobs using StructuredDocumentWritable
a) DictionaryVectorizer -> Makes VectorWritable M1
b) nGram Generator -> Makes ngrams ->
  1) Appends to the dictionary -> Creates Partial Vectors -> Merges
with vectors from Dictionary Vectorizer to create ngram based vectors M1
  2) Appends to  other vectorizers(random indexing, stochastic) M1?
or M2
c) Random Indexing Job -> Makes VectorWritable  M1? or M2
d) StochasticProjection Job -> Makes Vector writable  M1? or M2


How does this sound ? Feel free to edit/reorder them



A second step is to be able to store and represent more general documents
> similar to what is possible with Lucene.  This is critically important for
> some of the things that I want to do where I need to store and segregate
> title, publisher, authors, abstracts and body text (and many other
> characteristics ... we probably have >100 of them).  It is also critically
> important if we want to embrace the dualism between recommendation and
> search.  Representing documents can be done without discarding the simpler
> approach we have now and it can be done in advance of good vectorization of
> these complex documents.
>
> A third step is to define advanced vectorization for complex documents.  As
> an interim step, we can simply vectorize using the dictionary and
> alternative vectorizers that we have now, but applied to a single field of
> the document.  Shortly, though, we should be able to define cross
> occurrence
> features for a multi-field vectorization.
>
> The only dependencies here are that the third step depends on the first and
> second.
>
> You have been working on the Dictionary vectorizer.  I did a bit of work on
> stochastic projection with some cooccurrence.
>
> In parallel Drew and I have been working on building an Avro document
> schema.  This is driving forward on step 2.  I think that will actually
> bear
> some fruit quickly.  Once that is done, we should merge capabilities.  I am
> hoping that the good momentum that you have established on (1) will mean
> that merging your vectorization with the complex documents will be
> relatively easy.
>
> Is that a workable idea?
>
> On Thu, Feb 4, 2010 at 10:45 AM, Robin Anil  wrote:
>
> > And how does it
> > work with our sequence file format(string docid => string document>. All
> we
> > have is text=>text ?
> > and finally its all vectors. How does same word in two different fields
> > translate into vector?
> >
> > if you have a clear plan lets do it or lets do the first version with
> just
> >
> > document -> analyzer -> token array -> vector
> >  |-> ngram -> vector
> >
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>


Re: Mahout 0.3 Plan and other changes

2010-02-04 Thread Drew Farris
On Thu, Feb 4, 2010 at 1:45 PM, Robin Anil  wrote:
>
> if you have a clear plan lets do it or lets do the first version with just
>
> document -> analyzer -> token array -> vector
>                                                      |-> ngram -> vector
>

Ted summed it up perfectly. I think this is great until we get further
along with the document work.

>
> Lets not have overlapping ids otherwise it becomes a pain to merge. have
> unique ids in sequence file, and a file with last id used ?
>

Ok, I will read the partial vector/dictionary code to get my head around this.


Re: Mahout 0.3 Plan and other changes

2010-02-04 Thread deneche abdelhakim
> One important question in my mind here is how does this effect 0.20 based
> jobs and pre 0.20 based jobs. I had written pfpgrowth in pure 0.20 api. and
> deneche is also maintaining two version it seems. I will check the
> AbstractJob and see

although I maintain two versions of Decision Forests, one with the old
api and with the new one, the differences between the two APIs are so
important that I can't just keep working on the two versions. Thus all
the new stuff is being committed using the new API and as far as I can
say it seems to work great.

On Thu, Feb 4, 2010 at 4:48 PM, Robin Anil  wrote:
> On Thu, Feb 4, 2010 at 7:28 PM, Sean Owen  wrote:
>
>> On Thu, Feb 4, 2010 at 12:28 PM, Robin Anil  wrote:
>> > 3rd thing:
>> > I am planning to convert the launcher code to implement ToolRunner.
>> Anyone
>> > volunteer to help me with that?
>>
>> I had wished to begin standardizing how we write these jobs, yes.
>>
>> If you see AbstractJob, you'll see how I've unified my three jobs and
>> how I'm trying to structure them. It implements ToolRunner so all that
>> is already taken care of.
>>
>> I think some standardization is really useful, to solve problems like
>> this and others, and I'll offer this as a 'draft' for further work. No
>> real point in continuing to solve these things individually.
>
> One important question in my mind here is how does this effect 0.20 based
> jobs and pre 0.20 based jobs. I had written pfpgrowth in pure 0.20 api. and
> deneche is also maintaining two version it seems. I will check the
> AbstractJob and see
>
>
>> > 5th The release:
>> > Fix a date for 0.3 release? We should look to improve quality in this
>> > release. i.e In-terms of running the parts of the code each of us haven't
>> > tested (like I have run bayes and fp growth many a time, So, I will focus
>> on
>> > running clustering algorithms and try out various options see if there is
>> > any issue) provide feedback so that the one who wrote it can help tweak
>> it?
>>
>> Maybe, maybe not. There are always 100 things that could be worked on,
>> and that will never change -- it'll never be 'done'. The question of a
>> release, at this point, is more like, has enough time elapsed / has
>> enough progress been made to warrant a new point release? I think we
>> are at that point now.
>>
>> The question is not what big things can we do -- 'big' is for 0.4 or
>> beyond now -- but what small wins can we get in, or what small changes
>> are necessary to tie up loose ends to make a roughly coherent release.
>> In that sense, no, I'm not sure I'd say things like what you describe
>> should be in for 0.3. I mean we could, but then it's months away, and
>> isn't that just what we call "0.4"?
>>
>> Everyone's had a week or two to move towards 0.3 so I believe it's
>> time to begin pushing on these issues, closing then / resolving them /
>> moving to 0.4 by end of week. Then set the wheel in motion first thing
>> next week, since it'll still be some time before everyone's on board.
>>
>


Re: Mahout 0.3 Plan and other changes

2010-02-04 Thread Ted Dunning
These are good questions.  I see the best course as answering these kinds of
questions in phases.

First, the only thing that is working right now is the current text =>
vector stuff.  We should continue to refine this with alternative forms of
vectorization (random indexing, stochastic projection as well as the current
dictionary approach).

A second step is to be able to store and represent more general documents
similar to what is possible with Lucene.  This is critically important for
some of the things that I want to do where I need to store and segregate
title, publisher, authors, abstracts and body text (and many other
characteristics ... we probably have >100 of them).  It is also critically
important if we want to embrace the dualism between recommendation and
search.  Representing documents can be done without discarding the simpler
approach we have now and it can be done in advance of good vectorization of
these complex documents.

A third step is to define advanced vectorization for complex documents.  As
an interim step, we can simply vectorize using the dictionary and
alternative vectorizers that we have now, but applied to a single field of
the document.  Shortly, though, we should be able to define cross occurrence
features for a multi-field vectorization.

The only dependencies here are that the third step depends on the first and
second.

You have been working on the Dictionary vectorizer.  I did a bit of work on
stochastic projection with some cooccurrence.

In parallel Drew and I have been working on building an Avro document
schema.  This is driving forward on step 2.  I think that will actually bear
some fruit quickly.  Once that is done, we should merge capabilities.  I am
hoping that the good momentum that you have established on (1) will mean
that merging your vectorization with the complex documents will be
relatively easy.

Is that a workable idea?

On Thu, Feb 4, 2010 at 10:45 AM, Robin Anil  wrote:

> And how does it
> work with our sequence file format(string docid => string document>. All we
> have is text=>text ?
> and finally its all vectors. How does same word in two different fields
> translate into vector?
>
> if you have a clear plan lets do it or lets do the first version with just
>
> document -> analyzer -> token array -> vector
>  |-> ngram -> vector
>



-- 
Ted Dunning, CTO
DeepDyve


Re: Mahout 0.3 Plan and other changes

2010-02-04 Thread Robin Anil
On Thu, Feb 4, 2010 at 10:29 PM, Drew Farris  wrote:

> On Thu, Feb 4, 2010 at 10:51 AM, Robin Anil  wrote:
>
> >>
> >> Document Directory -> Document Sequence File
> >> Document Sequence File -> Document Token Streams
> >> Document Token Streams -> Document Vectors + Dictionary
> >>
> > Ok I will work on this Job.
>
> FWIW, Ted had proposed something on the order of allowing Documents to
> have multiple named Fields, where each field has an independent token
> stream. Likewise, Document sequence files could have multiple fields
> per Document where each field is a string. What do you think about
> something like this? The documents I work with day to day in
> production are more frequently field structured than flat and in some
> cases fields are tokenized while others are simply untouched. I
>
>  Tell me What the schema it should be List> ? And how does it
work with our sequence file format(string docid => string document>. All we
have is text=>text ?
and finally its all vectors. How does same word in two different fields
translate into vector?

if you have a clear plan lets do it or lets do the first version with just

document -> analyzer -> token array -> vector
  |-> ngram -> vector

> Also partial Vector merger could be reused by colloc when creating ngram
> > only vectors. But we need to keep adding to the dictionary file. If you
> can
> > work on a dictionary merger + chunker, it will be great. I think we can
> do
> > this integration quickly
>
> I'll take a closer look at the Dictionary code you're produced and see
> what I can come up with -- is the basic idea here to take multiple
> dictionaries with potentially overlapping ID's and merge them into a
> single dictionary? What needs to happen with regards to chunking?

Lets not have overlapping ids otherwise it becomes a pain to merge. have
unique ids in sequence file, and a file with last id used ?

>

Drew
>


Re: Mahout 0.3 Plan and other changes

2010-02-04 Thread Jake Mannix
On Thu, Feb 4, 2010 at 9:15 AM, Drew Farris  wrote:
>
>
> Ok, this makes sense to me. I think running on Amazon EMR is a good
> goal, not to mention I'm sure there are people out there with
> installations running on pre 0.20 hadoop too. As much as I hate to to
> see the deprecation warnings all of the time your reasoning behind
> sticking with the old apis sounds solid.
>
>
Amazon EMR and people's old installations is why I lean toward "-1" for
moving
wholly toward 0.20 standardization just yet.  I *strongly* prefer the new
API
myself, and there are also bugs (perf and func) fixed in 0.20, but I would
hate
to basically cut off support for these hadoop installations until people
and
Amazon upgrade.


  -jake


Re: Mahout 0.3 Plan and other changes

2010-02-04 Thread Sean Owen
On Thu, Feb 4, 2010 at 5:15 PM, Drew Farris  wrote:
> Which jobs specifically? It would be great to use these for reference.

All of the recommender-related ones. Try
org.apache.mahout.cf.taste.hadoop.item.RecommenderJob. Use of
TextInputFormat triggers the problem.

It's possible I'm misusing the new APIs, but I kind of doubt it, and
there's no examples for them yet anyhow that I've seen that tell me to
do something else.


Re: Mahout 0.3 Plan and other changes

2010-02-04 Thread Drew Farris
On Thu, Feb 4, 2010 at 12:10 PM, Sean Owen  wrote:
>
> ... so therefore what's the actual use in upgrading yet. I also
> figured we'd spend some time consolidating our own approach to Hadoop
> -- I've refactored my 3 jobs into one approach -- making the eventual
> transition simpler.
>

Which jobs specifically? It would be great to use these for reference.

>
> No harm in having new-API code alongside the old-API code, but I still
> suggest we should stick on the old APIs.

Ok, this makes sense to me. I think running on Amazon EMR is a good
goal, not to mention I'm sure there are people out there with
installations running on pre 0.20 hadoop too. As much as I hate to to
see the deprecation warnings all of the time your reasoning behind
sticking with the old apis sounds solid.

Drew


Re: Mahout 0.3 Plan and other changes

2010-02-04 Thread Sean Owen
It's this crazy thing where the new APIs call into old APIs and checks
fail as a result -- for example, try setting an InputFormat class that
implements the 'new' InputFormat. Somewhere in the code it checks to
see if you're implementing the *old* InputFormat.

It may so happen that only my jobs hit this. I don't see it fixed in
any branch yet.

Actually I failed to mention what I think is a far bigger reason to
not move to the new API just yet -- it won't run on Amazon Elastic
MapReduce.

I suppose the thinking is that the old APIs
- work with stuff like Amazon
- work with Hadoop's latest release
- work -- doesn't have a bug that's stopping us

... so therefore what's the actual use in upgrading yet. I also
figured we'd spend some time consolidating our own approach to Hadoop
-- I've refactored my 3 jobs into one approach -- making the eventual
transition simpler.

And so I stopped thinking about it.

No harm in having new-API code alongside the old-API code, but I still
suggest we should stick on the old APIs.

On Thu, Feb 4, 2010 at 5:01 PM, Drew Farris  wrote:
> Sean,
>
> What sort of problems have you run into, are there Hadoop JIRA issues
> open for them?
>
> It would be nice to commit to the 0.20.x api in Mahout, but I agree,
> not very nice if we back the users into a corner wrt what they can and
> can't do due to bugs in Hadoop.
>
> Drew
>
> On Thu, Feb 4, 2010 at 10:57 AM, Sean Owen  wrote:
>> Yeah I'm still on the old API because of problems in Hadoop. I'm still
>> hoping they get fixed in 0.20.x  We may need two-track support for a
>> while.
>>
>> On Thu, Feb 4, 2010 at 3:48 PM, Robin Anil  wrote:
>>> One important question in my mind here is how does this effect 0.20 based
>>> jobs and pre 0.20 based jobs. I had written pfpgrowth in pure 0.20 api. and
>>> deneche is also maintaining two version it seems. I will check the
>>> AbstractJob and see
>>>
>>>
>>
>


Re: Mahout 0.3 Plan and other changes

2010-02-04 Thread Drew Farris
Sean,

What sort of problems have you run into, are there Hadoop JIRA issues
open for them?

It would be nice to commit to the 0.20.x api in Mahout, but I agree,
not very nice if we back the users into a corner wrt what they can and
can't do due to bugs in Hadoop.

Drew

On Thu, Feb 4, 2010 at 10:57 AM, Sean Owen  wrote:
> Yeah I'm still on the old API because of problems in Hadoop. I'm still
> hoping they get fixed in 0.20.x  We may need two-track support for a
> while.
>
> On Thu, Feb 4, 2010 at 3:48 PM, Robin Anil  wrote:
>> One important question in my mind here is how does this effect 0.20 based
>> jobs and pre 0.20 based jobs. I had written pfpgrowth in pure 0.20 api. and
>> deneche is also maintaining two version it seems. I will check the
>> AbstractJob and see
>>
>>
>


Re: Mahout 0.3 Plan and other changes

2010-02-04 Thread Drew Farris
Sean,

What sort of problems have you run into, are there Hadoop JIRA issues
open for them?

It would be nice to commit to the 0.20.x api in Mahout, but I agree,
not very nice if we back the users into a corner wrt what they can and
can't do due to bugs in Hadoop.

Drew

On Thu, Feb 4, 2010 at 10:57 AM, Sean Owen  wrote:
> Yeah I'm still on the old API because of problems in Hadoop. I'm still
> hoping they get fixed in 0.20.x  We may need two-track support for a
> while.
>
> On Thu, Feb 4, 2010 at 3:48 PM, Robin Anil  wrote:
>> One important question in my mind here is how does this effect 0.20 based
>> jobs and pre 0.20 based jobs. I had written pfpgrowth in pure 0.20 api. and
>> deneche is also maintaining two version it seems. I will check the
>> AbstractJob and see
>>
>>
>


Re: Mahout 0.3 Plan and other changes

2010-02-04 Thread Drew Farris
On Thu, Feb 4, 2010 at 10:51 AM, Robin Anil  wrote:

>>
>> Document Directory -> Document Sequence File
>> Document Sequence File -> Document Token Streams
>> Document Token Streams -> Document Vectors + Dictionary
>>
> Ok I will work on this Job.

FWIW, Ted had proposed something on the order of allowing Documents to
have multiple named Fields, where each field has an independent token
stream. Likewise, Document sequence files could have multiple fields
per Document where each field is a string. What do you think about
something like this? The documents I work with day to day in
production are more frequently field structured than flat and in some
cases fields are tokenized while others are simply untouched. I

> Also partial Vector merger could be reused by colloc when creating ngram
> only vectors. But we need to keep adding to the dictionary file. If you can
> work on a dictionary merger + chunker, it will be great. I think we can do
> this integration quickly

I'll take a closer look at the Dictionary code you're produced and see
what I can come up with -- is the basic idea here to take multiple
dictionaries with potentially overlapping ID's and merge them into a
single dictionary? What needs to happen with regards to chunking?

Drew


Re: Mahout 0.3 Plan and other changes

2010-02-04 Thread Sean Owen
Yeah I'm still on the old API because of problems in Hadoop. I'm still
hoping they get fixed in 0.20.x  We may need two-track support for a
while.

On Thu, Feb 4, 2010 at 3:48 PM, Robin Anil  wrote:
> One important question in my mind here is how does this effect 0.20 based
> jobs and pre 0.20 based jobs. I had written pfpgrowth in pure 0.20 api. and
> deneche is also maintaining two version it seems. I will check the
> AbstractJob and see
>
>


Re: Mahout 0.3 Plan and other changes

2010-02-04 Thread Sean Owen
Yeah I'm still on the old API because of problems in Hadoop. I'm still
hoping they get fixed in 0.20.x  We may need two-track support for a
while.

On Thu, Feb 4, 2010 at 3:48 PM, Robin Anil  wrote:
> One important question in my mind here is how does this effect 0.20 based
> jobs and pre 0.20 based jobs. I had written pfpgrowth in pure 0.20 api. and
> deneche is also maintaining two version it seems. I will check the
> AbstractJob and see
>
>


Re: Mahout 0.3 Plan and other changes

2010-02-04 Thread Robin Anil
On Thu, Feb 4, 2010 at 8:13 PM, Drew Farris  wrote:

> On Thu, Feb 4, 2010 at 7:28 AM, Robin Anil  wrote:
>
> > Since I was converting vectorization in to sequence files. I was going to
> > change the lucene Driver to write dictionary to sequence file instead of
> tab
> > separated text file. Also I will change the cluster dumper to read the
> > dictionary from the sequence File.
>
> Sounds good.
>
> > Iterator interface where SequenceFile reader/writer
> is
> > one implementation, Tab separated file reader/writer is another
>
> I like this, but also how about providing a utility to go from
> tab-delimited dict format to SequenceFile format. This way there's a
> migration path for old datasets.
>
> > 2nd Thing:
> >  Lucene seems too slow for querying dictionary vectorization
> > 1 x 6 hours m/r as opposed to 2 x 1.5 hour on wikipedia. i.e. Double read
> of
> > wikipedia dump with a hashmap is faster than single read using a lucene
> > index
>
> Is the double read approach described in one of the previous threads
> discussion this issue? Just curious how it works..
>
> > 3rd thing:
> > I am planning to convert the launcher code to implement ToolRunner.
> Anyone
> > volunteer to help me with that?
>
> Sure, I can help out. What classes need to be updated? I've patched
> the clustering code in the past, that's probably a natural start.
> Sean, I'll take a look at AbstractJob and what would be involved in
> re-using it in the Clustering code.
>
> With ToolRunner, we get GenericOptionsParser for free, and the
> launcher classes must implement Tool and Configurable, right?
> ToolRunner is specific to the 0.20 api, isn't it?
>
> I did notice that Eclipse was complaining about GenericOptionsParser
> last night because commons-cli 1.x wasn't available. I had to remove
> its exclusion in the parent pom to get things to work properly, anyone
> else run into this or is this something funky in my environment.
>
> > 4th thing:
> > Any thoughts how we can integrate output of n-gram map/reduce to generate
> > vectors from dataset
>
> So are you speaking of n-grams in general, or the output of the colloc
> work? I suppose I should wrap up the process of writing the top
> collocations to a file which can be read into a bloom filter which can
> be integrated into phase of the document vectorization process that
> performs tokenization. The document vectorization code could use the
> shingle filter to produce ngrams and emit those that passed the bloom
> filter.
>
> There's some feedback I'm looking for on MAHOUT-242 related to this,
> that would be helpful, questions about the best way to produce the set
> of top collocations.
>
> Robin, have you considered adding a step to the document vectorization
> process that would produce output that's a token stream instead of a
> vector?
>
> Instead of:
> Document Drectory -> Document Sequence File
> Document Sequence File -> Document Vectors + Dictionary
>
> Document Directory -> Document Sequence File
> Document Sequence File -> Document Token Streams
> Document Token Streams -> Document Vectors + Dictionary
>
Ok I will work on this Job.
Also partial Vector merger could be reused by colloc when creating ngram
only vectors. But we need to keep adding to the dictionary file. If you can
work on a dictionary merger + chunker, it will be great. I think we can do
this integration quickly



> This way, something like the colloc/n-gram process would read the
> output of the second pass (Document Token Streams file) instead of
> having to re-tokenize everything simply to obtain token streams.
>
> > 5th The release:
> > Fix a date for 0.3 release? We should look to improve quality in this
> > release. i.e In-terms of running the parts of the code each of us haven't
> > tested (like I have run bayes and fp growth many a time, So, I will focus
> on
> > running clustering algorithms and try out various options see if there is
> > any issue) provide feedback so that the one who wrote it can help tweak
> it?
>
> It is probably time to resurrect Sean's thread from last week and see
> how we stand on the issues listed there.
>
> Drew
>


Re: Mahout 0.3 Plan and other changes

2010-02-04 Thread Robin Anil
On Thu, Feb 4, 2010 at 7:28 PM, Sean Owen  wrote:

> On Thu, Feb 4, 2010 at 12:28 PM, Robin Anil  wrote:
> > 3rd thing:
> > I am planning to convert the launcher code to implement ToolRunner.
> Anyone
> > volunteer to help me with that?
>
> I had wished to begin standardizing how we write these jobs, yes.
>
> If you see AbstractJob, you'll see how I've unified my three jobs and
> how I'm trying to structure them. It implements ToolRunner so all that
> is already taken care of.
>
> I think some standardization is really useful, to solve problems like
> this and others, and I'll offer this as a 'draft' for further work. No
> real point in continuing to solve these things individually.

One important question in my mind here is how does this effect 0.20 based
jobs and pre 0.20 based jobs. I had written pfpgrowth in pure 0.20 api. and
deneche is also maintaining two version it seems. I will check the
AbstractJob and see


> > 5th The release:
> > Fix a date for 0.3 release? We should look to improve quality in this
> > release. i.e In-terms of running the parts of the code each of us haven't
> > tested (like I have run bayes and fp growth many a time, So, I will focus
> on
> > running clustering algorithms and try out various options see if there is
> > any issue) provide feedback so that the one who wrote it can help tweak
> it?
>
> Maybe, maybe not. There are always 100 things that could be worked on,
> and that will never change -- it'll never be 'done'. The question of a
> release, at this point, is more like, has enough time elapsed / has
> enough progress been made to warrant a new point release? I think we
> are at that point now.
>
> The question is not what big things can we do -- 'big' is for 0.4 or
> beyond now -- but what small wins can we get in, or what small changes
> are necessary to tie up loose ends to make a roughly coherent release.
> In that sense, no, I'm not sure I'd say things like what you describe
> should be in for 0.3. I mean we could, but then it's months away, and
> isn't that just what we call "0.4"?
>
> Everyone's had a week or two to move towards 0.3 so I believe it's
> time to begin pushing on these issues, closing then / resolving them /
> moving to 0.4 by end of week. Then set the wheel in motion first thing
> next week, since it'll still be some time before everyone's on board.
>


Re: Mahout 0.3 Plan and other changes

2010-02-04 Thread Drew Farris
On Thu, Feb 4, 2010 at 7:28 AM, Robin Anil  wrote:

> Since I was converting vectorization in to sequence files. I was going to
> change the lucene Driver to write dictionary to sequence file instead of tab
> separated text file. Also I will change the cluster dumper to read the
> dictionary from the sequence File.

Sounds good.

> Iterator interface where SequenceFile reader/writer is
> one implementation, Tab separated file reader/writer is another

I like this, but also how about providing a utility to go from
tab-delimited dict format to SequenceFile format. This way there's a
migration path for old datasets.

> 2nd Thing:
>  Lucene seems too slow for querying dictionary vectorization
> 1 x 6 hours m/r as opposed to 2 x 1.5 hour on wikipedia. i.e. Double read of
> wikipedia dump with a hashmap is faster than single read using a lucene
> index

Is the double read approach described in one of the previous threads
discussion this issue? Just curious how it works..

> 3rd thing:
> I am planning to convert the launcher code to implement ToolRunner. Anyone
> volunteer to help me with that?

Sure, I can help out. What classes need to be updated? I've patched
the clustering code in the past, that's probably a natural start.
Sean, I'll take a look at AbstractJob and what would be involved in
re-using it in the Clustering code.

With ToolRunner, we get GenericOptionsParser for free, and the
launcher classes must implement Tool and Configurable, right?
ToolRunner is specific to the 0.20 api, isn't it?

I did notice that Eclipse was complaining about GenericOptionsParser
last night because commons-cli 1.x wasn't available. I had to remove
its exclusion in the parent pom to get things to work properly, anyone
else run into this or is this something funky in my environment.

> 4th thing:
> Any thoughts how we can integrate output of n-gram map/reduce to generate
> vectors from dataset

So are you speaking of n-grams in general, or the output of the colloc
work? I suppose I should wrap up the process of writing the top
collocations to a file which can be read into a bloom filter which can
be integrated into phase of the document vectorization process that
performs tokenization. The document vectorization code could use the
shingle filter to produce ngrams and emit those that passed the bloom
filter.

There's some feedback I'm looking for on MAHOUT-242 related to this,
that would be helpful, questions about the best way to produce the set
of top collocations.

Robin, have you considered adding a step to the document vectorization
process that would produce output that's a token stream instead of a
vector?

Instead of:
Document Drectory -> Document Sequence File
Document Sequence File -> Document Vectors + Dictionary

Document Directory -> Document Sequence File
Document Sequence File -> Document Token Streams
Document Token Streams -> Document Vectors + Dictionary

This way, something like the colloc/n-gram process would read the
output of the second pass (Document Token Streams file) instead of
having to re-tokenize everything simply to obtain token streams.

> 5th The release:
> Fix a date for 0.3 release? We should look to improve quality in this
> release. i.e In-terms of running the parts of the code each of us haven't
> tested (like I have run bayes and fp growth many a time, So, I will focus on
> running clustering algorithms and try out various options see if there is
> any issue) provide feedback so that the one who wrote it can help tweak it?

It is probably time to resurrect Sean's thread from last week and see
how we stand on the issues listed there.

Drew


Re: Mahout 0.3 Plan and other changes

2010-02-04 Thread Sean Owen
On Thu, Feb 4, 2010 at 12:28 PM, Robin Anil  wrote:
> 3rd thing:
> I am planning to convert the launcher code to implement ToolRunner. Anyone
> volunteer to help me with that?

I had wished to begin standardizing how we write these jobs, yes.

If you see AbstractJob, you'll see how I've unified my three jobs and
how I'm trying to structure them. It implements ToolRunner so all that
is already taken care of.

I think some standardization is really useful, to solve problems like
this and others, and I'll offer this as a 'draft' for further work. No
real point in continuing to solve these things individually.


> 5th The release:
> Fix a date for 0.3 release? We should look to improve quality in this
> release. i.e In-terms of running the parts of the code each of us haven't
> tested (like I have run bayes and fp growth many a time, So, I will focus on
> running clustering algorithms and try out various options see if there is
> any issue) provide feedback so that the one who wrote it can help tweak it?

Maybe, maybe not. There are always 100 things that could be worked on,
and that will never change -- it'll never be 'done'. The question of a
release, at this point, is more like, has enough time elapsed / has
enough progress been made to warrant a new point release? I think we
are at that point now.

The question is not what big things can we do -- 'big' is for 0.4 or
beyond now -- but what small wins can we get in, or what small changes
are necessary to tie up loose ends to make a roughly coherent release.
In that sense, no, I'm not sure I'd say things like what you describe
should be in for 0.3. I mean we could, but then it's months away, and
isn't that just what we call "0.4"?

Everyone's had a week or two to move towards 0.3 so I believe it's
time to begin pushing on these issues, closing then / resolving them /
moving to 0.4 by end of week. Then set the wheel in motion first thing
next week, since it'll still be some time before everyone's on board.


Mahout 0.3 Plan and other changes

2010-02-04 Thread Robin Anil
1st Thing:

Since I was converting vectorization in to sequence files. I was going to
change the lucene Driver to write dictionary to sequence file instead of tab
separated text file. Also I will change the cluster dumper to read the
dictionary from the sequence File.

I can go about in three ways

Stick to only SequenceFile Format for the dictionary and remove tab
separated thing out of the system

OR

Iterator interface where SequenceFile reader/writer is
one implementation, Tab separated file reader/writer is another


2nd Thing:
 Lucene seems too slow for querying dictionary vectorization
1 x 6 hours m/r as opposed to 2 x 1.5 hour on wikipedia. i.e. Double read of
wikipedia dump with a hashmap is faster than single read using a lucene
index


3rd thing:
I am planning to convert the launcher code to implement ToolRunner. Anyone
volunteer to help me with that?

4th thing:
Any thoughts how we can integrate output of n-gram map/reduce to generate
vectors from dataset

5th The release:
Fix a date for 0.3 release? We should look to improve quality in this
release. i.e In-terms of running the parts of the code each of us haven't
tested (like I have run bayes and fp growth many a time, So, I will focus on
running clustering algorithms and try out various options see if there is
any issue) provide feedback so that the one who wrote it can help tweak it?

Maybe time the code when we run it and put it on the wiki ?

Can we set a Sprint week when we will be doing this.



Comments awaited
Robin