Re: Mahout 0.3 Plan and other changes
I actually want to try and see how much runs on Amazon EMR (0.18.3*), as that would be good to document. I like running on 0.20 better, and I certainly think we should recommend people use it, but there are certainly some jobs which simply won't run on 0.18, although it would be good to document which ones those are. -jake On Wed, Feb 10, 2010 at 10:51 AM, Ted Dunning wrote: > +1 from me even though I am still on 19 at work. > > On Wed, Feb 10, 2010 at 3:53 AM, Isabel Drost wrote: > > > On Wed Sean Owen wrote: > > > > > I'd say we recommend 0.20, since that's what we develop against and > > > it's the current stable release, and everything we have works on it. > > > > > > We can also say it should work on 0.19 and 0.18, but we don't > > > guarantee or support that. (Slightly different than my last suggestion > > > -- we don't actually know how it all goes on 0.19) > > > > Sounds good to me. > > > > > -- > Ted Dunning, CTO > DeepDyve >
Re: Mahout 0.3 Plan and other changes
+1 from me even though I am still on 19 at work. On Wed, Feb 10, 2010 at 3:53 AM, Isabel Drost wrote: > On Wed Sean Owen wrote: > > > I'd say we recommend 0.20, since that's what we develop against and > > it's the current stable release, and everything we have works on it. > > > > We can also say it should work on 0.19 and 0.18, but we don't > > guarantee or support that. (Slightly different than my last suggestion > > -- we don't actually know how it all goes on 0.19) > > Sounds good to me. -- Ted Dunning, CTO DeepDyve
Re: Mahout 0.3 Plan and other changes
We could have a profile for that. On Wed, Feb 10, 2010 at 11:17 AM, Drew Farris wrote: > On Wed, Feb 10, 2010 at 6:40 AM, Sean Owen wrote: >> >> We can also say it should work on 0.19 and 0.18, but we don't >> guarantee or support that. (Slightly different than my last suggestion >> -- we don't actually know how it all goes on 0.19) >> > > +1 -- we can't really know how it will work unless we build against > the 0.19 jars and run the unit tests. >
Re: Mahout 0.3 Plan and other changes
On Wed, Feb 10, 2010 at 6:40 AM, Sean Owen wrote: > > We can also say it should work on 0.19 and 0.18, but we don't > guarantee or support that. (Slightly different than my last suggestion > -- we don't actually know how it all goes on 0.19) > +1 -- we can't really know how it will work unless we build against the 0.19 jars and run the unit tests.
Re: Mahout 0.3 Plan and other changes
On Wed Sean Owen wrote: > I'd say we recommend 0.20, since that's what we develop against and > it's the current stable release, and everything we have works on it. > > We can also say it should work on 0.19 and 0.18, but we don't > guarantee or support that. (Slightly different than my last suggestion > -- we don't actually know how it all goes on 0.19) Sounds good to me. Isabel
Re: Mahout 0.3 Plan and other changes
I'd say we recommend 0.20, since that's what we develop against and it's the current stable release, and everything we have works on it. We can also say it should work on 0.19 and 0.18, but we don't guarantee or support that. (Slightly different than my last suggestion -- we don't actually know how it all goes on 0.19) On Wed, Feb 10, 2010 at 11:36 AM, Isabel Drost wrote: > +1 > > Assuming that the majority of the algorithms may work on e.g. 0.19, we > could tell users something along the lines of "works with Hadoop 0.19, > except $algorithms_for_20, may work with 0.18, not guarantee given". > > Isabel >
Re: Mahout 0.3 Plan and other changes
On Wed, 10 Feb 2010 11:10:41 + Sean wrote: > For simplicity, I'd document that Mahout works on 0.19 and 0.20, and > may work on 0.18 +1 Assuming that the majority of the algorithms may work on e.g. 0.19, we could tell users something along the lines of "works with Hadoop 0.19, except $algorithms_for_20, may work with 0.18, not guarantee given". Isabel
Re: Mahout 0.3 Plan and other changes
fpm is purely based on 0.20.x api and works perfectly fine on that On Wed, Feb 10, 2010 at 4:40 PM, Sean wrote: > For simplicity, I'd document that Mahout works on 0.19 and 0.20, and > may work on 0.18. That's more what people need to know, rather than > confuse the issue with talk of old/new APIs, since even I am confused > about what's going on. The two are blending together, while one is > deprecated, and it causes problems. > > In the one case here, there are two implementations, covering all bases. >
Re: Mahout 0.3 Plan and other changes
For simplicity, I'd document that Mahout works on 0.19 and 0.20, and may work on 0.18. That's more what people need to know, rather than confuse the issue with talk of old/new APIs, since even I am confused about what's going on. The two are blending together, while one is deprecated, and it causes problems. In the one case here, there are two implementations, covering all bases.
Re: Mahout 0.3 Plan and other changes
On Thu deneche abdelhakim wrote: > although I maintain two versions of Decision Forests, one with the old > api and with the new one, the differences between the two APIs are so > important that I can't just keep working on the two versions. Thus all > the new stuff is being committed using the new API and as far as I can > say it seems to work great. If I understand you correctly, there is code in Mahout that still works with the old API but also bits and pieces that depend on the new API. Do we have some documentation we can include in the release that tells users for which algorithms/ implementations they need to make sure they are running a Hadoop version that provides the new API? Isabel
Re: Mahout 0.3 Plan and other changes
I am committing the first level of changes so that drew can work it. I have updated the patch on the issue as a reference. Ted please take a look when you get time. The names will change correspondingly What I have right now is 4 Main Entry points DocumentProcessor - does SequenceFile => StringTuple(later replaced by StructuredDocumentWritable backed by AvroWritable) DictionaryVectorizer - StringTuple of documents => Tf Vector PartialVectorMerger - merges partial vectors based on their doc id. Does optional normalizing(used by both DictionaryVectorizer(no normalizing) and TFIDFConverter (optional normalizing0 TfidfConverter - Converts tf vector to tfidf vector with optional normalizing An example which uses all of them hadoop jar examples/target/mahout-examples-0.3-SNAPSHOT.job org.apache.mahout.text.SparseVectorsFromSequenceFiles -i reuters-seqfiles -o reuters-vectors -w (tfidf|tf) --norm 2(works only with tfidf for now) Robin On Fri, Feb 5, 2010 at 12:46 PM, Ted Dunning wrote: > Drew has an early code drop that should be posted shortly. He has a > generic > AvroWritable that can serialize anything with an appropriate schema. That > changes your names and philosophy a bit. > > Regarding n-grams, I think that will be best combined with a non-dictionary > based vectorizer because of the large implied vocabulary that would > otherwise result. Also, in many cases vectorization and n-gram generation > is best done in the learning algorithm itself to avoid moving massive > amounts of data. As such, vectorization will probably need to be a library > rather than a map-reduce program. > > > On Thu, Feb 4, 2010 at 7:49 PM, Robin Anil wrote: > > > Lets break it down into milestones. See if you agree on the > following(even > > ClassNames ?) > > > > On Fri, Feb 5, 2010 at 12:27 AM, Ted Dunning > > wrote: > > > > > These are good questions. I see the best course as answering these > kinds > > > of > > > questions in phases. > > > > > > First, the only thing that is working right now is the current text => > > > vector stuff. We should continue to refine this with alternative forms > > of > > > vectorization (random indexing, stochastic projection as well as the > > > current > > > dictionary approach). > > > > > > The input all these vectorization job is StucturedDocumentWritable > format > > which you and Drew will work on(Avro based) > > > > To create the StructuredDocumentWritable format we have to write > Mapreduces > > which will convert > > a) SequenceFile => SingleField token array using Analyzer > > I am going with simple Document > > => StucturedDocumentWritable(encapsulating StringTuple) in M1. > > Change it to StucturedDocumentWritable( in M2 > > b) Lucene Repo => StucturedDocumentWritable M2 > > c) Structured XML => StucturedDocumentWritable M2 > > d) Other Formats/DataSources(RDBMS) => StucturedDocumentWritable > > M3 > > > > Jobs using StructuredDocumentWritable > > a) DictionaryVectorizer -> Makes VectorWritable M1 > > b) nGram Generator -> Makes ngrams -> > > 1) Appends to the dictionary -> Creates Partial Vectors -> > Merges > > with vectors from Dictionary Vectorizer to create ngram based vectors > > M1 > > 2) Appends to other vectorizers(random indexing, stochastic) > M1? > > or M2 > > c) Random Indexing Job -> Makes VectorWritable M1? or M2 > > d) StochasticProjection Job -> Makes Vector writable M1? or M2 > > > > > > How does this sound ? Feel free to edit/reorder them > > > > > > > > A second step is to be able to store and represent more general documents > > > similar to what is possible with Lucene. This is critically important > > for > > > some of the things that I want to do where I need to store and > segregate > > > title, publisher, authors, abstracts and body text (and many other > > > characteristics ... we probably have >100 of them). It is also > > critically > > > important if we want to embrace the dualism between recommendation and > > > search. Representing documents can be done without discarding the > > simpler > > > approach we have now and it can be done in advance of good > vectorization > > of > > > these complex documents. > > > > > > A third step is to define advanced vectorization for complex documents. > > As > > > an interim step, we can simply vectorize using the dictionary and > > > alternative vectorizers that we have now, but applied to a single field > > of > > > the document. Shortly, though, we should be able to define cross > > > occurrence > > > features for a multi-field vectorization. > > > > > > The only dependencies here are that the third step depends on the first > > and > > > second. > > > > > > You have been working on the Dictionary vectorizer. I did a bit of > work > > on > > > stochastic projection with some cooccurrence. > > > > > > In parallel Drew and I have been working on building an Avro docume
Re: Mahout 0.3 Plan and other changes
Drew has an early code drop that should be posted shortly. He has a generic AvroWritable that can serialize anything with an appropriate schema. That changes your names and philosophy a bit. Regarding n-grams, I think that will be best combined with a non-dictionary based vectorizer because of the large implied vocabulary that would otherwise result. Also, in many cases vectorization and n-gram generation is best done in the learning algorithm itself to avoid moving massive amounts of data. As such, vectorization will probably need to be a library rather than a map-reduce program. On Thu, Feb 4, 2010 at 7:49 PM, Robin Anil wrote: > Lets break it down into milestones. See if you agree on the following(even > ClassNames ?) > > On Fri, Feb 5, 2010 at 12:27 AM, Ted Dunning > wrote: > > > These are good questions. I see the best course as answering these kinds > > of > > questions in phases. > > > > First, the only thing that is working right now is the current text => > > vector stuff. We should continue to refine this with alternative forms > of > > vectorization (random indexing, stochastic projection as well as the > > current > > dictionary approach). > > > > The input all these vectorization job is StucturedDocumentWritable format > which you and Drew will work on(Avro based) > > To create the StructuredDocumentWritable format we have to write Mapreduces > which will convert > a) SequenceFile => SingleField token array using Analyzer > I am going with simple Document > => StucturedDocumentWritable(encapsulating StringTuple) in M1. > Change it to StucturedDocumentWritable( in M2 > b) Lucene Repo => StucturedDocumentWritable M2 > c) Structured XML => StucturedDocumentWritable M2 > d) Other Formats/DataSources(RDBMS) => StucturedDocumentWritable > M3 > > Jobs using StructuredDocumentWritable > a) DictionaryVectorizer -> Makes VectorWritable M1 > b) nGram Generator -> Makes ngrams -> > 1) Appends to the dictionary -> Creates Partial Vectors -> Merges > with vectors from Dictionary Vectorizer to create ngram based vectors > M1 > 2) Appends to other vectorizers(random indexing, stochastic) M1? > or M2 > c) Random Indexing Job -> Makes VectorWritable M1? or M2 > d) StochasticProjection Job -> Makes Vector writable M1? or M2 > > > How does this sound ? Feel free to edit/reorder them > > > > A second step is to be able to store and represent more general documents > > similar to what is possible with Lucene. This is critically important > for > > some of the things that I want to do where I need to store and segregate > > title, publisher, authors, abstracts and body text (and many other > > characteristics ... we probably have >100 of them). It is also > critically > > important if we want to embrace the dualism between recommendation and > > search. Representing documents can be done without discarding the > simpler > > approach we have now and it can be done in advance of good vectorization > of > > these complex documents. > > > > A third step is to define advanced vectorization for complex documents. > As > > an interim step, we can simply vectorize using the dictionary and > > alternative vectorizers that we have now, but applied to a single field > of > > the document. Shortly, though, we should be able to define cross > > occurrence > > features for a multi-field vectorization. > > > > The only dependencies here are that the third step depends on the first > and > > second. > > > > You have been working on the Dictionary vectorizer. I did a bit of work > on > > stochastic projection with some cooccurrence. > > > > In parallel Drew and I have been working on building an Avro document > > schema. This is driving forward on step 2. I think that will actually > > bear > > some fruit quickly. Once that is done, we should merge capabilities. I > am > > hoping that the good momentum that you have established on (1) will mean > > that merging your vectorization with the complex documents will be > > relatively easy. > > > > Is that a workable idea? > > > > On Thu, Feb 4, 2010 at 10:45 AM, Robin Anil > wrote: > > > > > And how does it > > > work with our sequence file format(string docid => string document>. > All > > we > > > have is text=>text ? > > > and finally its all vectors. How does same word in two different fields > > > translate into vector? > > > > > > if you have a clear plan lets do it or lets do the first version with > > just > > > > > > document -> analyzer -> token array -> vector > > > |-> ngram -> > vector > > > > > > > > > > > -- > > Ted Dunning, CTO > > DeepDyve > > > -- Ted Dunning, CTO DeepDyve
Re: Mahout 0.3 Plan and other changes
Lets break it down into milestones. See if you agree on the following(even ClassNames ?) On Fri, Feb 5, 2010 at 12:27 AM, Ted Dunning wrote: > These are good questions. I see the best course as answering these kinds > of > questions in phases. > > First, the only thing that is working right now is the current text => > vector stuff. We should continue to refine this with alternative forms of > vectorization (random indexing, stochastic projection as well as the > current > dictionary approach). > > The input all these vectorization job is StucturedDocumentWritable format which you and Drew will work on(Avro based) To create the StructuredDocumentWritable format we have to write Mapreduces which will convert a) SequenceFile => SingleField token array using Analyzer I am going with simple Document => StucturedDocumentWritable(encapsulating StringTuple) in M1. Change it to StucturedDocumentWritable( in M2 b) Lucene Repo => StucturedDocumentWritable M2 c) Structured XML => StucturedDocumentWritable M2 d) Other Formats/DataSources(RDBMS) => StucturedDocumentWritable M3 Jobs using StructuredDocumentWritable a) DictionaryVectorizer -> Makes VectorWritable M1 b) nGram Generator -> Makes ngrams -> 1) Appends to the dictionary -> Creates Partial Vectors -> Merges with vectors from Dictionary Vectorizer to create ngram based vectors M1 2) Appends to other vectorizers(random indexing, stochastic) M1? or M2 c) Random Indexing Job -> Makes VectorWritable M1? or M2 d) StochasticProjection Job -> Makes Vector writable M1? or M2 How does this sound ? Feel free to edit/reorder them A second step is to be able to store and represent more general documents > similar to what is possible with Lucene. This is critically important for > some of the things that I want to do where I need to store and segregate > title, publisher, authors, abstracts and body text (and many other > characteristics ... we probably have >100 of them). It is also critically > important if we want to embrace the dualism between recommendation and > search. Representing documents can be done without discarding the simpler > approach we have now and it can be done in advance of good vectorization of > these complex documents. > > A third step is to define advanced vectorization for complex documents. As > an interim step, we can simply vectorize using the dictionary and > alternative vectorizers that we have now, but applied to a single field of > the document. Shortly, though, we should be able to define cross > occurrence > features for a multi-field vectorization. > > The only dependencies here are that the third step depends on the first and > second. > > You have been working on the Dictionary vectorizer. I did a bit of work on > stochastic projection with some cooccurrence. > > In parallel Drew and I have been working on building an Avro document > schema. This is driving forward on step 2. I think that will actually > bear > some fruit quickly. Once that is done, we should merge capabilities. I am > hoping that the good momentum that you have established on (1) will mean > that merging your vectorization with the complex documents will be > relatively easy. > > Is that a workable idea? > > On Thu, Feb 4, 2010 at 10:45 AM, Robin Anil wrote: > > > And how does it > > work with our sequence file format(string docid => string document>. All > we > > have is text=>text ? > > and finally its all vectors. How does same word in two different fields > > translate into vector? > > > > if you have a clear plan lets do it or lets do the first version with > just > > > > document -> analyzer -> token array -> vector > > |-> ngram -> vector > > > > > > -- > Ted Dunning, CTO > DeepDyve >
Re: Mahout 0.3 Plan and other changes
On Thu, Feb 4, 2010 at 1:45 PM, Robin Anil wrote: > > if you have a clear plan lets do it or lets do the first version with just > > document -> analyzer -> token array -> vector > |-> ngram -> vector > Ted summed it up perfectly. I think this is great until we get further along with the document work. > > Lets not have overlapping ids otherwise it becomes a pain to merge. have > unique ids in sequence file, and a file with last id used ? > Ok, I will read the partial vector/dictionary code to get my head around this.
Re: Mahout 0.3 Plan and other changes
> One important question in my mind here is how does this effect 0.20 based > jobs and pre 0.20 based jobs. I had written pfpgrowth in pure 0.20 api. and > deneche is also maintaining two version it seems. I will check the > AbstractJob and see although I maintain two versions of Decision Forests, one with the old api and with the new one, the differences between the two APIs are so important that I can't just keep working on the two versions. Thus all the new stuff is being committed using the new API and as far as I can say it seems to work great. On Thu, Feb 4, 2010 at 4:48 PM, Robin Anil wrote: > On Thu, Feb 4, 2010 at 7:28 PM, Sean Owen wrote: > >> On Thu, Feb 4, 2010 at 12:28 PM, Robin Anil wrote: >> > 3rd thing: >> > I am planning to convert the launcher code to implement ToolRunner. >> Anyone >> > volunteer to help me with that? >> >> I had wished to begin standardizing how we write these jobs, yes. >> >> If you see AbstractJob, you'll see how I've unified my three jobs and >> how I'm trying to structure them. It implements ToolRunner so all that >> is already taken care of. >> >> I think some standardization is really useful, to solve problems like >> this and others, and I'll offer this as a 'draft' for further work. No >> real point in continuing to solve these things individually. > > One important question in my mind here is how does this effect 0.20 based > jobs and pre 0.20 based jobs. I had written pfpgrowth in pure 0.20 api. and > deneche is also maintaining two version it seems. I will check the > AbstractJob and see > > >> > 5th The release: >> > Fix a date for 0.3 release? We should look to improve quality in this >> > release. i.e In-terms of running the parts of the code each of us haven't >> > tested (like I have run bayes and fp growth many a time, So, I will focus >> on >> > running clustering algorithms and try out various options see if there is >> > any issue) provide feedback so that the one who wrote it can help tweak >> it? >> >> Maybe, maybe not. There are always 100 things that could be worked on, >> and that will never change -- it'll never be 'done'. The question of a >> release, at this point, is more like, has enough time elapsed / has >> enough progress been made to warrant a new point release? I think we >> are at that point now. >> >> The question is not what big things can we do -- 'big' is for 0.4 or >> beyond now -- but what small wins can we get in, or what small changes >> are necessary to tie up loose ends to make a roughly coherent release. >> In that sense, no, I'm not sure I'd say things like what you describe >> should be in for 0.3. I mean we could, but then it's months away, and >> isn't that just what we call "0.4"? >> >> Everyone's had a week or two to move towards 0.3 so I believe it's >> time to begin pushing on these issues, closing then / resolving them / >> moving to 0.4 by end of week. Then set the wheel in motion first thing >> next week, since it'll still be some time before everyone's on board. >> >
Re: Mahout 0.3 Plan and other changes
These are good questions. I see the best course as answering these kinds of questions in phases. First, the only thing that is working right now is the current text => vector stuff. We should continue to refine this with alternative forms of vectorization (random indexing, stochastic projection as well as the current dictionary approach). A second step is to be able to store and represent more general documents similar to what is possible with Lucene. This is critically important for some of the things that I want to do where I need to store and segregate title, publisher, authors, abstracts and body text (and many other characteristics ... we probably have >100 of them). It is also critically important if we want to embrace the dualism between recommendation and search. Representing documents can be done without discarding the simpler approach we have now and it can be done in advance of good vectorization of these complex documents. A third step is to define advanced vectorization for complex documents. As an interim step, we can simply vectorize using the dictionary and alternative vectorizers that we have now, but applied to a single field of the document. Shortly, though, we should be able to define cross occurrence features for a multi-field vectorization. The only dependencies here are that the third step depends on the first and second. You have been working on the Dictionary vectorizer. I did a bit of work on stochastic projection with some cooccurrence. In parallel Drew and I have been working on building an Avro document schema. This is driving forward on step 2. I think that will actually bear some fruit quickly. Once that is done, we should merge capabilities. I am hoping that the good momentum that you have established on (1) will mean that merging your vectorization with the complex documents will be relatively easy. Is that a workable idea? On Thu, Feb 4, 2010 at 10:45 AM, Robin Anil wrote: > And how does it > work with our sequence file format(string docid => string document>. All we > have is text=>text ? > and finally its all vectors. How does same word in two different fields > translate into vector? > > if you have a clear plan lets do it or lets do the first version with just > > document -> analyzer -> token array -> vector > |-> ngram -> vector > -- Ted Dunning, CTO DeepDyve
Re: Mahout 0.3 Plan and other changes
On Thu, Feb 4, 2010 at 10:29 PM, Drew Farris wrote: > On Thu, Feb 4, 2010 at 10:51 AM, Robin Anil wrote: > > >> > >> Document Directory -> Document Sequence File > >> Document Sequence File -> Document Token Streams > >> Document Token Streams -> Document Vectors + Dictionary > >> > > Ok I will work on this Job. > > FWIW, Ted had proposed something on the order of allowing Documents to > have multiple named Fields, where each field has an independent token > stream. Likewise, Document sequence files could have multiple fields > per Document where each field is a string. What do you think about > something like this? The documents I work with day to day in > production are more frequently field structured than flat and in some > cases fields are tokenized while others are simply untouched. I > > Tell me What the schema it should be List> ? And how does it work with our sequence file format(string docid => string document>. All we have is text=>text ? and finally its all vectors. How does same word in two different fields translate into vector? if you have a clear plan lets do it or lets do the first version with just document -> analyzer -> token array -> vector |-> ngram -> vector > Also partial Vector merger could be reused by colloc when creating ngram > > only vectors. But we need to keep adding to the dictionary file. If you > can > > work on a dictionary merger + chunker, it will be great. I think we can > do > > this integration quickly > > I'll take a closer look at the Dictionary code you're produced and see > what I can come up with -- is the basic idea here to take multiple > dictionaries with potentially overlapping ID's and merge them into a > single dictionary? What needs to happen with regards to chunking? Lets not have overlapping ids otherwise it becomes a pain to merge. have unique ids in sequence file, and a file with last id used ? > Drew >
Re: Mahout 0.3 Plan and other changes
On Thu, Feb 4, 2010 at 9:15 AM, Drew Farris wrote: > > > Ok, this makes sense to me. I think running on Amazon EMR is a good > goal, not to mention I'm sure there are people out there with > installations running on pre 0.20 hadoop too. As much as I hate to to > see the deprecation warnings all of the time your reasoning behind > sticking with the old apis sounds solid. > > Amazon EMR and people's old installations is why I lean toward "-1" for moving wholly toward 0.20 standardization just yet. I *strongly* prefer the new API myself, and there are also bugs (perf and func) fixed in 0.20, but I would hate to basically cut off support for these hadoop installations until people and Amazon upgrade. -jake
Re: Mahout 0.3 Plan and other changes
On Thu, Feb 4, 2010 at 5:15 PM, Drew Farris wrote: > Which jobs specifically? It would be great to use these for reference. All of the recommender-related ones. Try org.apache.mahout.cf.taste.hadoop.item.RecommenderJob. Use of TextInputFormat triggers the problem. It's possible I'm misusing the new APIs, but I kind of doubt it, and there's no examples for them yet anyhow that I've seen that tell me to do something else.
Re: Mahout 0.3 Plan and other changes
On Thu, Feb 4, 2010 at 12:10 PM, Sean Owen wrote: > > ... so therefore what's the actual use in upgrading yet. I also > figured we'd spend some time consolidating our own approach to Hadoop > -- I've refactored my 3 jobs into one approach -- making the eventual > transition simpler. > Which jobs specifically? It would be great to use these for reference. > > No harm in having new-API code alongside the old-API code, but I still > suggest we should stick on the old APIs. Ok, this makes sense to me. I think running on Amazon EMR is a good goal, not to mention I'm sure there are people out there with installations running on pre 0.20 hadoop too. As much as I hate to to see the deprecation warnings all of the time your reasoning behind sticking with the old apis sounds solid. Drew
Re: Mahout 0.3 Plan and other changes
It's this crazy thing where the new APIs call into old APIs and checks fail as a result -- for example, try setting an InputFormat class that implements the 'new' InputFormat. Somewhere in the code it checks to see if you're implementing the *old* InputFormat. It may so happen that only my jobs hit this. I don't see it fixed in any branch yet. Actually I failed to mention what I think is a far bigger reason to not move to the new API just yet -- it won't run on Amazon Elastic MapReduce. I suppose the thinking is that the old APIs - work with stuff like Amazon - work with Hadoop's latest release - work -- doesn't have a bug that's stopping us ... so therefore what's the actual use in upgrading yet. I also figured we'd spend some time consolidating our own approach to Hadoop -- I've refactored my 3 jobs into one approach -- making the eventual transition simpler. And so I stopped thinking about it. No harm in having new-API code alongside the old-API code, but I still suggest we should stick on the old APIs. On Thu, Feb 4, 2010 at 5:01 PM, Drew Farris wrote: > Sean, > > What sort of problems have you run into, are there Hadoop JIRA issues > open for them? > > It would be nice to commit to the 0.20.x api in Mahout, but I agree, > not very nice if we back the users into a corner wrt what they can and > can't do due to bugs in Hadoop. > > Drew > > On Thu, Feb 4, 2010 at 10:57 AM, Sean Owen wrote: >> Yeah I'm still on the old API because of problems in Hadoop. I'm still >> hoping they get fixed in 0.20.x We may need two-track support for a >> while. >> >> On Thu, Feb 4, 2010 at 3:48 PM, Robin Anil wrote: >>> One important question in my mind here is how does this effect 0.20 based >>> jobs and pre 0.20 based jobs. I had written pfpgrowth in pure 0.20 api. and >>> deneche is also maintaining two version it seems. I will check the >>> AbstractJob and see >>> >>> >> >
Re: Mahout 0.3 Plan and other changes
Sean, What sort of problems have you run into, are there Hadoop JIRA issues open for them? It would be nice to commit to the 0.20.x api in Mahout, but I agree, not very nice if we back the users into a corner wrt what they can and can't do due to bugs in Hadoop. Drew On Thu, Feb 4, 2010 at 10:57 AM, Sean Owen wrote: > Yeah I'm still on the old API because of problems in Hadoop. I'm still > hoping they get fixed in 0.20.x We may need two-track support for a > while. > > On Thu, Feb 4, 2010 at 3:48 PM, Robin Anil wrote: >> One important question in my mind here is how does this effect 0.20 based >> jobs and pre 0.20 based jobs. I had written pfpgrowth in pure 0.20 api. and >> deneche is also maintaining two version it seems. I will check the >> AbstractJob and see >> >> >
Re: Mahout 0.3 Plan and other changes
Sean, What sort of problems have you run into, are there Hadoop JIRA issues open for them? It would be nice to commit to the 0.20.x api in Mahout, but I agree, not very nice if we back the users into a corner wrt what they can and can't do due to bugs in Hadoop. Drew On Thu, Feb 4, 2010 at 10:57 AM, Sean Owen wrote: > Yeah I'm still on the old API because of problems in Hadoop. I'm still > hoping they get fixed in 0.20.x We may need two-track support for a > while. > > On Thu, Feb 4, 2010 at 3:48 PM, Robin Anil wrote: >> One important question in my mind here is how does this effect 0.20 based >> jobs and pre 0.20 based jobs. I had written pfpgrowth in pure 0.20 api. and >> deneche is also maintaining two version it seems. I will check the >> AbstractJob and see >> >> >
Re: Mahout 0.3 Plan and other changes
On Thu, Feb 4, 2010 at 10:51 AM, Robin Anil wrote: >> >> Document Directory -> Document Sequence File >> Document Sequence File -> Document Token Streams >> Document Token Streams -> Document Vectors + Dictionary >> > Ok I will work on this Job. FWIW, Ted had proposed something on the order of allowing Documents to have multiple named Fields, where each field has an independent token stream. Likewise, Document sequence files could have multiple fields per Document where each field is a string. What do you think about something like this? The documents I work with day to day in production are more frequently field structured than flat and in some cases fields are tokenized while others are simply untouched. I > Also partial Vector merger could be reused by colloc when creating ngram > only vectors. But we need to keep adding to the dictionary file. If you can > work on a dictionary merger + chunker, it will be great. I think we can do > this integration quickly I'll take a closer look at the Dictionary code you're produced and see what I can come up with -- is the basic idea here to take multiple dictionaries with potentially overlapping ID's and merge them into a single dictionary? What needs to happen with regards to chunking? Drew
Re: Mahout 0.3 Plan and other changes
Yeah I'm still on the old API because of problems in Hadoop. I'm still hoping they get fixed in 0.20.x We may need two-track support for a while. On Thu, Feb 4, 2010 at 3:48 PM, Robin Anil wrote: > One important question in my mind here is how does this effect 0.20 based > jobs and pre 0.20 based jobs. I had written pfpgrowth in pure 0.20 api. and > deneche is also maintaining two version it seems. I will check the > AbstractJob and see > >
Re: Mahout 0.3 Plan and other changes
Yeah I'm still on the old API because of problems in Hadoop. I'm still hoping they get fixed in 0.20.x We may need two-track support for a while. On Thu, Feb 4, 2010 at 3:48 PM, Robin Anil wrote: > One important question in my mind here is how does this effect 0.20 based > jobs and pre 0.20 based jobs. I had written pfpgrowth in pure 0.20 api. and > deneche is also maintaining two version it seems. I will check the > AbstractJob and see > >
Re: Mahout 0.3 Plan and other changes
On Thu, Feb 4, 2010 at 8:13 PM, Drew Farris wrote: > On Thu, Feb 4, 2010 at 7:28 AM, Robin Anil wrote: > > > Since I was converting vectorization in to sequence files. I was going to > > change the lucene Driver to write dictionary to sequence file instead of > tab > > separated text file. Also I will change the cluster dumper to read the > > dictionary from the sequence File. > > Sounds good. > > > Iterator interface where SequenceFile reader/writer > is > > one implementation, Tab separated file reader/writer is another > > I like this, but also how about providing a utility to go from > tab-delimited dict format to SequenceFile format. This way there's a > migration path for old datasets. > > > 2nd Thing: > > Lucene seems too slow for querying dictionary vectorization > > 1 x 6 hours m/r as opposed to 2 x 1.5 hour on wikipedia. i.e. Double read > of > > wikipedia dump with a hashmap is faster than single read using a lucene > > index > > Is the double read approach described in one of the previous threads > discussion this issue? Just curious how it works.. > > > 3rd thing: > > I am planning to convert the launcher code to implement ToolRunner. > Anyone > > volunteer to help me with that? > > Sure, I can help out. What classes need to be updated? I've patched > the clustering code in the past, that's probably a natural start. > Sean, I'll take a look at AbstractJob and what would be involved in > re-using it in the Clustering code. > > With ToolRunner, we get GenericOptionsParser for free, and the > launcher classes must implement Tool and Configurable, right? > ToolRunner is specific to the 0.20 api, isn't it? > > I did notice that Eclipse was complaining about GenericOptionsParser > last night because commons-cli 1.x wasn't available. I had to remove > its exclusion in the parent pom to get things to work properly, anyone > else run into this or is this something funky in my environment. > > > 4th thing: > > Any thoughts how we can integrate output of n-gram map/reduce to generate > > vectors from dataset > > So are you speaking of n-grams in general, or the output of the colloc > work? I suppose I should wrap up the process of writing the top > collocations to a file which can be read into a bloom filter which can > be integrated into phase of the document vectorization process that > performs tokenization. The document vectorization code could use the > shingle filter to produce ngrams and emit those that passed the bloom > filter. > > There's some feedback I'm looking for on MAHOUT-242 related to this, > that would be helpful, questions about the best way to produce the set > of top collocations. > > Robin, have you considered adding a step to the document vectorization > process that would produce output that's a token stream instead of a > vector? > > Instead of: > Document Drectory -> Document Sequence File > Document Sequence File -> Document Vectors + Dictionary > > Document Directory -> Document Sequence File > Document Sequence File -> Document Token Streams > Document Token Streams -> Document Vectors + Dictionary > Ok I will work on this Job. Also partial Vector merger could be reused by colloc when creating ngram only vectors. But we need to keep adding to the dictionary file. If you can work on a dictionary merger + chunker, it will be great. I think we can do this integration quickly > This way, something like the colloc/n-gram process would read the > output of the second pass (Document Token Streams file) instead of > having to re-tokenize everything simply to obtain token streams. > > > 5th The release: > > Fix a date for 0.3 release? We should look to improve quality in this > > release. i.e In-terms of running the parts of the code each of us haven't > > tested (like I have run bayes and fp growth many a time, So, I will focus > on > > running clustering algorithms and try out various options see if there is > > any issue) provide feedback so that the one who wrote it can help tweak > it? > > It is probably time to resurrect Sean's thread from last week and see > how we stand on the issues listed there. > > Drew >
Re: Mahout 0.3 Plan and other changes
On Thu, Feb 4, 2010 at 7:28 PM, Sean Owen wrote: > On Thu, Feb 4, 2010 at 12:28 PM, Robin Anil wrote: > > 3rd thing: > > I am planning to convert the launcher code to implement ToolRunner. > Anyone > > volunteer to help me with that? > > I had wished to begin standardizing how we write these jobs, yes. > > If you see AbstractJob, you'll see how I've unified my three jobs and > how I'm trying to structure them. It implements ToolRunner so all that > is already taken care of. > > I think some standardization is really useful, to solve problems like > this and others, and I'll offer this as a 'draft' for further work. No > real point in continuing to solve these things individually. One important question in my mind here is how does this effect 0.20 based jobs and pre 0.20 based jobs. I had written pfpgrowth in pure 0.20 api. and deneche is also maintaining two version it seems. I will check the AbstractJob and see > > 5th The release: > > Fix a date for 0.3 release? We should look to improve quality in this > > release. i.e In-terms of running the parts of the code each of us haven't > > tested (like I have run bayes and fp growth many a time, So, I will focus > on > > running clustering algorithms and try out various options see if there is > > any issue) provide feedback so that the one who wrote it can help tweak > it? > > Maybe, maybe not. There are always 100 things that could be worked on, > and that will never change -- it'll never be 'done'. The question of a > release, at this point, is more like, has enough time elapsed / has > enough progress been made to warrant a new point release? I think we > are at that point now. > > The question is not what big things can we do -- 'big' is for 0.4 or > beyond now -- but what small wins can we get in, or what small changes > are necessary to tie up loose ends to make a roughly coherent release. > In that sense, no, I'm not sure I'd say things like what you describe > should be in for 0.3. I mean we could, but then it's months away, and > isn't that just what we call "0.4"? > > Everyone's had a week or two to move towards 0.3 so I believe it's > time to begin pushing on these issues, closing then / resolving them / > moving to 0.4 by end of week. Then set the wheel in motion first thing > next week, since it'll still be some time before everyone's on board. >
Re: Mahout 0.3 Plan and other changes
On Thu, Feb 4, 2010 at 7:28 AM, Robin Anil wrote: > Since I was converting vectorization in to sequence files. I was going to > change the lucene Driver to write dictionary to sequence file instead of tab > separated text file. Also I will change the cluster dumper to read the > dictionary from the sequence File. Sounds good. > Iterator interface where SequenceFile reader/writer is > one implementation, Tab separated file reader/writer is another I like this, but also how about providing a utility to go from tab-delimited dict format to SequenceFile format. This way there's a migration path for old datasets. > 2nd Thing: > Lucene seems too slow for querying dictionary vectorization > 1 x 6 hours m/r as opposed to 2 x 1.5 hour on wikipedia. i.e. Double read of > wikipedia dump with a hashmap is faster than single read using a lucene > index Is the double read approach described in one of the previous threads discussion this issue? Just curious how it works.. > 3rd thing: > I am planning to convert the launcher code to implement ToolRunner. Anyone > volunteer to help me with that? Sure, I can help out. What classes need to be updated? I've patched the clustering code in the past, that's probably a natural start. Sean, I'll take a look at AbstractJob and what would be involved in re-using it in the Clustering code. With ToolRunner, we get GenericOptionsParser for free, and the launcher classes must implement Tool and Configurable, right? ToolRunner is specific to the 0.20 api, isn't it? I did notice that Eclipse was complaining about GenericOptionsParser last night because commons-cli 1.x wasn't available. I had to remove its exclusion in the parent pom to get things to work properly, anyone else run into this or is this something funky in my environment. > 4th thing: > Any thoughts how we can integrate output of n-gram map/reduce to generate > vectors from dataset So are you speaking of n-grams in general, or the output of the colloc work? I suppose I should wrap up the process of writing the top collocations to a file which can be read into a bloom filter which can be integrated into phase of the document vectorization process that performs tokenization. The document vectorization code could use the shingle filter to produce ngrams and emit those that passed the bloom filter. There's some feedback I'm looking for on MAHOUT-242 related to this, that would be helpful, questions about the best way to produce the set of top collocations. Robin, have you considered adding a step to the document vectorization process that would produce output that's a token stream instead of a vector? Instead of: Document Drectory -> Document Sequence File Document Sequence File -> Document Vectors + Dictionary Document Directory -> Document Sequence File Document Sequence File -> Document Token Streams Document Token Streams -> Document Vectors + Dictionary This way, something like the colloc/n-gram process would read the output of the second pass (Document Token Streams file) instead of having to re-tokenize everything simply to obtain token streams. > 5th The release: > Fix a date for 0.3 release? We should look to improve quality in this > release. i.e In-terms of running the parts of the code each of us haven't > tested (like I have run bayes and fp growth many a time, So, I will focus on > running clustering algorithms and try out various options see if there is > any issue) provide feedback so that the one who wrote it can help tweak it? It is probably time to resurrect Sean's thread from last week and see how we stand on the issues listed there. Drew
Re: Mahout 0.3 Plan and other changes
On Thu, Feb 4, 2010 at 12:28 PM, Robin Anil wrote: > 3rd thing: > I am planning to convert the launcher code to implement ToolRunner. Anyone > volunteer to help me with that? I had wished to begin standardizing how we write these jobs, yes. If you see AbstractJob, you'll see how I've unified my three jobs and how I'm trying to structure them. It implements ToolRunner so all that is already taken care of. I think some standardization is really useful, to solve problems like this and others, and I'll offer this as a 'draft' for further work. No real point in continuing to solve these things individually. > 5th The release: > Fix a date for 0.3 release? We should look to improve quality in this > release. i.e In-terms of running the parts of the code each of us haven't > tested (like I have run bayes and fp growth many a time, So, I will focus on > running clustering algorithms and try out various options see if there is > any issue) provide feedback so that the one who wrote it can help tweak it? Maybe, maybe not. There are always 100 things that could be worked on, and that will never change -- it'll never be 'done'. The question of a release, at this point, is more like, has enough time elapsed / has enough progress been made to warrant a new point release? I think we are at that point now. The question is not what big things can we do -- 'big' is for 0.4 or beyond now -- but what small wins can we get in, or what small changes are necessary to tie up loose ends to make a roughly coherent release. In that sense, no, I'm not sure I'd say things like what you describe should be in for 0.3. I mean we could, but then it's months away, and isn't that just what we call "0.4"? Everyone's had a week or two to move towards 0.3 so I believe it's time to begin pushing on these issues, closing then / resolving them / moving to 0.4 by end of week. Then set the wheel in motion first thing next week, since it'll still be some time before everyone's on board.
Mahout 0.3 Plan and other changes
1st Thing: Since I was converting vectorization in to sequence files. I was going to change the lucene Driver to write dictionary to sequence file instead of tab separated text file. Also I will change the cluster dumper to read the dictionary from the sequence File. I can go about in three ways Stick to only SequenceFile Format for the dictionary and remove tab separated thing out of the system OR Iterator interface where SequenceFile reader/writer is one implementation, Tab separated file reader/writer is another 2nd Thing: Lucene seems too slow for querying dictionary vectorization 1 x 6 hours m/r as opposed to 2 x 1.5 hour on wikipedia. i.e. Double read of wikipedia dump with a hashmap is faster than single read using a lucene index 3rd thing: I am planning to convert the launcher code to implement ToolRunner. Anyone volunteer to help me with that? 4th thing: Any thoughts how we can integrate output of n-gram map/reduce to generate vectors from dataset 5th The release: Fix a date for 0.3 release? We should look to improve quality in this release. i.e In-terms of running the parts of the code each of us haven't tested (like I have run bayes and fp growth many a time, So, I will focus on running clustering algorithms and try out various options see if there is any issue) provide feedback so that the one who wrote it can help tweak it? Maybe time the code when we run it and put it on the wiki ? Can we set a Sprint week when we will be doing this. Comments awaited Robin