Only vw itself. Sent from my iPhone
On Feb 22, 2012, at 9:01 PM, Jeff Eastman <[email protected]> wrote: > Got any code that does this I could look at? > > On 2/22/12 9:23 PM, Ted Dunning wrote: >> All reduce is a non map reduce primitive stolen from mpi. It is used, for >> example, in vw to accumulate gradient information without additional Map >> reduce iterations. >> >> The all reduce operation works by building a tree of all tasks. A state is >> sent up the tree from the leaves. Each internal node adds together the >> children's states and adds in its own. At the root we have the combination >> of all states and that result is sent back down the tree. >> >> In practice all mappers iterate through there input slice and do an all >> reduce. Then they reset their input and repeat. Commonly the root node will >> include a termination flag to signal convergence. >> >> The effect is that iterations don't require spawning a new map reduce job >> and thus we save considerable time at each step. Indeed, if the input can >> fit into memory, we can gain even more speed. With in memory operation we >> may get two orders of magnitude speed up. With data too large to fit in >> memory gains will be more modest. >> >> Sent from my iPhone >> >> On Feb 22, 2012, at 4:01 PM, Jeff Eastman<[email protected]> wrote: >> >>> Hey Ted, >>> >>> Could you elaborate on this approach? I don't grok how an "all reduce >>> implementation" can be done with a "map-only job", or how a mapper could do >>> "all iteration[s] internally". >>> >>> I've just gotten the ClusterIterator working in MR mode and it does what I >>> thought we'd been talking about earlier: In each iteration, each mapper >>> loads all the prior clusters and then iterates through all its input >>> points, training each of the prior clusters in the process. Then, in the >>> cleanup() method, all the trained clusters are sent to the reducers keyed >>> by their model indexes. This eliminates the need for a combiner and means >>> each reducer only has to merge n-mappers worth of trained clusters into a >>> posterior trained cluster before it is output. If numReducers == k then the >>> current reduce-step overloads should disappear. >>> >>> The secret to this implementation is to allow clusters to observe other >>> clusters in addition to observing vectors, thereby accumulating all of >>> those clusters' observation statistics before recomputing posterior >>> parameters. >>> >>> >>> >>> On 2/22/12 1:42 PM, Ted Dunning wrote: >>>> I would also like to see if we can put an all reduce implementation into >>>> this effort. The idea is that we can use a map only job that does all >>>> iteration internally. I think that this could result in more than an order >>>> of magnitude speed up for our clustering codes. It could also provide >>>> similar benefits for the nascent parallel classifier training work. >>>> >>>> This seems to be a cleanup of a long standing wart in our code but it is >>>> reasonable that others may feel differently. >>>> >>>> Sent from my iPhone >>>> >>>> On Feb 22, 2012, at 10:32 AM, Jeff Eastman<[email protected]> >>>> wrote: >>>> >>>>> This refactoring is focused on some of the iterative clustering >>>>> algorithms which, in each iteration, load a prior set of clusters ( e.g. >>>>> clusters-0) and process each input vector against them to produce a >>>>> posterior set of clusters (e.g. clusters-1) for the next iteration. This >>>>> will result in k-Means, fuzzyK and Dirichlet being collapsed into a >>>>> ClusterIterator iterating over a ClusterClassifier using a >>>>> ClusteringPolicy. You can see these classes in o.a.m.clustering. They are >>>>> a work in progress but in-memory, sequential from sequenceFiles and >>>>> k-means MR work in tests and can be demonstrated in the DisplayXX >>>>> examples which employ them. >>>>> >>>>> Paritosh has also been building a ClusterClassificationDriver >>>>> (o.a.m.clustering.classify) which we want to use to factor all of the >>>>> redundant cluster-data implementations (-cl option) out of the respective >>>>> cluster drivers. This will affect Canopy in addition to the above >>>>> algorithms. >>>>> >>>>> An imagined benefit of this refactoring comes from the fact that >>>>> ClusterClassifier extends AbstractVectorClassifier and implements >>>>> OnlineLearner. We think this means that a posterior set of trained >>>>> Clusters can be used as a component classifier in a semi-supervised >>>>> classifier implementation. I suppose we will need to demonstrate this >>>>> before we go too much further in the refactoring but Ted, at least, seems >>>>> to approve of this integration approach between supervised classification >>>>> and clustering (unsupervised classification). I don't think it has had a >>>>> lot of other eyeballs on it. >>>>> >>>>> I don't think LDA fits into this subset of clustering algorithms as also >>>>> do not Canopy and MeanShift. As you note, it does not produce Clusters >>>>> but I'd be interested in your reactions to the above. >>>>> >>>>> Jeff >>>>> >>>>> On 2/22/12 9:55 AM, Jake Mannix wrote: >>>>>> So I haven't looked super-carefully at the clustering refactoring work, >>>>>> can >>>>>> someone give a little overview of what >>>>>> the plan is? >>>>>> >>>>>> The NewLDA stuff is technically in "clustering" and generally works by >>>>>> taking in SeqFile<IW,VW> documents as the training corpus, and spits >>>>>> out >>>>>> two things: SeqFile<IW,VW> of a "model" (keyed on topicId, one vector >>>>>> per >>>>>> topic) and a SeqFile<IW,VW> of "classifications" (keyed on docId, one >>>>>> vector over the topic space for projection onto each topic dimension). >>>>>> >>>>>> This is similar to how SVD clustering/decomposition works, but with >>>>>> L1-normed outputs instead of L2. >>>>>> >>>>>> But this seems very different from all of the structures in the rest of >>>>>> clustering. >>>>>> >>>>>> -jake >>>>>> >>>>>> On Wed, Feb 22, 2012 at 7:56 AM, Jeff >>>>>> Eastman<[email protected]>wrote: >>>>>> >>>>>>> Hi Saikat, >>>>>>> >>>>>>> I agree with Paritosh, that a great place to begin would be to write >>>>>>> some >>>>>>> unit tests. This will familiarize you with the code base and help us a >>>>>>> lot >>>>>>> with our 0.7 housekeeping release. The new clustering classification >>>>>>> components are going to unify many - but not all - of the existing >>>>>>> clustering algorithms to reduce their complexity by factoring out >>>>>>> duplication and streamlining their integration into semi-supervised >>>>>>> classification engines. >>>>>>> >>>>>>> Please feel free to post any questions you may have in reading through >>>>>>> this code. This is a major refactoring effort and we will need all the >>>>>>> help >>>>>>> we can get. Thanks for the offer, >>>>>>> >>>>>>> Jeff >>>>>>> >>>>>>> >>>>>>> On 2/21/12 10:46 PM, Saikat Kanjilal wrote: >>>>>>> >>>>>>>> Hi Paritosh,Yes creating the test case would be a great first start, >>>>>>>> however are there other tasks you guys need help with before I can do >>>>>>>> before the test creation, I will sync trunk and start reading through >>>>>>>> the >>>>>>>> code in the meantime.Regards >>>>>>>> >>>>>>>> Date: Wed, 22 Feb 2012 10:57:51 +0530 >>>>>>>>> From: [email protected] >>>>>>>>> To: [email protected] >>>>>>>>> Subject: Re: Helping out with the .7 release >>>>>>>>> >>>>>>>>> We are creating clustering as classification components which will >>>>>>>>> help >>>>>>>>> in moving clustering out. Once the component is ready, then the >>>>>>>>> clustering algorithms would need refactoring. >>>>>>>>> The clustering as classification component and the outlier removal >>>>>>>>> component has been created. >>>>>>>>> >>>>>>>>> Most of it is committed, and rest is available as a patch. See >>>>>>>>> https://issues.apache.org/**jira/browse/MAHOUT-929<https://issues.apache.org/jira/browse/MAHOUT-929> >>>>>>>>> If you will apply the latest patch available on Mahout-929 you can see >>>>>>>>> all that is available now. >>>>>>>>> >>>>>>>>> If you want, you can help with the test case of >>>>>>>>> ClusterClassificationMapper available in the patch. >>>>>>>>> >>>>>>>>> On 22-02-2012 10:27, Saikat Kanjilal wrote: >>>>>>>>> >>>>>>>>>> Hi Guys,I was interested in helping out with the clustering component >>>>>>>>>> of mahout, I looked through the JIRA items below and was wondering >>>>>>>>>> if there >>>>>>>>>> is a specific one that would be good to start with: >>>>>>>>>> >>>>>>>>>> https://issues.apache.org/**jira/secure/IssueNavigator.** >>>>>>>>>> jspa?reset=true&jqlQuery=**project+%3D+MAHOUT+AND+** >>>>>>>>>> resolution+%3D+Unresolved+AND+**component+%3D+Clustering+** >>>>>>>>>> ORDER+BY+priority+DESC&mode=**hide<https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&jqlQuery=project+%3D+MAHOUT+AND+resolution+%3D+Unresolved+AND+component+%3D+Clustering+ORDER+BY+priority+DESC&mode=hide> >>>>>>>>>> >>>>>>>>>> I initially was thinking to work on Mahout-930 or Mahout-931 but >>>>>>>>>> could >>>>>>>>>> work on others if needed. >>>>>>>>>> Best Regards >>>>>>>>>> >> >
