Re: Helping out with the .7 release

Ted Dunning Wed, 22 Feb 2012 21:08:15 -0800

Only vw itself. 

Sent from my iPhone


On Feb 22, 2012, at 9:01 PM, Jeff Eastman <[email protected]> wrote:

> Got any code that does this I could look at?
> 
> On 2/22/12 9:23 PM, Ted Dunning wrote:
>> All reduce is a non map reduce primitive stolen from mpi. It is used, for 
>> example, in vw to accumulate gradient information without additional Map 
>> reduce iterations.
>> 
>> The all reduce operation works by building a tree of all tasks. A state is 
>> sent up the tree from the leaves. Each internal node adds together the 
>> children's states and adds in its own. At the root we have the combination 
>> of all states and that result is sent back down the tree.
>> 
>> In practice all mappers iterate through there input slice and do an all 
>> reduce. Then they reset their input and repeat. Commonly the root node will 
>> include a termination flag to signal convergence.
>> 
>> The effect is that iterations don't require spawning a new map reduce job 
>> and thus we save considerable time at each step.  Indeed, if the input can 
>> fit into memory, we can gain even more speed. With in memory operation we 
>> may get two orders of magnitude speed up. With data too large to fit in 
>> memory gains will be more modest.
>> 
>> Sent from my iPhone
>> 
>> On Feb 22, 2012, at 4:01 PM, Jeff Eastman<[email protected]>  wrote:
>> 
>>> Hey Ted,
>>> 
>>> Could you elaborate on this approach? I don't grok how an "all reduce 
>>> implementation" can be done with a "map-only job", or how a mapper could do 
>>> "all iteration[s] internally".
>>> 
>>> I've just gotten the ClusterIterator working in MR mode and it does what I 
>>> thought we'd been talking about earlier: In each iteration, each mapper 
>>> loads all the prior clusters and then iterates through all its input 
>>> points, training each of the prior clusters in the process. Then, in the 
>>> cleanup() method, all the trained clusters are sent to the reducers keyed 
>>> by their model indexes. This eliminates the need for a combiner and means 
>>> each reducer only has to merge n-mappers worth of trained clusters into a 
>>> posterior trained cluster before it is output. If numReducers == k then the 
>>> current reduce-step overloads should disappear.
>>> 
>>> The secret to this implementation is to allow clusters to observe other 
>>> clusters in addition to observing vectors, thereby accumulating all of 
>>> those clusters' observation statistics before recomputing posterior 
>>> parameters.
>>> 
>>> 
>>> 
>>> On 2/22/12 1:42 PM, Ted Dunning wrote:
>>>> I would also like to see if we can put an all reduce implementation into 
>>>> this effort. The idea is that we can use a map only job that does all 
>>>> iteration internally. I think that this could result in more than an order 
>>>> of magnitude speed up for our clustering codes.  It could also provide 
>>>> similar benefits for the nascent parallel classifier training work.
>>>> 
>>>> This seems to be a cleanup of a long standing wart in our code but it is 
>>>> reasonable that others may feel differently.
>>>> 
>>>> Sent from my iPhone
>>>> 
>>>> On Feb 22, 2012, at 10:32 AM, Jeff Eastman<[email protected]>   
>>>> wrote:
>>>> 
>>>>> This refactoring is focused on some of the iterative clustering 
>>>>> algorithms which, in each iteration, load a prior set of clusters ( e.g. 
>>>>> clusters-0) and process each input vector against them to produce a 
>>>>> posterior set of clusters (e.g. clusters-1) for the next iteration. This 
>>>>> will result in k-Means, fuzzyK and Dirichlet being collapsed into a 
>>>>> ClusterIterator iterating over a ClusterClassifier using a 
>>>>> ClusteringPolicy. You can see these classes in o.a.m.clustering. They are 
>>>>> a work in progress but in-memory, sequential from sequenceFiles and 
>>>>> k-means MR work in tests and can be demonstrated in the DisplayXX 
>>>>> examples which employ them.
>>>>> 
>>>>> Paritosh has also been building a ClusterClassificationDriver 
>>>>> (o.a.m.clustering.classify) which we want to use to factor all of the 
>>>>> redundant cluster-data implementations (-cl option) out of the respective 
>>>>> cluster drivers. This will affect Canopy in addition to the above 
>>>>> algorithms.
>>>>> 
>>>>> An imagined benefit of this refactoring comes from the fact that 
>>>>> ClusterClassifier extends AbstractVectorClassifier and implements 
>>>>> OnlineLearner. We think this means that a posterior set of trained 
>>>>> Clusters can be used as a component classifier in a semi-supervised 
>>>>> classifier implementation. I suppose we will need to demonstrate this 
>>>>> before we go too much further in the refactoring but Ted, at least, seems 
>>>>> to approve of this integration approach between supervised classification 
>>>>> and clustering (unsupervised classification). I don't think it has had a 
>>>>> lot of other eyeballs on it.
>>>>> 
>>>>> I don't think LDA fits into this subset of clustering algorithms as also 
>>>>> do not Canopy and MeanShift. As you note, it does not produce Clusters 
>>>>> but I'd be interested in your reactions to the above.
>>>>> 
>>>>> Jeff
>>>>> 
>>>>> On 2/22/12 9:55 AM, Jake Mannix wrote:
>>>>>> So I haven't looked super-carefully at the clustering refactoring work, 
>>>>>> can
>>>>>> someone give a little overview of what
>>>>>> the plan is?
>>>>>> 
>>>>>> The NewLDA stuff is technically in "clustering" and generally works by
>>>>>> taking in SeqFile<IW,VW>    documents as the training corpus, and spits 
>>>>>> out
>>>>>> two things: SeqFile<IW,VW>    of a "model" (keyed on topicId, one vector 
>>>>>> per
>>>>>> topic) and a SeqFile<IW,VW>    of "classifications" (keyed on docId, one
>>>>>> vector over the topic space for projection onto each topic dimension).
>>>>>> 
>>>>>> This is similar to how SVD clustering/decomposition works, but with
>>>>>> L1-normed outputs instead of L2.
>>>>>> 
>>>>>> But this seems very different from all of the structures in the rest of
>>>>>> clustering.
>>>>>> 
>>>>>>   -jake
>>>>>> 
>>>>>> On Wed, Feb 22, 2012 at 7:56 AM, Jeff 
>>>>>> Eastman<[email protected]>wrote:
>>>>>> 
>>>>>>> Hi Saikat,
>>>>>>> 
>>>>>>> I agree with Paritosh, that a great place to begin would be to write 
>>>>>>> some
>>>>>>> unit tests. This will familiarize you with the code base and help us a 
>>>>>>> lot
>>>>>>> with our 0.7 housekeeping release. The new clustering classification
>>>>>>> components are going to unify many - but not all - of the existing
>>>>>>> clustering algorithms to reduce their complexity by factoring out
>>>>>>> duplication and streamlining their integration into semi-supervised
>>>>>>> classification engines.
>>>>>>> 
>>>>>>> Please feel free to post any questions you may have in reading through
>>>>>>> this code. This is a major refactoring effort and we will need all the 
>>>>>>> help
>>>>>>> we can get. Thanks for the offer,
>>>>>>> 
>>>>>>> Jeff
>>>>>>> 
>>>>>>> 
>>>>>>> On 2/21/12 10:46 PM, Saikat Kanjilal wrote:
>>>>>>> 
>>>>>>>> Hi Paritosh,Yes creating the test case would be a great first start,
>>>>>>>> however are there other tasks you guys need help with before I can do
>>>>>>>> before the test creation, I will sync trunk and start reading through 
>>>>>>>> the
>>>>>>>> code in the meantime.Regards
>>>>>>>> 
>>>>>>>>  Date: Wed, 22 Feb 2012 10:57:51 +0530
>>>>>>>>> From: [email protected]
>>>>>>>>> To: [email protected]
>>>>>>>>> Subject: Re: Helping out with the .7 release
>>>>>>>>> 
>>>>>>>>> We are creating clustering as classification components which will 
>>>>>>>>> help
>>>>>>>>> in moving clustering out. Once the component is ready, then the
>>>>>>>>> clustering algorithms would need refactoring.
>>>>>>>>> The clustering as classification component and the outlier removal
>>>>>>>>> component has been created.
>>>>>>>>> 
>>>>>>>>> Most of it is committed, and rest is available as a patch. See
>>>>>>>>> https://issues.apache.org/**jira/browse/MAHOUT-929<https://issues.apache.org/jira/browse/MAHOUT-929>
>>>>>>>>> If you will apply the latest patch available on Mahout-929 you can see
>>>>>>>>> all that is available now.
>>>>>>>>> 
>>>>>>>>> If you want, you can help with the test case of
>>>>>>>>> ClusterClassificationMapper available in the patch.
>>>>>>>>> 
>>>>>>>>> On 22-02-2012 10:27, Saikat Kanjilal wrote:
>>>>>>>>> 
>>>>>>>>>> Hi Guys,I was interested in helping out with the clustering component
>>>>>>>>>> of mahout, I looked through the JIRA items below and was wondering 
>>>>>>>>>> if there
>>>>>>>>>> is a specific one that would be good to start with:
>>>>>>>>>> 
>>>>>>>>>> https://issues.apache.org/**jira/secure/IssueNavigator.**
>>>>>>>>>> jspa?reset=true&jqlQuery=**project+%3D+MAHOUT+AND+**
>>>>>>>>>> resolution+%3D+Unresolved+AND+**component+%3D+Clustering+**
>>>>>>>>>> ORDER+BY+priority+DESC&mode=**hide<https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&jqlQuery=project+%3D+MAHOUT+AND+resolution+%3D+Unresolved+AND+component+%3D+Clustering+ORDER+BY+priority+DESC&mode=hide>
>>>>>>>>>> 
>>>>>>>>>> I initially was thinking to work on Mahout-930 or Mahout-931 but 
>>>>>>>>>> could
>>>>>>>>>> work on others if needed.
>>>>>>>>>> Best Regards
>>>>>>>>>> 
>> 
>

Re: Helping out with the .7 release

Reply via email to