So I haven't looked super-carefully at the clustering refactoring work, can
someone give a little overview of what
the plan is?

The NewLDA stuff is technically in "clustering" and generally works by
taking in SeqFile<IW,VW> documents as the training corpus, and spits out
two things: SeqFile<IW,VW> of a "model" (keyed on topicId, one vector per
topic) and a SeqFile<IW,VW> of "classifications" (keyed on docId, one
vector over the topic space for projection onto each topic dimension).

This is similar to how SVD clustering/decomposition works, but with
L1-normed outputs instead of L2.

But this seems very different from all of the structures in the rest of
clustering.

  -jake

On Wed, Feb 22, 2012 at 7:56 AM, Jeff Eastman <[email protected]>wrote:

> Hi Saikat,
>
> I agree with Paritosh, that a great place to begin would be to write some
> unit tests. This will familiarize you with the code base and help us a lot
> with our 0.7 housekeeping release. The new clustering classification
> components are going to unify many - but not all - of the existing
> clustering algorithms to reduce their complexity by factoring out
> duplication and streamlining their integration into semi-supervised
> classification engines.
>
> Please feel free to post any questions you may have in reading through
> this code. This is a major refactoring effort and we will need all the help
> we can get. Thanks for the offer,
>
> Jeff
>
>
> On 2/21/12 10:46 PM, Saikat Kanjilal wrote:
>
>> Hi Paritosh,Yes creating the test case would be a great first start,
>> however are there other tasks you guys need help with before I can do
>> before the test creation, I will sync trunk and start reading through the
>> code in the meantime.Regards
>>
>>  Date: Wed, 22 Feb 2012 10:57:51 +0530
>>> From: [email protected]
>>> To: [email protected]
>>> Subject: Re: Helping out with the .7 release
>>>
>>> We are creating clustering as classification components which will help
>>> in moving clustering out. Once the component is ready, then the
>>> clustering algorithms would need refactoring.
>>> The clustering as classification component and the outlier removal
>>> component has been created.
>>>
>>> Most of it is committed, and rest is available as a patch. See
>>> https://issues.apache.org/**jira/browse/MAHOUT-929<https://issues.apache.org/jira/browse/MAHOUT-929>
>>> If you will apply the latest patch available on Mahout-929 you can see
>>> all that is available now.
>>>
>>> If you want, you can help with the test case of
>>> ClusterClassificationMapper available in the patch.
>>>
>>> On 22-02-2012 10:27, Saikat Kanjilal wrote:
>>>
>>>> Hi Guys,I was interested in helping out with the clustering component
>>>> of mahout, I looked through the JIRA items below and was wondering if there
>>>> is a specific one that would be good to start with:
>>>>
>>>> https://issues.apache.org/**jira/secure/IssueNavigator.**
>>>> jspa?reset=true&jqlQuery=**project+%3D+MAHOUT+AND+**
>>>> resolution+%3D+Unresolved+AND+**component+%3D+Clustering+**
>>>> ORDER+BY+priority+DESC&mode=**hide<https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&jqlQuery=project+%3D+MAHOUT+AND+resolution+%3D+Unresolved+AND+component+%3D+Clustering+ORDER+BY+priority+DESC&mode=hide>
>>>>
>>>> I initially was thinking to work on Mahout-930 or Mahout-931 but could
>>>> work on others if needed.
>>>> Best Regards
>>>>
>>>
>>
>
>

Reply via email to