My experience has been that it's best to leave the data processing for Python. I strongly suggest you re-write your ETL and let Mahout only do the clustering. The built-in vectorization routines are fairly primitive.
Then I would wash the features, basically set up your own list of stop words or phrases, before you let Mahout do anything. On Dec 4, 2014, at 8:38 AM, Shahid Shaikh <shaikhshah...@gmail.com> wrote: > Hey Donni thanks but I have used the configurations and obtained the > clusters .the results are not promising enough . I was looking if there are > any known technics I can follow specifically while generating vectors . > > Thanks > > On Thursday, December 4, 2014, Donni Khan <prince.don...@googlemail.com> > wrote: >> Hi >> it depends on the nature of data you are clustering. If you have knowledge >> about your data, you can figure out the results and you can also set the >> correct parameters to the clustering algorithm like number of topics or >> number of clusters. >> >> Cheers, >> Donni >> >> On Thu, Dec 4, 2014 at 2:38 PM, Shahid Shaikh <shaikhshah...@gmail.com> >> wrote: >> >>> Hi All, >>> I have been trying mahout clustering on unstructured data i.e human >>> written data . I have tried mahout clustering algorithms like >>> Kmeans,Canopy+Kmeans and LDA but the results produced are not help full . >>> >>> i see the problem is with the way data is written , Can some one please >>> provide me some pointers on how to proceed with unstructured data for >>> clustering. >>> >>> >>> i have written and analyzer that uses lower-Case and stop-words filter > also >>> . >>> >>> thanks :) >>> >>> >>> Regards, >>> Shaikh Shahid G . >>> +91 9503954781 >>> >> > > -- > Regards, > Shaikh Shahid G . > +91 9503954781