My experience has been that it's best to leave the data processing for Python.  
I strongly suggest you re-write your ETL and let Mahout only do the clustering. 
The built-in vectorization routines are fairly primitive.

Then I would wash the features, basically set up your own list of stop words or 
phrases, before you let Mahout do anything.

On Dec 4, 2014, at 8:38 AM, Shahid Shaikh <shaikhshah...@gmail.com> wrote:

> Hey Donni thanks but I have used the configurations and obtained the
> clusters .the results are not promising enough . I was looking if there are
> any known technics I can follow specifically while generating vectors .
> 
> Thanks
> 
> On Thursday, December 4, 2014, Donni Khan <prince.don...@googlemail.com>
> wrote:
>> Hi
>> it depends on the nature of data you are clustering. If you have knowledge
>> about your data, you can figure out the results and you can also set the
>> correct parameters to the clustering algorithm like number of topics or
>> number of clusters.
>> 
>> Cheers,
>> Donni
>> 
>> On Thu, Dec 4, 2014 at 2:38 PM, Shahid Shaikh <shaikhshah...@gmail.com>
>> wrote:
>> 
>>> Hi All,
>>>   I have been trying mahout clustering  on unstructured data i.e human
>>> written data . I have tried mahout clustering algorithms like
>>> Kmeans,Canopy+Kmeans and LDA but the results produced are not help full .
>>> 
>>> i see the problem is with the way data is written , Can some one please
>>> provide me some pointers on how to proceed with unstructured data  for
>>> clustering.
>>> 
>>> 
>>> i have written and analyzer that uses lower-Case and stop-words filter
> also
>>> .
>>> 
>>> thanks :)
>>> 
>>> 
>>> Regards,
>>> Shaikh Shahid G .
>>> +91 9503954781
>>> 
>> 
> 
> -- 
> Regards,
> Shaikh Shahid G .
> +91 9503954781

Reply via email to