Re: need help explaining difference in k means output

2014-01-06 Thread Scott C. Cote
Mahesh,

I guess this is what I get for working too long and not recognizing the
diff Š.  Suspected it was something silly.

Changing the driver parameters to EXACTLY the same as the command line
does indeed work.   Thank you.

I now have one file.  Not sure if it was the convergence or the
sequential, but I have a hunch that the problem was the sequential (As you
pointed out, I have plenty of iterations left).

Cheers!

SCott

On 1/6/14 3:58 AM, "Mahesh Balija"  wrote:

>Hi Scott,
>
>Not very sure why you are getting many part files in code execution, the
>difference b/w in your command line and the code execution is your cd
>[Convergence Delta] is different 0.1 and 0.01, in the later case KMeans
>might take more iterations to converge as its convergenceDelta is very
>less
>but anyways you have number of iterations set to 10.
>Another difference is you are running your source code execution in
>sequential mode. I am not sure whether these factors really effect the
>number of part files being generated.
>
>Anyhow you have to evaluate the number of clusters being generated finally
>by using ClusterDumper in both the cases, that will give you the number of
>clusters and the points associated with each clusters.
>
>The ClusteredPoints will be generated in the last iteration and will have
>the info about the clusters and associated points for each cluster.
>
>Best,
>Mahesh Balija.
>
>
>On Sun, Jan 5, 2014 at 1:59 AM, Scott C. Cote 
>wrote:
>
>> All,
>>
>> When I run the Kmeans analysis from the command line,
>>
>> > #
>> > # added the -cd option per instructions in the Mahout In Action (MiA)
>>so
>> the
>> > convergance threhsold is .1
>> > #   instead of default value of .5  because cosines lie within 0
>>and
>> 1.
>> > #
>> > # maximum number of iterations is 10
>> > #
>> > mahout kmeans -i reuters-vectors/tfidf-vectors/ -c
>> > reuters-canopy-centroids/clusters-0-final/ -cl -ow -o
>> reuters-kmeans-clusters
>> > -x 10 -dm org.apache.mahout.common.distance.CosineDistanceMeasure -cd
>>0.1
>>
>>  the iterations resolve to a directory with the word "final" that has a
>> single file where the name is like "part-r-0"  .
>>  If I run it as a java routine:
>>
>> KMeansDriver.run(conf, vectorsFolder, new Path(canopyCentroids,
>> "clusters-0-final"), clusterOutput,
>>
>> new CosineDistanceMeasure(), 0.01, 20, true, 0.0, true);
>>
>>
>>
>>  thousands of files such as "part-00338"  are produced.  The same data
>>is
>> used as input for both and both are initialized from canopy .
>>
>> Why does the command line form generate a single file while my Java
>>version
>> generate multiple output files.  What setting/configuration am I
>>missing?
>>
>> Secondary question:  The sequence files located in the "final" folder I
>> assume to contain the "centroids" of the data (and the points that the
>> centroids were derived from are in the "clusteredPoints" (please
>>confirm).
>>
>> Thanks in advance.
>>
>> SCott
>>
>>
>>
>>
>>




Re: need help explaining difference in k means output

2014-01-06 Thread Mahesh Balija
Hi Scott,

Not very sure why you are getting many part files in code execution, the
difference b/w in your command line and the code execution is your cd
[Convergence Delta] is different 0.1 and 0.01, in the later case KMeans
might take more iterations to converge as its convergenceDelta is very less
but anyways you have number of iterations set to 10.
Another difference is you are running your source code execution in
sequential mode. I am not sure whether these factors really effect the
number of part files being generated.

Anyhow you have to evaluate the number of clusters being generated finally
by using ClusterDumper in both the cases, that will give you the number of
clusters and the points associated with each clusters.

The ClusteredPoints will be generated in the last iteration and will have
the info about the clusters and associated points for each cluster.

Best,
Mahesh Balija.


On Sun, Jan 5, 2014 at 1:59 AM, Scott C. Cote  wrote:

> All,
>
> When I run the Kmeans analysis from the command line,
>
> > #
> > # added the -cd option per instructions in the Mahout In Action (MiA) so
> the
> > convergance threhsold is .1
> > #   instead of default value of .5  because cosines lie within 0 and
> 1.
> > #
> > # maximum number of iterations is 10
> > #
> > mahout kmeans -i reuters-vectors/tfidf-vectors/ -c
> > reuters-canopy-centroids/clusters-0-final/ -cl -ow -o
> reuters-kmeans-clusters
> > -x 10 -dm org.apache.mahout.common.distance.CosineDistanceMeasure -cd 0.1
>
>  the iterations resolve to a directory with the word "final" that has a
> single file where the name is like "part-r-0"  .
>  If I run it as a java routine:
>
> KMeansDriver.run(conf, vectorsFolder, new Path(canopyCentroids,
> "clusters-0-final"), clusterOutput,
>
> new CosineDistanceMeasure(), 0.01, 20, true, 0.0, true);
>
>
>
>  thousands of files such as "part-00338"  are produced.  The same data is
> used as input for both and both are initialized from canopy .
>
> Why does the command line form generate a single file while my Java version
> generate multiple output files.  What setting/configuration am I missing?
>
> Secondary question:  The sequence files located in the "final" folder I
> assume to contain the "centroids" of the data (and the points that the
> centroids were derived from are in the "clusteredPoints" (please confirm).
>
> Thanks in advance.
>
> SCott
>
>
>
>
>


need help explaining difference in k means output

2014-01-04 Thread Scott C. Cote
All,

When I run the Kmeans analysis from the command line,

> #
> # added the -cd option per instructions in the Mahout In Action (MiA) so the
> convergance threhsold is .1
> #   instead of default value of .5  because cosines lie within 0 and 1.
> #
> # maximum number of iterations is 10
> #
> mahout kmeans -i reuters-vectors/tfidf-vectors/ -c
> reuters-canopy-centroids/clusters-0-final/ -cl -ow -o reuters-kmeans-clusters
> -x 10 -dm org.apache.mahout.common.distance.CosineDistanceMeasure -cd 0.1

 the iterations resolve to a directory with the word "final" that has a
single file where the name is like "part-r-0"  .
 If I run it as a java routine:

KMeansDriver.run(conf, vectorsFolder, new Path(canopyCentroids,
"clusters-0-final"), clusterOutput,

new CosineDistanceMeasure(), 0.01, 20, true, 0.0, true);



 thousands of files such as "part-00338"  are produced.  The same data is
used as input for both and both are initialized from canopy .

Why does the command line form generate a single file while my Java version
generate multiple output files.  What setting/configuration am I missing?

Secondary question:  The sequence files located in the "final" folder I
assume to contain the "centroids" of the data (and the points that the
centroids were derived from are in the "clusteredPoints" (please confirm).

Thanks in advance.

SCott