Re: kmeans vectors

Matt Tanquary Thu, 30 Sep 2010 09:48:30 -0700

Hi Jeff,

Thanks for your reply. I just got trunk and started the install. It
ended with this error:


Error loading supplemental data models: Cannot create file-based resource.
org.codehaus.plexus.resource.loader.FileResourceCreationException:
Cannot create file-based resource.


A lot built, so I went ahead and tried your command-line example, but got:

ERROR: Could not find mahout-examples-*.job in
/mnt/install/tools/mahout or
/mnt/install/tools/mahout/examples/target, please run 'mvn install' to
create the .job file

I retrieved trunk as follows: svn co
http://svn.apache.org/repos/asf/mahout/trunk

Then ran 'mvn install' in the trunk folder.

Any issues with trunk today?

Thanks,
Matt

On Wed, Sep 29, 2010 at 12:29 PM, Jeff Eastman
<j...@windwardsolutions.com> wrote:
>  Hi Matt,
>
> From your command arguments, it looks like you are running 0.3. Due to the
> rate of change in Mahout we recommend you check out trunk and use that
> instead. With a little tweaking (added a --charset ASCII on seqdirectory) I
> was able to get as far as you did on trunk but seq2sparse is not what you
> want to use.
>
> The utilities you are using are intended for text preprocessing, to get
> documents word-counted, into term vector sequenceFiles and then running TF
> and/or TF-IDF processing on the results to produce VectorWritable sequence
> files suitable for clustering. For your problem, I suggest you instead look
> at the Synthetic Control clustering examples, starting with Canopy. These
> use an InputDriver to process text files containing space-delimited numbers
> like your data.dat file and produce the VectorWritable sequence files
> directly.
>
> I was able to run this on your data using trunk and it produced 3 clusters.
> You should be able to run the other synthetic control jobs on it too:
>
> CommandLine:
> ./bin/mahout org.apache.mahout.clustering.syntheticcontrol.canopy.Job \
> -i data \
> -o output \
> -t1 3 \
> -t2 2 \
> -ow \
> -dm org.apache.mahout.common.distance.EuclideanDistanceMeasure
>
> Clusters output:
> C-0{n=1 c=[22.000, 21.000] r=[0.000, 0.000]}
>    Weight:  Point:
>    1.0: [22.000, 21.000]
> C-1{n=2 c=[18.250, 21.500] r=[0.250, 0.500]}
>    Weight:  Point:
>    1.0: [19.000, 20.000]
>    1.0: [18.000, 22.000]
> C-2{n=2 c=[2.500, 2.250] r=[0.500, 0.250]}
>    Weight:  Point:
>    1.0: [1.000, 3.000]
>    1.0: [3.000, 2.000]
>
>
> Good hunting,
> Jeff
>
> On 9/29/10 2:26 PM, Matt Tanquary wrote:
>>
>> I was able to run the tutorials, etc. Now I would like to generate my
>> own small test.
>>
>> I have created a data.dat file and put these contents:
>> 22 21
>> 19 20
>> 18 22
>> 1 3
>> 3 2
>>
>> Then I ran: mahout seqdirectory -i ~/data/kmeans/data.dat -o kmeans/seqdir
>>
>> This created kmeans/seqdir/chunk-o in my dfs with the following content:
>> ź/%
>>         /data.dat22 21
>> 19 20
>> 18 22
>> 1 3
>> 3 2
>>
>> Next I ran:  mahout seq2sparse -i kmeans/seqdir -o kmeans/input
>>
>> This generated several things in kmeans/input including the
>> 'tfidf/vectors' folder. Inside the vectors folder I get: part-00000
>> which contains:
>> řĎân
>>         /data.dat7org.apache.mahout.math.RandomAccessSparseVectorWritable
>>      /data.dat@@
>>
>> It does not seem to have the numeric data at this point.
>>
>> I am hoping someone can shed some light on how I can get my datapoint
>> file into the proper vector format for running mahout kmeans.
>>
>> Just fyi, when I run kmeans against that file (mahout kmeans -i
>> kmeans/input/tfidf/vectors -c kmeans/clusters -o kmeans/output -k 2
>> -w) I get:
>>
>> Exception in thread "main" java.lang.IndexOutOfBoundsException: Index:
>> 1, Size: 1
>>         at java.util.ArrayList.RangeCheck(ArrayList.java:547)
>>
>> which tells me it was unable to find even 1 vector in the given input
>> folder.
>>
>> Thanks for any comments you provide.
>> -M@
>
>



-- 
Have you thanked a teacher today? ---> http://www.liftateacher.org

Re: kmeans vectors

Reply via email to