Re: clustering with kmeans, java app

Yuval Feinstein Mon, 06 Aug 2012 23:16:14 -0700

I spent a week trying to get Hadoop to work on Windows 7, and then gave up.
Do you manage to run Hadoop on Windows? Do Hadoop tests (e.g. wordcount) work?
http://en.wikisource.org/wiki/User:Fkorning/Code/Hadoop-on-Cygwin has
lots of details about this.
Some of the possible problems are cygwin paths (!= linux paths),
hdfs/local filesystem confusion, your hadoop user (!= your user
permissions-wise), or other things
listed at the link above.
Good luck,
Yuval


On Thu, Aug 2, 2012 at 11:57 AM, Videnova, Svetlana
<svetlana.viden...@logica.com> wrote:
>
> Hello,
>
> I’m doing java app for clustering my data with kmeans.
>
> Those are the steps:
>
> 1)
>
> LuceneDemo : Create index and vectors using lib Lucene.vector, input path of 
> my .txt, output index (segments_1, segments.gen, .fdt, .fdx, .fnm, .frq, 
> .nrm, .prx, .tii, .tis, .tvd, .tvx and the most important who will be using 
> by mahout .tvf) and vectors looking like that 
> (SEQ__org.apache.hadoop.io.Text_org.apache.hadoop.io.Text______t€ðàó^æVG²RŸ˜Õ_________Ž__P(0):{15:1.4650986194610596,14:0.9997141361236572,11:0.9997141361236572,10:0.9997141361236572,9:0.9997141361236572,8:1.4650986194610596,7:1.4650986194610596,6:1.4650986194610596,5:0.9997141361236572,4:1.4650986194610596,2:3.1613736152648926,1:1.4650986194610596,0:0.9997141361236572}_________Ž__P(1):{15:1.4650986194610596,14:0.9997141361236572,11:0.9997141361236572,10:0.9997141361236572,9:0.9997141361236572,8:1.4650986194610596,7:1.4650986194610596,6:1.4650986194610596,5:0.9997141361236572,4:1.4650986194610596,2:3.1613736152648926,1:1.4650986194610596,0:0.9997141361236572}_________Ž__P(2):{
>  [… and others])
>
> Does anyone please can confirm me that the output format looks good? If no, 
> what the vectors generated by lucene.vector should look like?
>
> This is part of the code :
> /*Creating vectors*/
>                                Map vectorMap = new TreeMap();
>                                IndexReader reader = IndexReader.open(index);
>                                int numDoc = reader.maxDoc();
>                                for(int i = 0; i < numDoc;i++){
>
>
>                                                TermFreqVector termFreqVector 
> = reader.getTermFreqVector(i, "content");
>                                                
> addTermFreqToMap(vectorMap,termFreqVector);
>
>                                }
>
>
>
>
> 2)
>
>
> MainClass : Create clusters with mahout, input – path of vectors (the vectors 
> generated by step 1 see above) , output -  clusters (looking like : for the 
> moment does not create any clusters cause of this error :
> Exception in thread "main" java.io.FileNotFoundException: File 
> file:/F:/MAHOUT/TesMahout/clusters/tf-vectors/wordcount/data does not exist.
>       at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361)
>       at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)
>       at 
> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:63)
>       at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:241)
>       at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:885)
>       at 
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779)
>       at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
>       at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
>       at 
> org.apache.mahout.vectorizer.tfidf.TFIDFConverter.startDFCounting(TFIDFConverter.java:368)
>       at 
> org.apache.mahout.vectorizer.tfidf.TFIDFConverter.calculateDF(TFIDFConverter.java:198)
>       at main.MainClass.main(MainClass.java:144))
>
>
> Does anyone please can help me to solve this exception? I can’t understand 
> why data could not be created… while I’m using hadoop and mahout libs on 
> windows (and I’m admin so should not be problem of rights).
>
>
> This is part of the code :
>
>
>             Pair<Long[], List<Path>> calculate 
> =TFIDFConverter.calculateDF(new 
> Path(outputDir,DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), new 
> Path(outputDir, DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), conf, 
> chuckSize);
>
>             TFIDFConverter.processTfIdf(new 
> Path(outputDir,DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), new 
> Path(outputDir),conf,calculate,minDf,maxDFPercent, norm, true, 
> sequentialAccessOutput, false, reduceTasks);
>
>             Path vectorFolder = new Path("output");
>             Path canopyCentroids = new Path(outputDir, "canopy-centroids");
>
>             Path clusterOutput = new Path(outputDir, "clusters");
>
>             CanopyDriver.run(vectorFolder, canopyCentroids, new 
> EuclideanDistanceMeasure(), 250, 120, false,3,false);
>
>             KMeansDriver.run(conf, vectorFolder, new 
> Path(canopyCentroids,"clusters-0"), clusterOutput, new 
> TanimotoDistanceMeasure(), 0.01, 20, true,3, false);
>
>
> Thank you for your time
>
>
>
>
> Regards
>
> Think green - keep it on the screen.
>
> This e-mail and any attachment is for authorised use by the intended 
> recipient(s) only. It may contain proprietary material, confidential 
> information and/or be subject to legal privilege. It should not be copied, 
> disclosed to, retained or used by, any other party. If you are not an 
> intended recipient then please promptly delete this e-mail and any attachment 
> and all copies and inform the sender. Thank you.
>

Re: clustering with kmeans, java app

Reply via email to