Re: clustering with kmeans, java app
I spent a week trying to get Hadoop to work on Windows 7, and then gave up. Do you manage to run Hadoop on Windows? Do Hadoop tests (e.g. wordcount) work? http://en.wikisource.org/wiki/User:Fkorning/Code/Hadoop-on-Cygwin has lots of details about this. Some of the possible problems are cygwin paths (!= linux paths), hdfs/local filesystem confusion, your hadoop user (!= your user permissions-wise), or other things listed at the link above. Good luck, Yuval On Thu, Aug 2, 2012 at 11:57 AM, Videnova, Svetlana svetlana.viden...@logica.com wrote: Hello, I’m doing java app for clustering my data with kmeans. Those are the steps: 1) LuceneDemo : Create index and vectors using lib Lucene.vector, input path of my .txt, output index (segments_1, segments.gen, .fdt, .fdx, .fnm, .frq, .nrm, .prx, .tii, .tis, .tvd, .tvx and the most important who will be using by mahout .tvf) and vectors looking like that (SEQ__org.apache.hadoop.io.Text_org.apache.hadoop.io.Text__t€ðàó^æVG²RŸ˜Õ_Ž__P(0):{15:1.4650986194610596,14:0.9997141361236572,11:0.9997141361236572,10:0.9997141361236572,9:0.9997141361236572,8:1.4650986194610596,7:1.4650986194610596,6:1.4650986194610596,5:0.9997141361236572,4:1.4650986194610596,2:3.1613736152648926,1:1.4650986194610596,0:0.9997141361236572}_Ž__P(1):{15:1.4650986194610596,14:0.9997141361236572,11:0.9997141361236572,10:0.9997141361236572,9:0.9997141361236572,8:1.4650986194610596,7:1.4650986194610596,6:1.4650986194610596,5:0.9997141361236572,4:1.4650986194610596,2:3.1613736152648926,1:1.4650986194610596,0:0.9997141361236572}_Ž__P(2):{ [… and others]) Does anyone please can confirm me that the output format looks good? If no, what the vectors generated by lucene.vector should look like? This is part of the code : /*Creating vectors*/ Map vectorMap = new TreeMap(); IndexReader reader = IndexReader.open(index); int numDoc = reader.maxDoc(); for(int i = 0; i numDoc;i++){ TermFreqVector termFreqVector = reader.getTermFreqVector(i, content); addTermFreqToMap(vectorMap,termFreqVector); } 2) MainClass : Create clusters with mahout, input – path of vectors (the vectors generated by step 1 see above) , output - clusters (looking like : for the moment does not create any clusters cause of this error : Exception in thread main java.io.FileNotFoundException: File file:/F:/MAHOUT/TesMahout/clusters/tf-vectors/wordcount/data does not exist. at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245) at org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:63) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:241) at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:885) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779) at org.apache.hadoop.mapreduce.Job.submit(Job.java:432) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447) at org.apache.mahout.vectorizer.tfidf.TFIDFConverter.startDFCounting(TFIDFConverter.java:368) at org.apache.mahout.vectorizer.tfidf.TFIDFConverter.calculateDF(TFIDFConverter.java:198) at main.MainClass.main(MainClass.java:144)) Does anyone please can help me to solve this exception? I can’t understand why data could not be created… while I’m using hadoop and mahout libs on windows (and I’m admin so should not be problem of rights). This is part of the code : PairLong[], ListPath calculate =TFIDFConverter.calculateDF(new Path(outputDir,DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), new Path(outputDir, DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), conf, chuckSize); TFIDFConverter.processTfIdf(new Path(outputDir,DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), new Path(outputDir),conf,calculate,minDf,maxDFPercent, norm, true, sequentialAccessOutput, false, reduceTasks); Path vectorFolder = new Path(output); Path canopyCentroids = new Path(outputDir, canopy-centroids); Path clusterOutput = new Path(outputDir, clusters); CanopyDriver.run(vectorFolder, canopyCentroids, new EuclideanDistanceMeasure(), 250, 120, false,3,false); KMeansDriver.run(conf, vectorFolder, new Path(canopyCentroids,clusters-0), clusterOutput, new TanimotoDistanceMeasure(), 0.01, 20, true,3, false); Thank you for your time Regards Think green - keep it on the screen. This e-mail and any attachment is for
RE: clustering with kmeans, java app
Hi, Yes i'm using mahout and hadoop libs on windows. My cluster output is not written on hdfs but in LOCAL. Thanks to cygwin I am able to run unix command in order to run mahout on windows. I changed the path on windows as well. I didn’t test if wordcount is working, because I am using only mahout libs did not tried to run examples. I was not following none tutorial but I found this may help you : http://blogs.msdn.com/b/avkashchauhan/archive/2012/03/06/running-apache-mahout-at-hadoop-on-windows-azure-www-hadooponazure-com.aspx Cheers -Message d'origine- De : Yuval Feinstein [mailto:yuv...@citypath.com] Envoyé : mardi 7 août 2012 08:16 À : user@mahout.apache.org Objet : Re: clustering with kmeans, java app I spent a week trying to get Hadoop to work on Windows 7, and then gave up. Do you manage to run Hadoop on Windows? Do Hadoop tests (e.g. wordcount) work? http://en.wikisource.org/wiki/User:Fkorning/Code/Hadoop-on-Cygwin has lots of details about this. Some of the possible problems are cygwin paths (!= linux paths), hdfs/local filesystem confusion, your hadoop user (!= your user permissions-wise), or other things listed at the link above. Good luck, Yuval On Thu, Aug 2, 2012 at 11:57 AM, Videnova, Svetlana svetlana.viden...@logica.com wrote: Hello, I’m doing java app for clustering my data with kmeans. Those are the steps: 1) LuceneDemo : Create index and vectors using lib Lucene.vector, input path of my .txt, output index (segments_1, segments.gen, .fdt, .fdx, .fnm, .frq, .nrm, .prx, .tii, .tis, .tvd, .tvx and the most important who will be using by mahout .tvf) and vectors looking like that (SEQ__org.apache.hadoop.io.Text_org.apache.hadoop.io.Text__t€ðàó^æ VG²RŸ˜Õ_Ž__P(0):{15:1.4650986194610596,14:0.9997141361236572,1 1:0.9997141361236572,10:0.9997141361236572,9:0.9997141361236572,8:1.46 50986194610596,7:1.4650986194610596,6:1.4650986194610596,5:0.999714136 1236572,4:1.4650986194610596,2:3.1613736152648926,1:1.4650986194610596 ,0:0.9997141361236572}_Ž__P(1):{15:1.4650986194610596,14:0.999 7141361236572,11:0.9997141361236572,10:0.9997141361236572,9:0.99971413 61236572,8:1.4650986194610596,7:1.4650986194610596,6:1.465098619461059 6,5:0.9997141361236572,4:1.4650986194610596,2:3.1613736152648926,1:1.4 650986194610596,0:0.9997141361236572}_Ž__P(2):{ [… and others]) Does anyone please can confirm me that the output format looks good? If no, what the vectors generated by lucene.vector should look like? This is part of the code : /*Creating vectors*/ Map vectorMap = new TreeMap(); IndexReader reader = IndexReader.open(index); int numDoc = reader.maxDoc(); for(int i = 0; i numDoc;i++){ TermFreqVector termFreqVector = reader.getTermFreqVector(i, content); addTermFreqToMap(vectorMap,termFreqVector); } 2) MainClass : Create clusters with mahout, input – path of vectors (the vectors generated by step 1 see above) , output - clusters (looking like : for the moment does not create any clusters cause of this error : Exception in thread main java.io.FileNotFoundException: File file:/F:/MAHOUT/TesMahout/clusters/tf-vectors/wordcount/data does not exist. at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245) at org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:63) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:241) at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:885) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779) at org.apache.hadoop.mapreduce.Job.submit(Job.java:432) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447) at org.apache.mahout.vectorizer.tfidf.TFIDFConverter.startDFCounting(TFIDFConverter.java:368) at org.apache.mahout.vectorizer.tfidf.TFIDFConverter.calculateDF(TFIDFConverter.java:198) at main.MainClass.main(MainClass.java:144)) Does anyone please can help me to solve this exception? I can’t understand why data could not be created… while I’m using hadoop and mahout libs on windows (and I’m admin so should not be problem of rights). This is part of the code : PairLong[], ListPath calculate =TFIDFConverter.calculateDF(new Path(outputDir,DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), new Path(outputDir, DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), conf, chuckSize); TFIDFConverter.processTfIdf(new
Re: Seq2sparse example produces bad TFIDF vectors while TF vectors are Ok.
This is the case: https://issues.apache.org/jira/browse/MAHOUT-973 The bug exists in Mahout 0.6 and was fixed in Mahout 0.7. I also used the workaround of using a high value for --maxDFPercent (I guess the number of documents in the corpus is enough). Maybe it will be good to fix it on 0.6 as well? Thanks, Yuval On Fri, Aug 3, 2012 at 11:55 PM, Sean Owen sro...@gmail.com wrote: This sounds a lot like a bug that was fixed by a patch some time ago. Grant I think it was something I had wanted you to double-check, not sure if you had a look. But I think it was fixed if it's the same issue. On Thu, Aug 2, 2012 at 8:44 AM, Abramov Pavel p.abra...@rambler-co.ruwrote: Thanks for this idea. Looks like a bug: 1) Setting --maxDFPercent to 100 has no effect 2) Setting --maxDFPercent to 1 000 000 000 makes TFIDF vectors Ok. seq2sparse cuts terms with DF maxDFPercent. So maxDFPercent is not a percentage. maxDFPercent is absolute value. Pavel 01.08.12 20:46 пользователь Robin Anil robin.a...@gmail.com написал: Tfidf job is where the document frequency pruning is applied. Try increasing maxDFPercent to 100 % On Wed, Aug 1, 2012 at 11:22 AM, Abramov Pavel p.abra...@rambler-co.ruwrote: Hello! I have trouble running the example seq2sparse with TFIDF weights. My TF vectors are Ok, while TFIDF vectors are 10 times smaller. Looks like seq2sparse cuts my terms during TFxIDF step. Document1 in TF vector has 20 terms, while Document1 in TFIDF vector has only 2 terms. What is wrong? I spent 2 days finding the answer and configuring seq2sparse parameters (( Thanks in advance! mahout seq2sparse -ow \ -chunk 512 \ --maxDFPercent 90 \ --maxNGramSize 1 \ --numReducers 128 \ --minSupport 150 \ -i --- \ -o --- \ -wt tfidf \ --namedVector \ -a org.apache.lucene.analysis.WhitespaceAnalyzer Pavel
RE: ClusterDumper eclipse human readable output kmeans
I already generated points directory when i run cluster (kmeans in my case). But for the moment I can't generate clustedump because of error on this line: ClusterDumper.readPoints(new Path(output/kmeans/clusters-0), 2, conf); Second parameter is double but he wants int but does not accept int well pretty confused ... -Message d'origine- De : kiran kumar [mailto:kirankumarsm...@gmail.com] Envoyé : lundi 6 août 2012 18:01 À : user@mahout.apache.org Objet : Re: ClusterDumper eclipse human readable output kmeans Hello, Clusterdump actually shows you the top terms and vectors of centroid and each document. But to identify what vectors are for your document, You need to generate points directory when running clustering algorithm and use the points directory generated in the above step when generating cluster dump. Thanks, Kiran Bushireddy. On Mon, Aug 6, 2012 at 10:33 AM, Videnova, Svetlana svetlana.viden...@logica.com wrote: Hi, My goal is to transform the vectors created by lucene.vector (thanks to kmeans clustering) to a human readable format. For that I am using ClusterDumper function on eclipse. But that code does not generate none files. What am I missing? What is the best approach to transform output of kmeans to a human readable (no unix command please I am on windows using eclipse and cygwin). This is the code: Code : MapInteger, ListWeightedVectorWritable result = ClusterDumper.readPoints(new Path(output/kmeans/clusters-0), 2, conf); System.out.println(result.get(0).toString()); for(int j = 0; j result.size(); j++){ ListWeightedVectorWritable list = result.get(j); for(WeightedVectorWritable vector : list){ System.out.println(vector.getVector().asFormatString()); } } Error : Exception in thread main java.lang.ClassCastException: org.apache.mahout.clustering.iterator.ClusterWritable cannot be cast to org.apache.mahout.clustering.classify.WeightedVectorWritable at main.LuceneDemo.main(LuceneDemo.java:260) Thank you Think green - keep it on the screen. This e-mail and any attachment is for authorised use by the intended recipient(s) only. It may contain proprietary material, confidential information and/or be subject to legal privilege. It should not be copied, disclosed to, retained or used by, any other party. If you are not an intended recipient then please promptly delete this e-mail and any attachment and all copies and inform the sender. Thank you. -- Thanks Regards, Kiran Kumar Think green - keep it on the screen. This e-mail and any attachment is for authorised use by the intended recipient(s) only. It may contain proprietary material, confidential information and/or be subject to legal privilege. It should not be copied, disclosed to, retained or used by, any other party. If you are not an intended recipient then please promptly delete this e-mail and any attachment and all copies and inform the sender. Thank you.
Re: ClusterDumper eclipse human readable output kmeans
I don't know why ClusterDumper is not working, but I can give an alternate solution. Use ClusterOutputPostProcessor (clusterpp), on the clusters-*-final directory. https://cwiki.apache.org/MAHOUT/top-down-clustering.html It will arrange the vectors in respective directories. However, it will still be in the form of sequence files. Its very simple to read a sequence file and write in a human readable format. Classes in org.apache.mahout.common.iterator.sequencefile package can help to read the sequence files easily. On 07-08-2012 12:50, Videnova, Svetlana wrote: I already generated points directory when i run cluster (kmeans in my case). But for the moment I can't generate clustedump because of error on this line: ClusterDumper.readPoints(new Path(output/kmeans/clusters-0), 2, conf); Second parameter is double but he wants int but does not accept int well pretty confused ... -Message d'origine- De : kiran kumar [mailto:kirankumarsm...@gmail.com] Envoyé : lundi 6 août 2012 18:01 À : user@mahout.apache.org Objet : Re: ClusterDumper eclipse human readable output kmeans Hello, Clusterdump actually shows you the top terms and vectors of centroid and each document. But to identify what vectors are for your document, You need to generate points directory when running clustering algorithm and use the points directory generated in the above step when generating cluster dump. Thanks, Kiran Bushireddy. On Mon, Aug 6, 2012 at 10:33 AM, Videnova, Svetlana svetlana.viden...@logica.com wrote: Hi, My goal is to transform the vectors created by lucene.vector (thanks to kmeans clustering) to a human readable format. For that I am using ClusterDumper function on eclipse. But that code does not generate none files. What am I missing? What is the best approach to transform output of kmeans to a human readable (no unix command please I am on windows using eclipse and cygwin). This is the code: Code : MapInteger, ListWeightedVectorWritable result = ClusterDumper.readPoints(new Path(output/kmeans/clusters-0), 2, conf); System.out.println(result.get(0).toString()); for(int j = 0; j result.size(); j++){ ListWeightedVectorWritable list = result.get(j); for(WeightedVectorWritable vector : list){ System.out.println(vector.getVector().asFormatString()); } } Error : Exception in thread main java.lang.ClassCastException: org.apache.mahout.clustering.iterator.ClusterWritable cannot be cast to org.apache.mahout.clustering.classify.WeightedVectorWritable at main.LuceneDemo.main(LuceneDemo.java:260) Thank you Think green - keep it on the screen. This e-mail and any attachment is for authorised use by the intended recipient(s) only. It may contain proprietary material, confidential information and/or be subject to legal privilege. It should not be copied, disclosed to, retained or used by, any other party. If you are not an intended recipient then please promptly delete this e-mail and any attachment and all copies and inform the sender. Thank you. -- Thanks Regards, Kiran Kumar Think green - keep it on the screen. This e-mail and any attachment is for authorised use by the intended recipient(s) only. It may contain proprietary material, confidential information and/or be subject to legal privilege. It should not be copied, disclosed to, retained or used by, any other party. If you are not an intended recipient then please promptly delete this e-mail and any attachment and all copies and inform the sender. Thank you.
RE: ClusterDumper eclipse human readable output kmeans
Just succeed to make work my app. Should to use ClusterDumperWriter.gettopfeatures(ar1,arg2,arg3) and that gave me the top words on human readable format :D -Message d'origine- De : Paritosh Ranjan [mailto:pran...@xebia.com] Envoyé : mardi 7 août 2012 10:32 À : user@mahout.apache.org Objet : Re: ClusterDumper eclipse human readable output kmeans I don't know why ClusterDumper is not working, but I can give an alternate solution. Use ClusterOutputPostProcessor (clusterpp), on the clusters-*-final directory. https://cwiki.apache.org/MAHOUT/top-down-clustering.html It will arrange the vectors in respective directories. However, it will still be in the form of sequence files. Its very simple to read a sequence file and write in a human readable format. Classes in org.apache.mahout.common.iterator.sequencefile package can help to read the sequence files easily. On 07-08-2012 12:50, Videnova, Svetlana wrote: I already generated points directory when i run cluster (kmeans in my case). But for the moment I can't generate clustedump because of error on this line: ClusterDumper.readPoints(new Path(output/kmeans/clusters-0), 2, conf); Second parameter is double but he wants int but does not accept int well pretty confused ... -Message d'origine- De : kiran kumar [mailto:kirankumarsm...@gmail.com] Envoyé : lundi 6 août 2012 18:01 À : user@mahout.apache.org Objet : Re: ClusterDumper eclipse human readable output kmeans Hello, Clusterdump actually shows you the top terms and vectors of centroid and each document. But to identify what vectors are for your document, You need to generate points directory when running clustering algorithm and use the points directory generated in the above step when generating cluster dump. Thanks, Kiran Bushireddy. On Mon, Aug 6, 2012 at 10:33 AM, Videnova, Svetlana svetlana.viden...@logica.com wrote: Hi, My goal is to transform the vectors created by lucene.vector (thanks to kmeans clustering) to a human readable format. For that I am using ClusterDumper function on eclipse. But that code does not generate none files. What am I missing? What is the best approach to transform output of kmeans to a human readable (no unix command please I am on windows using eclipse and cygwin). This is the code: Code : MapInteger, ListWeightedVectorWritable result = ClusterDumper.readPoints(new Path(output/kmeans/clusters-0), 2, conf); System.out.println(result.get(0).toString()); for(int j = 0; j result.size(); j++){ ListWeightedVectorWritable list = result.get(j); for(WeightedVectorWritable vector : list){ System.out.println(vector.getVector().asFormatString()); } } Error : Exception in thread main java.lang.ClassCastException: org.apache.mahout.clustering.iterator.ClusterWritable cannot be cast to org.apache.mahout.clustering.classify.WeightedVectorWritable at main.LuceneDemo.main(LuceneDemo.java:260) Thank you Think green - keep it on the screen. This e-mail and any attachment is for authorised use by the intended recipient(s) only. It may contain proprietary material, confidential information and/or be subject to legal privilege. It should not be copied, disclosed to, retained or used by, any other party. If you are not an intended recipient then please promptly delete this e-mail and any attachment and all copies and inform the sender. Thank you. -- Thanks Regards, Kiran Kumar Think green - keep it on the screen. This e-mail and any attachment is for authorised use by the intended recipient(s) only. It may contain proprietary material, confidential information and/or be subject to legal privilege. It should not be copied, disclosed to, retained or used by, any other party. If you are not an intended recipient then please promptly delete this e-mail and any attachment and all copies and inform the sender. Thank you. Think green - keep it on the screen. This e-mail and any attachment is for authorised use by the intended recipient(s) only. It may contain proprietary material, confidential information and/or be subject to legal privilege. It should not be copied, disclosed to, retained or used by, any other party. If you are not an intended recipient then please promptly delete this e-mail and any attachment and all copies and inform the sender. Thank you.
Re: Tags generation?
Hi All, We have developed an auto tagging system for our micro-blogging platform. Here is what we have done: The purpose of the system was to look for tags in an articles automatically when someone posts a link in our micro-blogging site. The goal was to allow us to follow a tag instead (in addition) of (to) a person. So we used some custom code on top of Mahout, UIMA, Open-NLP etc. If you are interested to see how it works take a look at: http://www.scoopspot.com/ One more thing, we also created a robot which goes to some of the well known web sites like: Read Write Web, Hackers News, Tech Crunch etc which gets the article from the web and publishes that to our micro-blog. As we already have the tag following, we get the information without any problem. That's very cool (to us at least). You can see the output of the robot at this location: http://news.scoopspot.com/ I thought, this might be an example of what Mahout can do and related to this thread, so felt like sharing with you guys. Sorry if it looks like off-topic. Regards, Samik On Tue, Aug 7, 2012 at 6:49 AM, Lance Norskog goks...@gmail.com wrote: I used the OpenNLP Parts-Of-Speech tool to label all words as 'noun', 'verb', etc. I removed all words that were not nouns or verbs. In my use case, this is a total win. In other cases, maybe not: Twitter has a quite varied non-grammer. On Sun, Aug 5, 2012 at 10:11 AM, Pat Ferrel p...@farfetchers.com wrote: The way back from stem to tag is interesting from the standpoint of making tags human readable. I had assumed a lookup but this seems much more satisfying and flexible. In order to keep frequencies it will take something like a dictionary creation step in the analyzer. This in turn seems to imply a join so a whole new map reduce job--maybe not completely trivial? It seems that NLP can be used in two very different ways here. First as a filter (keep only nouns and verbs?) second to differentiate semantics (can:verb, can:noun). One method is a dimensional reduction technique the other increases dimensions but can lead to orthogonal dimensions from the same term. I suppose both could be used together as the above example indicates. It sounds like you are using it to filter (only?) Can you explain what you mean by: One thing came through- parts-of-speech selection for nouns and verbs helped 5-10% in every combination of regularizers.' On Aug 3, 2012, at 6:31 PM, Lance Norskog goks...@gmail.com wrote: Thanks everyone- I hadn't considered the stem/synonym problem. I have code for regularizing a doc/term matrix with tf, binary, log and augmented norm for the cells and idf, gfidf, entropy, normal (term vector) and probabilistic inverse. Running any of these, and then SVD, on a Reuters article may take 10-20 ms. This uses a sentence/term matrix for document summarization. After doing all of this, I realized that maybe just the regularized matrix was good enough. One thing came through- parts-of-speech selection for nouns and verbs helped 5-10% in every combination of regularizers. All across the board. If you want good tags, select your parts of speech! On Fri, Aug 3, 2012 at 1:08 PM, Dawid Weiss dawid.we...@cs.put.poznan.pl wrote: I know, I know. :) Just wanted to mention that it could lead to funny results, that's all. There are lots of way of doing proper form disambiguation, including shallow tagging which then allows to retrieve correct base forms for lemmas, not stems. Stemming is typically good enough (and fast) so your advise was 100% fine. Dawid On Fri, Aug 3, 2012 at 9:31 PM, Ted Dunning ted.dunn...@gmail.com wrote: This is definitely just the first step. Similar goofs happen with inappropriate stemming. For instance, AIDS should not stem to aid. A reasonable way to find and classify exceptional cases is to look at cooccurrence statistics. The contexts of original forms can be examined to find cases where there is a clear semantic mismatch between the original and the set of all forms that stem to the same form. But just picking the most common that is present in the document is a pretty good step for all that it produces some oddities. The results are much better than showing a user the stemmed forms. On Fri, Aug 3, 2012 at 1:05 PM, Dawid Weiss dawid.we...@cs.put.poznan.plwrote: Unstemming is pretty simple. Just build an unstemming dictionary based on seeing what word forms have lead to a stemmed form. Include frequencies. This can lead to very funny (or not, depends how you look at it) mistakes when different lemmas stem to the same token. How frequent and important this phenomenon is varies from language to language (and can be calculated apriori). Dawid -- Lance Norskog goks...@gmail.com -- Lance Norskog goks...@gmail.com
Re: Tags generation?
Nice stuff. And glad that Mahout was able to help! On Tue, Aug 7, 2012 at 7:37 AM, SAMIK CHAKRABORTY sam...@gmail.com wrote: Hi All, We have developed an auto tagging system for our micro-blogging platform. Here is what we have done: The purpose of the system was to look for tags in an articles automatically when someone posts a link in our micro-blogging site. The goal was to allow us to follow a tag instead (in addition) of (to) a person. So we used some custom code on top of Mahout, UIMA, Open-NLP etc. If you are interested to see how it works take a look at: http://www.scoopspot.com/ One more thing, we also created a robot which goes to some of the well known web sites like: Read Write Web, Hackers News, Tech Crunch etc which gets the article from the web and publishes that to our micro-blog. As we already have the tag following, we get the information without any problem. That's very cool (to us at least). You can see the output of the robot at this location: http://news.scoopspot.com/ I thought, this might be an example of what Mahout can do and related to this thread, so felt like sharing with you guys. Sorry if it looks like off-topic. Regards, Samik On Tue, Aug 7, 2012 at 6:49 AM, Lance Norskog goks...@gmail.com wrote: I used the OpenNLP Parts-Of-Speech tool to label all words as 'noun', 'verb', etc. I removed all words that were not nouns or verbs. In my use case, this is a total win. In other cases, maybe not: Twitter has a quite varied non-grammer. On Sun, Aug 5, 2012 at 10:11 AM, Pat Ferrel p...@farfetchers.com wrote: The way back from stem to tag is interesting from the standpoint of making tags human readable. I had assumed a lookup but this seems much more satisfying and flexible. In order to keep frequencies it will take something like a dictionary creation step in the analyzer. This in turn seems to imply a join so a whole new map reduce job--maybe not completely trivial? It seems that NLP can be used in two very different ways here. First as a filter (keep only nouns and verbs?) second to differentiate semantics (can:verb, can:noun). One method is a dimensional reduction technique the other increases dimensions but can lead to orthogonal dimensions from the same term. I suppose both could be used together as the above example indicates. It sounds like you are using it to filter (only?) Can you explain what you mean by: One thing came through- parts-of-speech selection for nouns and verbs helped 5-10% in every combination of regularizers.' On Aug 3, 2012, at 6:31 PM, Lance Norskog goks...@gmail.com wrote: Thanks everyone- I hadn't considered the stem/synonym problem. I have code for regularizing a doc/term matrix with tf, binary, log and augmented norm for the cells and idf, gfidf, entropy, normal (term vector) and probabilistic inverse. Running any of these, and then SVD, on a Reuters article may take 10-20 ms. This uses a sentence/term matrix for document summarization. After doing all of this, I realized that maybe just the regularized matrix was good enough. One thing came through- parts-of-speech selection for nouns and verbs helped 5-10% in every combination of regularizers. All across the board. If you want good tags, select your parts of speech! On Fri, Aug 3, 2012 at 1:08 PM, Dawid Weiss dawid.we...@cs.put.poznan.pl wrote: I know, I know. :) Just wanted to mention that it could lead to funny results, that's all. There are lots of way of doing proper form disambiguation, including shallow tagging which then allows to retrieve correct base forms for lemmas, not stems. Stemming is typically good enough (and fast) so your advise was 100% fine. Dawid On Fri, Aug 3, 2012 at 9:31 PM, Ted Dunning ted.dunn...@gmail.com wrote: This is definitely just the first step. Similar goofs happen with inappropriate stemming. For instance, AIDS should not stem to aid. A reasonable way to find and classify exceptional cases is to look at cooccurrence statistics. The contexts of original forms can be examined to find cases where there is a clear semantic mismatch between the original and the set of all forms that stem to the same form. But just picking the most common that is present in the document is a pretty good step for all that it produces some oddities. The results are much better than showing a user the stemmed forms. On Fri, Aug 3, 2012 at 1:05 PM, Dawid Weiss dawid.we...@cs.put.poznan.plwrote: Unstemming is pretty simple. Just build an unstemming dictionary based on seeing what word forms have lead to a stemmed form. Include frequencies. This can lead to very funny (or not, depends how you look at it) mistakes when different lemmas stem to the same token. How frequent and important this phenomenon is varies from language to language (and
how to deal with mutiple preference values for same (user, item)-pair
Hi, I would like to know how I can deal with multiple preference values for the same (user, item)-pair from a machine learning perspective? That means, I have got more than one rating from a user u for an item i available. Of course using any kind of average (maybe also taking date information into account, e.g. by using a weighted/exponential moving average) would be possible. I am interested in if any more sophisticated methods are used. Probably it would already be very helpful to know which term to look/search for or have some papers on that topic. As far a I noticed Mahout would always just take the newest preference value. Is that correct? Thanks a lot, Dominik
Re: how to deal with mutiple preference values for same (user, item)-pair
As far as I remember, Mahout overrides older preference values with the newest one. On Tue, Aug 7, 2012 at 2:14 PM, Dominik Lahmann dominik.lahm...@fu-berlin.de wrote: Hi, I would like to know how I can deal with multiple preference values for the same (user, item)-pair from a machine learning perspective? That means, I have got more than one rating from a user u for an item i available. Of course using any kind of average (maybe also taking date information into account, e.g. by using a weighted/exponential moving average) would be possible. I am interested in if any more sophisticated methods are used. Probably it would already be very helpful to know which term to look/search for or have some papers on that topic. As far a I noticed Mahout would always just take the newest preference value. Is that correct? Thanks a lot, Dominik
Re: how to deal with mutiple preference values for same (user, item)-pair
It depends on what the values really mean. If they are something like ratings, using the most recent version makes most sense. (This is what the implementations do now.) If they are some kind of sampled reading it might make sense to take an average. If the input is based on observed activity, it may be best to accumulate (sum) the data, perhaps with some decay factor. On Tue, Aug 7, 2012 at 1:14 PM, Dominik Lahmann dominik.lahm...@fu-berlin.de wrote: Hi, I would like to know how I can deal with multiple preference values for the same (user, item)-pair from a machine learning perspective? That means, I have got more than one rating from a user u for an item i available. Of course using any kind of average (maybe also taking date information into account, e.g. by using a weighted/exponential moving average) would be possible. I am interested in if any more sophisticated methods are used. Probably it would already be very helpful to know which term to look/search for or have some papers on that topic. As far a I noticed Mahout would always just take the newest preference value. Is that correct? Thanks a lot, Dominik
Re: Question about recommender database drivers
I have used the same steps to create the dictionary and vector output from solr using *lucene.vector* command. Is there any way to pull only latest changes from solr and create vectors. Later how do we run clustering algorithms using this incremented vector files. Can you shed some light on this? Thanks, Kiran Bushireddy. On Thu, Aug 2, 2012 at 3:04 AM, Sean Owen sro...@gmail.com wrote: The backing store doesn't matter much, in the sense that using it for real-time computation needs it to all end up in memory anyway. It can live wherever you want before that, like Solr. It's not going to be feasible to run anything in real-time off Solr or any other store. Yes the trick is to use Solr to figure out what has changed efficiently much like update files. If you're using Hadoop, same answer mostly. It's going to read serially from wherever the data is and most stores are fine at listing out all data sequentially. On Thu, Aug 2, 2012 at 3:52 AM, Matt Mitchell goodie...@gmail.com wrote: Hi, The data I'm using to generate preferences happens to be in a solr index. Would it be feasible, or make any sense, to write an adapter so that I can use solr to store the preferences as well? The solr instance could be embedded since this is all java, and would probably end up being pretty quick. Our data is coming in fast, and I think we'll outgrow the file based approach quickly. Thoughts? - Matt -- Thanks Regards, Kiran Kumar
Re: LDA Questions
Hi Jake, Today I submitted the diff. It is available at https://issues.apache.org/jira/browse/MAHOUT-1051 Thanks for the advices On Tue, Aug 7, 2012 at 1:06 AM, Jake Mannix jake.man...@gmail.com wrote: Sounds great Gokhan! On Mon, Aug 6, 2012 at 2:53 PM, Gokhan Capan gkhn...@gmail.com wrote: Jake, I converted the ids to integers with rowid, and then modified InMemoryCollapsedVariationBayes0.loadVectors() such that it returns a SparseMatrix (instead of SparseRowMatrix) whose row ids are keys from IntWritable, VectorWritable tf vectors. I am not sure if it works, since the values of mapped integer ids (results of rowid) are in the range [0, #ofDocuments), but I believe it does. Constructing SparseMatrix needs RandomAccessSparseVector as row vectors and tf-vectors are sparse vectors, so I assumed that an incoming tf vector itself, or getDelegate if it is a NamedVector, can be cast to RandomAccessSparseVector. I will submit the diff tomorrow, so you can review and commit. Thank you for your help. On Mon, Aug 6, 2012 at 8:19 PM, Jake Mannix jake.man...@gmail.com wrote: Hi Gokhan, This looks like a bug in the InMemoryCollapsedVariationBayes0.loadVectors() method - it takes the SequenceFile? extends Writable, VectorWritable and ignores the keys, assigning the rows in order into an in-memory Matrix. If you run $MAHOUT_HOME/bin/mahout rowid -i your tf-vector-path -o output path this converts Text keys into IntWritable keys (and leaves behind an index file, a mapping of Text - IntWritable which tells you which int is assigned to which original text key). Then what you'd want to do is modify InMemoryCollapsedVariationBayes0.loadVectors() to actually use the keys which are given to it, instead of reassigning to sequential ids. If you make this change, we'd love to have the diff - not too many people use the cvb0_local path (it's usually used for debugging and testing smaller data sets to see that topics are converging properly), but getting it to actually produce document - topic outputs which correlate with original docIds would be very nice! :) On Mon, Aug 6, 2012 at 4:00 AM, Gokhan Capan gkhn...@gmail.com wrote: Hi, My question is about interpreting lda document-topics output. I am using trunk. I have a directory of documents, each of which are named by integers, and there is no sub-directory of the data directory. The directory structure is as follows $ ls /path/to/data/ 1 2 5 ... From those documents I want to detect topics, and output: - topic - top terms - document - top topics To this end, I first run seqdirectory on the directory: $ mahout seqdirectory -i $DIR_IN -o $SEQDIR -c UTF-8 -chunk 1 Then I run seq2sparse to create tf vectors of documents: $ mahout seq2sparse -i $SEQDIR -o $SPARSEDIR --weight TF --maxDFSigma 3 --namedVector After creating vectors, I run cvb0_local on those tf-vectors: $ mahout cvb0_local -i $SPARSEDIR/tf-vectors -do $LDA_OUT/docs -to $LDA_OUT/words -top 20 -m 50 --dictionary $SPARSEDIR/dictionary.file-0 And to interpret the results, I use mahout's vectordump utility: $ mahout vectordump -i $LDA_OUT/docs -o $LDA_HR_OUT/docs --vectorSize 10 -sort true -p true $ mahout vectordump -i $LDA_OUT/words -o $LDA_HR_OUT/words --dictionary $SPARSEDIR/dictionary.file-0 --dictionaryType sequencefile --vectorSize 10 -sort true -p true The resulting words file consists of #ofTopics lines. I assume each line is in topicID \t wordsVector format, where a wordsVector is a sorted vector whose elements are word, score pairs. The resulting docs file on the other hand, consists of #ofDocuments lines. I assume each line is in documentID \t topicsVector format, where a topicsVector is a sorted vector whose elements are topicID, probability pairs. The problem is that the documentID field does not match with the original document ids. This field is populated with zero-based auto-incrementing indices. I want to ask if I am missing something for vectordump to output correct document ids, or this is the normal behavior when one runs lda on a directory of documents, or I make a mistake in one of those steps. I suspect the issue is seqdirectory assigns Text ids to documents, while CVB algorithm expects documents in another format, IntWritable, VectorWritable. If this is the case, could you help me for assigning IntWritable ids to documents in the process of creating vectors from them? Or should I modify the o.a.m.text.SequenceFilesFromDirectory code to do so? Thanks -- Gokhan -- -jake -- Gokhan -- -jake -- Gokhan
HA: Seq2sparse example produces bad TFIDF vectors while TF vectors are Ok.
Hello Yuval, Thanks for the link. But I am sure I use 0.7 version. I will double check it Pavel От: Yuval Feinstein [yuv...@citypath.com] Отправлено: 7 августа 2012 г. 11:08 To: user@mahout.apache.org Тема: Re: Seq2sparse example produces bad TFIDF vectors while TF vectors are Ok. This is the case: https://issues.apache.org/jira/browse/MAHOUT-973 The bug exists in Mahout 0.6 and was fixed in Mahout 0.7. I also used the workaround of using a high value for --maxDFPercent (I guess the number of documents in the corpus is enough). Maybe it will be good to fix it on 0.6 as well? Thanks, Yuval On Fri, Aug 3, 2012 at 11:55 PM, Sean Owen sro...@gmail.com wrote: This sounds a lot like a bug that was fixed by a patch some time ago. Grant I think it was something I had wanted you to double-check, not sure if you had a look. But I think it was fixed if it's the same issue. On Thu, Aug 2, 2012 at 8:44 AM, Abramov Pavel p.abra...@rambler-co.ruwrote: Thanks for this idea. Looks like a bug: 1) Setting --maxDFPercent to 100 has no effect 2) Setting --maxDFPercent to 1 000 000 000 makes TFIDF vectors Ok. seq2sparse cuts terms with DF maxDFPercent. So maxDFPercent is not a percentage. maxDFPercent is absolute value. Pavel 01.08.12 20:46 пользователь Robin Anil robin.a...@gmail.com написал: Tfidf job is where the document frequency pruning is applied. Try increasing maxDFPercent to 100 % On Wed, Aug 1, 2012 at 11:22 AM, Abramov Pavel p.abra...@rambler-co.ruwrote: Hello! I have trouble running the example seq2sparse with TFIDF weights. My TF vectors are Ok, while TFIDF vectors are 10 times smaller. Looks like seq2sparse cuts my terms during TFxIDF step. Document1 in TF vector has 20 terms, while Document1 in TFIDF vector has only 2 terms. What is wrong? I spent 2 days finding the answer and configuring seq2sparse parameters (( Thanks in advance! mahout seq2sparse -ow \ -chunk 512 \ --maxDFPercent 90 \ --maxNGramSize 1 \ --numReducers 128 \ --minSupport 150 \ -i --- \ -o --- \ -wt tfidf \ --namedVector \ -a org.apache.lucene.analysis.WhitespaceAnalyzer Pavel
KMeans job fails during 2nd iteration. Java Heap space
Hello, I am trying to run KMeans example on 15 000 000 documents (seq2sparse output). There are 1 000 clusters, 200 000 terms dictionary and 3-10 terms document size (titles). seq2sparse produces 200 files 80 MB each. My job failed with Java heap space Error. 1st iteration passes while 2nd iteration fails. On a Map phase of buildClusters I see a lot of warnings, but it passes. Reduce phase of buildClusters fails with Java Heap space. I can not increase reducer/mapper memory in hadoop. My cluster is tunned well. How can I avoid this situation? My cluster has 300 Mappers and 220 Reducers running 40 servers 8-core 12 GB RAM. Thanks in advance! Here is KMeans parameters: mahout kmeans -Dmapred.reduce.tasks=200 \ -i ...tfidf-vectors/ \ -o /tmp/clustering_results_kmeans/ \ --clusters /tmp/clusters/ \ --numClusters 1000 \ --numClusters 5 \ --overwrite \ --clustering Pavel