Re: clustering with kmeans, java app

2012-08-07 Thread Yuval Feinstein
I spent a week trying to get Hadoop to work on Windows 7, and then gave up.
Do you manage to run Hadoop on Windows? Do Hadoop tests (e.g. wordcount) work?
http://en.wikisource.org/wiki/User:Fkorning/Code/Hadoop-on-Cygwin has
lots of details about this.
Some of the possible problems are cygwin paths (!= linux paths),
hdfs/local filesystem confusion, your hadoop user (!= your user
permissions-wise), or other things
listed at the link above.
Good luck,
Yuval

On Thu, Aug 2, 2012 at 11:57 AM, Videnova, Svetlana
svetlana.viden...@logica.com wrote:

 Hello,

 I’m doing java app for clustering my data with kmeans.

 Those are the steps:

 1)

 LuceneDemo : Create index and vectors using lib Lucene.vector, input path of 
 my .txt, output index (segments_1, segments.gen, .fdt, .fdx, .fnm, .frq, 
 .nrm, .prx, .tii, .tis, .tvd, .tvx and the most important who will be using 
 by mahout .tvf) and vectors looking like that 
 (SEQ__org.apache.hadoop.io.Text_org.apache.hadoop.io.Text__t€ðàó^æVG²RŸ˜Õ_Ž__P(0):{15:1.4650986194610596,14:0.9997141361236572,11:0.9997141361236572,10:0.9997141361236572,9:0.9997141361236572,8:1.4650986194610596,7:1.4650986194610596,6:1.4650986194610596,5:0.9997141361236572,4:1.4650986194610596,2:3.1613736152648926,1:1.4650986194610596,0:0.9997141361236572}_Ž__P(1):{15:1.4650986194610596,14:0.9997141361236572,11:0.9997141361236572,10:0.9997141361236572,9:0.9997141361236572,8:1.4650986194610596,7:1.4650986194610596,6:1.4650986194610596,5:0.9997141361236572,4:1.4650986194610596,2:3.1613736152648926,1:1.4650986194610596,0:0.9997141361236572}_Ž__P(2):{
  [… and others])

 Does anyone please can confirm me that the output format looks good? If no, 
 what the vectors generated by lucene.vector should look like?

 This is part of the code :
 /*Creating vectors*/
Map vectorMap = new TreeMap();
IndexReader reader = IndexReader.open(index);
int numDoc = reader.maxDoc();
for(int i = 0; i  numDoc;i++){


TermFreqVector termFreqVector 
 = reader.getTermFreqVector(i, content);

 addTermFreqToMap(vectorMap,termFreqVector);

}




 2)


 MainClass : Create clusters with mahout, input – path of vectors (the vectors 
 generated by step 1 see above) , output -  clusters (looking like : for the 
 moment does not create any clusters cause of this error :
 Exception in thread main java.io.FileNotFoundException: File 
 file:/F:/MAHOUT/TesMahout/clusters/tf-vectors/wordcount/data does not exist.
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361)
   at 
 org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)
   at 
 org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:63)
   at 
 org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:241)
   at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:885)
   at 
 org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779)
   at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
   at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
   at 
 org.apache.mahout.vectorizer.tfidf.TFIDFConverter.startDFCounting(TFIDFConverter.java:368)
   at 
 org.apache.mahout.vectorizer.tfidf.TFIDFConverter.calculateDF(TFIDFConverter.java:198)
   at main.MainClass.main(MainClass.java:144))


 Does anyone please can help me to solve this exception? I can’t understand 
 why data could not be created… while I’m using hadoop and mahout libs on 
 windows (and I’m admin so should not be problem of rights).


 This is part of the code :


 PairLong[], ListPath calculate 
 =TFIDFConverter.calculateDF(new 
 Path(outputDir,DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), new 
 Path(outputDir, DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), conf, 
 chuckSize);

 TFIDFConverter.processTfIdf(new 
 Path(outputDir,DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), new 
 Path(outputDir),conf,calculate,minDf,maxDFPercent, norm, true, 
 sequentialAccessOutput, false, reduceTasks);

 Path vectorFolder = new Path(output);
 Path canopyCentroids = new Path(outputDir, canopy-centroids);

 Path clusterOutput = new Path(outputDir, clusters);

 CanopyDriver.run(vectorFolder, canopyCentroids, new 
 EuclideanDistanceMeasure(), 250, 120, false,3,false);

 KMeansDriver.run(conf, vectorFolder, new 
 Path(canopyCentroids,clusters-0), clusterOutput, new 
 TanimotoDistanceMeasure(), 0.01, 20, true,3, false);


 Thank you for your time




 Regards

 Think green - keep it on the screen.

 This e-mail and any attachment is for 

RE: clustering with kmeans, java app

2012-08-07 Thread Videnova, Svetlana
Hi,

Yes i'm using mahout and hadoop libs on windows.
My cluster output is not written on hdfs but in LOCAL.
Thanks to cygwin I am able to run unix command in order to run mahout on 
windows.
I changed the path on windows as well.

I didn’t test if wordcount is working, because I am using only mahout libs did 
not tried to run examples.
I was not following none tutorial but I found this may help you : 
http://blogs.msdn.com/b/avkashchauhan/archive/2012/03/06/running-apache-mahout-at-hadoop-on-windows-azure-www-hadooponazure-com.aspx



Cheers


-Message d'origine-
De : Yuval Feinstein [mailto:yuv...@citypath.com] 
Envoyé : mardi 7 août 2012 08:16
À : user@mahout.apache.org
Objet : Re: clustering with kmeans, java app

I spent a week trying to get Hadoop to work on Windows 7, and then gave up.
Do you manage to run Hadoop on Windows? Do Hadoop tests (e.g. wordcount) work?
http://en.wikisource.org/wiki/User:Fkorning/Code/Hadoop-on-Cygwin has lots of 
details about this.
Some of the possible problems are cygwin paths (!= linux paths), hdfs/local 
filesystem confusion, your hadoop user (!= your user permissions-wise), or 
other things listed at the link above.
Good luck,
Yuval

On Thu, Aug 2, 2012 at 11:57 AM, Videnova, Svetlana 
svetlana.viden...@logica.com wrote:

 Hello,

 I’m doing java app for clustering my data with kmeans.

 Those are the steps:

 1)

 LuceneDemo : Create index and vectors using lib Lucene.vector, input 
 path of my .txt, output index (segments_1, segments.gen, .fdt, .fdx, 
 .fnm, .frq, .nrm, .prx, .tii, .tis, .tvd, .tvx and the most important 
 who will be using by mahout .tvf) and vectors looking like that 
 (SEQ__org.apache.hadoop.io.Text_org.apache.hadoop.io.Text__t€ðàó^æ
 VG²RŸ˜Õ_Ž__P(0):{15:1.4650986194610596,14:0.9997141361236572,1
 1:0.9997141361236572,10:0.9997141361236572,9:0.9997141361236572,8:1.46
 50986194610596,7:1.4650986194610596,6:1.4650986194610596,5:0.999714136
 1236572,4:1.4650986194610596,2:3.1613736152648926,1:1.4650986194610596
 ,0:0.9997141361236572}_Ž__P(1):{15:1.4650986194610596,14:0.999
 7141361236572,11:0.9997141361236572,10:0.9997141361236572,9:0.99971413
 61236572,8:1.4650986194610596,7:1.4650986194610596,6:1.465098619461059
 6,5:0.9997141361236572,4:1.4650986194610596,2:3.1613736152648926,1:1.4
 650986194610596,0:0.9997141361236572}_Ž__P(2):{ [… and 
 others])

 Does anyone please can confirm me that the output format looks good? If no, 
 what the vectors generated by lucene.vector should look like?

 This is part of the code :
 /*Creating vectors*/
Map vectorMap = new TreeMap();
IndexReader reader = IndexReader.open(index);
int numDoc = reader.maxDoc();
for(int i = 0; i  numDoc;i++){


TermFreqVector termFreqVector 
 = reader.getTermFreqVector(i, content);

 addTermFreqToMap(vectorMap,termFreqVector);

}




 2)


 MainClass : Create clusters with mahout, input – path of vectors (the vectors 
 generated by step 1 see above) , output -  clusters (looking like : for the 
 moment does not create any clusters cause of this error :
 Exception in thread main java.io.FileNotFoundException: File 
 file:/F:/MAHOUT/TesMahout/clusters/tf-vectors/wordcount/data does not exist.
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361)
   at 
 org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)
   at 
 org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:63)
   at 
 org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:241)
   at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:885)
   at 
 org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779)
   at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
   at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
   at 
 org.apache.mahout.vectorizer.tfidf.TFIDFConverter.startDFCounting(TFIDFConverter.java:368)
   at 
 org.apache.mahout.vectorizer.tfidf.TFIDFConverter.calculateDF(TFIDFConverter.java:198)
   at main.MainClass.main(MainClass.java:144))


 Does anyone please can help me to solve this exception? I can’t understand 
 why data could not be created… while I’m using hadoop and mahout libs on 
 windows (and I’m admin so should not be problem of rights).


 This is part of the code :


 PairLong[], ListPath calculate 
 =TFIDFConverter.calculateDF(new 
 Path(outputDir,DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), 
 new Path(outputDir, 
 DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), conf, chuckSize);

 TFIDFConverter.processTfIdf(new 
 

Re: Seq2sparse example produces bad TFIDF vectors while TF vectors are Ok.

2012-08-07 Thread Yuval Feinstein
This is the case:
https://issues.apache.org/jira/browse/MAHOUT-973
The bug exists in Mahout 0.6 and was fixed in Mahout 0.7.
I also used the workaround of using a high value for --maxDFPercent
(I guess the number of documents in the corpus is enough).
Maybe it will be good to fix it on 0.6 as well?
Thanks,
Yuval

On Fri, Aug 3, 2012 at 11:55 PM, Sean Owen sro...@gmail.com wrote:
 This sounds a lot like a bug that was fixed by a patch some time ago. Grant
 I think it was something I had wanted you to double-check, not sure if you
 had a look. But I think it was fixed if it's the same issue.

 On Thu, Aug 2, 2012 at 8:44 AM, Abramov Pavel p.abra...@rambler-co.ruwrote:

 Thanks for this idea.

 Looks like a bug:
 1) Setting --maxDFPercent to 100 has no effect
 2) Setting --maxDFPercent to 1 000 000 000 makes TFIDF vectors Ok.

 seq2sparse cuts terms with DF  maxDFPercent. So maxDFPercent is not a
 percentage. maxDFPercent is absolute value.


 Pavel




 01.08.12 20:46 пользователь Robin Anil robin.a...@gmail.com написал:

 Tfidf job is where the document frequency pruning is applied. Try
 increasing maxDFPercent to 100 %
 
 On Wed, Aug 1, 2012 at 11:22 AM, Abramov Pavel
 p.abra...@rambler-co.ruwrote:
 
  Hello!
 
  I have trouble running the example seq2sparse with TFIDF weights. My
 TF
  vectors are Ok, while TFIDF vectors are 10 times smaller. Looks like
  seq2sparse cuts my terms during TFxIDF step. Document1 in TF vector has
 20
  terms, while Document1 in TFIDF vector
   has only 2 terms. What is wrong? I spent 2 days finding the answer and
  configuring seq2sparse parameters ((
 
  Thanks in advance!
 
  mahout seq2sparse -ow  \
  -chunk 512 \
  --maxDFPercent 90 \
  --maxNGramSize 1 \
  --numReducers 128 \
  --minSupport 150 \
  -i --- \
  -o --- \
  -wt tfidf \
  --namedVector \
  -a org.apache.lucene.analysis.WhitespaceAnalyzer
 
  Pavel
 
 




RE: ClusterDumper eclipse human readable output kmeans

2012-08-07 Thread Videnova, Svetlana
I already generated points directory when i run cluster (kmeans in my case).
But for the moment I can't generate clustedump because of error on this line:
ClusterDumper.readPoints(new Path(output/kmeans/clusters-0), 2, conf);
Second parameter is double but he wants int but does not accept int  well 
pretty confused ...



-Message d'origine-
De : kiran kumar [mailto:kirankumarsm...@gmail.com] 
Envoyé : lundi 6 août 2012 18:01
À : user@mahout.apache.org
Objet : Re: ClusterDumper eclipse human readable output kmeans

Hello,
Clusterdump actually shows you the top terms and vectors of centroid and each 
document. But to identify what vectors are for your document, You need to 
generate points directory when running clustering algorithm and use the points 
directory generated in the above step when generating cluster dump.

Thanks,
Kiran Bushireddy.

On Mon, Aug 6, 2012 at 10:33 AM, Videnova, Svetlana  
svetlana.viden...@logica.com wrote:

 Hi,

 My goal is to transform the vectors created by lucene.vector (thanks 
 to kmeans clustering) to a human readable format. For that I am using 
 ClusterDumper function on eclipse. But that code does not generate 
 none files. What am I missing? What is the best approach to transform 
 output of kmeans to a human readable (no unix command please I am on 
 windows using eclipse and cygwin).
 This is the code:


 Code :

 MapInteger, ListWeightedVectorWritable result = 
 ClusterDumper.readPoints(new Path(output/kmeans/clusters-0), 2, 
 conf);

 System.out.println(result.get(0).toString());
 for(int j = 0; j  result.size(); j++){
   ListWeightedVectorWritable list = result.get(j);
   for(WeightedVectorWritable vector : list){

 System.out.println(vector.getVector().asFormatString());
   }

 }


 Error :

 Exception in thread main java.lang.ClassCastException:
 org.apache.mahout.clustering.iterator.ClusterWritable cannot be cast 
 to org.apache.mahout.clustering.classify.WeightedVectorWritable
   at main.LuceneDemo.main(LuceneDemo.java:260)



 Thank you


 Think green - keep it on the screen.

 This e-mail and any attachment is for authorised use by the intended
 recipient(s) only. It may contain proprietary material, confidential 
 information and/or be subject to legal privilege. It should not be 
 copied, disclosed to, retained or used by, any other party. If you are 
 not an intended recipient then please promptly delete this e-mail and 
 any attachment and all copies and inform the sender. Thank you.




--
Thanks  Regards,
Kiran Kumar

Think green - keep it on the screen.

This e-mail and any attachment is for authorised use by the intended 
recipient(s) only. It may contain proprietary material, confidential 
information and/or be subject to legal privilege. It should not be copied, 
disclosed to, retained or used by, any other party. If you are not an intended 
recipient then please promptly delete this e-mail and any attachment and all 
copies and inform the sender. Thank you.




Re: ClusterDumper eclipse human readable output kmeans

2012-08-07 Thread Paritosh Ranjan
I don't know why ClusterDumper is not working, but I can give an 
alternate solution.


Use ClusterOutputPostProcessor  (clusterpp), on the clusters-*-final 
directory. https://cwiki.apache.org/MAHOUT/top-down-clustering.html
It will arrange the vectors in respective directories. However, it will 
still be in the form of sequence files.


Its very simple to read a sequence file and write in a human readable 
format.


Classes in org.apache.mahout.common.iterator.sequencefile package can 
help to read the sequence files easily.


On 07-08-2012 12:50, Videnova, Svetlana wrote:

I already generated points directory when i run cluster (kmeans in my case).
But for the moment I can't generate clustedump because of error on this line:
ClusterDumper.readPoints(new Path(output/kmeans/clusters-0), 2, conf);
Second parameter is double but he wants int but does not accept int  well 
pretty confused ...



-Message d'origine-
De : kiran kumar [mailto:kirankumarsm...@gmail.com]
Envoyé : lundi 6 août 2012 18:01
À : user@mahout.apache.org
Objet : Re: ClusterDumper eclipse human readable output kmeans

Hello,
Clusterdump actually shows you the top terms and vectors of centroid and each 
document. But to identify what vectors are for your document, You need to 
generate points directory when running clustering algorithm and use the points 
directory generated in the above step when generating cluster dump.

Thanks,
Kiran Bushireddy.

On Mon, Aug 6, 2012 at 10:33 AM, Videnova, Svetlana  
svetlana.viden...@logica.com wrote:


Hi,

My goal is to transform the vectors created by lucene.vector (thanks
to kmeans clustering) to a human readable format. For that I am using
ClusterDumper function on eclipse. But that code does not generate
none files. What am I missing? What is the best approach to transform
output of kmeans to a human readable (no unix command please I am on
windows using eclipse and cygwin).
This is the code:


Code :

MapInteger, ListWeightedVectorWritable result =
ClusterDumper.readPoints(new Path(output/kmeans/clusters-0), 2,
conf);

 System.out.println(result.get(0).toString());
 for(int j = 0; j  result.size(); j++){
   ListWeightedVectorWritable list = result.get(j);
   for(WeightedVectorWritable vector : list){

System.out.println(vector.getVector().asFormatString());
   }

 }


Error :

Exception in thread main java.lang.ClassCastException:
org.apache.mahout.clustering.iterator.ClusterWritable cannot be cast
to org.apache.mahout.clustering.classify.WeightedVectorWritable
   at main.LuceneDemo.main(LuceneDemo.java:260)



Thank you


Think green - keep it on the screen.

This e-mail and any attachment is for authorised use by the intended
recipient(s) only. It may contain proprietary material, confidential
information and/or be subject to legal privilege. It should not be
copied, disclosed to, retained or used by, any other party. If you are
not an intended recipient then please promptly delete this e-mail and
any attachment and all copies and inform the sender. Thank you.




--
Thanks  Regards,
Kiran Kumar

Think green - keep it on the screen.

This e-mail and any attachment is for authorised use by the intended 
recipient(s) only. It may contain proprietary material, confidential 
information and/or be subject to legal privilege. It should not be copied, 
disclosed to, retained or used by, any other party. If you are not an intended 
recipient then please promptly delete this e-mail and any attachment and all 
copies and inform the sender. Thank you.







RE: ClusterDumper eclipse human readable output kmeans

2012-08-07 Thread Videnova, Svetlana
Just succeed to make work my app. Should to use 
ClusterDumperWriter.gettopfeatures(ar1,arg2,arg3) and that gave me the top 
words on human readable format :D



-Message d'origine-
De : Paritosh Ranjan [mailto:pran...@xebia.com] 
Envoyé : mardi 7 août 2012 10:32
À : user@mahout.apache.org
Objet : Re: ClusterDumper eclipse human readable output kmeans

I don't know why ClusterDumper is not working, but I can give an alternate 
solution.

Use ClusterOutputPostProcessor  (clusterpp), on the clusters-*-final directory. 
https://cwiki.apache.org/MAHOUT/top-down-clustering.html
It will arrange the vectors in respective directories. However, it will still 
be in the form of sequence files.

Its very simple to read a sequence file and write in a human readable format.

Classes in org.apache.mahout.common.iterator.sequencefile package can help to 
read the sequence files easily.

On 07-08-2012 12:50, Videnova, Svetlana wrote:
 I already generated points directory when i run cluster (kmeans in my case).
 But for the moment I can't generate clustedump because of error on this line:
 ClusterDumper.readPoints(new Path(output/kmeans/clusters-0), 2, 
 conf); Second parameter is double but he wants int but does not accept int 
  well pretty confused ...



 -Message d'origine-
 De : kiran kumar [mailto:kirankumarsm...@gmail.com]
 Envoyé : lundi 6 août 2012 18:01
 À : user@mahout.apache.org
 Objet : Re: ClusterDumper eclipse human readable output kmeans

 Hello,
 Clusterdump actually shows you the top terms and vectors of centroid and each 
 document. But to identify what vectors are for your document, You need to 
 generate points directory when running clustering algorithm and use the 
 points directory generated in the above step when generating cluster dump.

 Thanks,
 Kiran Bushireddy.

 On Mon, Aug 6, 2012 at 10:33 AM, Videnova, Svetlana  
 svetlana.viden...@logica.com wrote:

 Hi,

 My goal is to transform the vectors created by lucene.vector (thanks 
 to kmeans clustering) to a human readable format. For that I am using 
 ClusterDumper function on eclipse. But that code does not generate 
 none files. What am I missing? What is the best approach to transform 
 output of kmeans to a human readable (no unix command please I am on 
 windows using eclipse and cygwin).
 This is the code:


 Code :

 MapInteger, ListWeightedVectorWritable result = 
 ClusterDumper.readPoints(new Path(output/kmeans/clusters-0), 2, 
 conf);

  System.out.println(result.get(0).toString());
  for(int j = 0; j  result.size(); j++){
ListWeightedVectorWritable list = result.get(j);
for(WeightedVectorWritable vector : list){

 System.out.println(vector.getVector().asFormatString());
}

  }


 Error :

 Exception in thread main java.lang.ClassCastException:
 org.apache.mahout.clustering.iterator.ClusterWritable cannot be cast 
 to org.apache.mahout.clustering.classify.WeightedVectorWritable
at main.LuceneDemo.main(LuceneDemo.java:260)



 Thank you


 Think green - keep it on the screen.

 This e-mail and any attachment is for authorised use by the intended
 recipient(s) only. It may contain proprietary material, confidential 
 information and/or be subject to legal privilege. It should not be 
 copied, disclosed to, retained or used by, any other party. If you 
 are not an intended recipient then please promptly delete this e-mail 
 and any attachment and all copies and inform the sender. Thank you.



 --
 Thanks  Regards,
 Kiran Kumar

 Think green - keep it on the screen.

 This e-mail and any attachment is for authorised use by the intended 
 recipient(s) only. It may contain proprietary material, confidential 
 information and/or be subject to legal privilege. It should not be copied, 
 disclosed to, retained or used by, any other party. If you are not an 
 intended recipient then please promptly delete this e-mail and any attachment 
 and all copies and inform the sender. Thank you.






Think green - keep it on the screen.

This e-mail and any attachment is for authorised use by the intended 
recipient(s) only. It may contain proprietary material, confidential 
information and/or be subject to legal privilege. It should not be copied, 
disclosed to, retained or used by, any other party. If you are not an intended 
recipient then please promptly delete this e-mail and any attachment and all 
copies and inform the sender. Thank you.




Re: Tags generation?

2012-08-07 Thread SAMIK CHAKRABORTY
Hi All,

We have developed an auto tagging system for our micro-blogging platform.
Here is what we have done:

The purpose of the system was to look for tags in an articles automatically
when someone posts a link in our micro-blogging site. The goal was to allow
us to follow a tag instead (in addition) of (to) a person. So we used some
custom code on top of Mahout, UIMA, Open-NLP etc.

If you are interested to see how it works take a look at:
http://www.scoopspot.com/

One more thing, we also created a robot which goes to some of the well
known web sites like: Read Write Web, Hackers News, Tech Crunch etc which
gets the article from the web and publishes that to our micro-blog. As we
already have the tag following, we get the information without any problem.
That's very cool (to us at least). You can see the output of the robot at
this location:

http://news.scoopspot.com/

I thought, this might be an example of what Mahout can do and related to
this thread, so felt like sharing with you guys.

Sorry if it looks like off-topic.

Regards,
Samik

On Tue, Aug 7, 2012 at 6:49 AM, Lance Norskog goks...@gmail.com wrote:

 I used the OpenNLP Parts-Of-Speech tool to label all words as 'noun',
 'verb', etc. I removed all words that were not nouns or verbs. In my
 use case, this is a total win. In other cases, maybe not: Twitter has
 a quite varied non-grammer.

 On Sun, Aug 5, 2012 at 10:11 AM, Pat Ferrel p...@farfetchers.com wrote:
  The way back from stem to tag is interesting from the standpoint of
 making tags human readable. I had assumed a lookup but this seems much more
 satisfying and flexible. In order to keep frequencies it will take
 something like a dictionary creation step in the analyzer. This in turn
 seems to imply a join so a whole new map reduce job--maybe not completely
 trivial?
 
  It seems that NLP can be used in two very different ways here. First as
 a filter (keep only nouns and verbs?) second to differentiate semantics
 (can:verb, can:noun). One method is a dimensional reduction technique the
 other increases dimensions but can lead to orthogonal dimensions from the
 same term. I suppose both could be used together as the above example
 indicates.
 
  It sounds like you are using it to filter (only?) Can you explain what
 you mean by:
  One thing came through- parts-of-speech selection for nouns and verbs
  helped 5-10% in every combination of regularizers.'
 
 
  On Aug 3, 2012, at 6:31 PM, Lance Norskog goks...@gmail.com wrote:
 
  Thanks everyone- I hadn't considered the stem/synonym problem. I have
  code for regularizing a doc/term matrix with tf, binary, log and
  augmented norm for the cells and idf, gfidf, entropy, normal (term
  vector) and probabilistic inverse. Running any of these, and then SVD,
  on a Reuters article may take 10-20 ms. This uses a sentence/term
  matrix for document summarization. After doing all of this, I realized
  that maybe just the regularized matrix was good enough.
 
  One thing came through- parts-of-speech selection for nouns and verbs
  helped 5-10% in every combination of regularizers. All across the
  board. If you want good tags, select your parts of speech!
 
  On Fri, Aug 3, 2012 at 1:08 PM, Dawid Weiss
  dawid.we...@cs.put.poznan.pl wrote:
  I know, I know. :) Just wanted to mention that it could lead to funny
  results, that's all. There are lots of way of doing proper form
  disambiguation, including shallow tagging which then allows to
  retrieve correct base forms for lemmas, not stems. Stemming is
  typically good enough (and fast) so your advise was 100% fine.
 
  Dawid
 
  On Fri, Aug 3, 2012 at 9:31 PM, Ted Dunning ted.dunn...@gmail.com
 wrote:
  This is definitely just the first step.  Similar goofs happen with
  inappropriate stemming.  For instance, AIDS should not stem to aid.
 
  A reasonable way to find and classify exceptional cases is to look at
  cooccurrence statistics.  The contexts of original forms can be
 examined to
  find cases where there is a clear semantic mismatch between the
 original
  and the set of all forms that stem to the same form.
 
  But just picking the most common that is present in the document is a
  pretty good step for all that it produces some oddities.  The results
 are
  much better than showing a user the stemmed forms.
 
  On Fri, Aug 3, 2012 at 1:05 PM, Dawid Weiss 
 dawid.we...@cs.put.poznan.plwrote:
 
  Unstemming is pretty simple.  Just build an unstemming dictionary
 based
  on
  seeing what word forms have lead to a stemmed form.  Include
 frequencies.
 
  This can lead to very funny (or not, depends how you look at it)
  mistakes when different lemmas stem to the same token. How frequent
  and important this phenomenon is varies from language to language (and
  can be calculated apriori).
 
  Dawid
 
 
 
 
  --
  Lance Norskog
  goks...@gmail.com
 



 --
 Lance Norskog
 goks...@gmail.com



Re: Tags generation?

2012-08-07 Thread Ted Dunning
Nice stuff.  And glad that Mahout was able to help!

On Tue, Aug 7, 2012 at 7:37 AM, SAMIK CHAKRABORTY sam...@gmail.com wrote:

 Hi All,

 We have developed an auto tagging system for our micro-blogging platform.
 Here is what we have done:

 The purpose of the system was to look for tags in an articles automatically
 when someone posts a link in our micro-blogging site. The goal was to allow
 us to follow a tag instead (in addition) of (to) a person. So we used some
 custom code on top of Mahout, UIMA, Open-NLP etc.

 If you are interested to see how it works take a look at:
 http://www.scoopspot.com/

 One more thing, we also created a robot which goes to some of the well
 known web sites like: Read Write Web, Hackers News, Tech Crunch etc which
 gets the article from the web and publishes that to our micro-blog. As we
 already have the tag following, we get the information without any problem.
 That's very cool (to us at least). You can see the output of the robot at
 this location:

 http://news.scoopspot.com/

 I thought, this might be an example of what Mahout can do and related to
 this thread, so felt like sharing with you guys.

 Sorry if it looks like off-topic.

 Regards,
 Samik

 On Tue, Aug 7, 2012 at 6:49 AM, Lance Norskog goks...@gmail.com wrote:

  I used the OpenNLP Parts-Of-Speech tool to label all words as 'noun',
  'verb', etc. I removed all words that were not nouns or verbs. In my
  use case, this is a total win. In other cases, maybe not: Twitter has
  a quite varied non-grammer.
 
  On Sun, Aug 5, 2012 at 10:11 AM, Pat Ferrel p...@farfetchers.com wrote:
   The way back from stem to tag is interesting from the standpoint of
  making tags human readable. I had assumed a lookup but this seems much
 more
  satisfying and flexible. In order to keep frequencies it will take
  something like a dictionary creation step in the analyzer. This in turn
  seems to imply a join so a whole new map reduce job--maybe not completely
  trivial?
  
   It seems that NLP can be used in two very different ways here. First as
  a filter (keep only nouns and verbs?) second to differentiate semantics
  (can:verb, can:noun). One method is a dimensional reduction technique the
  other increases dimensions but can lead to orthogonal dimensions from the
  same term. I suppose both could be used together as the above example
  indicates.
  
   It sounds like you are using it to filter (only?) Can you explain what
  you mean by:
   One thing came through- parts-of-speech selection for nouns and verbs
   helped 5-10% in every combination of regularizers.'
  
  
   On Aug 3, 2012, at 6:31 PM, Lance Norskog goks...@gmail.com wrote:
  
   Thanks everyone- I hadn't considered the stem/synonym problem. I have
   code for regularizing a doc/term matrix with tf, binary, log and
   augmented norm for the cells and idf, gfidf, entropy, normal (term
   vector) and probabilistic inverse. Running any of these, and then SVD,
   on a Reuters article may take 10-20 ms. This uses a sentence/term
   matrix for document summarization. After doing all of this, I realized
   that maybe just the regularized matrix was good enough.
  
   One thing came through- parts-of-speech selection for nouns and verbs
   helped 5-10% in every combination of regularizers. All across the
   board. If you want good tags, select your parts of speech!
  
   On Fri, Aug 3, 2012 at 1:08 PM, Dawid Weiss
   dawid.we...@cs.put.poznan.pl wrote:
   I know, I know. :) Just wanted to mention that it could lead to funny
   results, that's all. There are lots of way of doing proper form
   disambiguation, including shallow tagging which then allows to
   retrieve correct base forms for lemmas, not stems. Stemming is
   typically good enough (and fast) so your advise was 100% fine.
  
   Dawid
  
   On Fri, Aug 3, 2012 at 9:31 PM, Ted Dunning ted.dunn...@gmail.com
  wrote:
   This is definitely just the first step.  Similar goofs happen with
   inappropriate stemming.  For instance, AIDS should not stem to aid.
  
   A reasonable way to find and classify exceptional cases is to look at
   cooccurrence statistics.  The contexts of original forms can be
  examined to
   find cases where there is a clear semantic mismatch between the
  original
   and the set of all forms that stem to the same form.
  
   But just picking the most common that is present in the document is a
   pretty good step for all that it produces some oddities.  The results
  are
   much better than showing a user the stemmed forms.
  
   On Fri, Aug 3, 2012 at 1:05 PM, Dawid Weiss 
  dawid.we...@cs.put.poznan.plwrote:
  
   Unstemming is pretty simple.  Just build an unstemming dictionary
  based
   on
   seeing what word forms have lead to a stemmed form.  Include
  frequencies.
  
   This can lead to very funny (or not, depends how you look at it)
   mistakes when different lemmas stem to the same token. How frequent
   and important this phenomenon is varies from language to language
 (and
  

how to deal with mutiple preference values for same (user, item)-pair

2012-08-07 Thread Dominik Lahmann
Hi,

I would like to know how I can deal with multiple preference values
for the same (user, item)-pair from a machine learning perspective?
That means, I have got more than one rating from a user u for an item i
available.
Of course using any kind of average (maybe also taking date information
into account, e.g. by using a weighted/exponential moving average)
would be possible.

I am interested in if any more sophisticated methods are used.

Probably it would already be very helpful to know which term to
look/search for or have some papers on that topic.

As far a I noticed Mahout would always just take the newest preference
value. Is that correct?

Thanks a lot,
Dominik


Re: how to deal with mutiple preference values for same (user, item)-pair

2012-08-07 Thread Julian Ortega
As far as I remember, Mahout overrides older preference values with the
newest one.

On Tue, Aug 7, 2012 at 2:14 PM, Dominik Lahmann 
dominik.lahm...@fu-berlin.de wrote:

 Hi,

 I would like to know how I can deal with multiple preference values
 for the same (user, item)-pair from a machine learning perspective?
 That means, I have got more than one rating from a user u for an item i
 available.
 Of course using any kind of average (maybe also taking date information
 into account, e.g. by using a weighted/exponential moving average)
 would be possible.

 I am interested in if any more sophisticated methods are used.

 Probably it would already be very helpful to know which term to
 look/search for or have some papers on that topic.

 As far a I noticed Mahout would always just take the newest preference
 value. Is that correct?

 Thanks a lot,
 Dominik



Re: how to deal with mutiple preference values for same (user, item)-pair

2012-08-07 Thread Sean Owen
It depends on what the values really mean. If they are something like
ratings, using the most recent version makes most sense. (This is what the
implementations do now.) If they are some kind of sampled reading it might
make sense to take an average. If the input is based on observed activity,
it may be best to accumulate (sum) the data, perhaps with some decay factor.

On Tue, Aug 7, 2012 at 1:14 PM, Dominik Lahmann 
dominik.lahm...@fu-berlin.de wrote:

 Hi,

 I would like to know how I can deal with multiple preference values
 for the same (user, item)-pair from a machine learning perspective?
 That means, I have got more than one rating from a user u for an item i
 available.
 Of course using any kind of average (maybe also taking date information
 into account, e.g. by using a weighted/exponential moving average)
 would be possible.

 I am interested in if any more sophisticated methods are used.

 Probably it would already be very helpful to know which term to
 look/search for or have some papers on that topic.

 As far a I noticed Mahout would always just take the newest preference
 value. Is that correct?

 Thanks a lot,
 Dominik



Re: Question about recommender database drivers

2012-08-07 Thread kiran kumar
I have used the same steps to create the dictionary and vector output from
solr using *lucene.vector* command.
Is there any way to pull only latest changes from solr and create vectors.
Later how  do we run clustering algorithms using this incremented vector
files. Can you shed some light on this?

Thanks,
Kiran Bushireddy.

On Thu, Aug 2, 2012 at 3:04 AM, Sean Owen sro...@gmail.com wrote:

 The backing store doesn't matter much, in the sense that using it for
 real-time computation needs it to all end up in memory anyway. It can live
 wherever you want before that, like Solr. It's not going to be feasible to
 run anything in real-time off Solr or any other store. Yes the trick is to
 use Solr to figure out what has changed efficiently much like update files.

 If you're using Hadoop, same answer mostly. It's going to read serially
 from wherever the data is and most stores are fine at listing out all data
 sequentially.


 On Thu, Aug 2, 2012 at 3:52 AM, Matt Mitchell goodie...@gmail.com wrote:

  Hi,
 
  The data I'm using to generate preferences happens to be in a solr
  index. Would it be feasible, or make any sense, to write an adapter so
  that I can use solr to store the preferences as well? The solr
  instance could be embedded since this is all java, and would probably
  end up being pretty quick. Our data is coming in fast, and I think
  we'll outgrow the file based approach quickly. Thoughts?
 
  - Matt
 




-- 
Thanks  Regards,
Kiran Kumar


Re: LDA Questions

2012-08-07 Thread Gokhan Capan
Hi Jake,

Today I submitted the diff. It is available at
https://issues.apache.org/jira/browse/MAHOUT-1051

Thanks for the advices

On Tue, Aug 7, 2012 at 1:06 AM, Jake Mannix jake.man...@gmail.com wrote:

 Sounds great Gokhan!

 On Mon, Aug 6, 2012 at 2:53 PM, Gokhan Capan gkhn...@gmail.com wrote:

  Jake,
 
  I converted the ids to integers with rowid, and then
  modified InMemoryCollapsedVariationBayes0.loadVectors() such that it
  returns a SparseMatrix (instead of SparseRowMatrix) whose row ids are
 keys
  from IntWritable, VectorWritable tf vectors. I am not sure if it works,
  since the values of mapped integer ids (results of rowid) are in the
 range
  [0, #ofDocuments), but I
  believe it does.
 
  Constructing SparseMatrix needs RandomAccessSparseVector as row vectors
 and
  tf-vectors are sparse vectors, so I assumed that an incoming tf vector
  itself, or getDelegate if it is a NamedVector, can be cast to
  RandomAccessSparseVector.
  I will submit the diff tomorrow, so you can review and commit.
 
  Thank you for your help.
 
 
  On Mon, Aug 6, 2012 at 8:19 PM, Jake Mannix jake.man...@gmail.com
 wrote:
 
   Hi Gokhan,
  
 This looks like a bug in the
   InMemoryCollapsedVariationBayes0.loadVectors()
   method - it takes the SequenceFile? extends Writable, VectorWritable
  and
   ignores
   the keys, assigning the rows in order into an in-memory Matrix.
  
 If you run $MAHOUT_HOME/bin/mahout rowid -i your tf-vector-path -o
   output path
   this converts Text keys into IntWritable keys (and leaves behind an
 index
   file, a mapping
   of Text - IntWritable which tells you which int is assigned to which
   original text key).
  
 Then what you'd want to do is modify
   InMemoryCollapsedVariationBayes0.loadVectors()
   to actually use the keys which are given to it, instead of reassigning
 to
   sequential
   ids.  If you make this change, we'd love to have the diff - not too
 many
   people use
   the cvb0_local path (it's usually used for debugging and testing
 smaller
   data sets to see that topics are converging properly), but getting it
 to
   actually produce
   document - topic outputs which correlate with original docIds would be
   very nice! :)
  
   On Mon, Aug 6, 2012 at 4:00 AM, Gokhan Capan gkhn...@gmail.com
 wrote:
  
Hi,
   
My question is about interpreting lda document-topics output.
   
I am using trunk.
   
I have a directory of documents, each of which are named by integers,
  and
there is no sub-directory of the data directory.
The directory structure is as follows
$ ls /path/to/data/
   1
   2
   5
   ...
   
From those documents I want to detect topics, and output:
- topic - top terms
- document - top topics
   
To this end, I first run seqdirectory on the directory:
$ mahout seqdirectory -i $DIR_IN -o $SEQDIR -c UTF-8 -chunk 1
   
Then I run seq2sparse to create tf vectors of documents:
$ mahout seq2sparse -i $SEQDIR -o $SPARSEDIR --weight TF
 --maxDFSigma 3
--namedVector
   
After creating vectors, I run cvb0_local on those tf-vectors:
$ mahout cvb0_local -i $SPARSEDIR/tf-vectors -do $LDA_OUT/docs -to
$LDA_OUT/words -top 20 -m 50 --dictionary
 $SPARSEDIR/dictionary.file-0
   
And to interpret the results, I use mahout's vectordump utility:
$ mahout vectordump -i $LDA_OUT/docs -o $LDA_HR_OUT/docs --vectorSize
  10
-sort true -p true
   
$ mahout vectordump -i $LDA_OUT/words -o $LDA_HR_OUT/words
 --dictionary
$SPARSEDIR/dictionary.file-0 --dictionaryType sequencefile
 --vectorSize
   10
-sort true -p true
   
The resulting words file consists of #ofTopics lines.
I assume each line is in topicID \t wordsVector format, where a
wordsVector is a sorted vector whose elements are word, score
 pairs.
   
The resulting docs file on the other hand, consists of #ofDocuments
   lines.
I assume each line is in documentID \t topicsVector format, where a
topicsVector is a sorted vector whose elements are topicID,
  probability
pairs.
   
The problem is that the documentID field does not match with the
  original
document ids. This field is populated with zero-based
 auto-incrementing
indices.
   
I want to ask if I am missing something for vectordump to output
  correct
document ids, or this is the normal behavior when one runs lda on a
directory of documents, or I make a mistake in one of those steps.
   
I suspect the issue is seqdirectory assigns Text ids to documents,
  while
CVB algorithm expects documents in another format, IntWritable,
VectorWritable. If this is the case, could you help me for assigning
IntWritable ids to documents in the process of creating vectors from
   them?
Or should I modify the o.a.m.text.SequenceFilesFromDirectory code to
 do
   so?
   
Thanks
   
--
Gokhan
   
  
  
  
   --
  
 -jake
  
 
 
 
  --
  Gokhan
 



 --

   -jake




-- 
Gokhan


HA: Seq2sparse example produces bad TFIDF vectors while TF vectors are Ok.

2012-08-07 Thread Abramov Pavel
Hello Yuval, 

Thanks for the link. 
But I am sure I use 0.7 version. I will double check it

Pavel

От: Yuval Feinstein [yuv...@citypath.com]
Отправлено: 7 августа 2012 г. 11:08
To: user@mahout.apache.org
Тема: Re: Seq2sparse example produces bad TFIDF vectors while TF vectors are Ok.

This is the case:
https://issues.apache.org/jira/browse/MAHOUT-973
The bug exists in Mahout 0.6 and was fixed in Mahout 0.7.
I also used the workaround of using a high value for --maxDFPercent
(I guess the number of documents in the corpus is enough).
Maybe it will be good to fix it on 0.6 as well?
Thanks,
Yuval

On Fri, Aug 3, 2012 at 11:55 PM, Sean Owen sro...@gmail.com wrote:
 This sounds a lot like a bug that was fixed by a patch some time ago. Grant
 I think it was something I had wanted you to double-check, not sure if you
 had a look. But I think it was fixed if it's the same issue.

 On Thu, Aug 2, 2012 at 8:44 AM, Abramov Pavel p.abra...@rambler-co.ruwrote:

 Thanks for this idea.

 Looks like a bug:
 1) Setting --maxDFPercent to 100 has no effect
 2) Setting --maxDFPercent to 1 000 000 000 makes TFIDF vectors Ok.

 seq2sparse cuts terms with DF  maxDFPercent. So maxDFPercent is not a
 percentage. maxDFPercent is absolute value.


 Pavel




 01.08.12 20:46 пользователь Robin Anil robin.a...@gmail.com написал:

 Tfidf job is where the document frequency pruning is applied. Try
 increasing maxDFPercent to 100 %
 
 On Wed, Aug 1, 2012 at 11:22 AM, Abramov Pavel
 p.abra...@rambler-co.ruwrote:
 
  Hello!
 
  I have trouble running the example seq2sparse with TFIDF weights. My
 TF
  vectors are Ok, while TFIDF vectors are 10 times smaller. Looks like
  seq2sparse cuts my terms during TFxIDF step. Document1 in TF vector has
 20
  terms, while Document1 in TFIDF vector
   has only 2 terms. What is wrong? I spent 2 days finding the answer and
  configuring seq2sparse parameters ((
 
  Thanks in advance!
 
  mahout seq2sparse -ow  \
  -chunk 512 \
  --maxDFPercent 90 \
  --maxNGramSize 1 \
  --numReducers 128 \
  --minSupport 150 \
  -i --- \
  -o --- \
  -wt tfidf \
  --namedVector \
  -a org.apache.lucene.analysis.WhitespaceAnalyzer
 
  Pavel
 
 




KMeans job fails during 2nd iteration. Java Heap space

2012-08-07 Thread Abramov Pavel
Hello, 

I am trying to run KMeans example on 15 000 000 documents (seq2sparse output).
There are 1 000 clusters, 200 000 terms dictionary and 3-10 terms document size 
(titles). seq2sparse produces 200 files 80 MB each.

My job failed with Java heap space Error. 1st iteration passes while 2nd 
iteration fails. On a Map phase of buildClusters I see a lot of warnings, but 
it passes. Reduce phase of buildClusters fails with Java Heap space.

I can not increase reducer/mapper memory in hadoop. My cluster is tunned well.

How can I avoid this situation? My cluster has 300 Mappers and 220 Reducers 
running 40 servers 8-core 12 GB RAM.

Thanks in advance!

Here is KMeans parameters:


mahout kmeans -Dmapred.reduce.tasks=200 \
-i ...tfidf-vectors/  \
-o /tmp/clustering_results_kmeans/ \
--clusters /tmp/clusters/ \
--numClusters 1000 \
--numClusters 5 \
--overwrite \
--clustering


Pavel