Question on OnlineLogisticRegression.iris() test case

2014-01-06 Thread Frank Scholten
Hi, I am studying the LR / SGD code and I was wondering why in the iris test case the first element of each vector is set to 1 in the loop parsing the CSV file via v.set(0,1) for (String line : raw.subList(1, raw.size())) { // order gets a list of indexes order.add(order.size());

Re: Question on OnlineLogisticRegression.iris() test case

2014-01-06 Thread Frank Scholten
set element which allows the model to have an intercept term > in addition to terms for the predictor variables. > > > > > On Mon, Jan 6, 2014 at 8:31 AM, Frank Scholten >wrote: > > > Hi, > > > > I am studying the LR / SGD code and I was wondering why in the

Logistic Regression cost function

2014-01-13 Thread Frank Scholten
Hi, I followed the Coursera Machine Learning course quite a while ago and I am trying to find out how Mahout implements the Logistic Regression cost function in the code surrounding AbstractOnlineLogisticRegression. I am looking at the train method in AbstractOnlineLogisticRegression and I see on

Re: Logistic Regression cost function

2014-01-13 Thread Frank Scholten
uneel Marthi wrote: > Mahout's impl is based off of Leon Bottou's paper on this subject. I > don't gave the link handy but it's referenced in the code or try google > search > > Sent from my iPhone > > > On Jan 13, 2014, at 7:14 AM, Frank Scholten >

Re: Logistic Regression cost function

2014-01-13 Thread Frank Scholten
df > > > On Mon, Jan 13, 2014 at 1:14 PM, Suneel Marthi >wrote: > > > I think this is the one. Yes, I don't see this paper referenced in the > > code sorry about that. > > http://leon.bottou.org/publications/pdf/compstat-2010.pdf > > > > > > &

Re: Logistic Regression cost function

2014-01-14 Thread Frank Scholten
iteseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.177.3514&rep=rep1&type=pdf double newValue = beta.getQuick(i, j) + learningRate * perTermLearningRate(j) * instance.get(j) * gradientBase; Cheers, Frank On Mon, Jan 13, 2014 at 10:54 PM, Frank Scholten wrote: > Thanks guys, I h

SGD classifier demo app

2014-02-03 Thread Frank Scholten
Hi all, I am exploring Mahout's SGD classifier and like some feedback because I think I didn't properly configure things. I created an example app that trains an SGD classifier on the 'bank marketing' dataset from UCI: http://archive.ics.uci.edu/ml/datasets/Bank+Marketing My app is at: https://g

Re: Data(Set) creation of for train and test.

2014-02-03 Thread Frank Scholten
Have a look at OnlineLogisticRegressionTest.iris(). Here List.subList() is used in combination with Collections.shuffle() to make the train and test dataset split. So you could first read the dataset in a list and then use this trick. I just pushed an example to Github that also uses this approa

Re: Data(Set) creation of for train and test.

2014-02-03 Thread Frank Scholten
Sorry I didn't properly read your message. The random forest code is quite different and what I suggested is not applicable. The DataConverter converts a String to a Vector wrapped by Instance. With this you can create your training set I think. On Mon, Feb 3, 2014 at 10:09 PM, Frank Sch

Annotation based vectorizer

2014-02-03 Thread Frank Scholten
Hi all, I put together a utility which vectorizes plain old Java objects annotated with @Feature and @Target via Mahout's vector encoders. See my Github branch: https://github.com/frankscholten/mahout/tree/annotation-based-vectorizer and the unit test: https://github.com/frankscholten/mahout/blo

Re: Annotation based vectorizer

2014-02-03 Thread Frank Scholten
The second field of Newsgroup should be called bodyText of course. On Mon, Feb 3, 2014 at 10:52 PM, Frank Scholten wrote: > Hi all, > > I put together a utility which vectorizes plain old Java objects annotated > with @Feature and @Target via Mahout's vector encoders. > &g

Re: SGD classifier demo app

2014-02-04 Thread Frank Scholten
every unique value should end up in a different location because the > >>>> continuous value is part of the hashing. Try adding the weight > directly > >>>> using a static word value encoder, addToVector("pDays",v,pDays) > >>>> > >>>>

Re: SGD classifier demo app

2014-02-04 Thread Frank Scholten
Thanks to you too, Johannes, for your comments! On Tue, Feb 4, 2014 at 7:39 PM, Frank Scholten wrote: > Thanks Ted! > > Would indeed be a nice example to add. > > > On Tue, Feb 4, 2014 at 10:40 AM, Ted Dunning wrote: > >> Yes. >> >> >> On Tu

Re: Rework our website

2014-03-05 Thread Frank Scholten
+1 for design 2 On Wed, Mar 5, 2014 at 6:00 PM, Suneel Marthi wrote: > +1 for Option# 2. > > > > > > On Wednesday, March 5, 2014 7:11 AM, Sebastian Schelter > wrote: > > Hi everyone, > > In our latest discussion, I argued that the lack (and errors) of > documentation on our website is one of th

Re: Welcome Andrew Musselman as new comitter

2014-03-07 Thread Frank Scholten
Congratulations Andrew! On Fri, Mar 7, 2014 at 6:12 PM, Sebastian Schelter wrote: > Hi, > > this is to announce that the Project Management Committee (PMC) for Apache > Mahout has asked Andrew Musselman to become committer and we are pleased to > announce that he has accepted. > > Being a commi

Re: Problem with K-Means clustering on Amazon EMR

2014-03-16 Thread Frank Scholten
Hi Konstantin, Good to hear from you. The link you mentioned points to EigenSeedGenerator not RandomSeedGenerator. The problem seems to be with the call to fs.getFileStatus(input).isDir() It's been a while and I don't remember but perhaps you have to set additional Hadoop fs properties to use

Re: Naive Bayes classification

2014-03-18 Thread Frank Scholten
Hi Tharindu, If I understand correctly seqdirectory creates labels based on the file name but this is not what you want. What do you want the labels to be? Cheers, Frank On Tue, Mar 18, 2014 at 2:22 PM, Tharindu Rusira wrote: > Hi everyone, > I'm developing an application where I need to trai

Text clustering with hashing vector encoders

2014-03-18 Thread Frank Scholten
Hi all, Would it be possible to use hashing vector encoders for text clustering just like when classifying? Currently we vectorize using a dictionary where we map each token to a fixed position in the dictionary. After the clustering we use have to retrieve the dictionary to determine the cluster

Re: Text clustering with hashing vector encoders

2014-03-19 Thread Frank Scholten
would like to code up a Java non-Hadoop example using the Reuters dataset which vectorizes each doc using the hashing encoders, configures KMeans with Hamming distance and then write some code to get the labels. Cheers, Frank > > > > On Tue, Mar 18, 2014 at 2:40 PM, Frank Scholten

Re: Text clustering with hashing vector encoders

2014-03-21 Thread Frank Scholten
eers, > > Johannes > > > On Wed, Mar 19, 2014 at 8:35 PM, Ted Dunning > wrote: > > > On Wed, Mar 19, 2014 at 11:34 AM, Frank Scholten > >wrote: > > > > > On Wed, Mar 19, 2014 at 12:13 AM, Ted Dunning > > > wrote: > > > > > &g

Re: Text clustering with hashing vector encoders

2014-03-21 Thread Frank Scholten
e no need in starting a map reduce job for that, with some > ram you can just stream the documents from the hdfs > > > > > On Fri, Mar 21, 2014 at 5:29 PM, Frank Scholten >wrote: > > > Hi Johannes, > > > > Sounds good. > > > > The step fo

Difference between CiMapper and ClusterIterator

2014-03-31 Thread Frank Scholten
Hi all, I noticed in the CIMapper that the policy.update() call is done in the setup of the mapper, while in the ClusterIterator it is called for every vector in the iteration. In the sequential version there is only a single policy while in the MR version we will get a policy per mapper. Which i

Re: lucene2seq error: field does not exist in the index

2014-04-16 Thread Frank Scholten
Hi Terry, What happens when you make the 'body' field indexed in your schema? LuceneIndexHelper checks the field using an IndexSearcher so it might be that the field has to be indexed as well as being stored, which would be a bug because lucene2seq is designed to load stored fields. Cheers, Fra

Re: Setting up a recommender

2014-04-21 Thread Frank Scholten
Pat and Ted: I am late to the party but this is very interesting! I am not sure I understand all the steps, though. Do you still create a cooccurrence matrix and compute LLR scores during this process or do you only compute matrix multiplication times the history vector: B'B * h and B'A * h? Chee

[Announcement] SearchWorkings.org is live!

2011-09-12 Thread Frank Scholten
Hi all, This is an announcement of the community site SearchWorkings.org [1] SearchWorkings.org offers search professionals a point of contact or comprehensive resource to learn and discuss all the new developments in the world of open source search and related subjects like Mahout and Hadoop. T

Cluster labeling

2011-11-08 Thread Frank Scholten
Hi all, Sometimes my cluster labels are terms that hardly occur in the combined text of the documents of a cluster. I would expect to see a label of a term that occurs very frequently across documents of the cluster. For example, suppose there is a cluster of tweets about Mahout. You would see a

Re: New User to Mahout

2011-11-12 Thread Frank Scholten
Hi Sachin, Most Mahout jobs have several overloaded run methods. For example: KMeansDriver.run(configuration, input, clustersIn, output, measure, convergenceDelta, maxIterations, runClustering, runSequential) Also, most of them extend AbstractJob and implement Hadoop's Tool interface, so you c

[ANNOUNCE] Apache Whirr 0.7.0 includes Mahout support

2011-12-22 Thread Frank Scholten
Hi all, Apache Whirr 0.7.0, which was released yesterday, includes Mahout support. You can install the Mahout binary distribution via the 'mahout-client' role. For more details see the following blog: http://www.searchworkings.org/blog/-/blogs/apache-whirr-includes-mahout-support Cheers, Frank

Re: How to present mahout cluster in combination with Solr results

2012-01-19 Thread Frank Scholten
Hi Vikas, I suggest indexing the cluster label, cluster size and cluster-document mappings so you can use that information to build a tag cloud of your data. Checkout this presentation http://java.dzone.com/videos/configuring-mahout-clustering Cheers, Frank On Thu, Jan 19, 2012 at 4:18 AM, Vika

Re: How to present mahout cluster in combination with Solr results

2012-01-20 Thread Frank Scholten
ave more attributes then you could indeed look into clustering, Cheers, Frank > Any thoughts? > > > From: Vikas Pandya > To: Frank Scholten ; "user@mahout.apache.org" > > Sent: Thursday, January 19, 2012 11:05 AM > > Subje

FOSDEM 2012 Brussels 4/5 february

2012-01-22 Thread Frank Scholten
Hi all, I will be visiting FOSDEM in Brussels 4/5 february. Anybody from this group planning to go there? Would be cool to meet a few of you there! I think the graph processing devroom and the virtualization and cloud devroom will be interesting. See http://fosdem.org/2012/ and of course the be

Re: How to present mahout cluster in combination with Solr results

2012-02-01 Thread Frank Scholten
uirements, to be precise it created three different clusters (if you pick > above mentioned example). > > can clustering be done the way I need it to work in Mahout? or any other > ideas that can be explore further? > > Thanks, On Fri, Jan 20, 2012 at 6:48 PM, Frank Scholten wrote

Re: How to present mahout cluster in combination with Solr results

2012-02-02 Thread Frank Scholten
ned? > > > RiskLevel1,RiskLevel2,RiskLevel3 all are having actual lookup values (High, > Medium,Low etc) in Solr index (Index is stored flatten) > > -Vikas > > > >  From: Frank Scholten > To: user@mahout.apache.org > Sent: Wednesda

Re: only single cluster per document

2012-02-06 Thread Frank Scholten
Hi Lokesh, Could you provide more details on the commands you are running, including parameters? If you use seqdirectory on one csv file it will generate one vector and then you end up with one cluster On Feb 6, 2012, at 14:55, Lokesh wrote: > hi, > I am new to mahout kmeans clustering

Re: Mahout 0.5 java.lang.IllegalStateException: No clusters found. Check your -c path.

2012-02-15 Thread Frank Scholten
You must either specify -k to have kmeans randomly pick k initial clusters from the input vectors or use -c to point to a directory of initial clusters, generated by canopy for example. 2012/2/15 Qiang Xu : > > Note, this problem is only happen in hadoop cluster.Mahout Standalone modle > is no s

Re: Mahout Hosting Provider

2012-02-17 Thread Frank Scholten
Check out http://www.searchworkings.org/blog/-/blogs/apache-whirr-includes-mahout-support to set up Mahout and Hadoop on Amazon AWS. You can then SSH into the cluster and submit jobs from the command line. Frank On Thu, Feb 16, 2012 at 9:30 AM, VIGNESH PRAJAPATI wrote: > Hi Folks, > >  I am ne

Re: Pre-configured Mahout on the cloud

2012-04-03 Thread Frank Scholten
An alternative is to use Apache Whirr to quickly set up a Hadoop cluster on AWS and install the Mahout binary distribution on one of the nodes. Checkout http://whirr.apache.org/ and http://www.searchworkings.org/blog/-/blogs/apache-whirr-includes-mahout-support for the mahout-client role Frank O

Collusion detection in online bridge

2012-04-24 Thread Frank Scholten
Hi all, I am working on a collusion detection system for online bridge. My plan was to use a user-based recommender using TanimotoCoefficient for looking up users that have played many games together as a starting point. I want to use this score as well as other features and feed this into an SGD

Re: Collusion detection in online bridge

2012-04-24 Thread Frank Scholten
g fair coins pretty > directly to this case: > http://en.wikipedia.org/wiki/Likelihood-ratio_test > > On Tue, Apr 24, 2012 at 11:55 AM, Frank Scholten > wrote: >> Hi all, >> >> I am working on a collusion detection system for online bridge. >> >> My plan

Re: Collusion detection in online bridge

2012-04-24 Thread Frank Scholten
tual change is highly unlikely (too high) given this, > like +3 standard deviations above expectation. That seems like a good approach. Thanks! Cheers, Frank > > How's that? > > On Tue, Apr 24, 2012 at 3:13 PM, Frank Scholten > wrote: >> Interesting. However, w

Re: Collusion detection in online bridge

2012-04-28 Thread Frank Scholten
on. I am not sure how to work these factors into a loglikelihood ratio test. Perhaps there is a different, more suitable method for this type of problem? Cheers, Frank On Tue, Apr 24, 2012 at 7:32 PM, Frank Scholten wrote: > On Tue, Apr 24, 2012 at 5:20 PM, Sean Owen wrote: >> OK, t

Re: general mahout working / some solr questions / last version tests

2012-07-07 Thread Frank Scholten
First make sure you can do a normal build. It seems you have some local changes to the pom because trunk builds fine on my machine. Do a clean checkout and run $ mvn clean install -DskipTests=true Second, the type of input and output depends on the job you want to run. If you want to do cluster

Re: 20NewsGroups Error: Illegal Capacity: -40

2011-04-13 Thread Frank Scholten
This sh error also occurred for the reuters script but has been fixed. Maybe good to update all scripts to bash? On Apr 13, 2011, at 18:34, Ken Williams wrote: > Ted Dunning gmail.com> writes: > >> >> This may be a bit of regression. > > Thanks for the reply. > > Just out of interest, I al

Vectorizing arbitrary value types with seq2sparse

2011-05-06 Thread Frank Scholten
Hi everyone, At the moment seq2sparse can generate vectors from sequence values of type Text. More specifically, SequenceFileTokenizerMapper handles Text values. Would it be useful if seq2sparse could be configured to vectorize value types such as a Blog article with several textual fields like t

Re: Vectorizing arbitrary value types with seq2sparse

2011-05-06 Thread Frank Scholten
resentations would make that easier, but still not > trivial.  Dictionary based methods add multiple dictionary specifications > and also require that we figure out how to combine vectors by concatenation > or overlay. > > On Fri, May 6, 2011 at 1:02 PM, Frank Scholten wrote: > >&

Re: The perennial "Error: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector" problem

2011-05-11 Thread Frank Scholten
Just ran seq2sparse on a clean checkout of trunk with a cluster started by Whirr. This works without problems. frank@franktop:~/Desktop/mahout$ bin/mahout seq2sparse --input target/posts --output target/seq2sparse --weight tfidf --namedVector Running on hadoop, using HADOOP_HOME=/usr/local/hadoop

Re: AW: Incremental clustering

2011-05-12 Thread Frank Scholten
What do you recommend for vectorizing the new docs? Run seq2sparse on a batch of them? Seems there's no code at the moment for quickly vectorizing a few new documents based on the existing dictionary. Frank On Thu, May 12, 2011 at 12:32 PM, Grant Ingersoll wrote: > From what I've seen, using Mah

Re: Finding thresholds for canopy

2011-05-17 Thread Frank Scholten
Hi Jeff, After building this distance matrix, what would then be a good value for T2? The average distance in the matrix? Frank On Wed, Apr 27, 2011 at 10:57 PM, Jeff Eastman wrote: > Worth a try, but it ultimately boils down to the distance measure you've > chosen, the distributions of input

Re: fkmeans or Cluster Dumper not working?

2011-07-21 Thread Frank Scholten
Hi Jeffrey, Fuzzy kmeans outputs a [Cluster ID, WeightedVectorWritable] file under clusters/clusteredPoints and a [Cluster ID, SoftCluster] file under clusters/clusters-*, you don't need to write code for that. However if you want to display your clusters in an application, along with nice labels

Re: Doubt regarding the kmeans clustering results on mahout

2011-07-30 Thread Frank Scholten
Maybe it should produce NamedVectors by default as well. This is another of those optional settings that is often needed in practice. On Fri, Jul 29, 2011 at 11:42 PM, Jeff Eastman wrote: > No problem. I really think the default needs to be changed anyway. Perhaps > this will get me to do it. >

Re: Doubt regarding the kmeans clustering results on mahout

2011-08-01 Thread Frank Scholten
but you > are free to send it points which are named. Those points will pass through > the clustering process and be available in the output. > > -Original Message- > From: Frank Scholten [mailto:fr...@frankscholten.nl] > Sent: Saturday, July 30, 2011 4:21 AM > To: use

MultithreadedBatchItemSimilarities with LLR versus Spark co-occurrence

2014-08-01 Thread Frank Scholten
Hi all, I noticed the development of the Spark co-occurrence of MAHOUT-1464 and I wondered if I could get similar results but with less scalability when I use MultithreadedBatchItemSimilarities with LLRSimilarity. I want to use a co-occurrence recommender on a smallish datasets of a few GBs that

Using ItemSimilarity.scala from Java

2014-09-12 Thread Frank Scholten
Hi all, Trying out the new spark-itemsimilarity code, but I am new to Scala and have hard time calling certain methods from Java. Here is a Gist with a Java main that runs the cooccurrence analysis: https://gist.github.com/frankscholten/d373c575ad721dd0204e When I run this I get an exception:

Re: Using ItemSimilarity.scala from Java

2014-09-26 Thread Frank Scholten
ted out a bug in mine, a bad > > value in the default schema. I’d be interested in helping with this as a > > way to work out the kinks in creating drivers. > > > > Are you interested in this or are you set on using java? Either way I’ll > > post a gist of your code us