Hi,
I am studying the LR / SGD code and I was wondering why in the iris test
case the first element of each vector is set to 1 in the loop parsing the
CSV file via v.set(0,1)
for (String line : raw.subList(1, raw.size())) {
// order gets a list of indexes
order.add(order.size());
set element which allows the model to have an intercept term
> in addition to terms for the predictor variables.
>
>
>
>
> On Mon, Jan 6, 2014 at 8:31 AM, Frank Scholten >wrote:
>
> > Hi,
> >
> > I am studying the LR / SGD code and I was wondering why in the
Hi,
I followed the Coursera Machine Learning course quite a while ago and I am
trying to find out how Mahout implements the Logistic Regression cost
function in the code surrounding AbstractOnlineLogisticRegression.
I am looking at the train method in AbstractOnlineLogisticRegression and I
see on
uneel Marthi wrote:
> Mahout's impl is based off of Leon Bottou's paper on this subject. I
> don't gave the link handy but it's referenced in the code or try google
> search
>
> Sent from my iPhone
>
> > On Jan 13, 2014, at 7:14 AM, Frank Scholten
>
df
>
>
> On Mon, Jan 13, 2014 at 1:14 PM, Suneel Marthi >wrote:
>
> > I think this is the one. Yes, I don't see this paper referenced in the
> > code sorry about that.
> > http://leon.bottou.org/publications/pdf/compstat-2010.pdf
> >
> >
> >
&
iteseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.177.3514&rep=rep1&type=pdf
double newValue = beta.getQuick(i, j) + learningRate *
perTermLearningRate(j) * instance.get(j) * gradientBase;
Cheers,
Frank
On Mon, Jan 13, 2014 at 10:54 PM, Frank Scholten wrote:
> Thanks guys, I h
Hi all,
I am exploring Mahout's SGD classifier and like some feedback because I
think I didn't properly configure things.
I created an example app that trains an SGD classifier on the 'bank
marketing' dataset from UCI:
http://archive.ics.uci.edu/ml/datasets/Bank+Marketing
My app is at: https://g
Have a look at OnlineLogisticRegressionTest.iris().
Here List.subList() is used in combination with Collections.shuffle() to
make the train and test dataset split.
So you could first read the dataset in a list and then use this trick.
I just pushed an example to Github that also uses this approa
Sorry I didn't properly read your message. The random forest code is quite
different and what I suggested is not applicable.
The DataConverter converts a String to a Vector wrapped by Instance. With
this you can create your training set I think.
On Mon, Feb 3, 2014 at 10:09 PM, Frank Sch
Hi all,
I put together a utility which vectorizes plain old Java objects annotated
with @Feature and @Target via Mahout's vector encoders.
See my Github branch:
https://github.com/frankscholten/mahout/tree/annotation-based-vectorizer
and the unit test:
https://github.com/frankscholten/mahout/blo
The second field of Newsgroup should be called bodyText of course.
On Mon, Feb 3, 2014 at 10:52 PM, Frank Scholten wrote:
> Hi all,
>
> I put together a utility which vectorizes plain old Java objects annotated
> with @Feature and @Target via Mahout's vector encoders.
>
&g
every unique value should end up in a different location because the
> >>>> continuous value is part of the hashing. Try adding the weight
> directly
> >>>> using a static word value encoder, addToVector("pDays",v,pDays)
> >>>>
> >>>>
Thanks to you too, Johannes, for your comments!
On Tue, Feb 4, 2014 at 7:39 PM, Frank Scholten wrote:
> Thanks Ted!
>
> Would indeed be a nice example to add.
>
>
> On Tue, Feb 4, 2014 at 10:40 AM, Ted Dunning wrote:
>
>> Yes.
>>
>>
>> On Tu
+1 for design 2
On Wed, Mar 5, 2014 at 6:00 PM, Suneel Marthi wrote:
> +1 for Option# 2.
>
>
>
>
>
> On Wednesday, March 5, 2014 7:11 AM, Sebastian Schelter
> wrote:
>
> Hi everyone,
>
> In our latest discussion, I argued that the lack (and errors) of
> documentation on our website is one of th
Congratulations Andrew!
On Fri, Mar 7, 2014 at 6:12 PM, Sebastian Schelter wrote:
> Hi,
>
> this is to announce that the Project Management Committee (PMC) for Apache
> Mahout has asked Andrew Musselman to become committer and we are pleased to
> announce that he has accepted.
>
> Being a commi
Hi Konstantin,
Good to hear from you.
The link you mentioned points to EigenSeedGenerator not
RandomSeedGenerator. The problem seems to be with the call to
fs.getFileStatus(input).isDir()
It's been a while and I don't remember but perhaps you have to set
additional Hadoop fs properties to use
Hi Tharindu,
If I understand correctly seqdirectory creates labels based on the file
name but this is not what you want. What do you want the labels to be?
Cheers,
Frank
On Tue, Mar 18, 2014 at 2:22 PM, Tharindu Rusira
wrote:
> Hi everyone,
> I'm developing an application where I need to trai
Hi all,
Would it be possible to use hashing vector encoders for text clustering
just like when classifying?
Currently we vectorize using a dictionary where we map each token to a
fixed position in the dictionary. After the clustering we use have to
retrieve the dictionary to determine the cluster
would like to code up a Java non-Hadoop
example using the Reuters dataset which vectorizes each doc using the
hashing encoders, configures KMeans with Hamming distance and then write
some code to get the labels.
Cheers,
Frank
>
>
>
> On Tue, Mar 18, 2014 at 2:40 PM, Frank Scholten
eers,
>
> Johannes
>
>
> On Wed, Mar 19, 2014 at 8:35 PM, Ted Dunning
> wrote:
>
> > On Wed, Mar 19, 2014 at 11:34 AM, Frank Scholten > >wrote:
> >
> > > On Wed, Mar 19, 2014 at 12:13 AM, Ted Dunning
> > > wrote:
> > >
> > &g
e no need in starting a map reduce job for that, with some
> ram you can just stream the documents from the hdfs
>
>
>
>
> On Fri, Mar 21, 2014 at 5:29 PM, Frank Scholten >wrote:
>
> > Hi Johannes,
> >
> > Sounds good.
> >
> > The step fo
Hi all,
I noticed in the CIMapper that the policy.update() call is done in the
setup of the mapper, while
in the ClusterIterator it is called for every vector in the iteration.
In the sequential version there is only a single policy while in the MR
version we will get a policy per mapper. Which i
Hi Terry,
What happens when you make the 'body' field indexed in your schema?
LuceneIndexHelper checks the field using an IndexSearcher so it might be
that the field has to be indexed as well as being stored, which would be a
bug because lucene2seq is designed to load stored fields.
Cheers,
Fra
Pat and Ted: I am late to the party but this is very interesting!
I am not sure I understand all the steps, though. Do you still create a
cooccurrence matrix and compute LLR scores during this process or do you
only compute matrix multiplication times the history vector: B'B * h and
B'A * h?
Chee
Hi all,
This is an announcement of the community site SearchWorkings.org [1]
SearchWorkings.org offers search professionals a point of contact or
comprehensive resource to learn and discuss all the
new developments in the world of open source search and related
subjects like Mahout and Hadoop.
T
Hi all,
Sometimes my cluster labels are terms that hardly occur in the
combined text of the documents of a cluster. I would expect to see a
label of a term that occurs very frequently across documents of the
cluster.
For example, suppose there is a cluster of tweets about Mahout. You
would see a
Hi Sachin,
Most Mahout jobs have several overloaded run methods. For example:
KMeansDriver.run(configuration, input, clustersIn, output, measure,
convergenceDelta, maxIterations, runClustering, runSequential)
Also, most of them extend AbstractJob and implement Hadoop's Tool
interface, so you c
Hi all,
Apache Whirr 0.7.0, which was released yesterday, includes Mahout
support. You can install the Mahout binary distribution via the
'mahout-client' role.
For more details see the following blog:
http://www.searchworkings.org/blog/-/blogs/apache-whirr-includes-mahout-support
Cheers,
Frank
Hi Vikas,
I suggest indexing the cluster label, cluster size and
cluster-document mappings so you can use that information to build a
tag cloud of your data. Checkout this presentation
http://java.dzone.com/videos/configuring-mahout-clustering
Cheers,
Frank
On Thu, Jan 19, 2012 at 4:18 AM, Vika
ave more attributes then you could indeed look into clustering,
Cheers,
Frank
> Any thoughts?
>
>
> From: Vikas Pandya
> To: Frank Scholten ; "user@mahout.apache.org"
>
> Sent: Thursday, January 19, 2012 11:05 AM
>
> Subje
Hi all,
I will be visiting FOSDEM in Brussels 4/5 february.
Anybody from this group planning to go there? Would be cool to meet a
few of you there!
I think the graph processing devroom and the virtualization and cloud
devroom will be interesting.
See http://fosdem.org/2012/ and of course the be
uirements, to be precise it created three different clusters (if you pick
> above mentioned example).
>
> can clustering be done the way I need it to work in Mahout? or any other
> ideas that can be explore further?
>
> Thanks,
On Fri, Jan 20, 2012 at 6:48 PM, Frank Scholten wrote
ned?
>
>
> RiskLevel1,RiskLevel2,RiskLevel3 all are having actual lookup values (High,
> Medium,Low etc) in Solr index (Index is stored flatten)
>
> -Vikas
>
>
>
> From: Frank Scholten
> To: user@mahout.apache.org
> Sent: Wednesda
Hi Lokesh,
Could you provide more details on the commands you are running, including
parameters?
If you use seqdirectory on one csv file it will generate one vector and then
you end up with one cluster
On Feb 6, 2012, at 14:55, Lokesh wrote:
> hi,
> I am new to mahout kmeans clustering
You must either specify -k to have kmeans randomly pick k
initial clusters from the input vectors or use -c to point to a
directory of initial clusters, generated by canopy for example.
2012/2/15 Qiang Xu :
>
> Note, this problem is only happen in hadoop cluster.Mahout Standalone modle
> is no s
Check out
http://www.searchworkings.org/blog/-/blogs/apache-whirr-includes-mahout-support
to set up Mahout and Hadoop on Amazon AWS.
You can then SSH into the cluster and submit jobs from the command line.
Frank
On Thu, Feb 16, 2012 at 9:30 AM, VIGNESH PRAJAPATI
wrote:
> Hi Folks,
>
> I am ne
An alternative is to use Apache Whirr to quickly set up a Hadoop
cluster on AWS and install the Mahout binary distribution on one of
the nodes.
Checkout http://whirr.apache.org/ and
http://www.searchworkings.org/blog/-/blogs/apache-whirr-includes-mahout-support
for the mahout-client role
Frank
O
Hi all,
I am working on a collusion detection system for online bridge.
My plan was to use a user-based recommender using TanimotoCoefficient
for looking up users that have played many games together as a
starting point. I want to use this score as well as other features and
feed this into an SGD
g fair coins pretty
> directly to this case:
> http://en.wikipedia.org/wiki/Likelihood-ratio_test
>
> On Tue, Apr 24, 2012 at 11:55 AM, Frank Scholten
> wrote:
>> Hi all,
>>
>> I am working on a collusion detection system for online bridge.
>>
>> My plan
tual change is highly unlikely (too high) given this,
> like +3 standard deviations above expectation.
That seems like a good approach. Thanks!
Cheers,
Frank
>
> How's that?
>
> On Tue, Apr 24, 2012 at 3:13 PM, Frank Scholten
> wrote:
>> Interesting. However, w
on.
I am not sure how to work these factors into a loglikelihood ratio
test. Perhaps there is a different, more suitable method for this type
of problem?
Cheers,
Frank
On Tue, Apr 24, 2012 at 7:32 PM, Frank Scholten wrote:
> On Tue, Apr 24, 2012 at 5:20 PM, Sean Owen wrote:
>> OK, t
First make sure you can do a normal build.
It seems you have some local changes to the pom because trunk builds
fine on my machine. Do a clean checkout and run
$ mvn clean install -DskipTests=true
Second, the type of input and output depends on the job you want to run.
If you want to do cluster
This sh error also occurred for the reuters script but has been fixed. Maybe
good to update all scripts to bash?
On Apr 13, 2011, at 18:34, Ken Williams wrote:
> Ted Dunning gmail.com> writes:
>
>>
>> This may be a bit of regression.
>
> Thanks for the reply.
>
> Just out of interest, I al
Hi everyone,
At the moment seq2sparse can generate vectors from sequence values of
type Text. More specifically, SequenceFileTokenizerMapper handles Text
values.
Would it be useful if seq2sparse could be configured to vectorize
value types such as a Blog article with several textual fields like
t
resentations would make that easier, but still not
> trivial. Dictionary based methods add multiple dictionary specifications
> and also require that we figure out how to combine vectors by concatenation
> or overlay.
>
> On Fri, May 6, 2011 at 1:02 PM, Frank Scholten wrote:
>
>&
Just ran seq2sparse on a clean checkout of trunk with a cluster
started by Whirr. This works without problems.
frank@franktop:~/Desktop/mahout$ bin/mahout seq2sparse --input
target/posts --output target/seq2sparse --weight tfidf --namedVector
Running on hadoop, using HADOOP_HOME=/usr/local/hadoop
What do you recommend for vectorizing the new docs? Run seq2sparse on
a batch of them? Seems there's no code at the moment for quickly
vectorizing a few new documents based on the existing dictionary.
Frank
On Thu, May 12, 2011 at 12:32 PM, Grant Ingersoll wrote:
> From what I've seen, using Mah
Hi Jeff,
After building this distance matrix, what would then be a good value
for T2? The average distance in the matrix?
Frank
On Wed, Apr 27, 2011 at 10:57 PM, Jeff Eastman wrote:
> Worth a try, but it ultimately boils down to the distance measure you've
> chosen, the distributions of input
Hi Jeffrey,
Fuzzy kmeans outputs a [Cluster ID, WeightedVectorWritable] file under
clusters/clusteredPoints and a [Cluster ID, SoftCluster] file under
clusters/clusters-*, you don't need to write code for that.
However if you want to display your clusters in an application, along
with nice labels
Maybe it should produce NamedVectors by default as well. This is
another of those optional settings
that is often needed in practice.
On Fri, Jul 29, 2011 at 11:42 PM, Jeff Eastman wrote:
> No problem. I really think the default needs to be changed anyway. Perhaps
> this will get me to do it.
>
but you
> are free to send it points which are named. Those points will pass through
> the clustering process and be available in the output.
>
> -Original Message-
> From: Frank Scholten [mailto:fr...@frankscholten.nl]
> Sent: Saturday, July 30, 2011 4:21 AM
> To: use
Hi all,
I noticed the development of the Spark co-occurrence of MAHOUT-1464 and I
wondered if I could get similar results but with less scalability when I
use MultithreadedBatchItemSimilarities with LLRSimilarity.
I want to use a co-occurrence recommender on a smallish datasets of a few
GBs that
Hi all,
Trying out the new spark-itemsimilarity code, but I am new to Scala and
have hard time calling certain methods from Java.
Here is a Gist with a Java main that runs the cooccurrence analysis:
https://gist.github.com/frankscholten/d373c575ad721dd0204e
When I run this I get an exception:
ted out a bug in mine, a bad
> > value in the default schema. I’d be interested in helping with this as a
> > way to work out the kinks in creating drivers.
> >
> > Are you interested in this or are you set on using java? Either way I’ll
> > post a gist of your code us
54 matches
Mail list logo