Woohoo!
On Wed, Apr 13, 2011 at 7:15 AM, Eric Charles
wrote:
> Can't wait for that :)
> Just bought PDF.
> Tks,
> - Eric
>
> On 13/04/2011 06:57, Ted Dunning wrote:
>>
>> Yes. That's the one.
>>
>> The hard copy should be out before long. The final passes by the
>> production
>> editors are hap
I think that our estimation of whether this would work differs a bit. In
the very high dimensional space that we are working in, proximities can be a
bit surprising.
For one thing, the bias term provides a mechanism so that a logistic
regression can attribute score to an other category. This all
The official solution is to assign outliers in the training set to
other. These are defined as high mean distance to other points. A
hack to get this to work would be to perform a knn-like distance
comparison with all trained sets and classify as other anything that
exceeds the threshold distance
I suspect but of the problem might be creating the training set for
the 'other' since the documents are distinctly 'different' from
anything else, including from each other.
I guess the definition for the 'other' category is a 'low relevance
for everything yet trained' but not 'high relevance to so
I filed https://issues.apache.org/jira/browse/MAHOUT-669 for this.
Anybody who would like to should please file a patch to fix one or more
scripts.
On Wed, Apr 13, 2011 at 9:34 AM, Ken Williams wrote:
> Ted Dunning gmail.com> writes:
>
> >
> > This may be a bit of regression.
>
> Thanks for th
Very good idea.
On Wed, Apr 13, 2011 at 9:49 AM, Frank Scholten wrote:
> This sh error also occurred for the reuters script but has been fixed.
> Maybe good to update all scripts to bash?
>
> On Apr 13, 2011, at 18:34, Ken Williams wrote:
>
> > Ted Dunning gmail.com> writes:
> >
> >>
> >> This
On Wed, Apr 13, 2011 at 8:56 AM, Claudia Grieco wrote:
> Thanks for the help :)
> > Why not just train with those documents and put a category tag of "other"
> on
> >them and run normal categorization? If you can distinguish these
> documents
> >by word frequencies, then this should do the trick.
This sh error also occurred for the reuters script but has been fixed. Maybe
good to update all scripts to bash?
On Apr 13, 2011, at 18:34, Ken Williams wrote:
> Ted Dunning gmail.com> writes:
>
>>
>> This may be a bit of regression.
>
> Thanks for the reply.
>
> Just out of interest, I al
Ted Dunning gmail.com> writes:
>
> This may be a bit of regression.
Thanks for the reply.
Just out of interest, I also reckon your 'build-cluster-syntheticcontrol.sh'
script should be a bash script (#!/bin/bash) rather than a standard
shell (#!/bin/sh) script.
$ trunk/examples/bin/build-clu
Claudia,
The term to look up is 'one class classifier'. Its built on this
problem with a set of solutions pre-made. I don't know if anyone has
put it in a general classifier before, but the theory is there.
Daniel.
On Wed, Apr 13, 2011 at 11:56 AM, Claudia Grieco wrote:
> Thanks for the help
The T2 value you select will determine the number of clusters you get. The T1
value determines how much points which are near to each cluster will influence
it in the final centroid calculation. Your choice of distance measure will also
have a big impact upon the outcome. If T2 is too small you
Thanks for the help :)
> Why not just train with those documents and put a category tag of "other" on
>them and run normal categorization? If you can distinguish these documents
>by word frequencies, then this should do the trick.
I don't know if this will help
1)I'm still not sure where to put th
This may be a bit of regression.
On Wed, Apr 13, 2011 at 4:48 AM, Ken Williams wrote:
> I'm not sure what to try next. Any help would be very welcome.
>
I think that what you are doing is inventing an "other" category and
building a classifier for that category.
Why not just train with those documents and put a category tag of "other" on
them and run normal categorization? If you can distinguish these documents
by word frequencies, then this shou
Let's see if this approach makes sense:
I have the documents to classify on a Lucene index (Index A) and the
training set in another Lucene index (Index B).
With a VectorMapper I map Term-Frequency Vectors of Index A to
Term-Frequency Vectors of Index B. In this way the transformed vectors have
onl
Can't wait for that :)
Just bought PDF.
Tks,
- Eric
On 13/04/2011 06:57, Ted Dunning wrote:
Yes. That's the one.
The hard copy should be out before long. The final passes by the production
editors are happening now.
On Tue, Apr 12, 2011 at 9:19 PM, Eric Charleswrote:
You were talking about
Hi All,
I'm having trouble getting the 20News-Groups
(https://cwiki.apache.org/confluence/display/MAHOUT/Twenty+Newsgroups,
and https://cwiki.apache.org/MAHOUT/twenty-newsgroups.html)
example to run.
I've downloaded the data and tried to train the Naive Bayes classifier
but I ran the 'traincl
It takes a truly gargantuan amount of data to justify map-reducing
LSH. You can get very far with a plain single-machine implementation.
On Wed, Apr 13, 2011 at 5:57 AM, Sebastian Schelter wrote:
> They are using PLSI which we already tried to implement in
> https://issues.apache.org/jira/browse/
They are using PLSI which we already tried to implement in
https://issues.apache.org/jira/browse/MAHOUT-106. We didn't get it
scalable, as far as I remember the paper, they are doing a nasty trick
when sending data to the reducers in a certain step so that they only
have to load a certain porti
Hi guys,
I'm using SGD to classify a set of documents but I have a problem: there are
some documents that are not related to any of the categories and I want to
be able to identify them and exclude them from the classification. My idea
is to read the documents of the training set (that are current
One of the three approaches that they combine is latent semantic indexing --
that is what I was referring to.
On Wed, Apr 13, 2011 at 8:33 AM, Ted Dunning wrote:
> Sean,
>
> Do you mean LSI (latent semantic indexing)? Or LSH (locality sensitive
> hashing)?
>
> (are you a victim of agressive err
Not for recommenders, but we have worked on using LSH for spotting
breaking news in Twitter.
Our experience is that it works well when the points are actually
close together, but you do need to tweak it (eg work-out the number of
hash functions to use and the number of tables). There are also
tec
You can do LSH on real-valued vectors - the 1's and 0's are just the
+/- signs of projections onto randomly chosen hyperplanes.
Ullman's book is a great reference for this, and also goes over how
to do all the parameter choosing.
On Wed, Apr 13, 2011 at 12:43 AM, ke xie wrote:
> Ok, I would try
Ok, I would try to implement a none-distributed one. Actually I have a
python version now.
But I have a problem. When doing min-hash, the matrix should be either 1 or
0, and then do the hash functions. Then how about rating data? If the matrix
is filled with 1~5 numbers, should we convert them use
Sean,
Do you mean LSI (latent semantic indexing)? Or LSH (locality sensitive
hashing)?
(are you a victim of agressive error correction?)
(or am I the victim of too little?)
On Wed, Apr 13, 2011 at 12:28 AM, Sean Owen wrote:
> This approach is really three approaches put together. Elements of
This approach is really three approaches put together. Elements of two of
the approaches exist in the project -- recommendations based on
co-occurrence, and based on clustering (though not MinHash). I don't believe
there's much proper LSI in the project at the moment?
I would steer you towards loo
Sure.
LSH is a fine candidate for parallelism and scaling.
I would recommend starting small and testing as you go rather than leaping
into a parallelized full-fledged implementation. Look for other open-source
implementaions of LSH algorithms.
Be warned that the parameter selection for LSH can
Dear all:
I've read a paper from google, which is about their news recommender system.
They implemented a LSH algorithm to find the closest neibourhoods and the
algorithm is fast for that.
Can we implement one and contribute into the mahout project? Any
suggestions?
paper is here:
http://iws.seu
28 matches
Mail list logo