Hey,
i want to cluster a set of documents using a bag-of-words approach (e.g.
using K-means). However, my documents (since they are automatically
generated by aggregating text snippets) show huge differences according to
their document size.
This means, some document vectors have 50 words with a
Hi all,
Somebody compile and install mahout with hadoop 2.0? If yes, that what
changes you make in mahout, that it have 100% passed unit tests and
successful work with hadoop 2.0?
Thanks
Christopher -
I had the same confusion with vectordump output on a hadoop cluster. The
solution is that it's not trying to write a file to your hdfs: -o will go
locally. So when I just named a file (it did not want to create a local
directory), it wound up in the /bin I was working out of.
Best,
No. There is very small demand for Mahout on Hadoop 2.0 so far and the
forward/backward incompatibility of 2.0 has made it difficult to motivate
moving to 2.0.
The bigtop guys built a maven profile for 0.23 some time ago. I don't know
the status of that.
I don't think that the differences are
I think it all minimally works on Hadoop 2.0.x, though I haven't tried
it recently -- it does require a recompile.
This is different from it working on MRv2 versus MRv1. I'm almost
certain it does not work on MRv2 and doubt it will.
The effort is not large, but it's subtle. A few hacks may fail
When I started looking at this I was a bit skeptical. As a Search engine Solr
may be peerless, but as yet another NoSQL db?
However getting further into this I see one very large benefit. It has one
feature that sets it completely apart from the typical NoSQL db. The type of
queries you do
I finally got some time to work on this and have a first cut at output to Solr
working on the github repo. It only works on 2-action input but I'll have that
cleaned up soon so it will work with one action. Solr indexing has not been
tested yet and the field names and/or types may need
Corrections inline
On Aug 13, 2013, at 10:49 AM, Pat Ferrel pat.fer...@gmail.com wrote:
I finally got some time to work on this and have a first cut at output to
Solr working on the github repo. It only works on 2-action input but I'll
have that cleaned up soon so it will work with one
Mahout 0.9 snapshot
RowSimilarityJob.java , sampleDown method
line 291 or 300
double rowSampleRate = Math.min(maxObservationsPerRow, observationsPerRow)
/ observationsPerRow;
return either 0.0 or 1.0, not fraction. needs (double) casting
BR
Sam
Why do you think this?
On Tue, Aug 13, 2013 at 11:56 AM, sam wu swu5...@gmail.com wrote:
Mahout 0.9 snapshot
RowSimilarityJob.java , sampleDown method
line 291 or 300
double rowSampleRate = Math.min(maxObservationsPerRow, observationsPerRow)
/ observationsPerRow;
return either 0.0
say column a has 1000 entries, maxPref=700
rowSampleRate = Math.min(maxObservationsPerRow, observationsPerRow) /
observationsPerRow;
we get rowSampleRate =0.0 ( not 0.7)
do we totally skip this column or sample column entries with .7 probalility
(roughly get 700 entries)
On Tue, Aug 13, 2013
Ouch.
Sorry... your original posting made it sound like you *wanted* it to be 0.0
or 1.0.
This is a bug. Can you file a JIRA?
On Tue, Aug 13, 2013 at 12:04 PM, sam wu swu5...@gmail.com wrote:
say column a has 1000 entries, maxPref=700
rowSampleRate = Math.min(maxObservationsPerRow,
Sorry for the phrasing.
I'll file a JIRA
Sam
On Tue, Aug 13, 2013 at 12:10 PM, Ted Dunning ted.dunn...@gmail.com wrote:
Ouch.
Sorry... your original posting made it sound like you *wanted* it to be 0.0
or 1.0.
This is a bug. Can you file a JIRA?
On Tue, Aug 13, 2013 at 12:04 PM,
Findbugs was reporting it whole time (see Warnings tab on
https://builds.apache.org/job/Mahout-Quality/2194/findbugsResult/ and
ICAST_IDIV_CAST_TO_DOUBLE
bug).
We should get findbugs to 0.
On Tue, Aug 13, 2013 at 9:13 PM, sam wu swu5...@gmail.com wrote:
Sorry for the phrasing.
I'll file a
OK single action recs are working so output to Solr with only [B'B] and B.
On Aug 13, 2013, at 10:52 AM, Pat Ferrel pat.fer...@gmail.com wrote:
Corrections inline
On Aug 13, 2013, at 10:49 AM, Pat Ferrel pat.fer...@gmail.com wrote:
I finally got some time to work on this and have a first
I recently asked the same core question on this list. I certainly won't
argue with the statistics of small numbers. But I will hazard a
prediction: the impetus for Mahout to support Hadoop 2 will appear about
the same time the elephant book gets updated for 2.0, provided Twister or
something
16 matches
Mail list logo