Clustering of heterogenous input data

2013-08-13 Thread Christian Sengstock
Hey, i want to cluster a set of documents using a bag-of-words approach (e.g. using K-means). However, my documents (since they are automatically generated by aggregating text snippets) show huge differences according to their document size. This means, some document vectors have 50 words with a

Install mahout 0.8 with hadoop 2.0

2013-08-13 Thread Sergey Svinarchuk
Hi all, Somebody compile and install mahout with hadoop 2.0? If yes, that what changes you make in mahout, that it have 100% passed unit tests and successful work with hadoop 2.0? Thanks

Re: Using CVB; LdaTopics confusion

2013-08-13 Thread Liz Merkhofer
Christopher - I had the same confusion with vectordump output on a hadoop cluster. The solution is that it's not trying to write a file to your hdfs: -o will go locally. So when I just named a file (it did not want to create a local directory), it wound up in the /bin I was working out of. Best,

Re: Install mahout 0.8 with hadoop 2.0

2013-08-13 Thread Ted Dunning
No. There is very small demand for Mahout on Hadoop 2.0 so far and the forward/backward incompatibility of 2.0 has made it difficult to motivate moving to 2.0. The bigtop guys built a maven profile for 0.23 some time ago. I don't know the status of that. I don't think that the differences are

Re: Install mahout 0.8 with hadoop 2.0

2013-08-13 Thread Sean Owen
I think it all minimally works on Hadoop 2.0.x, though I haven't tried it recently -- it does require a recompile. This is different from it working on MRv2 versus MRv1. I'm almost certain it does not work on MRv2 and doubt it will. The effort is not large, but it's subtle. A few hacks may fail

Re: Setting up a recommender

2013-08-13 Thread Pat Ferrel
When I started looking at this I was a bit skeptical. As a Search engine Solr may be peerless, but as yet another NoSQL db? However getting further into this I see one very large benefit. It has one feature that sets it completely apart from the typical NoSQL db. The type of queries you do

Re: Setting up a recommender

2013-08-13 Thread Pat Ferrel
I finally got some time to work on this and have a first cut at output to Solr working on the github repo. It only works on 2-action input but I'll have that cleaned up soon so it will work with one action. Solr indexing has not been tested yet and the field names and/or types may need

Re: Setting up a recommender

2013-08-13 Thread Pat Ferrel
Corrections inline On Aug 13, 2013, at 10:49 AM, Pat Ferrel pat.fer...@gmail.com wrote: I finally got some time to work on this and have a first cut at output to Solr working on the github repo. It only works on 2-action input but I'll have that cleaned up soon so it will work with one

RowSimilarityJob, sampleDown method problem

2013-08-13 Thread sam wu
Mahout 0.9 snapshot RowSimilarityJob.java , sampleDown method line 291 or 300 double rowSampleRate = Math.min(maxObservationsPerRow, observationsPerRow) / observationsPerRow; return either 0.0 or 1.0, not fraction. needs (double) casting BR Sam

Re: RowSimilarityJob, sampleDown method problem

2013-08-13 Thread Ted Dunning
Why do you think this? On Tue, Aug 13, 2013 at 11:56 AM, sam wu swu5...@gmail.com wrote: Mahout 0.9 snapshot RowSimilarityJob.java , sampleDown method line 291 or 300 double rowSampleRate = Math.min(maxObservationsPerRow, observationsPerRow) / observationsPerRow; return either 0.0

Re: RowSimilarityJob, sampleDown method problem

2013-08-13 Thread sam wu
say column a has 1000 entries, maxPref=700 rowSampleRate = Math.min(maxObservationsPerRow, observationsPerRow) / observationsPerRow; we get rowSampleRate =0.0 ( not 0.7) do we totally skip this column or sample column entries with .7 probalility (roughly get 700 entries) On Tue, Aug 13, 2013

Re: RowSimilarityJob, sampleDown method problem

2013-08-13 Thread Ted Dunning
Ouch. Sorry... your original posting made it sound like you *wanted* it to be 0.0 or 1.0. This is a bug. Can you file a JIRA? On Tue, Aug 13, 2013 at 12:04 PM, sam wu swu5...@gmail.com wrote: say column a has 1000 entries, maxPref=700 rowSampleRate = Math.min(maxObservationsPerRow,

Re: RowSimilarityJob, sampleDown method problem

2013-08-13 Thread sam wu
Sorry for the phrasing. I'll file a JIRA Sam On Tue, Aug 13, 2013 at 12:10 PM, Ted Dunning ted.dunn...@gmail.com wrote: Ouch. Sorry... your original posting made it sound like you *wanted* it to be 0.0 or 1.0. This is a bug. Can you file a JIRA? On Tue, Aug 13, 2013 at 12:04 PM,

Re: RowSimilarityJob, sampleDown method problem

2013-08-13 Thread Stevo Slavić
Findbugs was reporting it whole time (see Warnings tab on https://builds.apache.org/job/Mahout-Quality/2194/findbugsResult/ and ICAST_IDIV_CAST_TO_DOUBLE bug). We should get findbugs to 0. On Tue, Aug 13, 2013 at 9:13 PM, sam wu swu5...@gmail.com wrote: Sorry for the phrasing. I'll file a

Re: Setting up a recommender

2013-08-13 Thread Pat Ferrel
OK single action recs are working so output to Solr with only [B'B] and B. On Aug 13, 2013, at 10:52 AM, Pat Ferrel pat.fer...@gmail.com wrote: Corrections inline On Aug 13, 2013, at 10:49 AM, Pat Ferrel pat.fer...@gmail.com wrote: I finally got some time to work on this and have a first

Re: Install mahout 0.8 with hadoop 2.0

2013-08-13 Thread Carlos Mundi
I recently asked the same core question on this list. I certainly won't argue with the statistics of small numbers. But I will hazard a prediction: the impetus for Mahout to support Hadoop 2 will appear about the same time the elephant book gets updated for 2.0, provided Twister or something