Hi, RowSimilarityJob by itself does not do down-sampling.
The down-sampling is done by the ToItemVectorsMapper in the PreparePreferenceMatrixJob which is responsible for preparing the inputs (the matrix of interactions between users and items) for ItemSimilarityJob and RecommenderJob. As Sean noted, the option "maxPrefsPerUser" controls the sampling. By default, we use a 1000 samples per user. We could also move the sampling directly to RowSimilarityJob if people consider this more useful. Best, Sebastian On 18.06.2013 22:50, Ted Dunning wrote: > But RecommenderJob seems to call RowSimilarityJob first. That is where > sampling needs to be done. > > //calculate the co-occurrence matrix > ToolRunner.run(getConf(), new RowSimilarityJob(), new String[]{ > "--input", new Path(prepPath, > PreparePreferenceMatrixJob.RATING_MATRIX).toString(), > "--output", similarityMatrixPath.toString(), > "--numberOfColumns", String.valueOf(numberOfUsers), > "--similarityClassname", similarityClassname, > "--maxSimilaritiesPerRow", String.valueOf(maxSimilaritiesPerItem), > "--excludeSelfSimilarity", String.valueOf(Boolean.TRUE), > "--threshold", String.valueOf(threshold),Hi > "--tempDir", getTempPath().toString(), > }); > > // write out the similarity matrix if the user specified that behavior > if (hasOption("outputPathForSimilarityMatrix")) { > Path outputPathForSimilarityMatrix = new > Path(getOption("outputPathForSimilarityMatrix")); > > Job outputSimilarityMatrix = prepareJob(similarityMatrixPath, > outputPathForSimilarityMatrix, > SequenceFileInputFormat.class, > ItemSimilarityJob.MostSimilarItemPairsMapper.class, > EntityEntityWritable.class, DoubleWritable.class, > ItemSimilarityJob.MostSimilarItemPairsReducer.class, > EntityEntityWritable.class, DoubleWritable.class, > TextOutputFormat.class); > > Configuration mostSimilarItemsConf = > outputSimilarityMatrix.getConfiguration(); > mostSimilarItemsConf.set(ItemSimilarityJob.ITEM_ID_INDEX_PATH_STR, > new Path(prepPath, > PreparePreferenceMatrixJob.ITEMID_INDEX).toString()); > > mostSimilarItemsConf.setInt(ItemSimilarityJob.MAX_SIMILARITIES_PER_ITEM, > maxSimilaritiesPerItem); > outputSimilarityMatrix.waitForCompletion(true); > } > } > > > > > On Tue, Jun 18, 2013 at 10:47 PM, Sean Owen <sro...@gmail.com> wrote: > >> No, it's in ItemSimilarityJob -- I'm looking at it now. It ends up >> setting ToItemVectorsMapper.SAMPLE_SIZE, if that helps. >> >> On Tue, Jun 18, 2013 at 9:43 PM, Ted Dunning <ted.dunn...@gmail.com> >> wrote: >>> Ahh... only effective in RecommenderJob. >> >