Re: Does RowSimilarity job support down-sampling
On 19.06.2013 01:29, Ted Dunning wrote: > On Tue, Jun 18, 2013 at 11:01 PM, Sebastian Schelter wrote: > >> We could also move the sampling directly to RowSimilarityJob if people >> consider this more useful. > > It will have a large effect on the time for the RowSimilarityJob for some > data. I put the sampling into PreparePreferenceMatrixJob, because I considered it to be usecase specific for recommendations. > Does anybody have an idea about how much of the total time is in > RowSimilarityJob? What do you mean by total time? Compared to the rest of the jobs in ItemSimilarityJob and RecommenderJob? -sebastian
Re: Does RowSimilarity job support down-sampling
On Tue, Jun 18, 2013 at 11:01 PM, Sebastian Schelter wrote: > We could also move the sampling directly to RowSimilarityJob if people > consider this more useful. > It will have a large effect on the time for the RowSimilarityJob for some data. Does anybody have an idea about how much of the total time is in RowSimilarityJob?
Re: Does RowSimilarity job support down-sampling
Hi, RowSimilarityJob by itself does not do down-sampling. The down-sampling is done by the ToItemVectorsMapper in the PreparePreferenceMatrixJob which is responsible for preparing the inputs (the matrix of interactions between users and items) for ItemSimilarityJob and RecommenderJob. As Sean noted, the option "maxPrefsPerUser" controls the sampling. By default, we use a 1000 samples per user. We could also move the sampling directly to RowSimilarityJob if people consider this more useful. Best, Sebastian On 18.06.2013 22:50, Ted Dunning wrote: > But RecommenderJob seems to call RowSimilarityJob first. That is where > sampling needs to be done. > > //calculate the co-occurrence matrix > ToolRunner.run(getConf(), new RowSimilarityJob(), new String[]{ > "--input", new Path(prepPath, > PreparePreferenceMatrixJob.RATING_MATRIX).toString(), > "--output", similarityMatrixPath.toString(), > "--numberOfColumns", String.valueOf(numberOfUsers), > "--similarityClassname", similarityClassname, > "--maxSimilaritiesPerRow", String.valueOf(maxSimilaritiesPerItem), > "--excludeSelfSimilarity", String.valueOf(Boolean.TRUE), > "--threshold", String.valueOf(threshold),Hi > "--tempDir", getTempPath().toString(), > }); > > // write out the similarity matrix if the user specified that behavior > if (hasOption("outputPathForSimilarityMatrix")) { > Path outputPathForSimilarityMatrix = new > Path(getOption("outputPathForSimilarityMatrix")); > > Job outputSimilarityMatrix = prepareJob(similarityMatrixPath, > outputPathForSimilarityMatrix, > SequenceFileInputFormat.class, > ItemSimilarityJob.MostSimilarItemPairsMapper.class, > EntityEntityWritable.class, DoubleWritable.class, > ItemSimilarityJob.MostSimilarItemPairsReducer.class, > EntityEntityWritable.class, DoubleWritable.class, > TextOutputFormat.class); > > Configuration mostSimilarItemsConf = > outputSimilarityMatrix.getConfiguration(); > mostSimilarItemsConf.set(ItemSimilarityJob.ITEM_ID_INDEX_PATH_STR, > new Path(prepPath, > PreparePreferenceMatrixJob.ITEMID_INDEX).toString()); > > mostSimilarItemsConf.setInt(ItemSimilarityJob.MAX_SIMILARITIES_PER_ITEM, > maxSimilaritiesPerItem); > outputSimilarityMatrix.waitForCompletion(true); > } > } > > > > > On Tue, Jun 18, 2013 at 10:47 PM, Sean Owen wrote: > >> No, it's in ItemSimilarityJob -- I'm looking at it now. It ends up >> setting ToItemVectorsMapper.SAMPLE_SIZE, if that helps. >> >> On Tue, Jun 18, 2013 at 9:43 PM, Ted Dunning >> wrote: >>> Ahh... only effective in RecommenderJob. >> >
Re: Does RowSimilarity job support down-sampling
But RecommenderJob seems to call RowSimilarityJob first. That is where sampling needs to be done. //calculate the co-occurrence matrix ToolRunner.run(getConf(), new RowSimilarityJob(), new String[]{ "--input", new Path(prepPath, PreparePreferenceMatrixJob.RATING_MATRIX).toString(), "--output", similarityMatrixPath.toString(), "--numberOfColumns", String.valueOf(numberOfUsers), "--similarityClassname", similarityClassname, "--maxSimilaritiesPerRow", String.valueOf(maxSimilaritiesPerItem), "--excludeSelfSimilarity", String.valueOf(Boolean.TRUE), "--threshold", String.valueOf(threshold), "--tempDir", getTempPath().toString(), }); // write out the similarity matrix if the user specified that behavior if (hasOption("outputPathForSimilarityMatrix")) { Path outputPathForSimilarityMatrix = new Path(getOption("outputPathForSimilarityMatrix")); Job outputSimilarityMatrix = prepareJob(similarityMatrixPath, outputPathForSimilarityMatrix, SequenceFileInputFormat.class, ItemSimilarityJob.MostSimilarItemPairsMapper.class, EntityEntityWritable.class, DoubleWritable.class, ItemSimilarityJob.MostSimilarItemPairsReducer.class, EntityEntityWritable.class, DoubleWritable.class, TextOutputFormat.class); Configuration mostSimilarItemsConf = outputSimilarityMatrix.getConfiguration(); mostSimilarItemsConf.set(ItemSimilarityJob.ITEM_ID_INDEX_PATH_STR, new Path(prepPath, PreparePreferenceMatrixJob.ITEMID_INDEX).toString()); mostSimilarItemsConf.setInt(ItemSimilarityJob.MAX_SIMILARITIES_PER_ITEM, maxSimilaritiesPerItem); outputSimilarityMatrix.waitForCompletion(true); } } On Tue, Jun 18, 2013 at 10:47 PM, Sean Owen wrote: > No, it's in ItemSimilarityJob -- I'm looking at it now. It ends up > setting ToItemVectorsMapper.SAMPLE_SIZE, if that helps. > > On Tue, Jun 18, 2013 at 9:43 PM, Ted Dunning > wrote: > > Ahh... only effective in RecommenderJob. >
Re: Does RowSimilarity job support down-sampling
No, it's in ItemSimilarityJob -- I'm looking at it now. It ends up setting ToItemVectorsMapper.SAMPLE_SIZE, if that helps. On Tue, Jun 18, 2013 at 9:43 PM, Ted Dunning wrote: > Ahh... only effective in RecommenderJob.
Re: Does RowSimilarity job support down-sampling
Ahh... only effective in RecommenderJob. On Tue, Jun 18, 2013 at 10:40 PM, Ted Dunning wrote: > My recollection as well. > > I will read the code again. Didn't see where that happens. > > > On Tue, Jun 18, 2013 at 10:34 PM, Sean Owen wrote: > >> This is the "maxPrefsPerUser" option IIRC. >> >> On Tue, Jun 18, 2013 at 9:27 PM, Ted Dunning >> wrote: >> > I was reading the RowSimilarityJob and it doesn't appear that it does >> > down-sampling on the original data to minimize the performance impact of >> > perversely prolific users. >> > >> > The issue is that if a single user has 100,000 items in their history, >> we >> > learn nothing more than if we picked 300 of those while the former would >> > result in processing 10 billion cooccurrences and the latter would >> result >> > in 100,000. This factor of 10,000 is so large that it can make a big >> > difference in performance. >> > >> > I had thought that the code had this down-sampling in place. >> > >> > If not, I can add row based down-sampling quite easily. >> > >
Re: Does RowSimilarity job support down-sampling
My recollection as well. I will read the code again. Didn't see where that happens. On Tue, Jun 18, 2013 at 10:34 PM, Sean Owen wrote: > This is the "maxPrefsPerUser" option IIRC. > > On Tue, Jun 18, 2013 at 9:27 PM, Ted Dunning > wrote: > > I was reading the RowSimilarityJob and it doesn't appear that it does > > down-sampling on the original data to minimize the performance impact of > > perversely prolific users. > > > > The issue is that if a single user has 100,000 items in their history, we > > learn nothing more than if we picked 300 of those while the former would > > result in processing 10 billion cooccurrences and the latter would result > > in 100,000. This factor of 10,000 is so large that it can make a big > > difference in performance. > > > > I had thought that the code had this down-sampling in place. > > > > If not, I can add row based down-sampling quite easily. >
Re: Does RowSimilarity job support down-sampling
I think you can get what you need through the --maxPrefsForUser flag. Any user with more than that many will only keep a random sample of that size. On Jun 18, 2013, at 23:27, Ted Dunning wrote: > I was reading the RowSimilarityJob and it doesn't appear that it does > down-sampling on the original data to minimize the performance impact of > perversely prolific users. > > The issue is that if a single user has 100,000 items in their history, we > learn nothing more than if we picked 300 of those while the former would > result in processing 10 billion cooccurrences and the latter would result > in 100,000. This factor of 10,000 is so large that it can make a big > difference in performance. > > I had thought that the code had this down-sampling in place. > > If not, I can add row based down-sampling quite easily.
Re: Does RowSimilarity job support down-sampling
This is the "maxPrefsPerUser" option IIRC. On Tue, Jun 18, 2013 at 9:27 PM, Ted Dunning wrote: > I was reading the RowSimilarityJob and it doesn't appear that it does > down-sampling on the original data to minimize the performance impact of > perversely prolific users. > > The issue is that if a single user has 100,000 items in their history, we > learn nothing more than if we picked 300 of those while the former would > result in processing 10 billion cooccurrences and the latter would result > in 100,000. This factor of 10,000 is so large that it can make a big > difference in performance. > > I had thought that the code had this down-sampling in place. > > If not, I can add row based down-sampling quite easily.
Does RowSimilarity job support down-sampling
I was reading the RowSimilarityJob and it doesn't appear that it does down-sampling on the original data to minimize the performance impact of perversely prolific users. The issue is that if a single user has 100,000 items in their history, we learn nothing more than if we picked 300 of those while the former would result in processing 10 billion cooccurrences and the latter would result in 100,000. This factor of 10,000 is so large that it can make a big difference in performance. I had thought that the code had this down-sampling in place. If not, I can add row based down-sampling quite easily.