subject:"Does RowSimilarity job support down\-sampling"

Re: Does RowSimilarity job support down-sampling

2013-06-18 Thread Sebastian Schelter

On 19.06.2013 01:29, Ted Dunning wrote:
> On Tue, Jun 18, 2013 at 11:01 PM, Sebastian Schelter  wrote:
> 
>> We could also move the sampling directly to RowSimilarityJob if people
>> consider this more useful.
> 
> It will have a large effect on the time for the RowSimilarityJob for some
> data.

I put the sampling into PreparePreferenceMatrixJob, because I considered
it to be usecase specific for recommendations.

> Does anybody have an idea about how much of the total time is in
> RowSimilarityJob?

What do you mean by total time? Compared to the rest of the jobs in
ItemSimilarityJob and RecommenderJob?

-sebastian

Re: Does RowSimilarity job support down-sampling

2013-06-18 Thread Ted Dunning

On Tue, Jun 18, 2013 at 11:01 PM, Sebastian Schelter  wrote:

> We could also move the sampling directly to RowSimilarityJob if people
> consider this more useful.
>

It will have a large effect on the time for the RowSimilarityJob for some
data.

Does anybody have an idea about how much of the total time is in
RowSimilarityJob?

Re: Does RowSimilarity job support down-sampling

2013-06-18 Thread Sebastian Schelter

Hi,

RowSimilarityJob by itself does not do down-sampling.

The down-sampling is done by the ToItemVectorsMapper in the
PreparePreferenceMatrixJob which is responsible for preparing the inputs
(the matrix of interactions between users and items) for
ItemSimilarityJob and RecommenderJob. As Sean noted, the option
"maxPrefsPerUser" controls the sampling. By default, we use a 1000
samples per user.

We could also move the sampling directly to RowSimilarityJob if people
consider this more useful.

Best,
Sebastian


On 18.06.2013 22:50, Ted Dunning wrote:
> But RecommenderJob seems to call RowSimilarityJob first.  That is where
> sampling needs to be done.
> 
>   //calculate the co-occurrence matrix
>   ToolRunner.run(getConf(), new RowSimilarityJob(), new String[]{
> "--input", new Path(prepPath,
> PreparePreferenceMatrixJob.RATING_MATRIX).toString(),
> "--output", similarityMatrixPath.toString(),
> "--numberOfColumns", String.valueOf(numberOfUsers),
> "--similarityClassname", similarityClassname,
> "--maxSimilaritiesPerRow", String.valueOf(maxSimilaritiesPerItem),
> "--excludeSelfSimilarity", String.valueOf(Boolean.TRUE),
> "--threshold", String.valueOf(threshold),Hi
> "--tempDir", getTempPath().toString(),
>   });
> 
>   // write out the similarity matrix if the user specified that behavior
>   if (hasOption("outputPathForSimilarityMatrix")) {
> Path outputPathForSimilarityMatrix = new
> Path(getOption("outputPathForSimilarityMatrix"));
> 
> Job outputSimilarityMatrix = prepareJob(similarityMatrixPath,
> outputPathForSimilarityMatrix,
> SequenceFileInputFormat.class,
> ItemSimilarityJob.MostSimilarItemPairsMapper.class,
> EntityEntityWritable.class, DoubleWritable.class,
> ItemSimilarityJob.MostSimilarItemPairsReducer.class,
> EntityEntityWritable.class, DoubleWritable.class,
> TextOutputFormat.class);
> 
> Configuration mostSimilarItemsConf =
> outputSimilarityMatrix.getConfiguration();
> mostSimilarItemsConf.set(ItemSimilarityJob.ITEM_ID_INDEX_PATH_STR,
> new Path(prepPath,
> PreparePreferenceMatrixJob.ITEMID_INDEX).toString());
> 
> mostSimilarItemsConf.setInt(ItemSimilarityJob.MAX_SIMILARITIES_PER_ITEM,
> maxSimilaritiesPerItem);
> outputSimilarityMatrix.waitForCompletion(true);
>   }
> }
> 
> 
> 
> 
> On Tue, Jun 18, 2013 at 10:47 PM, Sean Owen  wrote:
> 
>> No, it's in ItemSimilarityJob -- I'm looking at it now. It ends up
>> setting ToItemVectorsMapper.SAMPLE_SIZE, if that helps.
>>
>> On Tue, Jun 18, 2013 at 9:43 PM, Ted Dunning 
>> wrote:
>>> Ahh... only effective in RecommenderJob.
>>
>

Re: Does RowSimilarity job support down-sampling

2013-06-18 Thread Ted Dunning

But RecommenderJob seems to call RowSimilarityJob first.  That is where
sampling needs to be done.

  //calculate the co-occurrence matrix
  ToolRunner.run(getConf(), new RowSimilarityJob(), new String[]{
"--input", new Path(prepPath,
PreparePreferenceMatrixJob.RATING_MATRIX).toString(),
"--output", similarityMatrixPath.toString(),
"--numberOfColumns", String.valueOf(numberOfUsers),
"--similarityClassname", similarityClassname,
"--maxSimilaritiesPerRow", String.valueOf(maxSimilaritiesPerItem),
"--excludeSelfSimilarity", String.valueOf(Boolean.TRUE),
"--threshold", String.valueOf(threshold),
"--tempDir", getTempPath().toString(),
  });

  // write out the similarity matrix if the user specified that behavior
  if (hasOption("outputPathForSimilarityMatrix")) {
Path outputPathForSimilarityMatrix = new
Path(getOption("outputPathForSimilarityMatrix"));

Job outputSimilarityMatrix = prepareJob(similarityMatrixPath,
outputPathForSimilarityMatrix,
SequenceFileInputFormat.class,
ItemSimilarityJob.MostSimilarItemPairsMapper.class,
EntityEntityWritable.class, DoubleWritable.class,
ItemSimilarityJob.MostSimilarItemPairsReducer.class,
EntityEntityWritable.class, DoubleWritable.class,
TextOutputFormat.class);

Configuration mostSimilarItemsConf =
outputSimilarityMatrix.getConfiguration();
mostSimilarItemsConf.set(ItemSimilarityJob.ITEM_ID_INDEX_PATH_STR,
new Path(prepPath,
PreparePreferenceMatrixJob.ITEMID_INDEX).toString());

mostSimilarItemsConf.setInt(ItemSimilarityJob.MAX_SIMILARITIES_PER_ITEM,
maxSimilaritiesPerItem);
outputSimilarityMatrix.waitForCompletion(true);
  }
}




On Tue, Jun 18, 2013 at 10:47 PM, Sean Owen  wrote:

> No, it's in ItemSimilarityJob -- I'm looking at it now. It ends up
> setting ToItemVectorsMapper.SAMPLE_SIZE, if that helps.
>
> On Tue, Jun 18, 2013 at 9:43 PM, Ted Dunning 
> wrote:
> > Ahh... only effective in RecommenderJob.
>

Re: Does RowSimilarity job support down-sampling

2013-06-18 Thread Sean Owen

No, it's in ItemSimilarityJob -- I'm looking at it now. It ends up
setting ToItemVectorsMapper.SAMPLE_SIZE, if that helps.

On Tue, Jun 18, 2013 at 9:43 PM, Ted Dunning  wrote:
> Ahh... only effective in RecommenderJob.

Re: Does RowSimilarity job support down-sampling

2013-06-18 Thread Ted Dunning

Ahh... only effective in RecommenderJob.




On Tue, Jun 18, 2013 at 10:40 PM, Ted Dunning  wrote:

> My recollection as well.
>
> I will read the code again.  Didn't see where that happens.
>
>
> On Tue, Jun 18, 2013 at 10:34 PM, Sean Owen  wrote:
>
>> This is the "maxPrefsPerUser" option IIRC.
>>
>> On Tue, Jun 18, 2013 at 9:27 PM, Ted Dunning 
>> wrote:
>> > I was reading the RowSimilarityJob and it doesn't appear that it does
>> > down-sampling on the original data to minimize the performance impact of
>> > perversely prolific users.
>> >
>> > The issue is that if a single user has 100,000 items in their history,
>> we
>> > learn nothing more than if we picked 300 of those while the former would
>> > result in processing 10 billion cooccurrences and the latter would
>> result
>> > in 100,000.  This factor of 10,000 is so large that it can make a big
>> > difference in performance.
>> >
>> > I had thought that the code had this down-sampling in place.
>> >
>> > If not, I can add row based down-sampling quite easily.
>>
>
>

Re: Does RowSimilarity job support down-sampling

2013-06-18 Thread Ted Dunning

My recollection as well.

I will read the code again.  Didn't see where that happens.


On Tue, Jun 18, 2013 at 10:34 PM, Sean Owen  wrote:

> This is the "maxPrefsPerUser" option IIRC.
>
> On Tue, Jun 18, 2013 at 9:27 PM, Ted Dunning 
> wrote:
> > I was reading the RowSimilarityJob and it doesn't appear that it does
> > down-sampling on the original data to minimize the performance impact of
> > perversely prolific users.
> >
> > The issue is that if a single user has 100,000 items in their history, we
> > learn nothing more than if we picked 300 of those while the former would
> > result in processing 10 billion cooccurrences and the latter would result
> > in 100,000.  This factor of 10,000 is so large that it can make a big
> > difference in performance.
> >
> > I had thought that the code had this down-sampling in place.
> >
> > If not, I can add row based down-sampling quite easily.
>

Re: Does RowSimilarity job support down-sampling

2013-06-18 Thread Dan Filimon

I think you can get what you need through the --maxPrefsForUser flag.
Any user with more than that many will only keep a random sample of that size.



On Jun 18, 2013, at 23:27, Ted Dunning  wrote:

> I was reading the RowSimilarityJob and it doesn't appear that it does
> down-sampling on the original data to minimize the performance impact of
> perversely prolific users.
> 
> The issue is that if a single user has 100,000 items in their history, we
> learn nothing more than if we picked 300 of those while the former would
> result in processing 10 billion cooccurrences and the latter would result
> in 100,000.  This factor of 10,000 is so large that it can make a big
> difference in performance.
> 
> I had thought that the code had this down-sampling in place.
> 
> If not, I can add row based down-sampling quite easily.

Re: Does RowSimilarity job support down-sampling

2013-06-18 Thread Sean Owen

This is the "maxPrefsPerUser" option IIRC.

On Tue, Jun 18, 2013 at 9:27 PM, Ted Dunning  wrote:
> I was reading the RowSimilarityJob and it doesn't appear that it does
> down-sampling on the original data to minimize the performance impact of
> perversely prolific users.
>
> The issue is that if a single user has 100,000 items in their history, we
> learn nothing more than if we picked 300 of those while the former would
> result in processing 10 billion cooccurrences and the latter would result
> in 100,000.  This factor of 10,000 is so large that it can make a big
> difference in performance.
>
> I had thought that the code had this down-sampling in place.
>
> If not, I can add row based down-sampling quite easily.

Does RowSimilarity job support down-sampling

2013-06-18 Thread Ted Dunning

I was reading the RowSimilarityJob and it doesn't appear that it does
down-sampling on the original data to minimize the performance impact of
perversely prolific users.

The issue is that if a single user has 100,000 items in their history, we
learn nothing more than if we picked 300 of those while the former would
result in processing 10 billion cooccurrences and the latter would result
in 100,000.  This factor of 10,000 is so large that it can make a big
difference in performance.

I had thought that the code had this down-sampling in place.

If not, I can add row based down-sampling quite easily.

Re: Does RowSimilarity job support down-sampling

Re: Does RowSimilarity job support down-sampling

Re: Does RowSimilarity job support down-sampling

Re: Does RowSimilarity job support down-sampling

Re: Does RowSimilarity job support down-sampling

Re: Does RowSimilarity job support down-sampling

Re: Does RowSimilarity job support down-sampling

Re: Does RowSimilarity job support down-sampling

Re: Does RowSimilarity job support down-sampling

Does RowSimilarity job support down-sampling

10 matches

Site Navigation

Mail list logo

Footer information