Re: Streaming and incremental cooccurrence

2015-04-22 Thread Ted Dunning
On Wed, Apr 22, 2015 at 8:07 PM, Pat Ferrel  wrote:

> I think we have been talking about an idea that does an incremental
> approximation, then a refresh every so often to remove any approximation so
> in an ideal world we need both.


Actually, the method I was pushing is exact.  If the sampling is made
deterministic using clever seeds, then deletion is even possible since you
can determine whether an observation was thrown away rather than used to
increment counts.

The only creeping crud aspect of this is the accumulation of zero rows as
things fall out of the accumulation window.  I would be tempted to not
allow deletion and just restart as Pat is suggesting.


[jira] [Commented] (MAHOUT-1690) CLONE - Some vector dumper flags are expecting arguments.

2015-04-22 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14508388#comment-14508388
 ] 

Hudson commented on MAHOUT-1690:


SUCCESS: Integrated in Mahout-Quality #3134 (See 
[https://builds.apache.org/job/Mahout-Quality/3134/])
MAHOUT-1690:CLONE - Some vector dumper flags are expecting arguments. This 
closes apache/mahout#122 (suneel.marthi: rev 
a3f78bde9bf87d3d37931f878015b490761e75ce)
* CHANGELOG
* integration/src/main/java/org/apache/mahout/utils/vectors/VectorDumper.java


> CLONE - Some vector dumper flags are expecting arguments.
> -
>
> Key: MAHOUT-1690
> URL: https://issues.apache.org/jira/browse/MAHOUT-1690
> Project: Mahout
>  Issue Type: Bug
>  Components: Integration
>Affects Versions: 0.10.0
>Reporter: Allen McIntosh
>Assignee: Suneel Marthi
>Priority: Minor
>  Labels: integration
> Fix For: 0.10.1
>
>
> MAHOUT-993 seems to be back.  In particular, the --sortVectors/-sort option 
> insists on an argument, but anything you give (e.g. aardvark) is happily 
> accepted.  I did not check the other options.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1690) CLONE - Some vector dumper flags are expecting arguments.

2015-04-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14508356#comment-14508356
 ] 

ASF GitHub Bot commented on MAHOUT-1690:


Github user asfgit closed the pull request at:

https://github.com/apache/mahout/pull/122


> CLONE - Some vector dumper flags are expecting arguments.
> -
>
> Key: MAHOUT-1690
> URL: https://issues.apache.org/jira/browse/MAHOUT-1690
> Project: Mahout
>  Issue Type: Bug
>  Components: Integration
>Affects Versions: 0.10.0
>Reporter: Allen McIntosh
>Assignee: Suneel Marthi
>Priority: Minor
>  Labels: integration
> Fix For: 0.10.1
>
>
> MAHOUT-993 seems to be back.  In particular, the --sortVectors/-sort option 
> insists on an argument, but anything you give (e.g. aardvark) is happily 
> accepted.  I did not check the other options.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Streaming and incremental cooccurrence

2015-04-22 Thread Pat Ferrel
Currently maxPrefs is applied to input, both row and column (in hadoop and 
scala) and has a default of 500. maxSimilaritiesPerItem is for the cooccurrence 
matrix and is applied to rows. The default is 50. Similar down-sampling is done 
on row similarity.

For a new way to use threshold I was thinking of one that is relative to the 
data itself and would always produce the same number of items in the input but 
based only on a quailty threshold, not row and column counts. From Sebastian’s 
paper this may not produce much benefit and the downside is that the input 
distribution parameters must be calculated before sparsification. This is 
avoided with a fixed threshold and/or row and column count downsampling.

BTW there is another half-way method to do part of this by juggling DStreams 
and RDDs. Trade-offs apply of course. 

The idea would be to make Cooccurrence a streaming operation fed by an Update 
Period of micro-batches. Keeping the input as a DStream allows us to drop old 
data when new nano-batches come in but the entire time window is potentially 
large, maybe months for long lived items. The time window would be fed to 
Cooccurrence periodically.

The benefit is that the process never reads persisted data (a fairly time 
consuming operation with nano-batches) but is passed new RDDs that have come 
from some streaming input (Kafka?)
The downside is that it still needs the entire time window’s worth of data for 
the calc. In Spark terms the input is taken from a DStream.

I think we have been talking about an idea that does an incremental 
approximation, then a refresh every so often to remove any approximation so in 
an ideal world we need both.

Streaming but non incremental would be relatively easy and use current math 
code. Incremental would require in-memory data structures of custom design. 


On Apr 19, 2015, at 8:39 PM, Ted Dunning  wrote:

Inline

On Sun, Apr 19, 2015 at 11:05 AM, Pat Ferrel  wrote:

> Short answer, you are correct this is not a new filter.
> 
> The Hadoop MapReduce implements:
> * maxSimilaritiesPerItem
> * maxPrefs
> * minPrefsPerUser
> * threshold
> 
> Scala version:
> * maxSimilaritiesPerItem
> 

I think of this as "column-wise", but that may be bad terminology.


> * maxPrefs
> 

And I think of this as "row-wise" or "user limit".  I think it is the
interaction-cut from the paper.


> 
> The paper talks about an interaction-cut, and describes it with "There is
> no significant decrease in the error for incorporating more interactions
> from the ‘power users’ after that.” While I’d trust your reading better
> than mine I thought that meant dowsampling overactive users.
> 

I agree.



> 
> However both the Hadoop Mapreduce and the Scala version downsample both
> user and item interactions by maxPrefs. So you are correct, not a new thing.
> 
> The paper also talks about the threshold and we’ve talked on the list
> about how better to implement that. A fixed number is not very useful so a
> number of sigmas was proposed but is not yet implemented.
> 

I think that both  minPrefsPerUser and threshold have limited utility in
the current code.  Could be wrong about that.

With low quality association measures that suffer from low count problems
or simplisitic user-based methods, minPrefsPerUser can be crucial.
Threshold can also be required for systems like that.

The Scala code doesn't have that problem since it doesn't support those
metrics.



Re: Welcome Anand Avati

2015-04-22 Thread Shannon Quinn

Welcome to the team, Anand!

On 4/22/15 2:55 PM, Pat Ferrel wrote:

Welcome Anand!

On Apr 22, 2015, at 11:29 AM, Andrew Palumbo  wrote:

Congratulations Anand, Welcome to the team!

On 04/22/2015 02:18 PM, Gokhan Capan wrote:

Welcome Anand!

Sent from my iPhone


On Apr 22, 2015, at 20:47, Dmitriy Lyubimov  wrote:

congrats and thank you!

-d

On Wed, Apr 22, 2015 at 10:33 AM, Andrew Musselman <
andrew.mussel...@gmail.com> wrote:


Welcome to the team Anand; thanks for your contributions!


On Wed, Apr 22, 2015 at 10:29 AM, Anand Avati  wrote:

Thank you Suneel, I am thrilled to join the team!

I am a relative newbie to data mining and machine learning. I currently
work at Red Hat, but have joined grad school (in machine learning)

starting

this fall.

I look forward to continuing my contributions, and thank you once again

for

the opportunity.

Anand


On Wed, Apr 22, 2015, 08:08 Suneel Marthi  wrote:

In recognition of the contributions of Anand Avati to the Mahout

project

over the past year, the PMC is pleased to announce that he has accepted

our

invitation to join the Mahout project as a committer.

As is customary, I will leave it to Anand to provide a little bit of
background about himself.

Congratulations and Welcome!

-Suneel Marthi
On Behalf of Mahout PMC






Re: Welcome Anand Avati

2015-04-22 Thread Pat Ferrel
Welcome Anand!

On Apr 22, 2015, at 11:29 AM, Andrew Palumbo  wrote:

Congratulations Anand, Welcome to the team!

On 04/22/2015 02:18 PM, Gokhan Capan wrote:
> Welcome Anand!
> 
> Sent from my iPhone
> 
>> On Apr 22, 2015, at 20:47, Dmitriy Lyubimov  wrote:
>> 
>> congrats and thank you!
>> 
>> -d
>> 
>> On Wed, Apr 22, 2015 at 10:33 AM, Andrew Musselman <
>> andrew.mussel...@gmail.com> wrote:
>> 
>>> Welcome to the team Anand; thanks for your contributions!
>>> 
 On Wed, Apr 22, 2015 at 10:29 AM, Anand Avati  wrote:
 
 Thank you Suneel, I am thrilled to join the team!
 
 I am a relative newbie to data mining and machine learning. I currently
 work at Red Hat, but have joined grad school (in machine learning)
>>> starting
 this fall.
 
 I look forward to continuing my contributions, and thank you once again
>>> for
 the opportunity.
 
 Anand
 
> On Wed, Apr 22, 2015, 08:08 Suneel Marthi  wrote:
> 
> In recognition of the contributions of Anand Avati to the Mahout
>>> project
> over the past year, the PMC is pleased to announce that he has accepted
 our
> invitation to join the Mahout project as a committer.
> 
> As is customary, I will leave it to Anand to provide a little bit of
> background about himself.
> 
> Congratulations and Welcome!
> 
> -Suneel Marthi
> On Behalf of Mahout PMC




Re: Welcome Anand Avati

2015-04-22 Thread Stevo Slavić
Congratulations and welcome Anand!
On Apr 22, 2015 8:30 PM, "Andrew Palumbo"  wrote:

> Congratulations Anand, Welcome to the team!
>
> On 04/22/2015 02:18 PM, Gokhan Capan wrote:
>
>> Welcome Anand!
>>
>> Sent from my iPhone
>>
>>  On Apr 22, 2015, at 20:47, Dmitriy Lyubimov  wrote:
>>>
>>> congrats and thank you!
>>>
>>> -d
>>>
>>> On Wed, Apr 22, 2015 at 10:33 AM, Andrew Musselman <
>>> andrew.mussel...@gmail.com> wrote:
>>>
>>>  Welcome to the team Anand; thanks for your contributions!

  On Wed, Apr 22, 2015 at 10:29 AM, Anand Avati 
> wrote:
>
> Thank you Suneel, I am thrilled to join the team!
>
> I am a relative newbie to data mining and machine learning. I currently
> work at Red Hat, but have joined grad school (in machine learning)
>
 starting

> this fall.
>
> I look forward to continuing my contributions, and thank you once again
>
 for

> the opportunity.
>
> Anand
>
>  On Wed, Apr 22, 2015, 08:08 Suneel Marthi  wrote:
>>
>> In recognition of the contributions of Anand Avati to the Mahout
>>
> project

> over the past year, the PMC is pleased to announce that he has accepted
>>
> our
>
>> invitation to join the Mahout project as a committer.
>>
>> As is customary, I will leave it to Anand to provide a little bit of
>> background about himself.
>>
>> Congratulations and Welcome!
>>
>> -Suneel Marthi
>> On Behalf of Mahout PMC
>>
>
>


Jenkins build is back to normal : Mahout-Examples-Cluster-Reuters-II #1166

2015-04-22 Thread Apache Jenkins Server
See 



Re: Welcome Anand Avati

2015-04-22 Thread Andrew Palumbo

Congratulations Anand, Welcome to the team!

On 04/22/2015 02:18 PM, Gokhan Capan wrote:

Welcome Anand!

Sent from my iPhone


On Apr 22, 2015, at 20:47, Dmitriy Lyubimov  wrote:

congrats and thank you!

-d

On Wed, Apr 22, 2015 at 10:33 AM, Andrew Musselman <
andrew.mussel...@gmail.com> wrote:


Welcome to the team Anand; thanks for your contributions!


On Wed, Apr 22, 2015 at 10:29 AM, Anand Avati  wrote:

Thank you Suneel, I am thrilled to join the team!

I am a relative newbie to data mining and machine learning. I currently
work at Red Hat, but have joined grad school (in machine learning)

starting

this fall.

I look forward to continuing my contributions, and thank you once again

for

the opportunity.

Anand


On Wed, Apr 22, 2015, 08:08 Suneel Marthi  wrote:

In recognition of the contributions of Anand Avati to the Mahout

project

over the past year, the PMC is pleased to announce that he has accepted

our

invitation to join the Mahout project as a committer.

As is customary, I will leave it to Anand to provide a little bit of
background about himself.

Congratulations and Welcome!

-Suneel Marthi
On Behalf of Mahout PMC




Re: Welcome Anand Avati

2015-04-22 Thread Gokhan Capan
Welcome Anand!

Sent from my iPhone

> On Apr 22, 2015, at 20:47, Dmitriy Lyubimov  wrote:
>
> congrats and thank you!
>
> -d
>
> On Wed, Apr 22, 2015 at 10:33 AM, Andrew Musselman <
> andrew.mussel...@gmail.com> wrote:
>
>> Welcome to the team Anand; thanks for your contributions!
>>
>>> On Wed, Apr 22, 2015 at 10:29 AM, Anand Avati  wrote:
>>>
>>> Thank you Suneel, I am thrilled to join the team!
>>>
>>> I am a relative newbie to data mining and machine learning. I currently
>>> work at Red Hat, but have joined grad school (in machine learning)
>> starting
>>> this fall.
>>>
>>> I look forward to continuing my contributions, and thank you once again
>> for
>>> the opportunity.
>>>
>>> Anand
>>>
 On Wed, Apr 22, 2015, 08:08 Suneel Marthi  wrote:

 In recognition of the contributions of Anand Avati to the Mahout
>> project
 over the past year, the PMC is pleased to announce that he has accepted
>>> our
 invitation to join the Mahout project as a committer.

 As is customary, I will leave it to Anand to provide a little bit of
 background about himself.

 Congratulations and Welcome!

 -Suneel Marthi
 On Behalf of Mahout PMC
>>


Re: Welcome Anand Avati

2015-04-22 Thread Dmitriy Lyubimov
congrats and thank you!

-d

On Wed, Apr 22, 2015 at 10:33 AM, Andrew Musselman <
andrew.mussel...@gmail.com> wrote:

> Welcome to the team Anand; thanks for your contributions!
>
> On Wed, Apr 22, 2015 at 10:29 AM, Anand Avati  wrote:
>
> > Thank you Suneel, I am thrilled to join the team!
> >
> > I am a relative newbie to data mining and machine learning. I currently
> > work at Red Hat, but have joined grad school (in machine learning)
> starting
> > this fall.
> >
> > I look forward to continuing my contributions, and thank you once again
> for
> > the opportunity.
> >
> > Anand
> >
> > On Wed, Apr 22, 2015, 08:08 Suneel Marthi  wrote:
> >
> > > In recognition of the contributions of Anand Avati to the Mahout
> project
> > > over the past year, the PMC is pleased to announce that he has accepted
> > our
> > > invitation to join the Mahout project as a committer.
> > >
> > > As is customary, I will leave it to Anand to provide a little bit of
> > > background about himself.
> > >
> > > Congratulations and Welcome!
> > >
> > > -Suneel Marthi
> > > On Behalf of Mahout PMC
> > >
> >
>


Re: Welcome Anand Avati

2015-04-22 Thread Andrew Musselman
Welcome to the team Anand; thanks for your contributions!

On Wed, Apr 22, 2015 at 10:29 AM, Anand Avati  wrote:

> Thank you Suneel, I am thrilled to join the team!
>
> I am a relative newbie to data mining and machine learning. I currently
> work at Red Hat, but have joined grad school (in machine learning) starting
> this fall.
>
> I look forward to continuing my contributions, and thank you once again for
> the opportunity.
>
> Anand
>
> On Wed, Apr 22, 2015, 08:08 Suneel Marthi  wrote:
>
> > In recognition of the contributions of Anand Avati to the Mahout project
> > over the past year, the PMC is pleased to announce that he has accepted
> our
> > invitation to join the Mahout project as a committer.
> >
> > As is customary, I will leave it to Anand to provide a little bit of
> > background about himself.
> >
> > Congratulations and Welcome!
> >
> > -Suneel Marthi
> > On Behalf of Mahout PMC
> >
>


Re: Welcome Anand Avati

2015-04-22 Thread Anand Avati
Thank you Suneel, I am thrilled to join the team!

I am a relative newbie to data mining and machine learning. I currently
work at Red Hat, but have joined grad school (in machine learning) starting
this fall.

I look forward to continuing my contributions, and thank you once again for
the opportunity.

Anand

On Wed, Apr 22, 2015, 08:08 Suneel Marthi  wrote:

> In recognition of the contributions of Anand Avati to the Mahout project
> over the past year, the PMC is pleased to announce that he has accepted our
> invitation to join the Mahout project as a committer.
>
> As is customary, I will leave it to Anand to provide a little bit of
> background about himself.
>
> Congratulations and Welcome!
>
> -Suneel Marthi
> On Behalf of Mahout PMC
>


Welcome Anand Avati

2015-04-22 Thread Suneel Marthi
In recognition of the contributions of Anand Avati to the Mahout project
over the past year, the PMC is pleased to announce that he has accepted our
invitation to join the Mahout project as a committer.

As is customary, I will leave it to Anand to provide a little bit of
background about himself.

Congratulations and Welcome!

-Suneel Marthi
On Behalf of Mahout PMC