Re: 0.2

2009-10-15 Thread Sean Owen
I suppose I have volunteered for the release. What does it entail,
making the release? I don't knowledge of this.

... or MAHOUT-114 or what it means to sign these jars?

If info is available I can try to figure these out.

On Thu, Oct 15, 2009 at 10:19 AM, Grant Ingersoll  wrote:
> OK.  The Sparse vector improvements we have now are already a lot faster
> than what was in 0.1, so that is good.  I'd suggest that whoever is the
> Release Mgr. for this release takes care of the signing stuff.  I'll look at
> the Label (LLR) stuff by Monday.
>


Re: LDA for multi label classification was: Mahout Book

2009-10-15 Thread David Hall
Sorry, this slipped out of my inbox and I just found it!

On Thu, Oct 8, 2009 at 12:05 PM, Robin Anil  wrote:
> Posting to the dev list.
> Great Paper Thanks!. Looks like L-LDA could be used to create some
> interesting examples.

Thanks!

> The Paper shows L-LDA could be used to creating word-tag model for accurate
> tag(s) prediction given a document of words. I will complete reading and
> tell
> How much work is need to transform/build on top of current LDA
> implementation to L-LDA. any thoughts?

Umm, cool! In the paper we used Gibbs sampling to do the inference,
and the implementation in Mahout uses variational inference (because
it distributes better). I don't see any obvious problems in terms of
math, and so the rest is just fitting it in the system.

I think a small amount of refactoring would be in order to make things
more generic, and then it shouldn't be too hard to plug in. I'll add
it to my list, but I'm swamped for quite some time.

-- David

> Robin
> On Thu, Oct 8, 2009 at 11:50 PM, David Hall  wrote:
>>
>> The short answer is, that it probably won't help all that much. Naive
>> Bayes is unreasonably good when you have enough data.
>>
>> The long answer is, I have a paper with Dan Ramage and Ramesh
>> Nallapati that talks about how to do it.
>>
>> www.aclweb.org/anthology-new/D/D09/D09-1026.pdf
>>
>> In some sense, "Labeled-LDA" is a kind of Naive Bayes where you can
>> have more than one class per document. If you have exactly one class
>> per document, then LDA reduces to Naive Bayes (or the unsupervised
>> variant of naive bayes which is basically k-means in multinomial
>> space). If instead you wanted to project W words to K topics, with K >
>> numWords, then there is something to do...
>>
>> That something is something like:
>>
>> 1) get p(topic|word,document) for each word in each document (which is
>> output by LDAInference). Those are your expected counts for each
>> topic.
>>
>> 2)For each class, do something like:
>> p(topic|class) \propto  \sum_{document with that class,word}
>> p(topic|word,document)
>>
>> Then just apply bayes rule to do classification:
>>
>> p(class|topics,document) \propto p(class) \prod p(topic|class,document)
>>
>> -- David
>>
>> On Thu, Oct 8, 2009 at 11:07 AM, Robin Anil  wrote:
>> > Thanks. Didnt see that, Fixed it!.
>> > I have a query
>> > How is the LDA topic model used to improve a classifier. Say Naive
>> > Bayes? If
>> > its possible, then I would like to integrate it into mahout.
>> > Given m classes and the associated documents, One can build m topic
>> > models
>> > right. (set of topics(words) under each label and the associated
>> > probability
>> > distribution of words).
>> > How can i use that info weight the most relevant topic of a class ?
>> >
>> >
>>
>> >> LDA has two meanings: linear discriminant analysis and latent
>> >> dirichlet allocation. My code is the latter. The former is a kind of
>> >> classification. You say linear discriminant analysis in the outline.
>> >>
>>
>
>


[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

2009-10-15 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12766231#action_12766231
 ] 

Grant Ingersoll commented on MAHOUT-165:


Shashi's vectors are at:  
http://people.apache.org/~gsingers/mahout/vectors-test-mahout.gz.

> Using better primitives hash for sparse vector for performance gains
> 
>
> Key: MAHOUT-165
> URL: https://issues.apache.org/jira/browse/MAHOUT-165
> Project: Mahout
>  Issue Type: Improvement
>  Components: Matrix
>Affects Versions: 0.2
>Reporter: Shashikant Kore
>Assignee: Grant Ingersoll
> Fix For: 0.2
>
> Attachments: colt.jar, mahout-165-trove.patch, 
> MAHOUT-165-updated.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The 
> present implementation of this hash map is not as efficient as some of the 
> other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of 
>  Colt performance an order of magnitude better than OrderedIntDoubleMapping. 
> For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an 
> experimental dataset, the current implementation takes 50 minutes. Using 
> Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the 
> delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: 0.2

2009-10-15 Thread Grant Ingersoll
OK.  The Sparse vector improvements we have now are already a lot  
faster than what was in 0.1, so that is good.  I'd suggest that  
whoever is the Release Mgr. for this release takes care of the signing  
stuff.  I'll look at the Label (LLR) stuff by Monday.


On Oct 15, 2009, at 1:02 PM, Jeff Eastman wrote:

I'd vote to delay 165 for 0.3 but do it in trunk asap after 0.2 so  
folks can get their hands on it.


Sean Owen wrote:

It still sounds somewhat significant to me. Either it's rushed or
takes a while and both seem negative.


+1 This is why

I think it is vital, at least, to put a schedule on this, or else we
are basically saying 0.2 is to not be released indefinitely, and
that's no good. Last time we said we'd finish up and release this was
2 weeks ago, and there hasn't been progress on this issue.

I'm starting to feel strongly enough to call for a vote?

On Thu, Oct 15, 2009 at 6:47 AM, Grant Ingersoll  
 wrote:



I don't think it is that big.  We can likely just make another
implementation of Vector.  We don't have to convert everything to  
Colt.













Re: 0.2

2009-10-15 Thread Jeff Eastman
I'd vote to delay 165 for 0.3 but do it in trunk asap after 0.2 so folks 
can get their hands on it.


Sean Owen wrote:

It still sounds somewhat significant to me. Either it's rushed or
takes a while and both seem negative.
  

+1 This is why

I think it is vital, at least, to put a schedule on this, or else we
are basically saying 0.2 is to not be released indefinitely, and
that's no good. Last time we said we'd finish up and release this was
2 weeks ago, and there hasn't been progress on this issue.

I'm starting to feel strongly enough to call for a vote?

On Thu, Oct 15, 2009 at 6:47 AM, Grant Ingersoll  wrote:
  

I don't think it is that big.  We can likely just make another
implementation of Vector.  We don't have to convert everything to Colt.




  




PGP.sig
Description: PGP signature


Re: 0.2

2009-10-15 Thread Jake Mannix
On Thu, Oct 15, 2009 at 6:47 AM, Grant Ingersoll wrote:

>
> On Oct 15, 2009, at 8:22 AM, Sean Owen wrote:
>
>  On Thu, Oct 15, 2009 at 4:57 AM, Grant Ingersoll 
>> wrote:
>>
>>>   MAHOUT-165  Using better primitives hash for sparse vector for
 performance gainsOpen14/Oct/09

 Per discussion, move the remainder (migration to Colt or something) to
 0.3

>>>
>>> I will try to get to this, as I think it is important.
>>>
>>
>> I agree with Jeff that the migration to a new framework is a big
>> change and should be left to 0.3. (Vote?) There is a whole lot of
>> change already, more than might normally go into a point release.
>> Since you have another blocker below, and limited time, I say don't
>> kill yourself to work on this. It's going to be hard to get it done in
>> a weekend.
>>
>>
>
> I don't think it is that big.  We can likely just make another
> implementation of Vector.  We don't have to convert everything to Colt.
>

Ted's patch (since monkeyed with my you and myself) has the other
implementation of Vector, but testing showed it's slower?  This patch also
had  a significant refactoring of the Vector hierarchy so it's not just "a
new class".

I'm all for getting this in as soon as we can, because this issue (well,
finalizing on a linear api) pretty much blocks my donating decomposer to
Mahout, but it looks like you're the only one who feels strongly about
resolving M-165 for 0.2, Grant.

Can we not just have 0.3 in another 6-8 weeks or so which covers this?  What
Mahout user is getting blocked by having too-slow sparse vectors currently?

  -jake


[jira] Resolved: (MAHOUT-138) Convert main() methods to use Commons CLI for argument processing

2009-10-15 Thread Isabel Drost (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Isabel Drost resolved MAHOUT-138.
-

   Resolution: Fixed
Fix Version/s: (was: 0.3)
   0.2

The last ci changed the remaining classes - so at least grep does not find any 
usages of 'args\[' anywhere in our source code.

> Convert main() methods to use Commons CLI for argument processing
> -
>
> Key: MAHOUT-138
> URL: https://issues.apache.org/jira/browse/MAHOUT-138
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.2
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 0.2
>
> Attachments: MAHOUT-138.patch, MAHOUT-138_fuzzyKMeansJob.patch
>
>
> Commons CLI is in the classpath and makes it much easier to handle command 
> line args and they are more self-documenting when done right.  We should 
> convert our main methods to use CLI

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: 0.2

2009-10-15 Thread Sean Owen
It still sounds somewhat significant to me. Either it's rushed or
takes a while and both seem negative.

I think it is vital, at least, to put a schedule on this, or else we
are basically saying 0.2 is to not be released indefinitely, and
that's no good. Last time we said we'd finish up and release this was
2 weeks ago, and there hasn't been progress on this issue.

I'm starting to feel strongly enough to call for a vote?

On Thu, Oct 15, 2009 at 6:47 AM, Grant Ingersoll  wrote:
> I don't think it is that big.  We can likely just make another
> implementation of Vector.  We don't have to convert everything to Colt.


Re: 0.2

2009-10-15 Thread Grant Ingersoll


On Oct 15, 2009, at 8:22 AM, Sean Owen wrote:

On Thu, Oct 15, 2009 at 4:57 AM, Grant Ingersoll  
 wrote:
   MAHOUT-165  Using better primitives hash for sparse  
vector for

performance gainsOpen14/Oct/09

Per discussion, move the remainder (migration to Colt or  
something) to 0.3


I will try to get to this, as I think it is important.


I agree with Jeff that the migration to a new framework is a big
change and should be left to 0.3. (Vote?) There is a whole lot of
change already, more than might normally go into a point release.
Since you have another blocker below, and limited time, I say don't
kill yourself to work on this. It's going to be hard to get it done in
a weekend.




I don't think it is that big.  We can likely just make another  
implementation of Vector.  We don't have to convert everything to Colt.





   MAHOUT-114  Release Process Needs to sign published
dependencies such
as Hadoop, etc.  Open06/Apr/09

Not clear on status here, mark as 0.3?


This is a blocker for 0.2 and thus must be completed.  That being  
said, I
think Hadoop is now publishing to the Maven repo, so we may be able  
to stop

our own publishing of Hadoop.







Re: 0.2

2009-10-15 Thread Sean Owen
On Thu, Oct 15, 2009 at 4:57 AM, Grant Ingersoll  wrote:
>>        MAHOUT-165      Using better primitives hash for sparse vector for
>> performance gains                Open    14/Oct/09
>>
>> Per discussion, move the remainder (migration to Colt or something) to 0.3
>
> I will try to get to this, as I think it is important.

I agree with Jeff that the migration to a new framework is a big
change and should be left to 0.3. (Vote?) There is a whole lot of
change already, more than might normally go into a point release.
Since you have another blocker below, and limited time, I say don't
kill yourself to work on this. It's going to be hard to get it done in
a weekend.


>>        MAHOUT-114      Release Process Needs to sign published
>> dependencies such
>> as Hadoop, etc.          Open    06/Apr/09
>>
>> Not clear on status here, mark as 0.3?
>
> This is a blocker for 0.2 and thus must be completed.  That being said, I
> think Hadoop is now publishing to the Maven repo, so we may be able to stop
> our own publishing of Hadoop.
>
>


Re: 0.2

2009-10-15 Thread Grant Ingersoll


On Oct 15, 2009, at 7:21 AM, Sean Owen wrote:


Here's what is marked 0.2 plus suggested actions. I am basically
suggesting the things that are 'pretty ready' be submitted and
published -- if they're 85% done, definitely good enough for an 0.2
release, and worth getting them play-tested. (Or else, decide they
need another month or two, and mark for 0.3) And then that takes care
of just about everything for 0.2


MAHOUT-163  Get (better) cluster labels using Log Likelihood Ratio  

Open 17/Sep/09

No recent action here, but seemed ready enough to submit as of last
patch. Do so or mark 0.3?


I will make sure to get this one in before the release.



MAHOUT-171  Move deployment to repository.apache.org
 Open02/Oct/09

Seems ready to submit?


+1. It would be great to have Maven snapshots available for nightly  
builds.




MAHOUT-185  Add mahout shell script for easy launching of various
algorithms   Open06/Oct/09

Very new, sounds like something for 0.3


This is mostly for convenience.  Would be nice to have in 0.2, but not  
a show stopper.




	MAHOUT-170	Enable Java compile optimize flag during build		 Open	  
07/Oct/09


Go ahead and submit? the original change seemed quite uncontroversial.
Robin suggested a further change. Either submit or mark 0.3

MAHOUT-186  Classifier PriorityQueue returns erroneous results  

Patch Available  08/Oct/09

Two patches available. I would like my patch for this issue to get
some feedback -- would prefer it be submitted or some even better
hybrid of it and the first patch.

MAHOUT-148  Convert Classification Algs to use richer Writable
syntax   Patch Available 10/Oct/09

Ready to submit?

MAHOUT-157  Frequent Pattern Mining using Parallel FP-Growth
 Patch
Available13/Oct/09

Seems like still work in progress. If it's 'good enough', submit and
continue iterating. Or mark 0.3

MAHOUT-165  Using better primitives hash for sparse vector for
performance gainsOpen14/Oct/09

Per discussion, move the remainder (migration to Colt or something)  
to 0.3


I will try to get to this, as I think it is important.



MAHOUT-106  PLSI/EM in pig based on hofmann's ACM 04 paper. 
 Patch
Available27/Aug/09

This looks like something better tagged as 'unknown version'; don't
understand the status



I had hoped to do this, but let's move it to 0.3



MAHOUT-114  Release Process Needs to sign published dependencies 
such
as Hadoop, etc.  Open06/Apr/09

Not clear on status here, mark as 0.3?


This is a blocker for 0.2 and thus must be completed.  That being  
said, I think Hadoop is now publishing to the Maven repo, so we may be  
able to stop our own publishing of Hadoop.




[jira] Commented: (MAHOUT-157) Frequent Pattern Mining using Parallel FP-Growth

2009-10-15 Thread Isabel Drost (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12766030#action_12766030
 ] 

Isabel Drost commented on MAHOUT-157:
-

The patch looks good to me. Good work Robin.

> Frequent Pattern Mining using Parallel FP-Growth
> 
>
> Key: MAHOUT-157
> URL: https://issues.apache.org/jira/browse/MAHOUT-157
> Project: Mahout
>  Issue Type: New Feature
>  Components: Frequent Itemset/Association Rule Mining
>Affects Versions: 0.2
>Reporter: Robin Anil
>Assignee: Robin Anil
> Fix For: 0.2
>
> Attachments: MAHOUT-157-August-17.patch, MAHOUT-157-August-24.patch, 
> MAHOUT-157-August-31.patch, MAHOUT-157-August-6.patch, 
> MAHOUT-157-codecleanup-javadocs.patch, 
> MAHOUT-157-Combinations-BSD-License.patch, 
> MAHOUT-157-Combinations-BSD-License.patch, 
> MAHOUT-157-CompactTransactionMapperFormat.patch, MAHOUT-157-final.patch, 
> MAHOUT-157-inProgress-August-5.patch, MAHOUT-157-Oct-1.patch, 
> MAHOUT-157-Oct-10.pfpgrowth.patch, MAHOUT-157-Oct-8.pfpgrowth.patch, 
> MAHOUT-157-Oct-8.TestedMapReducePipeline.patch, 
> MAHOUT-157-Oct-9.StreamingDBRead-Inprogress.patch, 
> MAHOUT-157-September-10.patch, MAHOUT-157-September-18.patch, 
> MAHOUT-157-September-5.patch
>
>
> Implement: http://infolab.stanford.edu/~echang/recsys08-69.pdf

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: 0.2

2009-10-15 Thread Sean Owen
Here's what is marked 0.2 plus suggested actions. I am basically
suggesting the things that are 'pretty ready' be submitted and
published -- if they're 85% done, definitely good enough for an 0.2
release, and worth getting them play-tested. (Or else, decide they
need another month or two, and mark for 0.3) And then that takes care
of just about everything for 0.2


MAHOUT-163  Get (better) cluster labels using Log Likelihood Ratio  

Open 17/Sep/09

No recent action here, but seemed ready enough to submit as of last
patch. Do so or mark 0.3?

MAHOUT-171  Move deployment to repository.apache.org
 Open02/Oct/09

Seems ready to submit?

MAHOUT-185  Add mahout shell script for easy launching of various
algorithms   Open06/Oct/09

Very new, sounds like something for 0.3

MAHOUT-170  Enable Java compile optimize flag during build  
 Open07/Oct/09

Go ahead and submit? the original change seemed quite uncontroversial.
Robin suggested a further change. Either submit or mark 0.3

MAHOUT-186  Classifier PriorityQueue returns erroneous results  

Patch Available  08/Oct/09

Two patches available. I would like my patch for this issue to get
some feedback -- would prefer it be submitted or some even better
hybrid of it and the first patch.

MAHOUT-148  Convert Classification Algs to use richer Writable
syntax   Patch Available 10/Oct/09

Ready to submit?

MAHOUT-157  Frequent Pattern Mining using Parallel FP-Growth
 Patch
Available13/Oct/09

Seems like still work in progress. If it's 'good enough', submit and
continue iterating. Or mark 0.3

MAHOUT-165  Using better primitives hash for sparse vector for
performance gainsOpen14/Oct/09

Per discussion, move the remainder (migration to Colt or something) to 0.3

MAHOUT-106  PLSI/EM in pig based on hofmann's ACM 04 paper. 
 Patch
Available27/Aug/09

This looks like something better tagged as 'unknown version'; don't
understand the status

MAHOUT-114  Release Process Needs to sign published dependencies 
such
as Hadoop, etc.  Open06/Apr/09

Not clear on status here, mark as 0.3?

On Mon, Oct 12, 2009 at 11:05 AM, Sean Owen  wrote:
> I am ready too. Same question, what is left that must block 0.2 and what is
> the ETA looking like?
>
> On Oct 12, 2009 6:07 PM, "Robin Anil"  wrote:
>
> Everything looks good from my side. I will work on the launcher and tidying
> up Bayes classifier, the next couple of days. Any idea on a target date? If
> there is time, I would like to spend those precious amazon credits to
> register some performance numbers.
> Robin
>
> On Tue, Oct 6, 2009 at 5:53 PM, Isabel Drost  wrote: > On
> Tue, 6 Oct 2009 17:36...