Re: [jira] Commented: (MAHOUT-206) Separate and clearly label different SparseVector implementations

2009-11-25 Thread Ted Dunning
Practically speaking, the huge advantage of the abstract class is that you
have lower update requirements and less duplicated code when augmenting the
interface.  Yes, you can do the dual thing, but the practical experience
with Hadoop and Lucene has been that using just the abstract class which is
named as if it were an interface works about better in the long run.  The
update requirements become very onerous when you are dealing with more than
one package that have to be updated (and which can't for some reason be
updated simultaneously).

When adding methods, the standard practice is to add an implementation that
throws UnsupportedOperationException or something similar.  Yes, you can do
this with interace+abstract if *everybody* codes just the right way, but
with the abstract only approach, there is one less thing for people to do
wrong.

I took a long time to come around to this pattern of coding, but I finally
agree that publishing abstract classes really is better except where you
have to have an interface (for RPC or multiple inheritance).  It only takes
a little bit of outside coding to run into the problem and the social cost
can be enormous.

On Tue, Nov 24, 2009 at 1:09 PM, Sean Owen  wrote:

> ...
> Abstract classes afford the possibility of adding methods plus
> implementation, without breaking anybody, so yeah I'm into abstract
> classes. But then that's no argument against an abstract class +
> interface, which would add a small bit of flexibility too.
>


[jira] Commented: (MAHOUT-204) Better integration of Mahout matrix capabilities with Colt Matrix additions

2009-11-25 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782670#action_12782670
 ] 

Sean Owen commented on MAHOUT-204:
--

OK I have another pretty big round of changes queued up, as per my last 
comments. I've also deleted the 'test' classes and demo code as neither appear 
maintained and are not unit tests.

Before I get into some fine-grained work, can anyone comment on what definitely 
isn't needed, so I don't bother with it? Otherwise I assume it's basically just 
linear algebra and matrices -- not stats stuff, etc.

> Better integration of Mahout matrix capabilities with Colt Matrix additions
> ---
>
> Key: MAHOUT-204
> URL: https://issues.apache.org/jira/browse/MAHOUT-204
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.3
>Reporter: Grant Ingersoll
> Fix For: 0.3
>
> Attachments: MAHOUT-204-author-cleanup.patch
>
>
> Per MAHOUT-165, we need to refactor the matrix package structures a bit to be 
> more coherent and clean.  For instance, there are two levels of matrix 
> packages now, so those should be rectified.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-11) Static fields used throughout clustering code (Canopy, K-Means).

2009-11-25 Thread Isabel Drost (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782470#action_12782470
 ] 

Isabel Drost commented on MAHOUT-11:


Drew, go ahead then.

> Static fields used throughout clustering code (Canopy, K-Means).
> 
>
> Key: MAHOUT-11
> URL: https://issues.apache.org/jira/browse/MAHOUT-11
> Project: Mahout
>  Issue Type: Bug
>  Components: Clustering
>Affects Versions: 0.1
>Reporter: Dawid Weiss
> Fix For: 0.3
>
> Attachments: MAHOUT-11-kmeans-cleanup.patch, 
> MAHOUT-11-RandomSeedGenerator.patch, MAHOUT-11.patch
>
>
> I file this as a bug, even though I'm not 100% sure it is one. In the currect 
> code the information is exchanged via static fields (for example, distance 
> measure and thresholds for Canopies are static field). Is it always true in 
> Hadoop that one job runs inside one JVM with exclusive access? I haven't seen 
> it anywhere in Hadoop documentation and my impression was that everything 
> uses JobConf to pass configuration to jobs, but jobs are configured on a 
> per-object basis (a job is an object, a mapper is an object and everything 
> else is basically an object).
> If it's possible for two jobs to run in parallel inside one JVM then this is 
> a limitation and bug in our code that needs to be addressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-204) Better integration of Mahout matrix capabilities with Colt Matrix additions

2009-11-25 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782466#action_12782466
 ] 

Jake Mannix commented on MAHOUT-204:


I think we should be pretty aggressive in removing code from this stuff.  There 
are some core stuff we want (linear stuff and collections, and morphisms which 
interact with them), and a ton of stuff we don't.  Maybe we should have 
separate jira tickets for each thing that could/should be removed?

> Better integration of Mahout matrix capabilities with Colt Matrix additions
> ---
>
> Key: MAHOUT-204
> URL: https://issues.apache.org/jira/browse/MAHOUT-204
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.3
>Reporter: Grant Ingersoll
> Fix For: 0.3
>
> Attachments: MAHOUT-204-author-cleanup.patch
>
>
> Per MAHOUT-165, we need to refactor the matrix package structures a bit to be 
> more coherent and clean.  For instance, there are two levels of matrix 
> packages now, so those should be rectified.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-204) Better integration of Mahout matrix capabilities with Colt Matrix additions

2009-11-25 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782463#action_12782463
 ] 

Sean Owen commented on MAHOUT-204:
--

I've been attacking this all day. The changes are already big enough I'm 
committing my current changes, but need to keep going. The major changes still 
left: replacing System.out with logs, getting rid of all these type references 
using the complete package name too.

There is lots of dead code and other practices that kind of concern me. If 
there are changes I think deserve discussion I'll surface them.

Note, I found some code in here that carries a different copyright: Copyright 
PIERSOL Engineering? See TestMatrix2D. It's commented out but I think it best 
to kill it. Along with the other commented out code actually.

Also class Gamma mentions it's a port of some code from 
http://www.sci.usq.edu.au/staff/leighb/graph/Top.html and a library called 
Cephes 2.2. Can't find these now. Should we be concerned?

bottom line there is a lot of work to be done on this code.

> Better integration of Mahout matrix capabilities with Colt Matrix additions
> ---
>
> Key: MAHOUT-204
> URL: https://issues.apache.org/jira/browse/MAHOUT-204
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.3
>Reporter: Grant Ingersoll
> Fix For: 0.3
>
> Attachments: MAHOUT-204-author-cleanup.patch
>
>
> Per MAHOUT-165, we need to refactor the matrix package structures a bit to be 
> more coherent and clean.  For instance, there are two levels of matrix 
> packages now, so those should be rectified.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-204) Better integration of Mahout matrix capabilities with Colt Matrix additions

2009-11-25 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782370#action_12782370
 ] 

Sean Owen commented on MAHOUT-204:
--

I'm still working on this. The reformatting was simple but IntelliJ is having a 
field day with its inspections and I'm slugging through them all. I'm focusing 
on style issues mostly.

> Better integration of Mahout matrix capabilities with Colt Matrix additions
> ---
>
> Key: MAHOUT-204
> URL: https://issues.apache.org/jira/browse/MAHOUT-204
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.3
>Reporter: Grant Ingersoll
> Fix For: 0.3
>
> Attachments: MAHOUT-204-author-cleanup.patch
>
>
> Per MAHOUT-165, we need to refactor the matrix package structures a bit to be 
> more coherent and clean.  For instance, there are two levels of matrix 
> packages now, so those should be rectified.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: SVM algo, code, etc.

2009-11-25 Thread Isabel Drost
On Fri Grant Ingersoll  wrote:
> On Nov 19, 2009, at 1:15 PM, Sean Owen wrote:
> > Post a patch if you'd like to proceed, IMHO.
> +1

+1 from me as well. I would love to see solid svm support in Mahout.

Isabel