[jira] Commented: (MAHOUT-205) Pull Writable (and anything else hadoop dependent) out of the matrix module

2010-01-12 Thread Jake Mannix (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12799638#action_12799638 ] Jake Mannix commented on MAHOUT-205: and since tests pass for me with this, I'll commit

[jira] Updated: (MAHOUT-205) Pull Writable (and anything else hadoop dependent) out of the matrix module

2010-01-12 Thread Jake Mannix (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jake Mannix updated MAHOUT-205: --- Attachment: MAHOUT-205.patch up to date patch, with Robin's most recent Vectorizer commits merged in.

[jira] Commented: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

2010-01-12 Thread Robin Anil (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12799631#action_12799631 ] Robin Anil commented on MAHOUT-237: --- Ok. Done > Map/Reduce Implementation of Document Ve

[jira] Commented: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

2010-01-12 Thread Jake Mannix (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12799625#action_12799625 ] Jake Mannix commented on MAHOUT-237: Looking at this a little: is there a reason why

[jira] Reopened: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

2010-01-12 Thread Robin Anil (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin Anil reopened MAHOUT-237: --- reopening this to let in further review > Map/Reduce Implementation of Document Vectorizer > ---

[jira] Commented: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

2010-01-12 Thread Jake Mannix (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12799560#action_12799560 ] Jake Mannix commented on MAHOUT-237: It appears that there is just a missing line above

[jira] Commented: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

2010-01-12 Thread Jake Mannix (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12799557#action_12799557 ] Jake Mannix commented on MAHOUT-237: Given the following code in PartialVectorGenerator

Re: [jira] Resolved: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

2010-01-12 Thread Drew Farris
Hi Robin, I'm seeing some strangeness from this, I've got a directory with 100k documents. I build a sequence file using SequenceFilesFromDirectory, which emits 4 chunks for this particular dataset. I then dump each of the chinks using SequenceFileDumper. I only see 75,964 documents in the resulti

[jira] Resolved: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

2010-01-12 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved MAHOUT-237. -- Resolution: Fixed > Map/Reduce Implementation of Document Vectorizer >

[jira] Commented: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

2010-01-12 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12799508#action_12799508 ] Sean Owen commented on MAHOUT-237: -- I'll commit -- still seeing some code inspection warni

[jira] Updated: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

2010-01-12 Thread Robin Anil (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin Anil updated MAHOUT-237: -- Attachment: DictionaryVectorizer.patch Uses String Reader. Removes unused imports and added License hea

[jira] Commented: (MAHOUT-180) port Hadoop-ified Lanczos SVD implementation from decomposer

2010-01-12 Thread Jake Mannix (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12799295#action_12799295 ] Jake Mannix commented on MAHOUT-180: This is waiting on my MAHOUT-205 and MAHOUT-206 pa

[jira] Commented: (MAHOUT-205) Pull Writable (and anything else hadoop dependent) out of the matrix module

2010-01-12 Thread Jake Mannix (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12799293#action_12799293 ] Jake Mannix commented on MAHOUT-205: It's not byet committed, because I was hoping at l

Re: SparseVectors writing out a lot of data

2010-01-12 Thread Sean Owen
OK. The code throws up a number of warnings for me, like unused declarations and variables, missing copyrights, etc. Mind if I accept those before committing? On Tue, Jan 12, 2010 at 2:09 PM, Robin Anil wrote: > No.. Thats fixed. StringReader change was to prevent encoding errors. This > patch wo

Re: JPFF

2010-01-12 Thread Sean Owen
Until this is reasonably stable code and getting traction I wouldn't think much about it. I doubt there will be a reason to seriously consider porting to something other than Hadoop anytime soon -- certainly we should get the Hadoop side of things in order first. On Tue, Jan 12, 2010 at 3:29 PM, G

JPFF

2010-01-12 Thread Grant Ingersoll
Thoughts on http://wiki.apache.org/incubator/JppfProposal? Seems like it might be useful. At some point, we may need APIs that go beyond M/R and provide more generalized distributed capabilities. -Grant

Re: Collections of primitives.

2010-01-12 Thread Benson Margulies
Dawid, Like I said, I'm not sure we're disagreeing. My focal goal is primitive collections, and I'm prepared to take my lumps with compatibility. Sun has made such a mess of the collections API that we seem forced to choose. --benson On Tue, Jan 12, 2010 at 9:28 AM, Dawid Weiss wrote: > Thanks

Re: Collections of primitives.

2010-01-12 Thread Dawid Weiss
Thanks for the clarification and understanding of my motives, Benson. I know Trove and I know other libraries of this type -- PCJ has been our favorite so far, but it's LGPL and our persistent attempts to ask Soren Bak to distribute that code under a different license have failed. Adapters are a

Re: SparseVectors writing out a lot of data

2010-01-12 Thread Robin Anil
No.. Thats fixed. StringReader change was to prevent encoding errors. This patch works just fine. Infact I will post some numbers up on running it against wikipedia tonight Robin On Tue, Jan 12, 2010 at 7:25 PM, Sean Owen wrote: > Sorry but isn't that the very problem you are trying to solve

Re: SparseVectors writing out a lot of data

2010-01-12 Thread Sean Owen
Sorry but isn't that the very problem you are trying to solve on this thread? why do you want to commit this if it has this big memory problem. On Tue, Jan 12, 2010 at 12:49 PM, Robin Anil wrote: > https://issues.apache.org/jira/secure/attachment/12429906/DictionaryVectorizer.patch > > Havent cha

[jira] Resolved: (MAHOUT-156) Documentation and Code cleanup for all Bayesian Classes

2010-01-12 Thread Robin Anil (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin Anil resolved MAHOUT-156. --- Resolution: Duplicate > Documentation and Code cleanup for all Bayesian Classes > ---

[jira] Commented: (MAHOUT-156) Documentation and Code cleanup for all Bayesian Classes

2010-01-12 Thread Robin Anil (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12799189#action_12799189 ] Robin Anil commented on MAHOUT-156: --- Oops. I guess i made a duplicate jira issue here. Ma

Re: SparseVectors writing out a lot of data

2010-01-12 Thread Robin Anil
https://issues.apache.org/jira/secure/attachment/12429906/DictionaryVectorizer.patch Havent changed the StringReader portion. rest are ok to review On Tue, Jan 12, 2010 at 4:47 PM, Sean Owen wrote: > > https://issues.apache.org/jira/secure/attachment/12429846/DictionaryVectorizer.patch > > Thi

Re: Collections of primitives.

2010-01-12 Thread Benson Margulies
Dawid, I find that I didn't quite answer all of your questions, and then again maybe I'm not in a position to. I started this by looking for some way to get the functionality of Trove without the GPL. When I discovered that Mahout had already absorbed Colt, I decided that the shortest path was to

[jira] Resolved: (MAHOUT-173) Implement clustering of massive-domain attributes

2010-01-12 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved MAHOUT-173. -- Resolution: Won't Fix > Implement clustering of massive-domain attributes > ---

Re: Collections of primitives.

2010-01-12 Thread Benson Margulies
Dawid, There is a model compromise out there: the Trove 'decorator' approach. I'm perfectly happy to follow that model to give people whatever value you can get from Java collection compatibility. I confess that I've been considering using it as an excuse to learn the CGM library and generate the

Collections of primitives.

2010-01-12 Thread Dawid Weiss
Hi guys, I see Benson working really hard on converting Colt primitive collections to Mahout -- this is great effort, really, since no such library currently exists with an Apache or BSD license. I wanted to ask you if compatibility with Java Collections is something you consider crucial for a se

[jira] Commented: (MAHOUT-173) Implement clustering of massive-domain attributes

2010-01-12 Thread Vaijanath N. Rao (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12799163#action_12799163 ] Vaijanath N. Rao commented on MAHOUT-173: - Hi Sean, This can be subsumed by Mahout

[jira] Commented: (MAHOUT-205) Pull Writable (and anything else hadoop dependent) out of the matrix module

2010-01-12 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12799157#action_12799157 ] Sean Owen commented on MAHOUT-205: -- Just checking, did this get committed? I know I had pr

[jira] Commented: (MAHOUT-180) port Hadoop-ified Lanczos SVD implementation from decomposer

2010-01-12 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12799155#action_12799155 ] Sean Owen commented on MAHOUT-180: -- Is this unblocked now that much of the Math stuff has

[jira] Commented: (MAHOUT-173) Implement clustering of massive-domain attributes

2010-01-12 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12799154#action_12799154 ] Sean Owen commented on MAHOUT-173: -- Just clarifying the status -- Vaijanath are you workin

Re: SparseVectors writing out a lot of data

2010-01-12 Thread Sean Owen
https://issues.apache.org/jira/secure/attachment/12429846/DictionaryVectorizer.patch This one? still seems to have the issues described in this thread. Where's the latest one? On Tue, Jan 12, 2010 at 9:08 AM, Robin Anil wrote: > Hi Sean, Could you take a look at the Patch and comment. > > Robin

[jira] Resolved: (MAHOUT-163) Get (better) cluster labels using Log Likelihood Ratio

2010-01-12 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved MAHOUT-163. -- Resolution: Fixed It sounds like this was committed, so resolving (?) > Get (better) cluster labels us

[jira] Commented: (MAHOUT-156) Documentation and Code cleanup for all Bayesian Classes

2010-01-12 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12799151#action_12799151 ] Sean Owen commented on MAHOUT-156: -- Is this actually done, subsumed in other changes? or i

Re: SparseVectors writing out a lot of data

2010-01-12 Thread Robin Anil
Hi Sean, Could you take a look at the Patch and comment. Robin On Mon, Jan 11, 2010 at 10:39 PM, Sean Owen wrote: > If one needs a Reader based on the contents of a String, the > StringReader is a far better way of doing this. This also has > potential character set issues if the platform's def