[jira] Commented: (MAHOUT-393) Distributed item similarity functions

2010-05-09 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12865599#action_12865599 ] Sean Owen commented on MAHOUT-393: -- Unless, I missed something, and the unit tests d

[jira] Resolved: (MAHOUT-393) Distributed item similarity functions

2010-05-09 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved MAHOUT-393. -- Assignee: Sean Owen Fix Version/s: 0.4 Resolution: Fixed Done, I committed with only

[jira] Commented: (MAHOUT-392) Test cases for logGamma, Distribution.normal and Distribution.beta, fix for Distribution.normal

2010-05-08 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12865429#action_12865429 ] Sean Owen commented on MAHOUT-392: -- I won't argue about it, since it's tiny

[jira] Updated: (MAHOUT-376) Implement Map-reduce version of stochastic SVD

2010-05-08 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated MAHOUT-376: - Issue Type: Improvement (was: Bug) Assignee: Ted Dunning Fix Version/s: 0.4

[jira] Commented: (MAHOUT-392) Test cases for logGamma, Distribution.normal and Distribution.beta, fix for Distribution.normal

2010-05-08 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12865424#action_12865424 ] Sean Owen commented on MAHOUT-392: -- Tiny comments: - You could inline those b0, b1,

Re: meaning of isSequentialAccess?

2010-05-08 Thread Sean Owen
It returns 'true' since it can be iterated in order efficiently -- really it also implies that the iterators iterate in order. For purposes of serialization, that bit of information is unused since it is encoded with a dense representation. I can quantify the meaning of these flags more in the jav

Re: Build failed in Hudson: Mahout Trunk #615

2010-05-07 Thread Sean Owen
Something remains screwed up here and it's ultimately my fault. My utils/ is not building. I am on it though, will address this pronto. On Fri, May 7, 2010 at 10:43 AM, Apache Hudson Server wrote: > See >

Re: BUILD FAILURE

2010-05-06 Thread Sean Owen
Blast, I'll take a look. I knew it was too easy. I was not seeing such failures but from the stack trace maybe I can figure out what's up. On Thu, May 6, 2010 at 8:45 PM, Tamas Jambor wrote: > just updated the SVN to get Sean's implemetation, but now it fails to build > the project. I haven't cha

[jira] Resolved: (MAHOUT-389) UncenteredCosineSimilarity

2010-05-06 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved MAHOUT-389. -- Assignee: Sean Owen Fix Version/s: 0.4 Resolution: Fixed Committed patch #3 with some

[jira] Commented: (MAHOUT-391) Make vector more space efficient with variable-length encoding, et al

2010-05-06 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12864851#action_12864851 ] Sean Owen commented on MAHOUT-391: -- Hmm, I got similar results from a crude

[jira] Commented: (MAHOUT-389) UncenteredCosineSimilarity

2010-05-06 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12864812#action_12864812 ] Sean Owen commented on MAHOUT-389: -- I'm happy to commit this since it looks fi

[jira] Resolved: (MAHOUT-302) Change tests to use temp directories instead of output, testdata

2010-05-06 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved MAHOUT-302. -- Resolution: Fixed > Change tests to use temp directories instead of output, testd

[jira] Commented: (MAHOUT-391) Make vector more space efficient with variable-length encoding, et al

2010-05-05 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12864308#action_12864308 ] Sean Owen commented on MAHOUT-391: -- Oh I get it. My other outstanding patch for MA

[jira] Commented: (MAHOUT-391) Make vector more space efficient with variable-length encoding, et al

2010-05-05 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12864303#action_12864303 ] Sean Owen commented on MAHOUT-391: -- Bleh, that's not much of a difference at

[jira] Updated: (MAHOUT-391) Make vector more space efficient with variable-length encoding, et al

2010-05-05 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated MAHOUT-391: - Attachment: MAHOUT-391.patch > Make vector more space efficient with variable-length encoding, et

[jira] Created: (MAHOUT-391) Make vector more space efficient with variable-length encoding, et al

2010-05-05 Thread Sean Owen (JIRA)
: Improvement Affects Versions: 0.3 Reporter: Sean Owen Assignee: Sean Owen Priority: Minor Fix For: 0.4 There are a few things we can do to make Vector representations smaller on disk: - Use variable-length encoding for integer values like size

Re: [jira] Commented: (MAHOUT-302) Change tests to use temp directories instead of output, testdata

2010-05-05 Thread Sean Owen
I actually didn't intend to introduce variables used only once. I could have made an error but usually it was because some expression needed to be a Path, and was used at least twice. Sometimes I might have done it for parallel style consistency across several methods. So at least we agree there.

[jira] Updated: (MAHOUT-302) Change tests to use temp directories instead of output, testdata

2010-05-04 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated MAHOUT-302: - Attachment: MAHOUT-302.patch Holy moly this took a lot of work. What I attempt to do is centralize all

[jira] Updated: (MAHOUT-389) UncenteredCosineSimilarity

2010-05-04 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated MAHOUT-389: - Issue Type: Improvement (was: Bug) Priority: Minor (was: Major) I should add you could emulate

Re: Optimization opportunity: Speed up serialization and deserialization

2010-05-02 Thread Sean Owen
It's the same approach to variable-length encoding, yes. Zig-zag is a trick to make negative numbers "compatible" with this encoding. Because two's-complement negative numbers start with a bunch of 1s their representation is terrible under this variable-length encoding -- always of maximum length.

Re: Optimization opportunity: Speed up serialization and deserialization

2010-05-02 Thread Sean Owen
That much is expected right? Since it stores a 4-byte index along with each 8-byte double value, the sparse representation is bigger when over 8/(4+8) = 66% of the values are non-default / non-zero. But variable-encoding the index value trims a byte or more per element depending on your assumption

Re: Optimization opportunity: Speed up serialization and deserialization

2010-05-02 Thread Sean Owen
That's the one! I actually didn't know this was how PBs did the variable length encoding but makes sense, it's about the most efficient thing I can imagine. Values up to 16,383 fit in two bytes, which less than a 4-byte int and the 3 bytes or so it would take the other scheme. Could add up over th

Re: Optimization opportunity: Speed up serialization and deserialization

2010-05-02 Thread Sean Owen
What's the specific improvement idea? Size and speed improvements would be good. The Hadoop serialization mechanism is already pretty low-level, dealing directly in bytes (as opposed to fancier stuff like Avro). It's if anything fast and lean but quite manual. The latest Writable updates squeezed

Re: Canopy Clustering not scaling

2010-05-02 Thread Sean Owen
As I said, "you can imagine how the rest goes" -- this is a taste of how you might distribute the key piece of the computation you asked about, and certainly does that correctly. It is not the whole algorithm of course -- up to you. On Sun, May 2, 2010 at 1:52 PM, Robin Anil wrote: > I dont think

Re: Canopy Clustering not scaling

2010-05-02 Thread Sean Owen
How about this for the first phase? I think you can imagine how the rest goes, more later... Mapper 1A. map() input: One canopy map() output: canopy ID -> canopy Mapper 1B. Has in memory all canopy IDs, read at startup) map() input: one point map() output: for each canopy ID, canopy ID -> point

Re: Canopy Clustering not scaling

2010-05-02 Thread Sean Owen
Not surprising indeed, that won't scale at some point. What is the stage that needs everything in memory? maybe describing that helps imagine solutions. The typical reason for this, in my experience back in the day, was needing to look up data infrequently in a key-value way. "Side-loading" off HD

Re: Unsubscribe to MAHOUT

2010-04-30 Thread Sean Owen
http://www.apache.org/foundation/mailinglists.html "To get off a list, send a message to list-unsubscr...@apache.org" So you need to mail mahout-dev-unsubscr...@apache.org. There is nobody who can manually answer your request.

Re: Intermittant Test Failure: testTranspose(org.apache.mahout.math.hadoop.TestDistributedRowMatrix)

2010-04-29 Thread Sean Owen
I had taken on MAHOUT-302 which is basically about overhauling how temp data is handled for tests. I think we can indeed handle it more cleanly and in a way such that collisions never happen. I'm still in the middle of it. On Thu, Apr 29, 2010 at 11:38 PM, Ted Dunning wrote: > Any chance to use a

Re: Negative LLR Score

2010-04-29 Thread Sean Owen
I could sure be wrong about this (or perhaps out of date). It makes sense in theory. But I can't find it in the JLS and in the bytecode I still see it calling Math.log(), calling StrictMath.log(), FWIW. I would actually believe a JIT would do something with this. But I still find myself always prog

Re: Negative LLR Score

2010-04-29 Thread Sean Owen
You mean "sum * Math.log(sum)"? That's nice, I'll go with that. javac definitely isn't allowed to do that kind of transformation -- it actually can't do much of anything. ProGuard might -- it's actually a dynamite byte code optimizer and I've been itching to get it re-integrated into the build for

Re: Negative LLR Score

2010-04-29 Thread Sean Owen
7;s a small optimization. Shall I commit something like that, but also cap the LLR at 0 anyhow? that fixes the original issue for sure. On Thu, Apr 29, 2010 at 5:28 PM, Sean Owen wrote: > Ah yeah that's it. > > So... is the better change to cap the result of logLikelihoodRatio() a

Re: Negative LLR Score

2010-04-29 Thread Sean Owen
Ah yeah that's it. So... is the better change to cap the result of logLikelihoodRatio() at 0.0? On Thu, Apr 29, 2010 at 5:11 PM, Ted Dunning wrote: > I suspect round-off error.  In R I get this for the raw LLR: > >> llr(matrix(c(6,7567, 1924, 2426487), nrow=2)) > [1] 3.380607e-11 > > A slightly

Re: Negative LLR Score

2010-04-29 Thread Sean Owen
What about Shashikant's example? Unless my brain's not in gear, that seems like a legit example, but does indeed product a negative LLR.

Re: Negative LLR Score

2010-04-29 Thread Sean Owen
(I can easily make the fix and add a test, but is the right thing to return 0, or instead proceed in the method with the value -sqrt(-llr) when llr is negative?) On Thu, Apr 29, 2010 at 12:44 PM, Shashikant Kore wrote: > Root LLR calculation has a minor bug. When LLR score is negative, > square r

Re: Similarity Tests Failing since 939074?

2010-04-29 Thread Sean Owen
Sorry that's essentially an elaborate typo, which made something that Can't Possibly Change Behavior, Change Behavior. On Thu, Apr 29, 2010 at 4:12 AM, Jeff Eastman wrote: > Failed tests: >  testSimple(org.apache.mahout.cf.taste.impl.similarity.PearsonCorrelationSimilarityTest) >  testSimpleItem(

Re: [jira] Updated: (MAHOUT-387) Cosine item similarity implementation

2010-04-28 Thread Sean Owen
e original question. Sean On Wed, Apr 28, 2010 at 7:14 PM, Sean Owen wrote: > Actually scratch that patch I sent over. I see the trick now that > makes the existing approach quite good. I think I can make a version > that preserves that trick and still streamlines the processing. I will >

Re: [jira] Updated: (MAHOUT-387) Cosine item similarity implementation

2010-04-28 Thread Sean Owen
ive Intelligence" by T. > Searagan. As far as I can judge (to be honest my mathematical knowledge > is kinda limited) there are no different interpretations of the rating > scala here as the values are fixed, so I thought that a centering of the > data would not be necessary. >

[jira] Resolved: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs

2010-04-27 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved MAHOUT-305. -- Fix Version/s: 0.4 Resolution: Fixed I think with my latest commit this is substantially done

[jira] Resolved: (MAHOUT-385) Unify Vector Writables

2010-04-27 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved MAHOUT-385. -- Assignee: Sean Owen Resolution: Fixed I think this is uncontroversial enough to commit. It means

[jira] Updated: (MAHOUT-387) Cosine item similarity implementation

2010-04-27 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated MAHOUT-387: - Status: Resolved (was: Patch Available) Assignee: Sean Owen Fix Version/s: 0.3

[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs

2010-04-27 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12861381#action_12861381 ] Sean Owen commented on MAHOUT-305: -- I see, fair enough. Even for this simplistic ini

[jira] Assigned: (MAHOUT-302) Change tests to use temp directories instead of output, testdata

2010-04-27 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned MAHOUT-302: Assignee: Sean Owen > Change tests to use temp directories instead of output, testd

[jira] Resolved: (MAHOUT-359) org.apache.mahout.cf.taste.hadoop.item.RecommenderJob for Boolean recommendation

2010-04-27 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved MAHOUT-359. -- Assignee: Sean Owen Fix Version/s: 0.4 Resolution: Fixed

[jira] Resolved: (MAHOUT-354) make the output of RecommenderJob more readable

2010-04-27 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved MAHOUT-354. -- Assignee: Sean Owen Fix Version/s: 0.4 Resolution: Fixed > make the output

[jira] Resolved: (MAHOUT-329) Implement some recommendation ideas used by the Netflix top teams to boost the recommenders package

2010-04-27 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved MAHOUT-329. -- Assignee: Robin Anil Resolution: Later Shelving this as GSoC projects are set and if something

[jira] Resolved: (MAHOUT-386) org.apache.mahout.cf.taste.hadoop.pseudo.RecommenderJob breaks when no usersFile is supplied

2010-04-27 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved MAHOUT-386. -- Assignee: Sean Owen Fix Version/s: 0.4 Resolution: Fixed Yeah looks like the old

[jira] Commented: (MAHOUT-297) Canopy and Kmeans clustering slows down on using SeqAccVector for center

2010-04-27 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12861276#action_12861276 ] Sean Owen commented on MAHOUT-297: -- If I may nit-pick: new RandomAccessSparseVe

[jira] Commented: (MAHOUT-371) [GSoC] Proposal to implement Distributed SVD++ Recommender using Hadoop

2010-04-26 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12861154#action_12861154 ] Sean Owen commented on MAHOUT-371: -- Your schedule maps it out well. In the next m

[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs

2010-04-26 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12861144#action_12861144 ] Sean Owen commented on MAHOUT-305: -- Ted says he likes LLR, and doesn't like thr

[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs

2010-04-26 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12861095#action_12861095 ] Sean Owen commented on MAHOUT-305: -- I'm about to commit another pass at this s

[jira] Commented: (MAHOUT-371) [GSoC] Proposal to implement Distributed SVD++ Recommender using Hadoop

2010-04-26 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12861087#action_12861087 ] Sean Owen commented on MAHOUT-371: -- Looks like this was accept to GSoC, nice. Let

[jira] Updated: (MAHOUT-385) Unify Vector Writables

2010-04-26 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated MAHOUT-385: - Attachment: MAHOUT-385.patch > Unify Vector Writables > -- > >

[jira] Created: (MAHOUT-385) Unify Vector Writables

2010-04-26 Thread Sean Owen (JIRA)
Unify Vector Writables -- Key: MAHOUT-385 URL: https://issues.apache.org/jira/browse/MAHOUT-385 Project: Mahout Issue Type: Improvement Components: Math Affects Versions: 0.3 Reporter: Sean Owen

[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs

2010-04-26 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12860914#action_12860914 ] Sean Owen commented on MAHOUT-305: -- OK, I think I get the (item1,item2) -> (item

[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs

2010-04-26 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12860895#action_12860895 ] Sean Owen commented on MAHOUT-305: -- Most broadly, the input is item1->item2 pairs

Fwd: announcing new TLPs [was: ASF Board Meeting Summary - April 21, 2010 - new TLP reporting schedule?]

2010-04-26 Thread Sean Owen
Here's my suggested boilerplate -- see below and please suggest edits if desired. There's a 150 word limit. Apache Mahout provides scalable implementations of machine learning algorithms on top of Apache Hadoop. It offers collaborative filtering, clustering, classification algorithms and more. Beg

Re: How to tackle Vector->NamedVector and back conversion

2010-04-25 Thread Sean Owen
Where though, I just deleted all the methods to try it and every test passes. On Sun, Apr 25, 2010 at 7:51 PM, Robin Anil wrote: > Its used in clustering to generate clusterid -> point id. Also to be used in > classification(by end of this summer) to keep class labels.

Re: How to tackle Vector->NamedVector and back conversion

2010-04-25 Thread Sean Owen
I agree that it'd be good to kind of finalize the Vector stuff. I don't think it's reasonable for users to expect data output by 0.3 to be compatible with 0.4 though, so wouldn't worry about that. I think we're on the verge of wanting a proper serialization system like Avro for vectors here -- but

Re: How to tackle Vector->NamedVector and back conversion

2010-04-25 Thread Sean Owen
Yes, I think if we can convince ourselves that there won't be that many different possibilities for representing a vector, then a simple boolean might unify everything. This approach doesn't 'scale' but I don't know there are other representations we must have. The issue of named vectors is intere

Re: How to tackle Vector->NamedVector and back conversion

2010-04-25 Thread Sean Owen
PS let's see a patch to keep discussing, I'm seeing ideas on lots of good topics here and want to take the opportunity to strike while the iron is hot and continue overhauling this. But things like making everything a named vector is sort of stepping backwards to where we just agreed to move from

Re: Clean checkout Test broken

2010-04-25 Thread Sean Owen
I'm not seeing it in my client, hmm. While I'd tend to guess my change broke it, I don't see the direct link... this code writes TreeID -> MapredOutput in its test and then tries to read exactly that. I don't yet see how the SequenceFile.Reader expects anything related to VectorWritable nor why it

Re: How to tackle Vector->NamedVector and back conversion

2010-04-24 Thread Sean Owen
NamedVectorWritable already extends VectorWritable, though honestly I don't like that and kept it to minimize disruption. Serialized vector formats aren't exactly "polymorphic". I can't read and X vector with the code intended to deserialize something that extends X. So, really the Writables shoul

Re: Mahout In Action

2010-04-23 Thread Sean Owen
ith the latest changes and so does Sean. There isnt much that got > affected by your latest commit though(it compiles). Though I haven't fully > tested the code with the dataset after the commit, something I plan to do > soon. >

[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs

2010-04-23 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12860284#action_12860284 ] Sean Owen commented on MAHOUT-305: -- What do you mean about the secondary sort and i

[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs

2010-04-23 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12860223#action_12860223 ] Sean Owen commented on MAHOUT-305: -- And now more thoughts: Yes all the code is che

[jira] Commented: (MAHOUT-305) Combine both cooccurrence-based CF M/R jobs

2010-04-23 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12860221#action_12860221 ] Sean Owen commented on MAHOUT-305: -- First copying and pasting my comment from the mai

[jira] Commented: (MAHOUT-384) Implement of AVF algorithm

2010-04-22 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859731#action_12859731 ] Sean Owen commented on MAHOUT-384: -- What do others think of 'outlier'

[jira] Commented: (MAHOUT-384) Implement of AVF algorithm

2010-04-22 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859703#action_12859703 ] Sean Owen commented on MAHOUT-384: -- Let's also think about where it fits into th

[jira] Updated: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-21 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated MAHOUT-379: - Status: Resolved (was: Patch Available) Fix Version/s: 0.4 (was: 0.3

[jira] Resolved: (MAHOUT-316) CardinalityException and IndexException should remove the default constructor, and always construct with arguments saying what the error was

2010-04-21 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved MAHOUT-316. -- Assignee: Sean Owen Resolution: Fixed Good idea, I made this happen. > CardinalityException

Re: SnowballAnalyzer

2010-04-20 Thread Sean Owen
Yes, you can discover the available constructors and their parameters. But I don't think that it make sense in general to just pass null / 0 to parameters or guess at dummy values. It'd be as likely to cause even subtler errors. I think what you have to do here is extend SnowballAnalyzer, where t

[jira] Resolved: (MAHOUT-336) Update Site for the Release

2010-04-20 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved MAHOUT-336. -- Resolution: Fixed I think this is resolved? I committed all changes i had mentioned. > Update S

[jira] Commented: (MAHOUT-337) Don't serialize cached length squared in JSON vector representation

2010-04-20 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12858826#action_12858826 ] Sean Owen commented on MAHOUT-337: -- May I resolve this as won't-fix? we stream

[jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-20 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12858807#action_12858807 ] Sean Owen commented on MAHOUT-379: -- I'd like to commit this patch as it ad

[jira] Resolved: (MAHOUT-381) org.apache.mahout.cf.taste.hadoop.item is more misleading

2010-04-19 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved MAHOUT-381. -- Fix Version/s: 0.3 Resolution: Not A Problem While I think the package name is fine, and do not

Re: AbstractVector.minus(Vector)

2010-04-19 Thread Sean Owen
On Mon, Apr 19, 2010 at 5:33 PM, Jake Mannix wrote: > result.times(-1.0) > with > result.assign(Functions.negate) Cool, good one. > The efficiency points are twofold: number of nonzero elements, and > the impl: you don't want to iterate over a vector of any type while > continually calling setQu

[jira] Updated: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-19 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated MAHOUT-379: - Attachment: MAHOUT-379.patch Another update to this huge patch. As promised, *Writable no longer extend

[jira] Resolved: (MAHOUT-356) ClassNotFoundException: org.apache.mahout.math.function.IntDoubleProcedure

2010-04-19 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved MAHOUT-356. -- Assignee: Sean Owen Resolution: Cannot Reproduce > ClassNotFoundExcept

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-18 Thread Sean Owen
On Sun, Apr 18, 2010 at 11:16 PM, Jake Mannix wrote: > VectorWritable currently is a proper decorator, right?  It doesn't even > implement Vector at all. Yeah, the other *Writable classes should be as well. NamedVector should both be a Vector and decorate a Vector too. Its Writable also decorates

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-18 Thread Sean Owen
I guess I'm suggesting the polymorphism pain need not be very painful. (No doubt it's all nicer with Avro, but that much can be separate.) VectorWritable is the one Writable used in all cases. We have *Writable decorators, corresponding to *Vector, in a similar hierarchy. We have NamedVector decor

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-18 Thread Sean Owen
t; clustering.  You make a SequenceFile of any old key type, and > NamedVectorWritable as the value.  Now you can't use that file as input for > any DistributedRowMatrix operation, you have to do a full pass over the data > to peel off the names and spit out regular VectorWritables...

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-18 Thread Sean Owen
d be the Writable hierarchy with this NamedVector proposal? > >> > On Apr 18, 2010 11:05 AM, "Sean Owen" wrote: > > On > keeping 'name': sure, I ... > > On Sun, Apr 18, 2010 at 6:45 PM, Jake Mannix wrote: >> Ok this is a good con... >

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-18 Thread Sean Owen
On keeping 'name': sure, I don't mind being conservative. I would like to keep name in the form on NamedVector. As it happens, name is actually barely used right now -- if you can wade through the patch you can see there's just a few instances, the ones in mind now. Making NamedVector is, it seems

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-18 Thread Sean Owen
I mean wrapping Vector in a NamedVector. It seems like a good step forward, even as I agree that it probably isn't even needed. Since I'm the one ripping up the floor-boards here to do some plumbing, seems like it should fall on me to put things back into a similar working state with NamedVector. T

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-18 Thread Sean Owen
Yeah why don't I have a crack at this. The change as it stands is already too big for what it is (though I believe they're good changes.) Then we look at more changes, and sounds like there are several ideas for streamlining vectors, which is a great thing to think about at this early stage. On Su

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-17 Thread Sean Owen
At the moment I'm already overreaching on the way to fix MAHOUT-379 with this patch, as I've expanded to address some mildly related issues (equals, iterators). So I personally am not trying to change serialization formats in MAHOUT-379 / my current patch, no. The issue uncovered by removing name

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-17 Thread Sean Owen
Yeah thats what I changed -- now the key is point.asFormatString(). And it almost works, except the serialized state in this format string includes lengthSquared, and a mismatch there before/after makes this fail. It may fail more significantly in the real world versus tests and we should be caut

[jira] Updated: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-17 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated MAHOUT-379: - Attachment: MAHOUT-379.patch Hmm my second patch didn't attach > SequentialAccessSparseVecto

[jira] Updated: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-17 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated MAHOUT-379: - Status: Patch Available (was: Open) Assignee: Sean Owen Here's another patch, which builds o

Re: mahout/solr integration

2010-04-16 Thread Sean Owen
On Fri, Apr 16, 2010 at 7:39 PM, Jake Mannix wrote: > I will start playing around with Anthony's github-based stuff, and > see where a patch can be made.  The question is where it would > go?  It's a fully functioning project already over on its own. I suppose that's my question too -- what is b

Re: mahout/solr integration

2010-04-16 Thread Sean Owen
Clojure isn't my cup of tea but that's not important. It's an interesting question, how much belongs under the Mahout tent? There's a tradeoff between excluding useful extensions to the project on the one hand, and becoming a spare parts bin of code of varying levels of maturity and support. I'm

[jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-16 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12857779#action_12857779 ] Sean Owen commented on MAHOUT-379: -- Yup, this is why I said "pre-patch", i

[jira] Updated: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-16 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated MAHOUT-379: - Attachment: MAHOUT-379.patch This is a pre-patch, per discussion on the mailing list. Is this too much

Re: Having some trouble with SequentialAccessSparseVector.DenseVector

2010-04-16 Thread Sean Owen
Actually it does all work. I wrote some tests that verify it. I think my first question about index and cur works out because both are set to 0 -- and 0 is correct as the starting value of an array offset and index. And in the other case I believe it's intended that the two values are the current a

Re: c# porting of mahout

2010-04-16 Thread Sean Owen
Lots of both -- I imagine it will be changing rapidly for the rest of the year. On Fri, Apr 16, 2010 at 10:48 AM, pedram salehpoor wrote: > For Hadoop I was thinking about making them assemblies usable for c#. > > But ever changing code is a problem. Do currently new features are added or > the n

Re: c# porting of mahout

2010-04-16 Thread Sean Owen
None that I'm aware of, and I might suggest it would be hard at the moment for several reasons: - The code is changing very rapidly - The code depends heavily on Java libraries, notably Hadoop, which makes porting difficult On Fri, Apr 16, 2010 at 10:31 AM, pedram salehpoor wrote: > Hi, > Is the

[jira] Resolved: (MAHOUT-380) IllegalArgumentException from AbstractJDBCDataModel constructor which is extended by AbstractBooleanPrefJDBCDataModel

2010-04-16 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved MAHOUT-380. -- Assignee: Sean Owen Fix Version/s: 0.4 Resolution: Fixed Oops! fixed

Having some trouble with SequentialAccessSparseVector.DenseVector

2010-04-15 Thread Sean Owen
Along the way to a patch for MAHOUT-379, I'm having some trouble figuring out SequentialAccessSparseVector.DenseVector. I think it can be simplified, but unless I'm misunderstanding there are several bugs here. I'd like to find my mistake or else simplify/fix this along the way. get() uses offset

Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-14 Thread Sean Owen
On Wed, Apr 14, 2010 at 3:28 PM, Jake Mannix wrote: > What is the transitivity problem?  If (a instanceof VClassA), (b instanceof > VClassB) and (c instanceof VClassC), if all three equals() methods compare > the same things (ie values, names, not implementation), then a.equals(b) && > b.equals(c)

[jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-14 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12856812#action_12856812 ] Sean Owen commented on MAHOUT-379: -- Yeah let's take some time to get this righ

  1   2   3   4   5   6   7   8   9   10   >