[jira] [Created] (MAHOUT-1148) QR Decomposition is too slow

2013-02-03 Thread Ted Dunning (JIRA)
Ted Dunning created MAHOUT-1148: --- Summary: QR Decomposition is too slow Key: MAHOUT-1148 URL: https://issues.apache.org/jira/browse/MAHOUT-1148 Project: Mahout Issue Type: Bug

Re: 0.8?

2013-02-03 Thread Ted Dunning
Isabel and Zeno (who ported some of his code from > >> http://mymedialite.net/) in the next 2 weeks. We'll have another pass > >> over the new recommenders to finalize them for 0.8. > >> > >> Best, > >> Sebastian > >> > >> On 02.0

Re: Jenkins build is back to normal : Mahout-Quality #1851

2013-02-02 Thread Ted Dunning
Ahh.. compliment retracted then. Nice work Suneel! On Sat, Feb 2, 2013 at 3:57 AM, Grant Ingersoll wrote: > That was all Suneel, I just hit the commit button after > applying/reviewing. Keep in mind, I broke it in the first place! > > On Feb 2, 2013, at 3:56 AM, Ted Dunning wrote

Re: 0.8?

2013-02-02 Thread Ted Dunning
Sounds good to me. Dan should have the new clustering stuff inserted soon. That was all I was after. We should probably noodle a bit about how to update the MiA examples since that keeps coming up on the list. My first thought (from Ellen) is that asking Alex Ott to repeat his fabulous tech rev

Re: Jenkins build is back to normal : Mahout-Quality #1851

2013-02-02 Thread Ted Dunning
Nice work Grant! On Fri, Feb 1, 2013 at 2:54 PM, Apache Jenkins Server < jenk...@builds.apache.org> wrote: > See > >

Re: increase in warnings

2013-01-31 Thread Ted Dunning
There is not a JIRA. Feel free to file one. On Thu, Jan 31, 2013 at 2:55 PM, Suneel Marthi wrote: > If there is a JIRA for this, I can work on it. IntelliJ does highlight > most of these warnings that are being reported. > > > > ____ > Fr

Re: Out-of-core random forest implementation

2013-01-28 Thread Ted Dunning
> Do you have an alternative method? > > Andy > > > On 28 January 2013 16:42, Ted Dunning wrote: > > IF we have a step which permutes data (once) then I doubt that > > redistribution is necessary. At that point the randomness consists of > > building trees based o

Re: Out-of-core random forest implementation

2013-01-28 Thread Ted Dunning
IF we have a step which permutes data (once) then I doubt that redistribution is necessary. At that point the randomness consists of building trees based on different variable subsets and data subsets. The original random forests only split on variable subsets. How much this matters is an open q

Re: Out-of-core random forest implementation

2013-01-25 Thread Ted Dunning
Hey Andy, There are no plans for this. You are correct that multiple passes aren't too difficult, but they do go against the standard map-reduce paradigm a bit if you want to avoid iterative map-reduce. It definitely would be nice to have a really competitive random forest implementation that us

[jira] [Commented] (MAHOUT-1112) Migrate code from Lucene / Solr 3.6 to 4.0.0

2013-01-25 Thread Ted Dunning (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13563048#comment-13563048 ] Ted Dunning commented on MAHOUT-1112: - Megan, Thanks so much for the feed

[jira] [Commented] (MAHOUT-1140) Uniform random sampling problem in RandomSeedGenerator.java

2013-01-20 Thread Ted Dunning (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13558433#comment-13558433 ] Ted Dunning commented on MAHOUT-1140: - Great. Can't hit this just now, b

[jira] [Commented] (MAHOUT-865) Refactor Sequential Clustering algorithms

2013-01-15 Thread Ted Dunning (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13554545#comment-13554545 ] Ted Dunning commented on MAHOUT-865: Yes. I think so. Keep in mind that

Re: Where to start refactoring?

2013-01-13 Thread Ted Dunning
That is a pity. Good use cases with realistic (not necessarily real) data would be very helpful. Probably much more impact than small code fixes. On Sun, Jan 13, 2013 at 5:54 PM, Florents Tselai wrote: > For now, I'm afraid no, I don't. > > On Mon, Jan 14, 2013 at 3:31 AM, T

Re: Where to start refactoring?

2013-01-13 Thread Ted Dunning
Do you have any sample data? On Sun, Jan 13, 2013 at 5:13 PM, Florents Tselai wrote: > Thanks for the reply! > > Yes, you're correct the data source is a smart-meter installed in each > building. > > On Mon, Jan 14, 2013 at 3:07 AM, Ted Dunning > wrote: > > >

Re: Where to start refactoring?

2013-01-13 Thread Ted Dunning
2013 at 3:41 PM, Florents Tselai wrote: > Real-time energy data, > Association mining is in fact the core analysis applied (but not the only > one for e.g. it could be classification as well). > > On Mon, Jan 14, 2013 at 1:34 AM, Ted Dunning > wrote: > > > Can you say mor

Re: Mahout TDD?

2013-01-13 Thread Ted Dunning
Not strictly, no. But most of the production code has reasonable levels of testing. On Sun, Jan 13, 2013 at 3:21 PM, Florents Tselai wrote: > Hello, > > is there any code any mahout that was developed following TDD principles? >

Re: Where to start refactoring?

2013-01-13 Thread Ted Dunning
Can you say more about what kind of data and what kind of analysis? It is usually best if the work you do is motivated by your needs. On Sun, Jan 13, 2013 at 3:18 PM, Florents Tselai wrote: > Hello, > > In the next weeks/months I'll be using mahout for analyzing some big data > for a start-up a

Re: scalding and mahout vector

2013-01-12 Thread Ted Dunning
This might be more appropriate on the Mahout list. I have copied that list in order to gain the largest audience for the answers. It is an absolute requirement in Mahout to have multiple vector implementations. It is also a requirement that the math library not depend on Hadoop. A third absolut

[jira] [Reopened] (MAHOUT-1139) Off by one error in LSMR

2013-01-11 Thread Ted Dunning (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Dunning reopened MAHOUT-1139: - Assignee: Ted Dunning The istop/stop stuff is ugly legacy. I will fix this by eliminating

[jira] [Resolved] (MAHOUT-1139) Off by one error in LSMR

2013-01-10 Thread Ted Dunning (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Dunning resolved MAHOUT-1139. - Resolution: Fixed Added a test case and a resolution of the problem. See http://mail

[jira] [Created] (MAHOUT-1139) Off by one error in LSMR

2013-01-10 Thread Ted Dunning (JIRA)
Ted Dunning created MAHOUT-1139: --- Summary: Off by one error in LSMR Key: MAHOUT-1139 URL: https://issues.apache.org/jira/browse/MAHOUT-1139 Project: Mahout Issue Type: Bug Affects Versions

[jira] [Resolved] (MAHOUT-1138) Clean up some findbugs warnings

2013-01-10 Thread Ted Dunning (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Dunning resolved MAHOUT-1138. - Resolution: Fixed Committed small changes. Should knock down the score by a few percent

[jira] [Created] (MAHOUT-1138) Clean up some findbugs warnings

2013-01-10 Thread Ted Dunning (JIRA)
Ted Dunning created MAHOUT-1138: --- Summary: Clean up some findbugs warnings Key: MAHOUT-1138 URL: https://issues.apache.org/jira/browse/MAHOUT-1138 Project: Mahout Issue Type: Bug Affects

[jira] [Updated] (MAHOUT-1136) Cannot import project into eclipse with m2e 1.2

2013-01-09 Thread Ted Dunning (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Dunning updated MAHOUT-1136: Resolution: Fixed Status: Resolved (was: Patch Available) Committed to trunk

Re: Mahout-Quality Jenkins job gone?

2013-01-06 Thread Ted Dunning
Jenkins has very little history as well. On Sun, Jan 6, 2013 at 6:34 PM, Grant Ingersoll wrote: > Hmm, does seem to be gone... > > On Jan 6, 2013, at 3:30 PM, Ted Dunning wrote: > > > Has somebody deleted the Mahout-Quality Jenkins build? > > >

[jira] [Commented] (MAHOUT-1136) Cannot import project into eclipse with m2e 1.2

2013-01-06 Thread Ted Dunning (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13545458#comment-13545458 ] Ted Dunning commented on MAHOUT-1136: - This patch changes one pom. There

Re: Initial k-means based logistic regression experiments

2013-01-04 Thread Ted Dunning
Stil haven't gotten into this. I will talk about it at our meeting. On Thu, Jan 3, 2013 at 2:13 PM, Dan Filimon wrote: > Ted, I have basic training code that seems to be working (i.e. > generating the models) that I haven't classified with. > > I just want to make sure it's going in the right di

Re: Performance of matrix/vector projections

2013-01-04 Thread Ted Dunning
Have you looked at IntelliJ's local history? On Fri, Jan 4, 2013 at 4:35 AM, Dan Filimon wrote: > The thing that's bothering me is that a had an older version that > behaved totally different but I overwrote it and don't have it any > more. :( >

Re: clusterLogFactor in StreamingKMeans

2013-01-02 Thread Ted Dunning
Well, the point of clusterLogFactor was original to be c in the expression c * k * log N. The idea was that this would normally be in the range from 1 to 3 and would be a fudge factor to increase k log N. This is useful where we know k, but not N. On the other hand, in some cases we can just pic

Re: Should algorithms log progress?

2013-01-02 Thread Ted Dunning
without > > having 2 versions of the code (one that prints it out and one that > > doesn't). > > How could I do this without logging? > > > > On Wed, Jan 2, 2013 at 4:40 PM, Ted Dunning > wrote: > >> The normal answer is that we use sl

Re: SamplingLongPrimitiveIteratorTest fails

2013-01-02 Thread Ted Dunning
+1 on losing Uncommons Math. On Wed, Jan 2, 2013 at 6:10 AM, Sean Owen wrote: > Related idea: if we're now on Commons 3.1, I can back-port changes > from Myrrix to use Commons Math's Mersenne Twister RNG. I found it > faster and more thread-friendly, and would let us get rid of the > Uncommons M

[jira] [Commented] (MAHOUT-1135) Unify decorated vectors in DecoratedVector

2013-01-02 Thread Ted Dunning (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13542151#comment-13542151 ] Ted Dunning commented on MAHOUT-1135: - Is there any value to the super class?

[jira] [Commented] (MAHOUT-1135) Unify decorated vectors in DecoratedVector

2013-01-02 Thread Ted Dunning (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13542150#comment-13542150 ] Ted Dunning commented on MAHOUT-1135: - {quote} Also, the existing writables d

[jira] [Commented] (MAHOUT-1135) Unify decorated vectors in DecoratedVector

2013-01-02 Thread Ted Dunning (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13542147#comment-13542147 ] Ted Dunning commented on MAHOUT-1135: - I didn't really understand this at f

Re: Should algorithms log progress?

2013-01-02 Thread Ted Dunning
The normal answer is that we use slf4j. If you log at debug or info level, then your conditionals shouldn't be necessary. Returning the log as a stream is pretty unusual, but some high performance systems can't handle the overhead of even something like slf4j. Typically, this is because these sy

Re: How come math doesn't depend on core?

2013-01-01 Thread Ted Dunning
sn't depend on > Hadoop. > > On Tuesday, January 1, 2013, Ted Dunning wrote: > > > I would rather see Pair moved to math. > > > > On Tue, Jan 1, 2013 at 2:20 PM, Dan Filimon > > >wrote: > > > > > I fixed it by moving the classes I wanted

Re: How come math doesn't depend on core?

2013-01-01 Thread Ted Dunning
I would rather see Pair moved to math. On Tue, Jan 1, 2013 at 2:20 PM, Dan Filimon wrote: > I fixed it by moving the classes I wanted into core rather than math. > I moved WeightedVector and Centroid to core. > > I'm playing around with a patch for that vector refactoring peeve [1]. > > [1] https

Re: Build failed in Jenkins: Mahout-Quality #1800

2012-12-29 Thread Ted Dunning
This looks like it might be an environmental issue on solaris1. I have changed the build to only allow one of the ubuntu machines for this test, but that will take some time to run. I have also changed the acceptable scores for PMD and FindBugs to that the current scores are acceptable. I delete

Re: mahout-pmml

2012-12-26 Thread Ted Dunning
Marty, That sounds like a reasonable idea. IF integrated, this would need to be a separate module in any case so for now, it might be easiest for you to simply develop this module independently so that you don't have to wait for others to commit partial results. On Wed, Dec 26, 2012 at 6:52 PM

[jira] [Commented] (MAHOUT-1130) Wrong logic in org.apache.mahout.clustering.kmeans.RandomSeedGenerator

2012-12-20 Thread Ted Dunning (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13537058#comment-13537058 ] Ted Dunning commented on MAHOUT-1130: - A quick look indicates that, yes,

[jira] [Commented] (MAHOUT-1130) Wrong logic in org.apache.mahout.clustering.kmeans.RandomSeedGenerator

2012-12-20 Thread Ted Dunning (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13537050#comment-13537050 ] Ted Dunning commented on MAHOUT-1130: - Andrey, I think you are correct, bu

Re: Clustering algos without hadoop

2012-12-15 Thread Ted Dunning
The new ThreadedStreamingKmeans does pretty well without Hadoop. See https://github.com/tdunning/knn for now. This is being brought into Mahout over the next few months. On Sat, Dec 15, 2012 at 12:18 AM, Florents Tselai wrote: > Hello, > > is there a list of the clustering algorithms that run

[jira] [Resolved] (MAHOUT-1127) OnlineLogisticRegression test is flaky (and wrong)

2012-12-14 Thread Ted Dunning (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Dunning resolved MAHOUT-1127. - Resolution: Fixed The problem had to do with how the loop was accumulating results. I fixed

[jira] [Created] (MAHOUT-1127) OnlineLogisticRegression test is flaky (and wrong)

2012-12-14 Thread Ted Dunning (JIRA)
Ted Dunning created MAHOUT-1127: --- Summary: OnlineLogisticRegression test is flaky (and wrong) Key: MAHOUT-1127 URL: https://issues.apache.org/jira/browse/MAHOUT-1127 Project: Mahout Issue Type

Re: Work this week

2012-12-11 Thread Ted Dunning
On Tue, Dec 11, 2012 at 3:10 AM, Dan Filimon wrote: > Hi Ted, > > I have a lot of work to do for school (multiple assignments with > pressing deadlines :) and I won't be able to do any meaningful work > this week. > School comes first. > So, I'll mostly just be reading the papers (streaming k-m

Re: coding in Mahout

2012-12-09 Thread Ted Dunning
quest by github! > >>>> > >>>> Btw for me the only package that is a little mess is > >>>> > >>> > https://github.com/apache/mahout/tree/trunk/math/src/main/java/org/apache/mahout/math > >>> , > >>>> is it really

Re: coding in Mahout

2012-12-08 Thread Ted Dunning
BSD license is fine. http://www.apache.org/legal/3party.html On Sat, Dec 8, 2012 at 2:13 PM, Simon Vocella wrote: > Mmm ok i'll try to see this part with PMML, one last question, I don't have > much knowledge of licenses, normally I'll search with google, if i use a > project like > > http://co

Re: build status?

2012-12-05 Thread Ted Dunning
See https://builds.apache.org/job/Mahout-Quality/ Many of the recent failures are due to Jenkins being pretty flaky lately. On Wed, Dec 5, 2012 at 5:44 PM, Pat Ferrel wrote: > I'm trying to merge the latest trunk from github. Is the Jenkins dashboard > still at https://builds.apache.org? If so

Re: coding in Mahout

2012-12-05 Thread Ted Dunning
Every other part of Mahout uses the math library. On Wed, Dec 5, 2012 at 7:12 PM, Simon Vocella wrote: > Btw for me the only package that is a little mess is > > https://github.com/apache/mahout/tree/trunk/math/src/main/java/org/apache/mahout/math > , > is it really used? >

Re: BFR clustering algorithm?

2012-12-04 Thread Ted Dunning
There are literally hundreds and hundreds of algorithms for k-means alone. That isn't even counting clustering that doesn't optimize k-means figure of merit. On Tue, Dec 4, 2012 at 5:05 PM, Dan Filimon wrote: > On Tue, Dec 4, 2012 at 10:00 AM, Ted Dunning > wrote: > > I

Re: BFR clustering algorithm?

2012-12-04 Thread Ted Dunning
I didn't know about BFR at the time and I always tend to choose simplicity in any case. The theoretical bounds for streaming k-means are also persuasive. The other strong-ish candidate is k-means++, but it doesn't have the required sketch architecture in the form that they have analyzed. BFR is

Re: Build failed in Jenkins: Mahout-Quality #1769

2012-12-01 Thread Ted Dunning
I will have a fix for this by the time I land. The issue is test non determinism. I will increase the number of passes (decreases failure rate to about 1%) and also allow the test to pass on up to two failures as long as we get a success eventually. This will still be fast with 99% success

Re: Streaming KMeans 20newsgroups clustering

2012-11-29 Thread Ted Dunning
ing k-means in R with the projected 50-dimensional vectors gets me > > the following sizes for the 20 clusters: > > > > K-means clustering with 20 clusters of sizes 140, 1195, 228, 3081, > > 2162, 462, 31, 329, 14, 936, 2602, 32, 32, 587, 105, 1662, 2124, 66, > > 78, 296

Re: Link to mahout build status

2012-11-27 Thread Ted Dunning
On Tue, Nov 27, 2012 at 4:27 PM, Marty Kube < martyk...@beavercreekconsulting.com> wrote: > That's a good URL :-) Thanks! > How would one fix the link on the home page? > Not sure. I tried to edit it with the Apache content management service, but it didn't work. I thought we had converted the

Re: Link to mahout build status

2012-11-27 Thread Ted Dunning
Try https://builds.apache.org//job/Mahout-Quality/ instead. On Tue, Nov 27, 2012 at 4:00 PM, Marty Kube < martyk...@beavercreekconsulting.com> wrote: > Hey, > I've been working through get an mahout development environment set up. > Sometimes things don't work out for me, so my first question is

Re: Streaming KMeans 20newsgroups clustering

2012-11-27 Thread Ted Dunning
Wrong in the sense of clustering is hard to define. Certainly a wide range of cluster sizes looks dubious, but not definitive. Next easy steps include cosine normalizing the vectors and doing semi-supervised clustering. Clustering the 50d data in R might also be useful. Normalizing is a single

Re: Streaming KMeans 20newsgroups clustering

2012-11-27 Thread Ted Dunning
Dan, Cool results. The headers can be useful. This is a problem where clustering doesn't actually necessarily work. We need to assess what alternative clustering algorithms would be able to do here. It is also possible that the down projection is not working as expected. On Tue, Nov 27, 2012

Re: coding in Mahout

2012-11-18 Thread Ted Dunning
Sounds fantastic. File a JIRA with suggested improvements. Go for it! On Sun, Nov 18, 2012 at 8:04 AM, Simon Vocella wrote: > Hi Grant, > > Ok maybe i can start to code cleanup and refactoring some parts in Mahout > to became more confident with the code. > I have spent many years to do refact

Re: Build failed in Jenkins: Mahout-Quality #1749

2012-11-16 Thread Ted Dunning
Anybody have an idea here? The stack trace is this: Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 137.174 sec <<< FAILURE! testDistributedLanczosSolverEVJCLI(org.apache.mahout.math.hadoop.decomposer.TestDistributedLanczosSolverCLI) Time elapsed: 94.772 sec <<< ERROR! java.lan

[jira] [Commented] (MAHOUT-1117) Vectors are not hashable

2012-11-16 Thread Ted Dunning (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13498966#comment-13498966 ] Ted Dunning commented on MAHOUT-1117: - The current hash code is fine for vec

[jira] [Commented] (MAHOUT-1116) WeightedVectors do not implement equals()

2012-11-16 Thread Ted Dunning (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13498956#comment-13498956 ] Ted Dunning commented on MAHOUT-1116: - Breaking out the comparison into a sepa

Re: BallKMeans: all points in a cluster are considered when updating the center

2012-11-16 Thread Ted Dunning
It is forgotten. I was experimenting with different trimFractions and ultimately wound up not trimming in my experiments. The problem here is that ball k-means gives pretty strong probabilistic guarantees for well separated clusters and good seeds if you only include points much closer than the n

Re: FastProjectionSearchTest.testEpsilon is flaky

2012-11-15 Thread Ted Dunning
So the idea here is that if you do a nearest neighbor search for k items, then you want a few things to be true: 1) you want sufficient overlap between the k approximately nearest neighbors and the k truly nearest neighbors. Sufficient overlap in different applications differs, but 50% or more se

Re: [jira] [Updated] (MAHOUT-1115) [PATCH] Add values() method to FastByIDMap

2012-11-15 Thread Ted Dunning
Julien, At first glance, this looks like a fine addition. One question I have, however, is whether you have any test cases for this. On Thu, Nov 15, 2012 at 8:28 AM, Julien Aymé (JIRA) wrote: > > [ > https://issues.apache.org/jira/browse/MAHOUT-1115?page=com.atlassian.jira.plugin.system.i

[jira] [Commented] (MAHOUT-1112) Migrate code from Lucene / Solr 3.6 to 4.0.0

2012-11-15 Thread Ted Dunning (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13498082#comment-13498082 ] Ted Dunning commented on MAHOUT-1112: - Hadoop is in the mix as well. Sent fro

[jira] [Commented] (MAHOUT-1113) Need test case to demonstrate simple use of SGD

2012-11-13 Thread Ted Dunning (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13496818#comment-13496818 ] Ted Dunning commented on MAHOUT-1113: - Lance I don't understand your comme

[jira] [Resolved] (MAHOUT-1114) Some delegating vectors have subtle clone bug

2012-11-13 Thread Ted Dunning (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Dunning resolved MAHOUT-1114. - Resolution: Fixed Committed fix. > Some delegating vectors have subtle cl

[jira] [Created] (MAHOUT-1114) Some delegating vectors have subtle clone bug

2012-11-13 Thread Ted Dunning (JIRA)
Ted Dunning created MAHOUT-1114: --- Summary: Some delegating vectors have subtle clone bug Key: MAHOUT-1114 URL: https://issues.apache.org/jira/browse/MAHOUT-1114 Project: Mahout Issue Type

[jira] [Resolved] (MAHOUT-1107) OnlineLogisticRegression doesn't seem to work for some people/problems

2012-11-13 Thread Ted Dunning (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Dunning resolved MAHOUT-1107. - Resolution: Not A Problem > OnlineLogisticRegression doesn't seem to work for som

[jira] [Resolved] (MAHOUT-1113) Need test case to demonstrate simple use of SGD

2012-11-13 Thread Ted Dunning (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Dunning resolved MAHOUT-1113. - Resolution: Fixed Committed test case that was developed under MAHOUT-1107

[jira] [Created] (MAHOUT-1113) Need test case to demonstrate simple use of SGD

2012-11-13 Thread Ted Dunning (JIRA)
Ted Dunning created MAHOUT-1113: --- Summary: Need test case to demonstrate simple use of SGD Key: MAHOUT-1113 URL: https://issues.apache.org/jira/browse/MAHOUT-1113 Project: Mahout Issue Type

[jira] [Commented] (MAHOUT-1112) Migrate code from Lucene / Solr 3.6 to 4.0.0

2012-11-12 Thread Ted Dunning (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13495723#comment-13495723 ] Ted Dunning commented on MAHOUT-1112: - {quote} i found the dependency need v

[jira] [Commented] (MAHOUT-1112) Migrate code from Lucene / Solr 3.6 to 4.0.0

2012-11-12 Thread Ted Dunning (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13495656#comment-13495656 ] Ted Dunning commented on MAHOUT-1112: - What is the point of the javaee

Re: Elkan implementation for KMeans

2012-11-12 Thread Ted Dunning
Factor of 2 is good. The important aspect of this optimization is that it works with any real metric. The streaming k-means stuff is only good (at this point) for L_2 metric. In the other direction, I don't think that the triangle inequality will help streaming k-means because search for centroi

Re: Lucene 4.0.0 patch for mahout available

2012-11-12 Thread Ted Dunning
Grant, What do you think of this patch? On Mon, Nov 12, 2012 at 3:55 AM, Andrew Janowczyk < andrew.janowc...@searchbox.com> wrote: > All, > > Just wanted to drop a quick note and point out that I just submitted a > working patch for the current Mahout trunk which migrates it from > lucene/solr 3

Re: Lucene 4.0.0 patch for mahout available

2012-11-12 Thread Ted Dunning
Andrew, This is an excellent approach. On Mon, Nov 12, 2012 at 3:55 AM, Andrew Janowczyk < andrew.janowc...@searchbox.com> wrote: > All, > > Just wanted to drop a quick note and point out that I just submitted a > working patch for the current Mahout trunk which migrates it from > lucene/solr 3.

Re: 0.8-SNAPSHOT

2012-11-07 Thread Ted Dunning
The artifacts have to be pushed to the repo and I don't know either which job might be doing that or exactly how to do it. It should be a standard maven lifecycle target, but there are keys and such to worry about. On Wed, Nov 7, 2012 at 11:39 PM, Sebastian Schelter wrote: > Hi, > > I built an

Re: Vectorization, dictionary size, OpenObjectIntHashMap and OOM

2012-11-07 Thread Ted Dunning
Well, there is considerable redundancy in the list of words that could result in massive compression. This is roughly what Lucene is doing. Storing each string incurs substantial overhead that dwarfs even the original size of the strings (overhead is about 50 bytes per string, the average word le

Re: Regarding contribution of an algorithm

2012-11-05 Thread Ted Dunning
First step is to file a jira. Then document how well the algorithm scales. Sent from my iPhone On Oct 31, 2012, at 7:48 AM, p.shail...@iitg.ernet.in wrote: > > Sir, > > > I have made an Apriori Algorithm in association rule mining using > map-reduce (hadoop) framework and, as currently aprio

Re: StreamingKMeansTest for weights in corners

2012-11-05 Thread Ted Dunning
The assert should clearly be parameterized. On Mon, Nov 5, 2012 at 8:46 AM, Dan Filimon wrote: > Oops, I'm sorry for that last e-mail. > > I realized I was playing around with the test and changed the number of > points being generated from 10 to 2 but didn't update the assert. > >

Re: Welford-style update of a vector?

2012-11-05 Thread Ted Dunning
It is just a choice of who to please and whether they are close enough to throw rocks. As of now, there are lots of users of the Mahout math library who would be confused and who know where we live. On Mon, Nov 5, 2012 at 8:28 AM, Dan Filimon wrote: > > I don't know that zipwith is a more commo

Re: Welford-style update of a vector?

2012-11-05 Thread Ted Dunning
On Mon, Nov 5, 2012 at 4:44 AM, Dan Filimon wrote: > > Ted told me that Mahout Centroids [1] are Weighted vectors that > additionally perform a Welford-style update of a vector. > I think that there may be an older Centroid definition that is different from this. > So, in the code, for an exist

Re: Commits on GitHub

2012-11-04 Thread Ted Dunning
It would be better if you produce diffs and attach them to issues filed on the apache issue tracker. See https://issues.apache.org/jira/browse/MAHOUT Using a github fork to produce the diffs is very easy and works well. Not quite as easy as pull requests. On Sun, Nov 4, 2012 at 12:32 PM, Flore

Re: Build failed in Jenkins: Mahout-Quality #1729

2012-11-02 Thread Ted Dunning
This is spurious. On Fri, Nov 2, 2012 at 5:16 PM, Apache Jenkins Server < jenk...@builds.apache.org> wrote: > See > > -- > Started by timer > Building remotely on ubuntu4 in workspace < > https://builds.a

[jira] [Commented] (MAHOUT-1099) Multiple slf4j bindings

2012-11-01 Thread Ted Dunning (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488797#comment-13488797 ] Ted Dunning commented on MAHOUT-1099: - This sounds like a good move in any case

Re: Regarding contribution of an algorithm

2012-10-31 Thread Ted Dunning
How scalable is this algorithm? Remember that Mahout is all about scalable machine learning. On Wed, Oct 31, 2012 at 7:48 AM, wrote: > > Sir, > > > I have made an Apriori Algorithm in association rule mining using > map-reduce (hadoop) framework and, as currently apriori algorithm is not > in m

[jira] [Comment Edited] (MAHOUT-1107) OnlineLogisticRegression doesn't seem to work for some people/problems

2012-10-30 Thread Ted Dunning (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13487396#comment-13487396 ] Ted Dunning edited comment on MAHOUT-1107 at 10/31/12 12:2

[jira] [Updated] (MAHOUT-1107) OnlineLogisticRegression doesn't seem to work for some people/problems

2012-10-30 Thread Ted Dunning (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Dunning updated MAHOUT-1107: Attachment: olr.png Here is a plot of convergence versus number of passes on the data that Rajesh

[jira] [Updated] (MAHOUT-1107) OnlineLogisticRegression doesn't seem to work for some people/problems

2012-10-30 Thread Ted Dunning (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Dunning updated MAHOUT-1107: Attachment: MAHOUT-1107-Test_case_for_OnlineLogisticRegression.patch Here is a test case that

[jira] [Created] (MAHOUT-1107) OnlineLogisticRegression doesn't seem to work for some people/problems

2012-10-30 Thread Ted Dunning (JIRA)
Ted Dunning created MAHOUT-1107: --- Summary: OnlineLogisticRegression doesn't seem to work for some people/problems Key: MAHOUT-1107 URL: https://issues.apache.org/jira/browse/MAHOUT-1107 Project: M

Re: svn commit: r1402553 - /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/cooccurrence/measures/EuclideanDistanceSimilarity.java

2012-10-26 Thread Ted Dunning
be a bunch of recalculation anyway. I think the > speed / elegance benefit probably trumps precision issues. > > At least -- stare decisis, that's how it had always been anyway, this > was just fixing round-off errors. Which is I suppose exactly what you > mean. > > On F

Fwd: svn commit: r1402553 - /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/similarity/cooccurrence/measures/EuclideanDistanceSimilarity.java

2012-10-26 Thread Ted Dunning
I am not sure if this matters in this context, but using this formula will lose precision for very near points. That can affect ordering in the limit. By lose precision, I mean it can degrade to 7-8 sig figs instead of 16 or so. I doubt this matters, but I wouldn't know if it does. -- F

Re: classify vs. classifyFull in AbstractVectorClassifier

2012-10-25 Thread Ted Dunning
){ > return classifyFull(instance); > } > > I'm not necessarily pushing for this, I'm just generating discussion. > > -Timothy Mann > > On Tue, Oct 23, 2012 at 12:33 PM, Ted Dunning > wrote: > > > Classification *is* regression. You can always ask the re

Re: Streaming k-means as a MapReduce

2012-10-24 Thread Ted Dunning
On Wed, Oct 24, 2012 at 2:32 AM, Dan Filimon wrote: > So, the way I'm thinking of doing it is: > 1. refactor (and prettify?) the existing code to support adding new > points directly (both weighted and unweighted) and update the tests; > 2. write the wrapper CentroidWritable class and the MapReduc

Re: classify vs. classifyFull in AbstractVectorClassifier

2012-10-23 Thread Ted Dunning
x27;t break older code, but it also wouldn't resolve > strange use of classifier. > > -Timothy Mann > > On Tue, Oct 23, 2012 at 5:32 AM, Grant Ingersoll >wrote: > > > > > On Oct 22, 2012, at 12:20 AM, Ted Dunning wrote: > > > > > Yes. > >

Re: classify vs. classifyFull in AbstractVectorClassifier

2012-10-22 Thread Ted Dunning
UOE also sounds like a good idea. Lately I have adjusted my default method template in IntelliJ to just throw UOE in order to increase the likelihood I remember to adjust things. Sent from my iPhone On Oct 22, 2012, at 12:59 PM, Timothy Mann wrote: > I also plan on adding javadoc comments

Re: classify vs. classifyFull in AbstractVectorClassifier

2012-10-22 Thread Ted Dunning
adding javadoc comments to methods where classify throws an > UnsupportedOperationException to indicate this instead of allowing default > copying of the superclass javadoc comment (which does not indicate that the > method is unsupported). > > Any other ideas? > > -Timothy Ma

Re: classify vs. classifyFull in AbstractVectorClassifier

2012-10-21 Thread Ted Dunning
Yes. It seems stupid in retrospect. Changing these things is very painful, however, because we have no idea how many people will be affected. On Sun, Oct 21, 2012 at 9:16 PM, Timothy Mann wrote: > It seems strange to me that the classify method declared in > AbstractVectorClassifier returns a v

Re: Code Cleanup and Documentation

2012-10-20 Thread Ted Dunning
Yes. Just create JIRA's and have at it. All help is welcome for cleanups. On Sat, Oct 20, 2012 at 4:46 PM, Timothy Mann wrote: > Hi, > > I'm interested in cleaning up code and adding documentation related to the > classifier packages. What is the best way to provide patches, since these > aren'

Re: Mahout Bachelor's Project

2012-10-16 Thread Ted Dunning
Sounds like a great idea. I will follow up on this off-list. On Tue, Oct 16, 2012 at 10:50 AM, Dan Filimon wrote: > Ted, could we possibly set up some sort of weekly 1-on-1s to discuss > goals/milestones? Or some way to ensure progress is being made? Should > I include my official supervisor as

<    5   6   7   8   9   10   11   12   13   14   >