Nearly done splitting the code up, but I'm not sure what the costs should ideally be.
Robin, you proposed: cost of iteration + cost of lookup + cost of update (if its not in-place) This sounds like it's per element, rather than for the entire vector. Also, are we just going to assume that there will be as many updates as non-zeros or as large as the array is? How does the cost of an update factor in? And, is "in-place" just for DenseVectors? On Thu, Apr 25, 2013 at 8:04 PM, Robin Anil <robina...@apache.org> wrote: > Depends on the speed of copying a 1M double array v/s doing a calloc + > copying 1000 non zeros (Assuming java is doing that underneath). > > ------ > Robin Anil > > > > On Thu, Apr 25, 2013 at 2:18 AM, Dan Filimon <dangeorge.fili...@gmail.com > >wrote: > > > Right, but is clone() generally slower than assigning? That strikes me as > > odd; doesn't Java optimize copying the internal structures (there are > > arrays underneath after all)? > > > > > > On Thu, Apr 25, 2013 at 10:14 AM, Robin Anil <robina...@apache.org> > wrote: > > > >> Seems like for dense clone is slower than like().assign I need to test > it > >> with different sizes to be sure. I kept it from the old behavior. > >> On Apr 25, 2013 2:12 AM, "Dan Filimon" <dangeorge.fili...@gmail.com> > >> wrote: > >> > >>> Okay, so I should split it further into smaller sub-cases that handle > >>> each Vector type. I tried making so that these match the cases in the > >>> document to the extent possible. > >>> You're right. It is ugly and I need to split it up. > >>> > >>> I removed the OrderedIntDoubleMapping (but with another if...) but one > >>> more thing that helped is making the assign() only go through the > non-zeros > >>> when setting up a new copy. > >>> > >>> About the copying, I saw that the optimized copy doesn't always call > >>> clone clone on the Vector. Why is this? > >>> > >>> Thanks for the test and the feedback! :) > >>> > >>> > >>> On Thu, Apr 25, 2013 at 8:30 AM, Robin Anil <robina...@apache.org > >wrote: > >>> > >>>> Yes its reaching dev. I checked out your code to take a look. I > >>>> understand you are trying to still mix isSequentialAccess etc in the > >>>> validity calculation. The whole point of this architecture is to > completely > >>>> unroll things to be more modular instead of creating a giant spaghetti > >>>> > >>>> Here is an example of what I am trying to say. > >>>> > >>>> This is your current implementation of AssignIterateOneLookupOther. > >>>> There are so many conditionals > >>>> > >>>> > >>>> 1. @Override > >>>> 2. public Vector assign(Vector x, Vector y, > >>>> DoubleDoubleFunction f) { > >>>> 3. return !swap ? assignInner(x, y, f) : assignInner(y, x, > f); > >>>> 4. } > >>>> 5. > >>>> 6. public Vector assignInner(Vector x, Vector y, > >>>> DoubleDoubleFunction f) { > >>>> 7. Iterator<Vector.Element> xi = x.iterateNonZero(); > >>>> 8. Vector.Element xe; > >>>> 9. Vector.Element ye; > >>>> 10. OrderedIntDoubleMapping updates = > newOrderedIntDoubleMapping(); > >>>> 11. while (xi.hasNext()) { > >>>> 12. xe = xi.next(); > >>>> 13. ye = y.getElement(xe.index()); > >>>> 14. if (!swap) { > >>>> 15. xe.set(f.apply(xe.get(), ye.get())); > >>>> 16. } else { > >>>> 17. if (ye.get() != 0.0 || y.isAddConstantTime()) { > >>>> 18. ye.set(f.apply(ye.get(), xe.get())); > >>>> 19. } else { > >>>> 20. updates.set(xe.index(), f.apply(ye.get(), > >>>> xe.get())); > >>>> 21. } > >>>> 22. } > >>>> 23. } > >>>> 24. if (swap && !y.isAddConstantTime()) { > >>>> 25. y.mergeUpdates(updates); > >>>> 26. } > >>>> 27. return swap ? y : x; > >>>> 28. } > >>>> 29. } > >>>> > >>>> Split this into. > >>>> > >>>> IterateThisLookupThatInplaceUpdate > >>>> IterateThatLookupThisInplaceUpdate > >>>> IterateThisLookupThatMergeUpdate > >>>> IterateThatLookupThisMergeUpdate > >>>> > >>>> Removing the conditionals itself will speedup the operations you see > >>>> regression on. > >>>> > >>>> Also note that getElement() is not constant Add time for RASV. In > >>>> earlier version we tried at best to set in place. That reduced the > extra > >>>> hashkey computation. getElement in AbstractVector is really optimized > for > >>>> DenseVector and it does indicate as the cause of regression you are > seeing > >>>> in RASV. Another one is the unused new OrderedIntDoubleMapping(). > Removing > >>>> all these will bring back the loss. > >>>> > >>>> Now the cost for each can be computed as: cost of iteration + cost of > >>>> lookup + cost of update (if its not in-place) > >>>> > >>>> See the attached file for a mock test. You will have to rework the > >>>> expectations but the framework should be understandable. > >>>> > >>>> > >>>> ------ > >>>> Robin Anil > >>>> > >>>> > >>>> > >>>> On Tue, Apr 23, 2013 at 4:09 AM, Dan Filimon < > >>>> dangeorge.fili...@gmail.com> wrote: > >>>> > >>>>> Here's [1] a link to the "design document" of the new vector > >>>>> operations so I can lay out the ideas behind what I'm doing more > clearly. > >>>>> > >>>>> I'd like anyone who can to have a look and comment. > >>>>> > >>>>> Ted, > >>>>> I know you'd like me to get back to working on clustering, but > >>>>> currently, BallKMeans is roughly 2x as slow as Mahout's KMeans. > >>>>> This is because of centroid.update() for one (it doesn't have > >>>>> "interesting" properties that are exploited in the current) and also > >>>>> because we're using we're just doing lots of Vector operations: dot > >>>>> products for projections, hash codes for unique sets of candidates, > >>>>> equality checks etc. > >>>>> > >>>>> Here's a snapshot of the hotspots for one run of BallKMeans (20 > >>>>> newsgroups, unprojected, i.e 20K sparse vectors with 200K dimensions > and > >>>>> ~100 nonzero values): > >>>>> Name Time (ms) org.apache.mahout.math.AbstractVector.dot(Vector) > >>>>> 104069 org.apache.mahout.math.DelegatingVector.hashCode() 30159 > org.apache.mahout.math.AbstractVector.assignIterateBoth(Iterator, > >>>>> Iterator, DoubleDoubleFunction) 13720 > >>>>> > org.apache.mahout.math.OrderedIntDoubleMapping.merge(OrderedIntDoubleMapping) > >>>>> 2911 java.lang.ClassLoader.loadClass(String, boolean) 2620 > sun.misc.Launcher$AppClassLoader.loadClass(String, > >>>>> boolean) 2620 > >>>>> > >>>>> > >>>>> You can see that the most time is spent doing Vector operations. > >>>>> Combined, the first three rows account for 91% of the runtime. > >>>>> > >>>>> Thanks for your help! > >>>>> > >>>>> PS: Is this reaching dev@? > >>>>> > >>>>> [1] > >>>>> > https://docs.google.com/document/d/1g1PjUuvjyh2LBdq2_rKLIcUiDbeOORA1sCJiSsz-JVU/edit# > >>>>> > >>>>> > >>>>> On Tue, Apr 23, 2013 at 7:53 AM, Robin Anil <robina...@apache.org > >wrote: > >>>>> > >>>>>> Few more brain dump ideas > >>>>>> > >>>>>> For any operation (assign, aggregate) you are trying to do. > >>>>>> Compute set of "feasible" solutions: Based on all the properties of > >>>>>> functions > >>>>>> Choose the solution that minimises the cost. > >>>>>> > >>>>>> For this you may need to refactor all the special kinds of > iterations > >>>>>> and aggregations subclassing from AssignOperation and > AggregateOperation. > >>>>>> Each operation would return the cost given two input vectors (based > on > >>>>>> their cost properties) > >>>>>> > >>>>>> > >>>>>> > >>>>>> ------ > >>>>>> Robin Anil > >>>>>> > >>>>>> > >>>>>> > >>>>>> On Mon, Apr 22, 2013 at 8:46 PM, Robin Anil <robina...@apache.org > >wrote: > >>>>>> > >>>>>>> Strike that. I think the code is very confusing. You will have > >>>>>>> better luck if you create method for the three implementations. It > will > >>>>>>> make things explicit and easy to debug. > >>>>>>> > >>>>>>> public final double getIterationCost() { > >>>>>>> return one of {cardinality, 1.2*numNonDefault, numNonDefault} > >>>>>>> } > >>>>>>> > >>>>>>> public final double getLookupCost() { > >>>>>>> return one of {1, 1.5(lets say), log_2(numNonDefault)} > >>>>>>> } > >>>>>>> > >>>>>>> All your logic then becomes a minimization of the total cost. > >>>>>>> > >>>>>>> This is more extendable than current isDense(), isSequential(), > >>>>>>> isSparse(), isRandom() and doesn't assume what the underlying > >>>>>>> implementation is. > >>>>>>> > >>>>>>> > >>>>>>> ------ > >>>>>>> Robin Anil > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> On Mon, Apr 22, 2013 at 8:09 PM, Robin Anil <robina...@apache.org > >wrote: > >>>>>>> > >>>>>>>> Also the calculation for cost is wrong. RASV is HashMap based its > >>>>>>>> cost O(n) not O(log n) > >>>>>>>> > >>>>>>>> ------ > >>>>>>>> Robin Anil > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> On Mon, Apr 22, 2013 at 4:43 PM, Robin Anil <robina...@apache.org > >wrote: > >>>>>>>> > >>>>>>>>> This is an automatically generated e-mail. To reply, visit: > >>>>>>>>> https://reviews.apache.org/r/10669/ > >>>>>>>>> math/src/main/java/org/apache/mahout/math/AbstractVector.java< > https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line56> > (Diff > >>>>>>>>> revision 2) > >>>>>>>>> > >>>>>>>>> None > >>>>>>>>> > >>>>>>>>> 55 > >>>>>>>>> > >>>>>>>>> boolean commutativeAndAssociative = > aggregator.isCommutative() && aggregator.isAssociative(); > >>>>>>>>> > >>>>>>>>> This should be a method in AbstractFunction interface > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> math/src/main/java/org/apache/mahout/math/AbstractVector.java< > https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line1037> > (Diff > >>>>>>>>> revision 2) > >>>>>>>>> > >>>>>>>>> {'text': 'public int getNumNonZeroElements() {', 'line': 726, > 'expand_offset': 2} > >>>>>>>>> > >>>>>>>>> {'text': 'public int getNumNonZeroElements() {', 'line': 927, > 'expand_offset': 2} > >>>>>>>>> > >>>>>>>>> 942 > >>>>>>>>> > >>>>>>>>> // Make all the non-zero values 0. > >>>>>>>>> > >>>>>>>>> vector.clear() > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> math/src/main/java/org/apache/mahout/math/AbstractVector.java< > https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line1047> > (Diff > >>>>>>>>> revision 2) > >>>>>>>>> > >>>>>>>>> {'text': 'public int getNumNonZeroElements() {', 'line': 726, > 'expand_offset': 2} > >>>>>>>>> > >>>>>>>>> {'text': 'public int getNumNonZeroElements() {', 'line': 927, > 'expand_offset': 2} > >>>>>>>>> > >>>>>>>>> 952 > >>>>>>>>> > >>>>>>>>> while (it.hasNext()) { > >>>>>>>>> > >>>>>>>>> There is no guarantee of ordered iteration. The updates is > unsorted. merge typically implies that input is sorted > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> math/src/main/java/org/apache/mahout/math/AbstractVector.java< > https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line1074> > (Diff > >>>>>>>>> revision 2) > >>>>>>>>> > >>>>>>>>> {'text': 'public int getNumNonZeroElements() {', 'line': 726, > 'expand_offset': 2} > >>>>>>>>> > >>>>>>>>> {'text': 'public int getNumNonZeroElements() {', 'line': 927, > 'expand_offset': 2} > >>>>>>>>> > >>>>>>>>> 979 > >>>>>>>>> > >>>>>>>>> element.set(values[index]); > >>>>>>>>> > >>>>>>>>> What if there are values that are not in the existing data. I > dont this is correct > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> math/src/main/java/org/apache/mahout/math/AbstractVector.java< > https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line1100> > (Diff > >>>>>>>>> revision 2) > >>>>>>>>> > >>>>>>>>> {'text': 'public int getNumNonZeroElements() {', 'line': 726, > 'expand_offset': 2} > >>>>>>>>> > >>>>>>>>> {'text': 'public int getNumNonZeroElements() {', 'line': 927, > 'expand_offset': 2} > >>>>>>>>> > >>>>>>>>> 999 > >>>>>>>>> > >>>>>>>>> invalidateCachedLength(); > >>>>>>>>> > >>>>>>>>> move invalidateCachedLength to the set method. Instead of > dispersing it everywhere. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> math/src/main/java/org/apache/mahout/math/AbstractVector.java< > https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line1115> > (Diff > >>>>>>>>> revision 2) > >>>>>>>>> > >>>>>>>>> None > >>>>>>>>> > >>>>>>>>> {'text': ' private Vector assignIterateBoth(Iterator<Element> > thisIterator, Iterator<Element> thatIterator,', 'line': 1070} > >>>>>>>>> > >>>>>>>>> 1014 > >>>>>>>>> > >>>>>>>>> protected Vector assignSkipZerosIterateOne(Iterator<Element> > thisIterator, Vector other, > >>>>>>>>> > >>>>>>>>> be consistent, this that or thisIterator that Iterator. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> math/src/main/java/org/apache/mahout/math/AbstractVector.java< > https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line1171> > (Diff > >>>>>>>>> revision 2) > >>>>>>>>> > >>>>>>>>> None > >>>>>>>>> > >>>>>>>>> {'text': ' private Vector assignIterateBoth(Iterator<Element> > thisIterator, Iterator<Element> thatIterator,', 'line': 1070} > >>>>>>>>> > >>>>>>>>> 1070 > >>>>>>>>> > >>>>>>>>> private Vector assignIterateBoth(Iterator<Element> > thisIterator, Iterator<Element> thatIterator, > >>>>>>>>> > >>>>>>>>> This expects the data to to be sorted. indicate that in a > comment. Similarly comment other places > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > math/src/main/java/org/apache/mahout/math/function/Functions.java< > https://reviews.apache.org/r/10669/diff/2/?file=283202#file283202line900> > (Diff > >>>>>>>>> revision 2) > >>>>>>>>> > >>>>>>>>> None > >>>>>>>>> > >>>>>>>>> {'text': ' public boolean isLikeLeftMult() {', 'line': 892} > >>>>>>>>> > >>>>>>>>> 896 > >>>>>>>>> > >>>>>>>>> /** > >>>>>>>>> > >>>>>>>>> Move this comments to the base class > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> - Robin > >>>>>>>>> > >>>>>>>>> On April 22nd, 2013, 8:45 p.m., Dan Filimon wrote: > >>>>>>>>> Review request for mahout, Ted Dunning, Sebastian Schelter, and > >>>>>>>>> Robin Anil. > >>>>>>>>> By Dan Filimon. > >>>>>>>>> > >>>>>>>>> *Updated April 22, 2013, 8:45 p.m.* > >>>>>>>>> Description > >>>>>>>>> > >>>>>>>>> This patch contains code cleaning up AbstractVector and making > the operations as fast as possible while still having a high level > interface. > >>>>>>>>> > >>>>>>>>> The main changes are in AbstractVector as well as new methods in > DoubleDoubleFunction. > >>>>>>>>> > >>>>>>>>> Testing > >>>>>>>>> > >>>>>>>>> The vectors test pass but it's likely that the patch in it's > current state is broken as other, unrelated tests (BallKMeans...) are > failing. > >>>>>>>>> Also, my Hadoop conf is broken so I didn't run all the core > tests. Anyone? > >>>>>>>>> > >>>>>>>>> I can't seem to find the bug, so _please_ have a closer look. > It's still a work in progress. > >>>>>>>>> > >>>>>>>>> The benchmarks seem comparable (although there are some jarring > diferences – Minkowski distance seems a lot slower in new-dan-1 than > old-trunk-2). It may be however that this is just variance due to the load > of the machine at the time. I'm having trouble interpreting the benchmarks > in general, so anyone who could give me a hand is more than welcome. > >>>>>>>>> > >>>>>>>>> Diffs > >>>>>>>>> > >>>>>>>>> - > math/src/main/java/org/apache/mahout/math/AbstractMatrix.java > >>>>>>>>> (e12aa38) > >>>>>>>>> - > math/src/main/java/org/apache/mahout/math/AbstractVector.java > >>>>>>>>> (090aa7a) > >>>>>>>>> - math/src/main/java/org/apache/mahout/math/Centroid.java > >>>>>>>>> (0c42196) > >>>>>>>>> - > math/src/main/java/org/apache/mahout/math/ConstantVector.java > >>>>>>>>> (51d67d4) > >>>>>>>>> - > math/src/main/java/org/apache/mahout/math/DelegatingVector.java > >>>>>>>>> (12220d4) > >>>>>>>>> - math/src/main/java/org/apache/mahout/math/DenseVector.java > >>>>>>>>> (41c356b) > >>>>>>>>> - > math/src/main/java/org/apache/mahout/math/FileBasedSparseBinaryMatrix.java > >>>>>>>>> (094003b) > >>>>>>>>> - math/src/main/java/org/apache/mahout/math/MatrixSlice.java > >>>>>>>>> (7f79c96) > >>>>>>>>> - > math/src/main/java/org/apache/mahout/math/MatrixVectorView.java > >>>>>>>>> (af70727) > >>>>>>>>> - math/src/main/java/org/apache/mahout/math/NamedVector.java > >>>>>>>>> (4b7a41d) > >>>>>>>>> - > math/src/main/java/org/apache/mahout/math/OrderedIntDoubleMapping.java > >>>>>>>>> (650d82d) > >>>>>>>>> - > math/src/main/java/org/apache/mahout/math/PermutedVectorView.java > >>>>>>>>> (d1ea93a) > >>>>>>>>> - > math/src/main/java/org/apache/mahout/math/RandomAccessSparseVector.java > >>>>>>>>> (6f85692) > >>>>>>>>> - > math/src/main/java/org/apache/mahout/math/SequentialAccessSparseVector.java > >>>>>>>>> (21982f9) > >>>>>>>>> - math/src/main/java/org/apache/mahout/math/Vector.java > >>>>>>>>> (2f8b417) > >>>>>>>>> - math/src/main/java/org/apache/mahout/math/VectorView.java > >>>>>>>>> (add2a60) > >>>>>>>>> - > math/src/main/java/org/apache/mahout/math/WeightedVector.java > >>>>>>>>> (06fbd05) > >>>>>>>>> - > math/src/main/java/org/apache/mahout/math/function/DoubleDoubleFunction.java > >>>>>>>>> (82b350a) > >>>>>>>>> - > math/src/main/java/org/apache/mahout/math/function/Functions.java > >>>>>>>>> (eb2e42f) > >>>>>>>>> - > math/src/main/java/org/apache/mahout/math/function/PlusMult.java > >>>>>>>>> (60587b1) > >>>>>>>>> - > math/src/main/java/org/apache/mahout/math/function/TimesFunction.java > >>>>>>>>> (8ab0bb1) > >>>>>>>>> - > math/src/main/java/org/apache/mahout/math/jet/math/Constants.java > >>>>>>>>> (53535d6) > >>>>>>>>> - > math/src/test/java/org/apache/mahout/math/AbstractVectorTest.java > >>>>>>>>> (2b11199) > >>>>>>>>> - math/src/test/java/org/apache/mahout/math/VectorTest.java > >>>>>>>>> (d6d554b) > >>>>>>>>> > >>>>>>>>> View Diff <https://reviews.apache.org/r/10669/diff/> > >>>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >>> > > >