Nearly done splitting the code up, but I'm not sure what the costs should
ideally be.

Robin, you proposed: cost of iteration + cost of lookup + cost of update
(if its not in-place)

This sounds like it's per element, rather than for the entire vector. Also,
are we just going to assume that there will be as many updates as non-zeros
or as large as the array is? How does the cost of an update factor in?

And, is "in-place" just for DenseVectors?


On Thu, Apr 25, 2013 at 8:04 PM, Robin Anil <robina...@apache.org> wrote:

> Depends on the speed of copying a 1M double array v/s doing a calloc +
> copying 1000 non zeros (Assuming java is doing that underneath).
>
> ------
> Robin Anil
>
>
>
> On Thu, Apr 25, 2013 at 2:18 AM, Dan Filimon <dangeorge.fili...@gmail.com
> >wrote:
>
> > Right, but is clone() generally slower than assigning? That strikes me as
> > odd; doesn't Java optimize copying the internal structures (there are
> > arrays underneath after all)?
> >
> >
> > On Thu, Apr 25, 2013 at 10:14 AM, Robin Anil <robina...@apache.org>
> wrote:
> >
> >> Seems like for dense clone is slower than like().assign I need to test
> it
> >> with different sizes to be sure. I kept it from the old behavior.
> >> On Apr 25, 2013 2:12 AM, "Dan Filimon" <dangeorge.fili...@gmail.com>
> >> wrote:
> >>
> >>> Okay, so I should split it further into smaller sub-cases that handle
> >>> each Vector type. I tried making so that these match the cases in the
> >>> document to the extent possible.
> >>> You're right. It is ugly and I need to split it up.
> >>>
> >>> I removed the OrderedIntDoubleMapping (but with another if...) but one
> >>> more thing that helped is making the assign() only go through the
> non-zeros
> >>> when setting up a new copy.
> >>>
> >>> About the copying, I saw that the optimized copy doesn't always call
> >>> clone clone on the Vector. Why is this?
> >>>
> >>> Thanks for the test and the feedback! :)
> >>>
> >>>
> >>> On Thu, Apr 25, 2013 at 8:30 AM, Robin Anil <robina...@apache.org
> >wrote:
> >>>
> >>>> Yes its reaching dev. I checked out your code to take a look. I
> >>>> understand you are trying to still mix isSequentialAccess etc in the
> >>>> validity calculation. The whole point of this architecture is to
> completely
> >>>> unroll things to be more modular instead of creating a giant spaghetti
> >>>>
> >>>> Here is an example of what I am trying to say.
> >>>>
> >>>> This is your current implementation of AssignIterateOneLookupOther.
> >>>> There are so many conditionals
> >>>>
> >>>>
> >>>>    1. @Override
> >>>>    2.     public Vector assign(Vector x, Vector y,
> >>>>    DoubleDoubleFunction f) {
> >>>>    3.       return !swap ? assignInner(x, y, f) : assignInner(y, x,
> f);
> >>>>    4.     }
> >>>>    5.
> >>>>    6.     public Vector assignInner(Vector x, Vector y,
> >>>>    DoubleDoubleFunction f) {
> >>>>    7.       Iterator<Vector.Element> xi = x.iterateNonZero();
> >>>>    8.       Vector.Element xe;
> >>>>    9.       Vector.Element ye;
> >>>>    10.       OrderedIntDoubleMapping updates =
> newOrderedIntDoubleMapping();
> >>>>    11.       while (xi.hasNext()) {
> >>>>    12.         xe = xi.next();
> >>>>    13.         ye = y.getElement(xe.index());
> >>>>    14.         if (!swap) {
> >>>>    15.           xe.set(f.apply(xe.get(), ye.get()));
> >>>>    16.         } else {
> >>>>    17.           if (ye.get() != 0.0 || y.isAddConstantTime()) {
> >>>>    18.             ye.set(f.apply(ye.get(), xe.get()));
> >>>>    19.           } else {
> >>>>    20.             updates.set(xe.index(), f.apply(ye.get(),
> >>>>    xe.get()));
> >>>>    21.           }
> >>>>    22.         }
> >>>>    23.       }
> >>>>    24.       if (swap && !y.isAddConstantTime()) {
> >>>>    25.         y.mergeUpdates(updates);
> >>>>    26.       }
> >>>>    27.       return swap ? y : x;
> >>>>    28.     }
> >>>>    29.   }
> >>>>
> >>>> Split this into.
> >>>>
> >>>> IterateThisLookupThatInplaceUpdate
> >>>> IterateThatLookupThisInplaceUpdate
> >>>> IterateThisLookupThatMergeUpdate
> >>>> IterateThatLookupThisMergeUpdate
> >>>>
> >>>> Removing the conditionals itself will speedup the operations you see
> >>>> regression on.
> >>>>
> >>>> Also note that getElement() is not constant Add time for RASV. In
> >>>> earlier version we tried at best to set in place. That reduced the
> extra
> >>>> hashkey computation. getElement in AbstractVector is really optimized
> for
> >>>> DenseVector and it does indicate as the cause of regression you are
> seeing
> >>>> in RASV. Another one is the unused new OrderedIntDoubleMapping().
> Removing
> >>>> all these will bring back the loss.
> >>>>
> >>>> Now the cost for each can be computed as: cost of iteration + cost of
> >>>> lookup + cost of update (if its not in-place)
> >>>>
> >>>> See the attached file for a mock test. You will have to rework the
> >>>> expectations but the framework should be understandable.
> >>>>
> >>>>
> >>>> ------
> >>>> Robin Anil
> >>>>
> >>>>
> >>>>
> >>>> On Tue, Apr 23, 2013 at 4:09 AM, Dan Filimon <
> >>>> dangeorge.fili...@gmail.com> wrote:
> >>>>
> >>>>> Here's [1] a link to the "design document" of the new vector
> >>>>> operations so I can lay out the ideas behind what I'm doing more
> clearly.
> >>>>>
> >>>>> I'd like anyone who can to have a look and comment.
> >>>>>
> >>>>> Ted,
> >>>>> I know you'd like me to get back to working on clustering, but
> >>>>> currently, BallKMeans is roughly 2x as slow as Mahout's KMeans.
> >>>>> This is because of centroid.update() for one (it doesn't have
> >>>>> "interesting" properties that are exploited in the current) and also
> >>>>> because we're using we're just doing lots of Vector operations: dot
> >>>>> products for projections, hash codes for unique sets of candidates,
> >>>>> equality checks etc.
> >>>>>
> >>>>> Here's a snapshot of the hotspots for one run of BallKMeans (20
> >>>>> newsgroups, unprojected, i.e 20K sparse vectors with 200K dimensions
> and
> >>>>> ~100 nonzero values):
> >>>>>   Name Time (ms)  org.apache.mahout.math.AbstractVector.dot(Vector)
> >>>>> 104069  org.apache.mahout.math.DelegatingVector.hashCode() 30159
>  org.apache.mahout.math.AbstractVector.assignIterateBoth(Iterator,
> >>>>> Iterator, DoubleDoubleFunction) 13720
> >>>>>
> org.apache.mahout.math.OrderedIntDoubleMapping.merge(OrderedIntDoubleMapping)
> >>>>> 2911  java.lang.ClassLoader.loadClass(String, boolean) 2620
>  sun.misc.Launcher$AppClassLoader.loadClass(String,
> >>>>> boolean) 2620
> >>>>>
> >>>>>
> >>>>> You can see that the most time is spent doing Vector operations.
> >>>>> Combined, the first three rows account for 91% of the runtime.
> >>>>>
> >>>>> Thanks for your help!
> >>>>>
> >>>>> PS: Is this reaching dev@?
> >>>>>
> >>>>> [1]
> >>>>>
> https://docs.google.com/document/d/1g1PjUuvjyh2LBdq2_rKLIcUiDbeOORA1sCJiSsz-JVU/edit#
> >>>>>
> >>>>>
> >>>>> On Tue, Apr 23, 2013 at 7:53 AM, Robin Anil <robina...@apache.org
> >wrote:
> >>>>>
> >>>>>> Few more brain dump ideas
> >>>>>>
> >>>>>> For any operation (assign, aggregate) you are trying to do.
> >>>>>> Compute set of "feasible" solutions: Based on all the properties of
> >>>>>> functions
> >>>>>> Choose the solution that minimises the cost.
> >>>>>>
> >>>>>> For this you may need to refactor all the special kinds of
> iterations
> >>>>>> and aggregations subclassing from AssignOperation and
> AggregateOperation.
> >>>>>> Each operation would return the cost given two input vectors (based
> on
> >>>>>> their cost properties)
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> ------
> >>>>>> Robin Anil
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Mon, Apr 22, 2013 at 8:46 PM, Robin Anil <robina...@apache.org
> >wrote:
> >>>>>>
> >>>>>>> Strike that. I think the code is very confusing. You will have
> >>>>>>> better luck if you create method for the three implementations. It
> will
> >>>>>>> make things explicit and easy to debug.
> >>>>>>>
> >>>>>>> public final double getIterationCost() {
> >>>>>>>  return one of {cardinality, 1.2*numNonDefault, numNonDefault}
> >>>>>>> }
> >>>>>>>
> >>>>>>> public final double getLookupCost() {
> >>>>>>>  return one of {1, 1.5(lets say), log_2(numNonDefault)}
> >>>>>>> }
> >>>>>>>
> >>>>>>> All your logic then becomes a minimization of the total cost.
> >>>>>>>
> >>>>>>> This is more extendable than current isDense(), isSequential(),
> >>>>>>> isSparse(), isRandom() and doesn't assume what the underlying
> >>>>>>> implementation is.
> >>>>>>>
> >>>>>>>
> >>>>>>> ------
> >>>>>>> Robin Anil
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Mon, Apr 22, 2013 at 8:09 PM, Robin Anil <robina...@apache.org
> >wrote:
> >>>>>>>
> >>>>>>>>  Also the calculation for cost is wrong. RASV is HashMap based its
> >>>>>>>> cost O(n) not O(log n)
> >>>>>>>>
> >>>>>>>> ------
> >>>>>>>> Robin Anil
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Mon, Apr 22, 2013 at 4:43 PM, Robin Anil <robina...@apache.org
> >wrote:
> >>>>>>>>
> >>>>>>>>>    This is an automatically generated e-mail. To reply, visit:
> >>>>>>>>> https://reviews.apache.org/r/10669/
> >>>>>>>>>    math/src/main/java/org/apache/mahout/math/AbstractVector.java<
> https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line56>
> (Diff
> >>>>>>>>> revision 2)
> >>>>>>>>>
> >>>>>>>>> None
> >>>>>>>>>
> >>>>>>>>>    55
> >>>>>>>>>
> >>>>>>>>>     boolean commutativeAndAssociative =
> aggregator.isCommutative() && aggregator.isAssociative();
> >>>>>>>>>
> >>>>>>>>>   This should be a method in AbstractFunction interface
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>    math/src/main/java/org/apache/mahout/math/AbstractVector.java<
> https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line1037>
> (Diff
> >>>>>>>>> revision 2)
> >>>>>>>>>
> >>>>>>>>> {'text': 'public int getNumNonZeroElements() {', 'line': 726,
> 'expand_offset': 2}
> >>>>>>>>>
> >>>>>>>>> {'text': 'public int getNumNonZeroElements() {', 'line': 927,
> 'expand_offset': 2}
> >>>>>>>>>
> >>>>>>>>>    942
> >>>>>>>>>
> >>>>>>>>>       // Make all the non-zero values 0.
> >>>>>>>>>
> >>>>>>>>>   vector.clear()
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>    math/src/main/java/org/apache/mahout/math/AbstractVector.java<
> https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line1047>
> (Diff
> >>>>>>>>> revision 2)
> >>>>>>>>>
> >>>>>>>>> {'text': 'public int getNumNonZeroElements() {', 'line': 726,
> 'expand_offset': 2}
> >>>>>>>>>
> >>>>>>>>> {'text': 'public int getNumNonZeroElements() {', 'line': 927,
> 'expand_offset': 2}
> >>>>>>>>>
> >>>>>>>>>    952
> >>>>>>>>>
> >>>>>>>>>       while (it.hasNext()) {
> >>>>>>>>>
> >>>>>>>>>   There is no guarantee of ordered iteration. The updates is
> unsorted. merge typically implies that input is sorted
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>    math/src/main/java/org/apache/mahout/math/AbstractVector.java<
> https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line1074>
> (Diff
> >>>>>>>>> revision 2)
> >>>>>>>>>
> >>>>>>>>> {'text': 'public int getNumNonZeroElements() {', 'line': 726,
> 'expand_offset': 2}
> >>>>>>>>>
> >>>>>>>>> {'text': 'public int getNumNonZeroElements() {', 'line': 927,
> 'expand_offset': 2}
> >>>>>>>>>
> >>>>>>>>>    979
> >>>>>>>>>
> >>>>>>>>>         element.set(values[index]);
> >>>>>>>>>
> >>>>>>>>>   What if there are values that are not in the existing data. I
> dont this is correct
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>    math/src/main/java/org/apache/mahout/math/AbstractVector.java<
> https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line1100>
> (Diff
> >>>>>>>>> revision 2)
> >>>>>>>>>
> >>>>>>>>> {'text': 'public int getNumNonZeroElements() {', 'line': 726,
> 'expand_offset': 2}
> >>>>>>>>>
> >>>>>>>>> {'text': 'public int getNumNonZeroElements() {', 'line': 927,
> 'expand_offset': 2}
> >>>>>>>>>
> >>>>>>>>>    999
> >>>>>>>>>
> >>>>>>>>>     invalidateCachedLength();
> >>>>>>>>>
> >>>>>>>>>   move invalidateCachedLength to the set method. Instead of
> dispersing it everywhere.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>    math/src/main/java/org/apache/mahout/math/AbstractVector.java<
> https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line1115>
> (Diff
> >>>>>>>>> revision 2)
> >>>>>>>>>
> >>>>>>>>> None
> >>>>>>>>>
> >>>>>>>>> {'text': '  private Vector assignIterateBoth(Iterator<Element>
> thisIterator, Iterator<Element> thatIterator,', 'line': 1070}
> >>>>>>>>>
> >>>>>>>>>     1014
> >>>>>>>>>
> >>>>>>>>>   protected Vector assignSkipZerosIterateOne(Iterator<Element>
> thisIterator, Vector other,
> >>>>>>>>>
> >>>>>>>>>   be consistent, this that or thisIterator that Iterator.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>    math/src/main/java/org/apache/mahout/math/AbstractVector.java<
> https://reviews.apache.org/r/10669/diff/2/?file=283185#file283185line1171>
> (Diff
> >>>>>>>>> revision 2)
> >>>>>>>>>
> >>>>>>>>> None
> >>>>>>>>>
> >>>>>>>>> {'text': '  private Vector assignIterateBoth(Iterator<Element>
> thisIterator, Iterator<Element> thatIterator,', 'line': 1070}
> >>>>>>>>>
> >>>>>>>>>     1070
> >>>>>>>>>
> >>>>>>>>>   private Vector assignIterateBoth(Iterator<Element>
> thisIterator, Iterator<Element> thatIterator,
> >>>>>>>>>
> >>>>>>>>>   This expects the data to to be sorted. indicate that in a
> comment. Similarly comment other places
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> math/src/main/java/org/apache/mahout/math/function/Functions.java<
> https://reviews.apache.org/r/10669/diff/2/?file=283202#file283202line900>
> (Diff
> >>>>>>>>> revision 2)
> >>>>>>>>>
> >>>>>>>>> None
> >>>>>>>>>
> >>>>>>>>> {'text': '      public boolean isLikeLeftMult() {', 'line': 892}
> >>>>>>>>>
> >>>>>>>>>    896
> >>>>>>>>>
> >>>>>>>>>       /**
> >>>>>>>>>
> >>>>>>>>>   Move this comments to the base class
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> - Robin
> >>>>>>>>>
> >>>>>>>>> On April 22nd, 2013, 8:45 p.m., Dan Filimon wrote:
> >>>>>>>>>   Review request for mahout, Ted Dunning, Sebastian Schelter, and
> >>>>>>>>> Robin Anil.
> >>>>>>>>> By Dan Filimon.
> >>>>>>>>>
> >>>>>>>>> *Updated April 22, 2013, 8:45 p.m.*
> >>>>>>>>> Description
> >>>>>>>>>
> >>>>>>>>> This patch contains code cleaning up AbstractVector and making
> the operations as fast as possible while still having a high level
> interface.
> >>>>>>>>>
> >>>>>>>>> The main changes are in AbstractVector as well as new methods in
> DoubleDoubleFunction.
> >>>>>>>>>
> >>>>>>>>>   Testing
> >>>>>>>>>
> >>>>>>>>> The vectors test pass but it's likely that the patch in it's
> current state is broken as other, unrelated tests (BallKMeans...) are
> failing.
> >>>>>>>>> Also, my Hadoop conf is broken so I didn't run all the core
> tests. Anyone?
> >>>>>>>>>
> >>>>>>>>> I can't seem to find the bug, so _please_ have a closer look.
> It's still a work in progress.
> >>>>>>>>>
> >>>>>>>>> The benchmarks seem comparable (although there are some jarring
> diferences – Minkowski distance seems a lot slower in new-dan-1 than
> old-trunk-2). It may be however that this is just variance due to the load
> of the machine at the time. I'm having trouble interpreting the benchmarks
> in general, so anyone who could give me a hand is more than welcome.
> >>>>>>>>>
> >>>>>>>>>   Diffs
> >>>>>>>>>
> >>>>>>>>>    -
> math/src/main/java/org/apache/mahout/math/AbstractMatrix.java
> >>>>>>>>>    (e12aa38)
> >>>>>>>>>    -
> math/src/main/java/org/apache/mahout/math/AbstractVector.java
> >>>>>>>>>    (090aa7a)
> >>>>>>>>>    - math/src/main/java/org/apache/mahout/math/Centroid.java
> >>>>>>>>>    (0c42196)
> >>>>>>>>>    -
> math/src/main/java/org/apache/mahout/math/ConstantVector.java
> >>>>>>>>>    (51d67d4)
> >>>>>>>>>    -
> math/src/main/java/org/apache/mahout/math/DelegatingVector.java
> >>>>>>>>>    (12220d4)
> >>>>>>>>>    - math/src/main/java/org/apache/mahout/math/DenseVector.java
> >>>>>>>>>    (41c356b)
> >>>>>>>>>    -
> math/src/main/java/org/apache/mahout/math/FileBasedSparseBinaryMatrix.java
> >>>>>>>>>    (094003b)
> >>>>>>>>>    - math/src/main/java/org/apache/mahout/math/MatrixSlice.java
> >>>>>>>>>    (7f79c96)
> >>>>>>>>>    -
> math/src/main/java/org/apache/mahout/math/MatrixVectorView.java
> >>>>>>>>>    (af70727)
> >>>>>>>>>    - math/src/main/java/org/apache/mahout/math/NamedVector.java
> >>>>>>>>>    (4b7a41d)
> >>>>>>>>>    -
> math/src/main/java/org/apache/mahout/math/OrderedIntDoubleMapping.java
> >>>>>>>>>    (650d82d)
> >>>>>>>>>    -
> math/src/main/java/org/apache/mahout/math/PermutedVectorView.java
> >>>>>>>>>    (d1ea93a)
> >>>>>>>>>    -
> math/src/main/java/org/apache/mahout/math/RandomAccessSparseVector.java
> >>>>>>>>>    (6f85692)
> >>>>>>>>>    -
> math/src/main/java/org/apache/mahout/math/SequentialAccessSparseVector.java
> >>>>>>>>>    (21982f9)
> >>>>>>>>>    - math/src/main/java/org/apache/mahout/math/Vector.java
> >>>>>>>>>    (2f8b417)
> >>>>>>>>>    - math/src/main/java/org/apache/mahout/math/VectorView.java
> >>>>>>>>>    (add2a60)
> >>>>>>>>>    -
> math/src/main/java/org/apache/mahout/math/WeightedVector.java
> >>>>>>>>>    (06fbd05)
> >>>>>>>>>    -
> math/src/main/java/org/apache/mahout/math/function/DoubleDoubleFunction.java
> >>>>>>>>>    (82b350a)
> >>>>>>>>>    -
> math/src/main/java/org/apache/mahout/math/function/Functions.java
> >>>>>>>>>    (eb2e42f)
> >>>>>>>>>    -
> math/src/main/java/org/apache/mahout/math/function/PlusMult.java
> >>>>>>>>>    (60587b1)
> >>>>>>>>>    -
> math/src/main/java/org/apache/mahout/math/function/TimesFunction.java
> >>>>>>>>>    (8ab0bb1)
> >>>>>>>>>    -
> math/src/main/java/org/apache/mahout/math/jet/math/Constants.java
> >>>>>>>>>    (53535d6)
> >>>>>>>>>    -
> math/src/test/java/org/apache/mahout/math/AbstractVectorTest.java
> >>>>>>>>>    (2b11199)
> >>>>>>>>>    - math/src/test/java/org/apache/mahout/math/VectorTest.java
> >>>>>>>>>    (d6d554b)
> >>>>>>>>>
> >>>>>>>>> View Diff <https://reviews.apache.org/r/10669/diff/>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >
>

Reply via email to