[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-22 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836598#action_12836598
 ] 

Robin Anil commented on MAHOUT-300:
---

We should be multiplying using sparsity instead of cardinality to calculated 
the speed in MB/s for Sparse and Seq and by cardinality for dense vector

 Solve performance issues with Vector Implementations
 

 Key: MAHOUT-300
 URL: https://issues.apache.org/jira/browse/MAHOUT-300
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.3
Reporter: Robin Anil
 Fix For: 0.3

 Attachments: MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, 
 MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch


 AbstractVector operations like times
   public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   int index = element.index();
   result.setQuick(index, element.get() * x);
 }
 return result;
   }
 should be implemented as follows
  public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = result.iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   element.set(element.get() * x);
 }
 return result;
   }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-22 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836624#action_12836624
 ] 

Robin Anil commented on MAHOUT-300:
---

I think the irregularity is due to the sparse vector generation process where 
duplicate index values could get generated leaving some vectors much sparser 
than the sparsity value

{code}
  Vector v = new SequentialAccessSparseVector(cardinality, sparsity); // 
sparsity!
  int[] indexes = new int[sparsity];
  double[] values = new double[sparsity];
  for (int j = 0; j  sparsity; j++) {
double value = r.nextGaussian();
int index = sparsity  cardinality ? r.nextInt(cardinality) : j;
v.set(index, value);
indexes[j] = index;
values[j] = value;
  }
{code}

instead i suggest this

{code}
  Vector v = new SequentialAccessSparseVector(cardinality, sparsity); // 
sparsity!
  boolean[] featureSpace = new boolean[cardinality];
  int[] indexes = new int[sparsity];
  double[] values = new double[sparsity];
  int j = 0;
  while(j  sparsity) {
double value = r.nextGaussian();
int index = r.nextInt(cardinality);
if(featureSpace[index] == false) {
  featureSpace[index] = true;
  indexes[j] = index;
  values[j++] = value;
  v.set(index, value);
}
  }
{code}

 Solve performance issues with Vector Implementations
 

 Key: MAHOUT-300
 URL: https://issues.apache.org/jira/browse/MAHOUT-300
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.3
Reporter: Robin Anil
 Fix For: 0.3

 Attachments: MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, 
 MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch


 AbstractVector operations like times
   public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   int index = element.index();
   result.setQuick(index, element.get() * x);
 }
 return result;
   }
 should be implemented as follows
  public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = result.iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   element.set(element.get() * x);
 }
 return result;
   }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-22 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836630#action_12836630
 ] 

Robin Anil commented on MAHOUT-300:
---

Ted, your loop structure seem to be slower by about 150MB/s than the null based 
impl. Does it need more loops before optimisations kick in ?

 Solve performance issues with Vector Implementations
 

 Key: MAHOUT-300
 URL: https://issues.apache.org/jira/browse/MAHOUT-300
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.3
Reporter: Robin Anil
 Fix For: 0.3

 Attachments: MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, 
 MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch


 AbstractVector operations like times
   public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   int index = element.index();
   result.setQuick(index, element.get() * x);
 }
 return result;
   }
 should be implemented as follows
  public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = result.iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   element.set(element.get() * x);
 }
 return result;
   }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-22 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836633#action_12836633
 ] 

Sean Owen commented on MAHOUT-300:
--

Tiny comment -- will probably be wise to use BitSet rather than boolean[], as 
booleans are stored as full 32 bit value (!). A 32x reduction in memory is 
non-trivial with cardinalities in the millions.

 Solve performance issues with Vector Implementations
 

 Key: MAHOUT-300
 URL: https://issues.apache.org/jira/browse/MAHOUT-300
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.3
Reporter: Robin Anil
 Fix For: 0.3

 Attachments: MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, 
 MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch


 AbstractVector operations like times
   public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   int index = element.index();
   result.setQuick(index, element.get() * x);
 }
 return result;
   }
 should be implemented as follows
  public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = result.iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   element.set(element.get() * x);
 }
 return result;
   }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-22 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836649#action_12836649
 ] 

Robin Anil commented on MAHOUT-300:
---

On dense data 1000, 1000

{noformat}
BenchMarks  DenseVector RandSparseVector
SeqSparseVector Dense.dot(Rand) Dense.dot(Seq)  
Rand.dot(Dense) Rand.dot(Seq)   Seq.dot(Dense)  
Seq.dot(Rand)   

DotProduct  


nCalls = 2; nCalls = 2; nCalls 
= 2; nCalls = 2; nCalls = 2; nCalls = 
2; nCalls = 2; nCalls = 2; nCalls = 2;  
   
sum = 0.042869s;sum = 1.139837s;sum = 
0.293336s;sum = 0.882977s;sum = 0.452817s;sum = 
1.330815s;sum = 0.843993s;sum = 0.931822s;sum = 
1.093099s;
min = 0.0010ms; min = 0.046ms;  min = 
0.01ms;   min = 0.03ms;   min = 0.011ms;  min = 
0.049ms;  min = 0.027ms;  min = 0.036ms;  min = 
0.049ms;  
max = 2.717ms;  max = 21.51ms;  max = 
3.156ms;  max = 25.346ms; max = 26.567ms; max = 
14.738ms; max = 53.265ms; max = 9.44ms;   max = 
4.017ms;  
mean = 0.002143ms;  mean = 0.056991ms;  mean = 
0.014666ms;  mean = 0.044148ms;  mean = 0.02264ms;   mean = 
0.06654ms;   mean = 0.042199ms;  mean = 0.046591ms;  mean = 
0.054654ms;  
stdDev = 0.027798ms;stdDev = 0.194404ms;stdDev 
= 0.053138ms;stdDev = 0.30642ms; stdDev = 0.255753ms;stdDev = 
0.212913ms;stdDev = 0.446643ms;stdDev = 0.131948ms;stdDev = 
0.054681ms;
Speed = 466537.6 /sec   Speed = 17546.367 /sec  Speed = 
68181.195 /sec  Speed = 22650.646 /sec  Speed = 44167.953 /sec  Speed = 
15028.385 /sec  Speed = 23696.877 /sec  Speed = 21463.326 /sec  Speed = 
18296.604 /sec  
Rate = 5598.451 MB/sRate = 210.55641 MB/s   Rate = 
818.17444 MB/s   Rate = 271.80777 MB/s   Rate = 530.01544 MB/s   Rate = 
180.34062 MB/s   Rate = 284.36255 MB/s   Rate = 257.55994 MB/s   Rate = 
219.55927 MB/s   
{noformat}

On Sparse Data (1000, 300)
Dont compare the MB/s see the unit/s


{noformat}
BenchMarks  DenseVector RandSparseVector
SeqSparseVector Dense.dot(Rand) Dense.dot(Seq)  
Rand.dot(Dense) Rand.dot(Seq)   Seq.dot(Dense)  
Seq.dot(Rand)   

DotProduct  


nCalls = 2; nCalls = 2; nCalls 
= 2; nCalls = 2; nCalls = 2; nCalls = 
2; nCalls = 2; nCalls = 2; nCalls = 2;  
   
sum = 0.048355s;sum = 0.569326s;sum = 
0.338478s;sum = 0.408213s;sum = 0.205143s;sum = 
0.469473s;sum = 0.242953s;sum = 0.291587s;sum = 
0.362947s;
min = 0.0010ms; min = 0.018ms;  min = 
0.011ms;  min = 0.012ms;  min = 0.0040ms; min = 
0.017ms;  min = 0.01ms;   min = 0.011ms;  min = 
0.014ms;  
max = 6.525ms;  max = 33.768ms; max = 
3.936ms;  max = 26.649ms; max = 27.028ms; max = 
3.969ms;  max = 3.042ms;  max = 4.704ms;  max = 7.04ms; 
  
mean = 0.002417ms;  mean = 0.028466ms;  mean = 
0.016923ms;  mean = 0.02041ms;   mean = 0.010257ms;  mean = 
0.023473ms;  mean = 0.012147ms;  mean = 0.014579ms;  mean = 
0.018147ms;  
stdDev = 0.062427ms;stdDev = 0.302488ms;stdDev 
= 0.059426ms;stdDev = 0.237577ms;stdDev = 0.222142ms;stdDev = 
0.05819ms; stdDev = 0.026846ms;stdDev = 0.06257ms; stdDev = 
0.06777ms; 
Speed = 413607.7 /sec   Speed = 35129.258 /sec  Speed = 
59088.03 /sec   Speed = 48994.03 /sec   Speed = 97492.96 /sec   Speed = 
42600.957 /sec  Speed = 82320.45 /sec   Speed = 68590.164 /sec  Speed = 

[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-22 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836679#action_12836679
 ] 

Robin Anil commented on MAHOUT-300:
---

i found the anomaly Jake was talking about. It was due to too many instanceof 
checks in dot in AbstractVector. I moved the code out split as smaller check in 
each of overridden dot in each of the impls. The numbers just doubled, 
confirming my suspicion that instanceof is a heavy weight.

 Solve performance issues with Vector Implementations
 

 Key: MAHOUT-300
 URL: https://issues.apache.org/jira/browse/MAHOUT-300
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.3
Reporter: Robin Anil
 Fix For: 0.3

 Attachments: MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, 
 MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, 
 MAHOUT-300.patch


 AbstractVector operations like times
   public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   int index = element.index();
   result.setQuick(index, element.get() * x);
 }
 return result;
   }
 should be implemented as follows
  public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = result.iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   element.set(element.get() * x);
 }
 return result;
   }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-22 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836706#action_12836706
 ] 

Jake Mannix commented on MAHOUT-300:


The sparse data is odd... (-vs 50 -sp 5000) (running with 1000, 300 is 
really not very sparse at all...)  

I haven't applied any newer patches (just the one I submitted most recently), 
but I have svn upped.

These results are counterintuitve.

{code}
BenchMarksDenseVector   
RandomAccessSparseVector  SequentialAccessSparseVector  
Dense.dot(RandomAccess)   Dense.dot(SequentialAccess)   
RandomAcces.dot(Dense)
RandomAccess.dot(SequentialAccess)SequentialAccess.dot(Dense)   
SequentialAccess.dot(RandomAccess)
DotProduct  



  nCalls = 2500;nCalls = 2500;  
  nCalls = 2500;nCalls = 2500;nCalls = 
2500;nCalls = 2500;nCalls = 2500;   
 nCalls = 2500;nCalls = 2500;
  sumTime = 3.660321s;  sumTime = 
1.481516s;  sumTime = 0.448737s;  sumTime = 2.098937s;  
sumTime = 0.856259s;  sumTime = 2.277742s;  sumTime = 
0.607507s;  sumTime = 1.341608s;  sumTime = 0.741622s;  
  minTime = 1.31ms; minTime = 0.459ms;  
  minTime = 0.102ms;minTime = 0.716ms;minTime = 
0.24ms; minTime = 0.776ms;minTime = 0.18ms; 
minTime = 0.442ms;minTime = 0.209ms;
  maxTime = 10.149ms;   maxTime = 36.691ms; 
  maxTime = 4.552ms;maxTime = 5.437ms;maxTime = 
11.856ms;   maxTime = 8.059ms;maxTime = 4.509ms;
maxTime = 2.136ms;maxTime = 2.031ms;
  meanTime = 1.464128ms;meanTime = 
0.592606ms;meanTime = 0.179494ms;meanTime = 0.839574ms;
meanTime = 0.342503ms;meanTime = 0.911096ms;meanTime = 
0.243002ms;meanTime = 0.536643ms;meanTime = 0.296648ms;
  stdDevTime = 0.329025ms;  stdDevTime = 
0.852156ms;  stdDevTime = 0.234261ms;  stdDevTime = 0.179854ms;  
stdDevTime = 0.286798ms;  stdDevTime = 0.268853ms;  stdDevTime = 
0.115022ms;  stdDevTime = 0.171088ms;  stdDevTime = 0.115263ms;  
  Speed = 683.0002 /sec Speed = 1687.4606 
/secSpeed = 5571.192 /sec Speed = 1191.0791 /secSpeed = 
2919.6772 /secSpeed = 1097.5781 /secSpeed = 4115.1787 /sec  
  Speed = 1863.4355 /secSpeed = 3370.9895 /sec
  Rate = 4098.001 MB/s  Rate = 10124.764 
MB/s Rate = 33427.152 MB/s Rate = 7146.4746 MB/s Rate = 
17518.062 MB/s Rate = 6585.4688 MB/s Rate = 24691.072 MB/s  
   Rate = 11180.613 MB/s Rate = 20225.936 MB/s 
{code}

 Solve performance issues with Vector Implementations
 

 Key: MAHOUT-300
 URL: https://issues.apache.org/jira/browse/MAHOUT-300
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.3
Reporter: Robin Anil
 Fix For: 0.3

 Attachments: MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, 
 MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, 
 MAHOUT-300.patch, MAHOUT-300.patch


 AbstractVector operations like times
   public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   int index = element.index();
   result.setQuick(index, element.get() * x);
 }
 return result;
   }
 should be implemented as follows
  public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = result.iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   element.set(element.get() * x);
 }
 return result;
   }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-22 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836713#action_12836713
 ] 

Robin Anil commented on MAHOUT-300:
---

Can i commit the latest. If you dont have any changes pending on your end ? 
What ever be, we need to ensure correctness and proceed with 0.3. We are much 
better in terms of perf now than at the begining of this issue

 Solve performance issues with Vector Implementations
 

 Key: MAHOUT-300
 URL: https://issues.apache.org/jira/browse/MAHOUT-300
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.3
Reporter: Robin Anil
 Fix For: 0.3

 Attachments: MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, 
 MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, 
 MAHOUT-300.patch, MAHOUT-300.patch


 AbstractVector operations like times
   public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   int index = element.index();
   result.setQuick(index, element.get() * x);
 }
 return result;
   }
 should be implemented as follows
  public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = result.iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   element.set(element.get() * x);
 }
 return result;
   }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-22 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836815#action_12836815
 ] 

Jake Mannix commented on MAHOUT-300:


With these opts: -vs 50 -sp 500 -nv 50 -l 500 -no 10

Dot product looks more sensible.  

Executive summary: fastest is  SequentialAccess.dot(Dense), clocking in at 
69,246 units/s, which is as expected. 

Leaderboard for dotProduct:
{code}
Seq.dot(Den) :  69,246 units/s
Seq.dot(Seq) :  63,958 units/s
Seq.dot(Rnd) :  49,638 units/s
Rnd.dot(Seq) :  39,019 units/s
Den.dot(Seq) :  30,337 units/s
Rnd.dot(Rnd) :  5,320 units/s
Den.dot(Rnd) :  5,177 units/s
Rnd.dot(Den) :  5,101 units/s
Den.dot(Den) :  516 units/s
{code}

{code}
INFO: DotProduct DenseVector 
sum = 48.442942s;
min = 1.554ms;
max = 32.55ms;
mean = 1.937717ms;
stdDev = 0.55081ms; 
Speed: 516.07104 UnitsProcessed/sec 3.0964262 MBytes/sec   

INFO: DotProduct RandSparseVector 
sum = 4.69924s;
min = 0.116ms;
max = 24.211ms;
mean = 0.187969ms;
stdDev = 0.343685ms; 
Speed: 5320.0093 UnitsProcessed/sec 31.920053 MBytes/sec 
  
INFO: DotProduct SeqSparseVector 
sum = 0.390877s;
min = 0.012ms;
max = 2.698ms;
mean = 0.015635ms;
stdDev = 0.037619ms; 
Speed: 63958.742 UnitsProcessed/sec 383.7524 MBytes/sec   

INFO: DotProduct Dense.dot(Rand) 
sum = 4.828592s;
min = 0.137ms;
max = 4.09ms;
mean = 0.193143ms;
stdDev = 0.052169ms; 
Speed: 5177.4927 UnitsProcessed/sec 31.064955 MBytes/sec   

INFO: DotProduct Dense.dot(Seq) 
sum = 0.823286s;
min = 0.0ms;
max = 4.606ms;
mean = 0.032931ms;
stdDev = 0.03774ms; 
Speed: 30366.117 UnitsProcessed/sec 182.1967 MBytes/sec   

INFO: DotProduct Rand.dot(Dense) 
sum = 4.900044s;
min = 0.14ms;
max = 3.969ms;
mean = 0.196001ms;
stdDev = 0.056772ms; 
Speed: 5101.995 UnitsProcessed/sec 30.61197 MBytes/sec
   
INFO: DotProduct Rand.dot(Seq) 
sum = 0.640713s;
min = 0.0ms;
max = 2.253ms;
mean = 0.025628ms;
stdDev = 0.041805ms; 
Speed: 39019.027 UnitsProcessed/sec 234.11417 MBytes/sec 
  
INFO: DotProduct Seq.dot(Dense) 
sum = 0.361031s;
min = 0.0ms;
max = 4.63ms;
mean = 0.014441ms;
stdDev = 0.040413ms; 
Speed: 69246.13 UnitsProcessed/sec 415.47675 MBytes/sec   

INFO: DotProduct Seq.dot(Rand) 
sum = 0.503642s;
min = 0.0090ms;
max = 5.203ms;
mean = 0.020145ms;
stdDev = 0.05134ms; 
Speed: 49638.434 UnitsProcessed/sec 297.8306 MBytes/sec   
{code}

 Solve performance issues with Vector Implementations
 

 Key: MAHOUT-300
 URL: https://issues.apache.org/jira/browse/MAHOUT-300
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.3
Reporter: Robin Anil
 Fix For: 0.3

 Attachments: MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, 
 MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, 
 MAHOUT-300.patch, MAHOUT-300.patch


 AbstractVector operations like times
   public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   int index = element.index();
   result.setQuick(index, element.get() * x);
 }
 return result;
   }
 should be implemented as follows
  public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = result.iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   element.set(element.get() * x);
 }
 return result;
   }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-22 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836817#action_12836817
 ] 

Ted Dunning commented on MAHOUT-300:



These are getting respectable!

As a quick hack, the fact that dot is commutative should make it possible to 
get identical results for dense.dot(seq) as for seq.dot(dense).  Likewise for 
dense.dot(rand).

A similar, but less dramatic win might come from rnd.dot(seq) being redone as 
seq.dot(rnd).

 Solve performance issues with Vector Implementations
 

 Key: MAHOUT-300
 URL: https://issues.apache.org/jira/browse/MAHOUT-300
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.3
Reporter: Robin Anil
 Fix For: 0.3

 Attachments: MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, 
 MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, 
 MAHOUT-300.patch, MAHOUT-300.patch


 AbstractVector operations like times
   public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   int index = element.index();
   result.setQuick(index, element.get() * x);
 }
 return result;
   }
 should be implemented as follows
  public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = result.iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   element.set(element.get() * x);
 }
 return result;
   }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-22 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836818#action_12836818
 ] 

Jake Mannix commented on MAHOUT-300:


agreed, Ted.  

I'm liking that we're getting 60-70k units/s on Seq.dot(Den) and Seq.dot(Seq), 
with vectors with 500 nonzero elements.  

Since a dot requires a multiply and an add per nonzero element, this is doing 
60 mflops on my laptop in my IDE, with the browser running, etc.  Not bad.

 Solve performance issues with Vector Implementations
 

 Key: MAHOUT-300
 URL: https://issues.apache.org/jira/browse/MAHOUT-300
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.3
Reporter: Robin Anil
 Fix For: 0.3

 Attachments: MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, 
 MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, 
 MAHOUT-300.patch, MAHOUT-300.patch


 AbstractVector operations like times
   public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   int index = element.index();
   result.setQuick(index, element.get() * x);
 }
 return result;
   }
 should be implemented as follows
  public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = result.iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   element.set(element.get() * x);
 }
 return result;
   }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-22 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836819#action_12836819
 ] 

Robin Anil commented on MAHOUT-300:
---

Seq.rand and rand.seq shoudl get the same perf level now with an instanceof 
removed

 Solve performance issues with Vector Implementations
 

 Key: MAHOUT-300
 URL: https://issues.apache.org/jira/browse/MAHOUT-300
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.3
Reporter: Robin Anil
 Fix For: 0.3

 Attachments: MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, 
 MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, 
 MAHOUT-300.patch, MAHOUT-300.patch


 AbstractVector operations like times
   public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   int index = element.index();
   result.setQuick(index, element.get() * x);
 }
 return result;
   }
 should be implemented as follows
  public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = result.iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   element.set(element.get() * x);
 }
 return result;
   }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-22 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836826#action_12836826
 ] 

Jake Mannix commented on MAHOUT-300:


and now that my run (of three comments ago) is finally done, with dot product 
removed since it's already been reported.

This properly demonstrates how slow it is to build up a SeqAcc vector 
incrementally, since it's not random-access, among other things.

{code}
INFO: 
BenchMarks  DenseVector RandSparseVector
SeqSparseVector 
Clone   

nCalls = 25000; nCalls = 25000; nCalls 
= 25000; 
sum = 222.552872s;  sum = 34.923269s;   sum = 
34.251326s;   
min = 4.598ms;  min = 0.446ms;  min = 
0.4ms;
max = 265.445ms;max = 184.352ms;max = 
182.734ms;
mean = 8.902114ms;  mean = 1.39693ms;   mean = 
1.370053ms;  
stdDev = 11.676773ms;   stdDev = 4.533406ms;stdDev 
= 5.002041ms;
Speed = 112.33286 /sec  Speed = 715.8551 /sec   Speed = 
729.89874 /sec  
Rate = 0.6739971 MB/s   Rate = 4.2951303 MB/s   Rate = 
4.379392 MB/s

Create (copy)   

nCalls = 25000; nCalls = 25000; nCalls 
= 25000; 
sum = 209.506424s;  sum = 1.371177s;sum = 
0.667553s;
min = 1.427ms;  min = 0.0050ms; min = 
0.021ms;  
max = 11802.223ms;  max = 21.322ms; max = 
10.036ms; 
mean = 8.380256ms;  mean = 0.054847ms;  mean = 
0.026702ms;  
stdDev = 27.862112ms;   stdDev = 0.324031ms;stdDev 
= 0.130493ms;
Speed = 119.32809 /sec  Speed = 18232.512 /sec  Speed = 
37450.207 /sec  
Rate = 0.7159685 MB/s   Rate = 109.395065 MB/s  Rate = 
224.70125 MB/s   

Create (incrementally)  

nCalls = 25000; nCalls = 25000; nCalls 
= 25000; 
sum = 0.570172s;sum = 0.755783s;sum = 
3.969259s;
min = 0.0ms;min = 0.0ms;min = 
0.093ms;  
max = 4.148ms;  max = 23.108ms; max = 
13.452ms; 
mean = 0.022806ms;  mean = 0.030231ms;  mean = 
0.15877ms;   
stdDev = 0.060237ms;stdDev = 0.196128ms;stdDev 
= 0.192234ms;
Speed = 43846.414 /sec  Speed = 33078.277 /sec  Speed = 
6298.405 /sec   
Rate = 263.0785 MB/sRate = 198.46967 MB/s   Rate = 
37.79043 MB/s

org.apache.mahout.common.distance.CosineDistanceMeasure 
   
nCalls = 25000; nCalls = 25000; nCalls 
= 25000; 
sum = 500.69893s;   sum = 29.026116s;   sum = 
3.367885s;
min = 16.147ms; min = 0.896ms;  min = 
0.086ms;  
max = 163.619ms;max = 10.819ms; max = 
11.731ms; 
mean = 20.027957ms; mean = 1.161044ms;  mean = 
0.134715ms;  
stdDev = 4.146275ms;stdDev = 0.345399ms;stdDev 
= 0.092807ms;
Speed = 49.930202 /sec  Speed = 861.29333 /sec  Speed = 
7423.056 /sec   
Rate = 0.2995812 MB/s   Rate = 5.16776 MB/s Rate = 
44.538334 MB/s   

org.apache.mahout.common.distance.EuclideanDistanceMeasure  
  
nCalls = 25000; nCalls = 25000; nCalls 
= 25000; 
sum = 501.080023s;  sum = 26.812884s;   sum = 
3.649897s;
min = 17.011ms; min = 0.924ms;  min = 
0.086ms;  
max = 120.138ms;max = 9.692ms;  max = 
13.113ms; 
mean = 20.0432ms;   mean = 1.072515ms;  mean = 
0.145995ms;  
stdDev = 4.410452ms;stdDev = 0.262769ms;stdDev 
= 0.192273ms;
Speed = 49.89223 /sec   Speed = 932.3876 /sec   Speed = 

[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-22 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836839#action_12836839
 ] 

Robin Anil commented on MAHOUT-300:
---

{noformat}
seq.seq= 46,855
rand.seq   = 37,397
seq.dense  = 36,460
seq.rand   = 34,348
dense.seq  = 25,453
rand.rand  = 5,436
dense.rand = 5,303
rand.dense = 4,754
dense.dense= 477

{noformat}

 Solve performance issues with Vector Implementations
 

 Key: MAHOUT-300
 URL: https://issues.apache.org/jira/browse/MAHOUT-300
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.3
Reporter: Robin Anil
 Fix For: 0.3

 Attachments: MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, 
 MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, 
 MAHOUT-300.patch, MAHOUT-300.patch


 AbstractVector operations like times
   public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   int index = element.index();
   result.setQuick(index, element.get() * x);
 }
 return result;
   }
 should be implemented as follows
  public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = result.iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   element.set(element.get() * x);
 }
 return result;
   }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-22 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836848#action_12836848
 ] 

Robin Anil commented on MAHOUT-300:
---

{noformat}
rand.rand  = 14,435
dense.rand = 9,172
rand.dense = 10,578
dense.dense= 477
{noformat}

 Solve performance issues with Vector Implementations
 

 Key: MAHOUT-300
 URL: https://issues.apache.org/jira/browse/MAHOUT-300
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.3
Reporter: Robin Anil
 Fix For: 0.3

 Attachments: MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, 
 MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, 
 MAHOUT-300.patch, MAHOUT-300.patch


 AbstractVector operations like times
   public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   int index = element.index();
   result.setQuick(index, element.get() * x);
 }
 return result;
   }
 should be implemented as follows
  public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = result.iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   element.set(element.get() * x);
 }
 return result;
   }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-22 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836909#action_12836909
 ] 

Jake Mannix commented on MAHOUT-300:


New benchmark additions:

{code}INFO: 
BenchMarks  DenseVector RandSparseVector
SeqSparseVector Dense.fn(Rand)  Dense.fn(Seq)   
Rand.fn(Dense)  Rand.fn(Seq)Seq.fn(Dense)   
Seq.fn(Rand)
Clone   

nCalls = 25000; nCalls = 25000; nCalls 
= 25000; 
sum = 222.609888s;  sum = 0.427272s;sum = 
32.833216s;   
min = 4.509ms;  min = 0.0030ms; min = 
0.381ms;  
max = 205.425ms;max = 17.397ms; max = 
164.729ms;
mean = 8.904395ms;  mean = 0.01709ms;   mean = 
1.313328ms;  
stdDev = 11.839592ms;   stdDev = 0.256237ms;stdDev 
= 4.730696ms;
Speed = 112.30409 /sec  Speed = 58510.74 /sec   Speed = 
761.424 /sec
Rate = 0.6738245 MB/s   Rate = 351.06442 MB/s   Rate = 
4.568544 MB/s

Create (copy)   

nCalls = 25000; nCalls = 25000; nCalls 
= 25000; 
sum = 153.385135s;  sum = 1.316737s;sum = 
0.654021s;
min = 1.291ms;  min = 0.0080ms; min = 
0.0ms;
max = 149.59ms; max = 18.778ms; max = 
8.555ms;  
mean = 6.135405ms;  mean = 0.052669ms;  mean = 
0.02616ms;   
stdDev = 9.730283ms;stdDev = 0.276396ms;stdDev 
= 0.116822ms;
Speed = 162.9884 /sec   Speed = 18986.328 /sec  Speed = 
38225.074 /sec  
Rate = 0.9779304 MB/s   Rate = 113.91796 MB/s   Rate = 
229.35042 MB/s   

Create (incrementally)  

nCalls = 25000; nCalls = 25000; nCalls 
= 25000; 
sum = 0.556807s;sum = 1.914268s;sum = 
4.109328s;
min = 0.0ms;min = 0.02ms;   min = 
0.093ms;  
max = 2.523ms;  max = 184.955ms;max = 
16.624ms; 
mean = 0.022272ms;  mean = 0.07657ms;   mean = 
0.164373ms;  
stdDev = 0.038841ms;stdDev = 1.192837ms;stdDev 
= 0.214126ms;
Speed = 44898.863 /sec  Speed = 13059.822 /sec  Speed = 
6083.72 /sec
Rate = 269.39316 MB/s   Rate = 78.35893 MB/sRate = 
36.50232 MB/s

DotProduct  


nCalls = 25000; nCalls = 25000; nCalls 
= 25000; nCalls = 25000; nCalls = 25000; nCalls = 
25000; nCalls = 25000; nCalls = 25000; nCalls = 25000;  
   
sum = 48.730579s;   sum = 1.214007s;sum = 
0.421372s;sum = 2.091561s;sum = 0.883674s;sum = 
2.110771s;sum = 0.571964s;sum = 0.370673s;sum = 
0.624421s;
min = 1.581ms;  min = 0.0040ms; min = 
0.0ms;min = 0.036ms;  min = 0.0ms;min = 
0.033ms;  min = 0.018ms;  min = 0.0ms;min = 
0.019ms;  
max = 14.217ms; max = 26.558ms; max = 
2.628ms;  max = 9.386ms;  max = 8.269ms;  max = 
8.159ms;  max = 1.525ms;  max = 1.674ms;  max = 7.62ms; 
  
mean = 1.949223ms;  mean = 0.04856ms;   mean = 
0.016854ms;  mean = 0.083662ms;  mean = 0.035346ms;  mean = 
0.08443ms;   mean = 0.022878ms;  mean = 0.014826ms;  mean = 
0.024976ms;  
stdDev = 0.342952ms;stdDev = 0.216698ms;stdDev 
= 0.028979ms;stdDev = 0.070128ms;stdDev = 0.065883ms;stdDev = 
0.064003ms;stdDev = 0.026759ms;stdDev = 0.034967ms;stdDev = 
0.059001ms;
Speed = 513.0249 /sec   Speed = 20592.96 /sec   Speed = 
59330.0 /secSpeed = 

[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-21 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836370#action_12836370
 ] 

Robin Anil commented on MAHOUT-300:
---

ok. Made maxValue and maxValueIndex as per your comments. Only difference in 
behaviour is for random access it could return any index which is zero as the 
max value as things are not ordered.

I have modified dot and minus, a frequently used functions in distance measures 
and optimised them as follows

{code}
  public double dot(Vector x) {
if (size() != x.size()) {
  throw new CardinalityException(size(), x.size());
}
double result = 0;
if (this instanceof SequentialAccessSparseVector
 x instanceof SequentialAccessSparseVector) {
  // For sparse SeqAccVectors. do dot product without lookup in a linear 
fashion
  IteratorElement myIter = iterateNonZero();
  IteratorElement otherIter = x.iterateNonZero();
  Element myCurrent = null;
  Element otherCurrent = null;
  while (myIter.hasNext()  otherIter.hasNext()) {
if (myCurrent == null) myCurrent = myIter.next();
if (otherCurrent == null) otherCurrent = otherIter.next();

int myIndex = myCurrent.index();
int otherIndex = otherCurrent.index();

if (myIndex  otherIndex) {
  // due to the sparseness skipping occurs more hence checked before 
equality
  myCurrent = null;
} else if (myIndex  otherIndex){
  otherCurrent = null;
} else { // both are equal 
  result += myCurrent.get() * otherCurrent.get();
  myCurrent = null;
  otherCurrent = null;
} 
  }
  return result;
} else if ((this instanceof RandomAccessSparseVector || this instanceof 
DenseVector)
(x instanceof SequentialAccessSparseVector || x instanceof 
RandomAccessSparseVector)) {
  // Try to get the speed boost associated fast/normal seq access on x and 
quick lookup on this
  IteratorElement iter = x.iterateNonZero();
  while (iter.hasNext()) {
Element element = iter.next();
result += element.get() * getQuick(element.index());
  }
  return result;
} else { // TODO: can optimize more based on the numDefaultElements in the 
vectors
  IteratorElement iter = iterateNonZero();
  while (iter.hasNext()) {
Element element = iter.next();
result += element.get() * x.getQuick(element.index());
  }
  return result;
}
  }

  public Vector minus(Vector x) {
if (size() != x.size()) {
  throw new CardinalityException();
}
if (x instanceof RandomAccessSparseVector || x instanceof DenseVector) {
  Vector result = x.clone();
  IteratorElement iter = iterateNonZero();
  while (iter.hasNext()) {
Element e = iter.next();
result.setQuick(e.index(), result.getQuick(e.index()) - e.get());
  }
  return result;
} else { // TODO: check the numNonDefault elements to further optimize 
  Vector result = clone();
  IteratorElement iter = x.iterateNonZero();
  while (iter.hasNext()) {
Element e = iter.next();
result.setQuick(e.index(), getQuick(e.index()) - e.get());
  }
  return result;
}
  }

{code}

Based on all these optimisation, the before and after picture. Note: these are 
same impl benchmarks seq.dot(seq) etc.
{noformat}
BenchMarksDenseVector   
RandomAccessSparseVector  SequentialAccessSparseVector
DotProduct  

  nCalls = 2;   nCalls = 2; 
  nCalls = 2;   
  sumTime = 0.132436s;  sumTime = 
1.354725s;  sumTime = 1.78453s;   
  minTime = 0.0050ms;   minTime = 0.053ms;  
  minTime = 0.083ms;
  maxTime = 2.996ms;maxTime = 54.293ms; 
  maxTime = 8.921ms;
  meanTime = 0.006621ms;meanTime = 
0.067736ms;meanTime = 0.089226ms;
  stdDevTime = 0.029368ms;  stdDevTime = 
0.417954ms;  stdDevTime = 0.078909ms;  
  Speed = 151016.34 /secSpeed = 14763.144 
/secSpeed = 11207.433 /sec
  Rate = 1812.1962 MB/s Rate = 177.15773 
MB/s Rate = 134.4892 MB/s  
DotProduct  

  nCalls = 2;   nCalls = 2; 
  nCalls = 2;   
  sumTime = 0.127648s; 

[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-21 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836385#action_12836385
 ] 

Jake Mannix commented on MAHOUT-300:


This output is on the Reuters collection again, or on the dense data in the 
VectorBenchmarks code?  The latter is artificially favoring dense data...

 Solve performance issues with Vector Implementations
 

 Key: MAHOUT-300
 URL: https://issues.apache.org/jira/browse/MAHOUT-300
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.3
Reporter: Robin Anil
 Fix For: 0.3

 Attachments: MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch


 AbstractVector operations like times
   public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   int index = element.index();
   result.setQuick(index, element.get() * x);
 }
 return result;
   }
 should be implemented as follows
  public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = result.iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   element.set(element.get() * x);
 }
 return result;
   }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-21 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836411#action_12836411
 ] 

Robin Anil commented on MAHOUT-300:
---

Its on the artificial VectorBenchmarks. On reuters, i see similar performance 
gains in runtime. Its just a matter of adding the same in Vector benchmarks.  
To put them into quantifiable values, it just a matter of computing the 
following. 
Its just a matter of adding the following to the vector benchmarks.
seq.fn(sparse) sparse.fn(seq) seq.fn(dense) sparse.fn(dense) dense.fn(seq) 
dense.fn(sparse)

 Solve performance issues with Vector Implementations
 

 Key: MAHOUT-300
 URL: https://issues.apache.org/jira/browse/MAHOUT-300
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.3
Reporter: Robin Anil
 Fix For: 0.3

 Attachments: MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch


 AbstractVector operations like times
   public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   int index = element.index();
   result.setQuick(index, element.get() * x);
 }
 return result;
   }
 should be implemented as follows
  public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = result.iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   element.set(element.get() * x);
 }
 return result;
   }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-21 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836412#action_12836412
 ] 

Robin Anil commented on MAHOUT-300:
---

Also please review this and confirm its fit to commit. I dont want to block 
0.3. I can continue exploring other changes on 0.4

 Solve performance issues with Vector Implementations
 

 Key: MAHOUT-300
 URL: https://issues.apache.org/jira/browse/MAHOUT-300
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.3
Reporter: Robin Anil
 Fix For: 0.3

 Attachments: MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch


 AbstractVector operations like times
   public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   int index = element.index();
   result.setQuick(index, element.get() * x);
 }
 return result;
   }
 should be implemented as follows
  public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = result.iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   element.set(element.get() * x);
 }
 return result;
   }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-21 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836433#action_12836433
 ] 

Ted Dunning commented on MAHOUT-300:


I think that this is a cleaner style for the merge loop.  In particular, the 
average inner loop is much tighter.  The trick is that either iterator can take 
many steps per outer iteration and that whenever either iterator is stepping, 
you only check that iterator, its index and a constant.  In sparse vectors, 
this is a big win.  Even in fairly dense vectors there are just a few extra 
tests which the compiler may well be able to eliminate with common 
sub-expression analysis.  This form has the added benefit of having a simple 
correctness argument.

{noformat}
  IteratorElement myIter = iterateNonZero();
  IteratorElement otherIter = x.iterateNonZero();
  while (myIter.hasNext()  otherIter.hasNext()) {
// loop invariant: neither entry is up to date

// scan to end or equality
while (myCurrent.index() != otherCurrent.index() 
  ((myIter.hasNext()  myCurrent.index()  
otherCurrent.index()) ||
  (otherIter.hasNext()  otherCurrent.index()  
myCurrent.index( {
  // invariant: both entries are current

  // catch up my side
  while (myIter.hasNext()  myCurrent.index()  otherCurrent.index()) {
myCurrent = myIter.next();
  }

  // catch up other side  
  while (otherIter.hasNext()  otherCurrent.index()  
myCurrent.index()) {
otherCurrent = otherIter.next();
  }

  // invariant: both entries are current
}
// exit: (both entries are current AND equal index) OR one side ran out 
early

if (myCurrent.index() == otherCurrent.index()) {
  // if equal, use it
  result += myCurrent.get() * otherCurrent.get();
}
// invariant: neither entry is up to date or we will exit loop
  }
{noformat}

 Solve performance issues with Vector Implementations
 

 Key: MAHOUT-300
 URL: https://issues.apache.org/jira/browse/MAHOUT-300
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.3
Reporter: Robin Anil
 Fix For: 0.3

 Attachments: MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch


 AbstractVector operations like times
   public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   int index = element.index();
   result.setQuick(index, element.get() * x);
 }
 return result;
   }
 should be implemented as follows
  public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = result.iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   element.set(element.get() * x);
 }
 return result;
   }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-21 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836449#action_12836449
 ] 

Jake Mannix commented on MAHOUT-300:


Running this on my laptop, with numNonzeroElements = 1000, vector cardinality 
100,000, numVectors = 100, numLoops = 100 (requires -Xmx1g).
{code}
INFO: 
BenchMarksDenseVector   
RandomAccessSparseVector  SequentialAccessSparseVector  
Clone   

  nCalls = 1;   nCalls = 1; 
  nCalls = 1;   
  sumTime = 12.240903s; sumTime = 2.40168s; 
  sumTime = 2.1353s;
  minTime = 0.425ms;minTime = 0.065ms;  
  minTime = 0.065ms;
  maxTime = 96.625ms;   maxTime = 30.835ms; 
  maxTime = 25.169ms;   
  meanTime = 1.22409ms; meanTime = 
0.240168ms;meanTime = 0.21353ms; 
  stdDevTime = 3.994235ms;  stdDevTime = 
1.468017ms;  stdDevTime = 1.271389ms;  
  Speed = 816.93317 /secSpeed = 4163.752 
/sec Speed = 4683.1826 /sec
  Rate = 980.3198 MB/s  Rate = 4996.503 
MB/s  Rate = 5619.8193 MB/s 

Create (copy)   

  nCalls = 1;   nCalls = 1; 
  nCalls = 1;   
  sumTime = 13.525425s; sumTime = 
1.855115s;  sumTime = 0.662937s;  
  minTime = 0.206ms;minTime = 0.088ms;  
  minTime = 0.015ms;
  maxTime = 99.047ms;   maxTime = 25.277ms; 
  maxTime = 26.974ms;   
  meanTime = 1.352542ms;meanTime = 
0.185511ms;meanTime = 0.066293ms;
  stdDevTime = 5.680836ms;  stdDevTime = 
0.718719ms;  stdDevTime = 0.364231ms;  
  Speed = 739.34827 /secSpeed = 5390.5015 
/secSpeed = 15084.389 /sec
  Rate = 887.21796 MB/s Rate = 6468.6016 
MB/s Rate = 18101.268 MB/s 

Create (incrementally)  

  nCalls = 1;   nCalls = 1; 
  nCalls = 1;   
  sumTime = 0.145204s;  sumTime = 
0.533273s;  sumTime = 4.398924s;  
  minTime = 0.0ms;  minTime = 0.014ms;  
  minTime = 0.338ms;
  maxTime = 2.874ms;maxTime = 2.713ms;  
  maxTime = 22.045ms;   
  meanTime = 0.01452ms; meanTime = 
0.053327ms;meanTime = 0.439892ms;
  stdDevTime = 0.037578ms;  stdDevTime = 
0.034625ms;  stdDevTime = 0.29153ms;   
  Speed = 68868.625 /secSpeed = 18752.121 
/secSpeed = 2273.2832 /sec
  Rate = 82642.35 MB/s  Rate = 22502.547 
MB/s Rate = 2727.94 MB/s   

DotProduct  

  nCalls = 1;   nCalls = 1; 
  nCalls = 1;   
  sumTime = 3.094423s;  sumTime = 
1.218703s;  sumTime = 0.378118s;  
  minTime = 0.237ms;minTime = 0.07ms;   
  minTime = 0.025ms;
  maxTime = 5.995ms;maxTime = 20.012ms; 
  maxTime = 9.925ms;
  meanTime = 0.309442ms;meanTime = 
0.12187ms; meanTime = 0.037811ms;
  stdDevTime = 0.095079ms;  stdDevTime = 
0.288768ms;  stdDevTime = 0.1183ms;
  Speed = 3231.62 /sec  Speed = 8205.444 
/sec Speed = 26446.77 /sec 
  Rate = 3877.9443 MB/s Rate = 9846.534 
MB/s  Rate = 31736.123 MB/s 

org.apache.mahout.common.distance.CosineDistanceMeasure

[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-21 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836452#action_12836452
 ] 

Ted Dunning commented on MAHOUT-300:


Huh some of those times are a little surprising.

For DotProduct and CosineDistanceMeasure, SequentialAccessSparseVector is 3x 
faster than RandomAccessSparseVector and 8x faster than DenseVector.  There the 
world is good.

But for SquaredEuclideanDistanceMeasure and TanimotoDistanceMeasure, there is 
little difference while for ManhattanDistanceMeasure, 
SequentialAccessSparseVector is slower than RandomAccessSparseVector.

Is it just that for these last 3 distances the sequentiality has not been taken 
into account?

{noformat}
DotProduct
 Rate = 3877.9443 MB/s Rate = 9846.534 MB/s 
 Rate = 31736.123 MB/s

org.apache.mahout.common.distance.CosineDistanceMeasure
 Speed = 1690.1599 /secSpeed = 3366.8774 
/secSpeed = 12309.282 /sec

org.apache.mahout.common.distance.EuclideanDistanceMeasure
 Speed = 2913.8206 /secSpeed = 5868.9404 
/secSpeed = 8209.688 /sec

org.apache.mahout.common.distance.ManhattanDistanceMeasure
 Speed = 867.9127 /sec Speed = 2435.4307 
/secSpeed = 1048.7443 /sec

org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure
 Speed = 3387.1472 /secSpeed = 7091.4087 
/secSpeed = 8785.509 /sec

org.apache.mahout.common.distance.TanimotoDistanceMeasure
 Speed = 1803.4031 /secSpeed = 3873.8967 
/secSpeed = 6844.7017 /sec
{noformat}


 Solve performance issues with Vector Implementations
 

 Key: MAHOUT-300
 URL: https://issues.apache.org/jira/browse/MAHOUT-300
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.3
Reporter: Robin Anil
 Fix For: 0.3

 Attachments: MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, 
 MAHOUT-300.patch, MAHOUT-300.patch


 AbstractVector operations like times
   public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   int index = element.index();
   result.setQuick(index, element.get() * x);
 }
 return result;
   }
 should be implemented as follows
  public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = result.iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   element.set(element.get() * x);
 }
 return result;
   }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-21 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836457#action_12836457
 ] 

Jake Mannix commented on MAHOUT-300:


Interestingly, for SquaredEuclideanDistanceMeasure (and 
EuclideanDistanceMeasure), since vectors are caching their lengthSquareds, the 
only computation going on in the distance measure is a dot() :

{code} 
 // if this and v has a cached lengthSquared, dot product is quickest way to 
compute this.
if(lengthSquared = 0  v instanceof AbstractVector  
((AbstractVector)v).lengthSquared = 0) {
  return lengthSquared + v.getLengthSquared() - 2 * this.dot(v);
}
{code}

So the time should be nearly exactly the same for all three, in the case where 
it's cached.

 Solve performance issues with Vector Implementations
 

 Key: MAHOUT-300
 URL: https://issues.apache.org/jira/browse/MAHOUT-300
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.3
Reporter: Robin Anil
 Fix For: 0.3

 Attachments: MAHOUT-300.patch, MAHOUT-300.patch, MAHOUT-300.patch, 
 MAHOUT-300.patch, MAHOUT-300.patch


 AbstractVector operations like times
   public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   int index = element.index();
   result.setQuick(index, element.get() * x);
 }
 return result;
   }
 should be implemented as follows
  public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = result.iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   element.set(element.get() * x);
 }
 return result;
   }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-21 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836460#action_12836460
 ] 

Jake Mannix commented on MAHOUT-300:


Another run, even more sparse: cardinality: 500,000, density: 100, only 50 
vectors this time, because I was running out of memory, 100 loops:

{code}INFO: 
BenchMarksDenseVector   
RandomAccessSparseVector  SequentialAccessSparseVector  
Clone   

  nCalls = 5000;nCalls = 5000;  
  nCalls = 5000;
  sumTime = 41.899077s; sumTime = 
6.736452s;  sumTime = 6.268572s;  
  minTime = 4.593ms;minTime = 0.453ms;  
  minTime = 0.415ms;
  maxTime = 186.272ms;  maxTime = 
183.524ms;  maxTime = 187.254ms;  
  meanTime = 8.379815ms;meanTime = 
1.34729ms; meanTime = 1.253714ms;
  stdDevTime = 11.260108ms; stdDevTime = 
4.647302ms;  stdDevTime = 4.992136ms;  
  Speed = 119.334366 /sec   Speed = 742.23047 
/secSpeed = 797.6298 /sec 
  Rate = -514.0831 MB/s Rate = -3197.471 
MB/s Rate = -3436.127 MB/s 

Create (copy)   

  nCalls = 5000;nCalls = 5000;  
  nCalls = 5000;
  sumTime = 33.213833s; sumTime = 0.14521s; 
  sumTime = 0.035935s;  
  minTime = 1.643ms;minTime = 0.011ms;  
  minTime = 0.0040ms;   
  maxTime = 139.441ms;  maxTime = 18.174ms; 
  maxTime = 1.431ms;
  meanTime = 6.642766ms;meanTime = 
0.029042ms;meanTime = 0.007187ms;
  stdDevTime = 11.24469ms;  stdDevTime = 
0.349996ms;  stdDevTime = 0.030313ms;  
  Speed = 150.53969 /secSpeed = 34432.89 
/sec Speed = 139140.11 /sec
  Rate = -648.5132 MB/s Rate = -148334.2 
MB/s Rate = -599404.75 MB/s

Create (incrementally)  

  nCalls = 5000;nCalls = 5000;  
  nCalls = 5000;
  sumTime = 0.04538s;   sumTime = 
0.035935s;  sumTime = 0.089474s;  
  minTime = 0.0ms;  minTime = 0.0ms;
  minTime = 0.0ms;  
  maxTime = 0.172ms;maxTime = 1.201ms;  
  maxTime = 0.612ms;
  meanTime = 0.009076ms;meanTime = 
0.007187ms;meanTime = 0.017894ms;
  stdDevTime = 0.006338ms;  stdDevTime = 
0.023144ms;  stdDevTime = 0.016724ms;  
  Speed = 110180.695 /sec   Speed = 139140.11 
/secSpeed = 55882.156 /sec
  Rate = -474649.8 MB/s Rate = -599404.75 
MB/sRate = -240735.95 MB/s

DotProduct  

  nCalls = 5000;nCalls = 5000;  
  nCalls = 5000;
  sumTime = 5.036773s;  sumTime = 
0.510658s;  sumTime = 0.063884s;  
  minTime = 0.848ms;minTime = 0.047ms;  
  minTime = 0.0020ms;   
  maxTime = 7.558ms;maxTime = 8.919ms;  
  maxTime = 0.325ms;
  meanTime = 1.007354ms;meanTime = 
0.102131ms;meanTime = 0.012776ms;
  stdDevTime = 0.203069ms;  stdDevTime = 
0.19957ms;   stdDevTime = 0.024669ms;  
  Speed = 992.6991 /sec Speed = 9791.289 
/sec Speed = 78266.86 /sec 
  Rate = -4276.47 MB/s  Rate = -42180.11 
MB/s Rate = -337167.5 MB/s 

org.apache.mahout.common.distance.CosineDistanceMeasure 

[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-21 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836468#action_12836468
 ] 

Jake Mannix commented on MAHOUT-300:


Ok, went away... probably a case of pebkac 

{code}
org.apache.mahout.benchmark.VectorBenchmarks -vs 5 -sp 1000 -nv 50 -l 100 
-no 5
Feb 21, 2010 2:43:06 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: Create (copy) DenseVector 
 
nCalls = 5000;
sumTime = 3.585996s;
minTime = 0.12ms;
maxTime = 54.331ms;
meanTime = 0.717199ms;
stdDevTime = 3.823725ms; 
Speed: 1394.3127 UnitsProcessed/sec 836.58765 MBytes/sec
   
Feb 21, 2010 2:43:07 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: Create (copy) RandomAccessSparseVector 
 
nCalls = 5000;
sumTime = 0.953957s;
minTime = 0.115ms;
maxTime = 97.032ms;
meanTime = 0.190791ms;
stdDevTime = 1.489048ms; 
Speed: 5241.326 UnitsProcessed/sec 3144.796 MBytes/sec  
 
Feb 21, 2010 2:43:08 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: Create (copy) SequentialAccessSparseVector 
 
nCalls = 5000;
sumTime = 0.278149s;
minTime = 0.0090ms;
maxTime = 8.032ms;
meanTime = 0.055629ms;
stdDevTime = 0.149567ms; 
Speed: 17975.977 UnitsProcessed/sec 10785.587 MBytes/sec
   
Feb 21, 2010 2:43:11 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: Create (incrementally) DenseVector 
 
nCalls = 5000;
sumTime = 0.059921s;
minTime = 0.0040ms;
maxTime = 0.251ms;
meanTime = 0.011984ms;
stdDevTime = 0.009846ms; 
Speed: 83443.195 UnitsProcessed/sec 50065.92 MBytes/sec 
  
Feb 21, 2010 2:43:12 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: Create (incrementally) RandomAccessSparseVector 
 
nCalls = 5000;
sumTime = 0.268465s;
minTime = 0.021ms;
maxTime = 3.14ms;
meanTime = 0.053693ms;
stdDevTime = 0.047671ms; 
Speed: 18624.402 UnitsProcessed/sec 11174.642 MBytes/sec
   
Feb 21, 2010 2:43:15 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: Create (incrementally) SequentialAccessSparseVector 
 
nCalls = 5000;
sumTime = 2.172348s;
minTime = 0.356ms;
maxTime = 6.754ms;
meanTime = 0.434469ms;
stdDevTime = 0.198236ms; 
Speed: 2301.657 UnitsProcessed/sec 1380.9943 MBytes/sec 
  
Feb 21, 2010 2:43:18 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: Clone DenseVector 
 
nCalls = 5000;
sumTime = 3.192205s;
minTime = 0.211ms;
maxTime = 517.68ms;
meanTime = 0.638441ms;
stdDevTime = 7.573187ms; 
Speed: 1566.3154 UnitsProcessed/sec 939.78925 MBytes/sec
   
Feb 21, 2010 2:43:18 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: Clone RandomAccessSparseVector 
 
nCalls = 5000;
sumTime = 0.521343s;
minTime = 0.048ms;
maxTime = 22.281ms;
meanTime = 0.104268ms;
stdDevTime = 0.693174ms; 
Speed: 9590.614 UnitsProcessed/sec 5754.369 MBytes/sec  
 
Feb 21, 2010 2:43:19 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: Clone SequentialAccessSparseVector 
 
nCalls = 5000;
sumTime = 0.56215s;
minTime = 0.016ms;
maxTime = 20.333ms;
meanTime = 0.11243ms;
stdDevTime = 0.70505ms; 
Speed: 8894.423 UnitsProcessed/sec 5336.654 MBytes/sec  
 
Feb 21, 2010 2:43:20 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: DotProduct DenseVector 
sum = -5423.620877902339  
nCalls = 5000;
sumTime = 1.07412s;
minTime = 0.137ms;
maxTime = 13.189ms;
meanTime = 0.214824ms;
stdDevTime = 0.228647ms; 
Speed: 4654.9736 UnitsProcessed/sec 2792.9841 MBytes/sec
   
Feb 21, 2010 2:43:21 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: DotProduct RandomAccessSparseVector 
sum = -5564.910676620138  
nCalls = 5000;
sumTime = 0.890045s;
minTime = 0.051ms;
maxTime = 316.842ms;
meanTime = 0.178009ms;
stdDevTime = 4.501319ms; 
Speed: 5617.6934 UnitsProcessed/sec 3370.6162 MBytes/sec
   
Feb 21, 2010 2:43:21 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: DotProduct SequentialAccessSparseVector 
sum = -5233.180796033902  
nCalls = 5000;
sumTime = 0.215025s;
minTime = 0.026ms;
maxTime = 1.744ms;
meanTime = 0.043005ms;
stdDevTime = 0.084628ms; 
Speed: 23253.111 UnitsProcessed/sec 13951.867 MBytes/sec
   
Feb 21, 2010 2:43:29 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: org.apache.mahout.common.distance.CosineDistanceMeasure DenseVector 
minDistance = 4475.855202291762  
nCalls = 5000;
sumTime = 7.735503s;
minTime = 1.368ms;
maxTime = 9.961ms;
meanTime = 1.5471ms;
stdDevTime = 0.198466ms; 
Speed: 646.37036 UnitsProcessed/sec 387.82224 MBytes/sec
   
Feb 21, 2010 2:43:35 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: org.apache.mahout.common.distance.CosineDistanceMeasure 
RandomAccessSparseVector 
minDistance = 4476.713744856023  
nCalls = 5000;
sumTime = 6.548142s;
minTime = 1.175ms;
maxTime = 16.329ms;
meanTime = 1.309628ms;
stdDevTime = 

[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-20 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836163#action_12836163
 ] 

Sean Owen commented on MAHOUT-300:
--

Tiny stuff -- in things like dotSelf(), you don't need to call element.get() 
twice. Save its value in a local. Seems trivial until you consider a 
million-element vector dotting itself in a tight loop. Saving billions of 
method calls adds up.

maxValue() -- don't set hasNoElements every loop. Set it once according to the 
initial value of hasNext().

Otherwise looks good and free to commit, especially if it's for 0.3.

 Solve performance issues with Vector Implementations
 

 Key: MAHOUT-300
 URL: https://issues.apache.org/jira/browse/MAHOUT-300
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.3
Reporter: Robin Anil
 Fix For: 0.3

 Attachments: MAHOUT-300.patch


 AbstractVector operations like times
   public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   int index = element.index();
   result.setQuick(index, element.get() * x);
 }
 return result;
   }
 should be implemented as follows
  public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = result.iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   element.set(element.get() * x);
 }
 return result;
   }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-20 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836167#action_12836167
 ] 

Robin Anil commented on MAHOUT-300:
---

I removed hasNoElements check as per sean's and teds comment
the current fix is as follows. See the maxIndex implementation. I dont know 
what to do in the edge case of vector being negative valued and sparse. We 
could return -1 or first index of 0;

{code}
  public double maxValue() {
double result = Double.NEGATIVE_INFINITY;
IteratorElement iter = this.iterateNonZero();
while (iter.hasNext()) {
  Element element = iter.next();
  result = Math.max(result, element.get());
}
if (getNumNondefaultElements()  size()) return Math.max(result, 0.0);
return result;
  }

  public int maxValueIndex() {
int result = -1;
double max = Double.NEGATIVE_INFINITY;
IteratorElement iter = this.iterateNonZero();
while (iter.hasNext()) {
  Element element = iter.next();
  double tmp = element.get();
  if (tmp  max) {
max = tmp;
result = element.index();
  }
}
// if the maxElement is negative and the vector is sparse then any
// unfilled element(0.0) could be the maxValue hence return -1;
if (getNumNondefaultElements()  size()  max  0.0) {
  return -1; 
}
return result;
  }
{code}

 Solve performance issues with Vector Implementations
 

 Key: MAHOUT-300
 URL: https://issues.apache.org/jira/browse/MAHOUT-300
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.3
Reporter: Robin Anil
 Fix For: 0.3

 Attachments: MAHOUT-300.patch


 AbstractVector operations like times
   public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   int index = element.index();
   result.setQuick(index, element.get() * x);
 }
 return result;
   }
 should be implemented as follows
  public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = result.iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   element.set(element.get() * x);
 }
 return result;
   }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-20 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836169#action_12836169
 ] 

Robin Anil commented on MAHOUT-300:
---

An issue i found here was for empty dense vectors

IterateNonZero() optimises for iterating over non zero elements hence result is 
still -INF
getNumNonDefaultElements() == size hence it returns -INF instead of zero. 

I guess i will have to go with a bool hasNoElements/checkedNoElements based 
solution

 Solve performance issues with Vector Implementations
 

 Key: MAHOUT-300
 URL: https://issues.apache.org/jira/browse/MAHOUT-300
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.3
Reporter: Robin Anil
 Fix For: 0.3

 Attachments: MAHOUT-300.patch


 AbstractVector operations like times
   public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   int index = element.index();
   result.setQuick(index, element.get() * x);
 }
 return result;
   }
 should be implemented as follows
  public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = result.iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   element.set(element.get() * x);
 }
 return result;
   }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-20 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836238#action_12836238
 ] 

Ted Dunning commented on MAHOUT-300:


{quote}
I dont know what to do in the edge case of vector being negative valued and 
sparse. We could return -1 or first index of 0;
{quote}

The rule of thumb is that this should return the same as if you copied the 
vector into any other implementation (such as DenseVector) and did the same 
operation.  Thus 0 is the correct answer.

It may be that someday we will need maxNonZero, but we can do that when it 
comes up.

{quote}
An issue i found here was for empty dense vectors

IterateNonZero() optimises for iterating over non zero elements hence result is 
still -INF
getNumNonDefaultElements() == size hence it returns -INF instead of zero.

I guess i will have to go with a bool hasNoElements/checkedNoElements based 
solution
[ Show ยป ]
Robin Anil added a comment - 20/Feb/10 02:14 PM An issue i found here was for 
empty dense vectors IterateNonZero() optimises for iterating over non zero 
elements hence result is still -INF getNumNonDefaultElements() == size hence it 
returns -INF instead of zero. I guess i will have to go with a bool 
hasNoElements/checkedNoElements based solution
{quote}

Actually, if size() == 0, I am happy with the result being ill-defined.  
Probably the best course would be to simply throw an IllegalArgumentException 
or something similar to signal that asking for the max of a zero sized vector 
doesn't make a lot of sense. 

 Solve performance issues with Vector Implementations
 

 Key: MAHOUT-300
 URL: https://issues.apache.org/jira/browse/MAHOUT-300
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.3
Reporter: Robin Anil
 Fix For: 0.3

 Attachments: MAHOUT-300.patch, MAHOUT-300.patch


 AbstractVector operations like times
   public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   int index = element.index();
   result.setQuick(index, element.get() * x);
 }
 return result;
   }
 should be implemented as follows
  public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = result.iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   element.set(element.get() * x);
 }
 return result;
   }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-300) Solve performance issues with Vector Implementations

2010-02-19 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12836053#action_12836053
 ] 

Ted Dunning commented on MAHOUT-300:


I think that the min and max functions need to check to see if the number of 
non-zero elements is  size.  If so, then the max element is Math.max(0, max) 
and the min element is Math.min(0, min).  This should subsume the case where 
there are no elements so I would just replace hasNoElements with a counter that 
you can then use to compare to the size.

 Solve performance issues with Vector Implementations
 

 Key: MAHOUT-300
 URL: https://issues.apache.org/jira/browse/MAHOUT-300
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.3
Reporter: Robin Anil
 Fix For: 0.3

 Attachments: MAHOUT-300.patch


 AbstractVector operations like times
   public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   int index = element.index();
   result.setQuick(index, element.get() * x);
 }
 return result;
   }
 should be implemented as follows
  public Vector times(double x) {
 Vector result = clone();
 IteratorElement iter = result.iterateNonZero();
 while (iter.hasNext()) {
   Element element = iter.next();
   element.set(element.get() * x);
 }
 return result;
   }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.