issues found

2009-08-31 Thread Ted Dunning
I just did an exercise of implementing a faster sparse vector.  In the
process, I uncovered a bunch of warts.  I will be filing some Jiras and
patches as soon as I can get to them, but here is a preview:

a) most serialization code for vectors would write vectors with null names
out as if they had names of .  This causes grief in tests and seems wrong.

b) the Vector/SparseVector hierarchy was oddly split out.  I added a
HashVector and moved the current SparseVector into IntDoubleMappingVector
with SparseVector remaining as an abstract class.  This unfortunately caused
lots of upheaval even up into the Vector class.  I have yet to sort this out
cleanly.

c) the squared distance functions were defined in multiple places.  I
centralized these into SquaredEuclideanDistance as a static function.  These
definitions also were wrong and would ignore any components that had
opposite sign and equal magnitude.  In fixing this, I went ahead and wrote
an implementation that makes use of any sparsity present.  To do that, I
added a method to Vector so that I could tell if a vector is sparse.  This
subsumes all of the distance optimizations that we have discussed.  It also
makes it very easy for new code to use these optimizations without knowing
about them.

d) the nonZeroIterator had to be substantially refactored to work in an
abstract class without knowing to much about the internals of everything.

e) in order ot make sure that all metrics worked reasonably with all types,
I have heavily refactored the testing structure for metrics.  I will be
doing more of this as well.  The goal is to test all vector types against
all distance metrics to make sure that they give the same results as
DenseVector.  Then, I will build tests for DenseVector to verify that it
produces the correct result.  This is important because so many vectors have
special code to help metric computation.

f) I have uncovered some strangeness in the ARFF I/O code that I introduced
with the SparseVector abstraction.  The old code will work as it did, but it
won't understand the new HashVector.


Soo

I will be trying to break this down into as small a pieces as I can, but the
total will be a bunch of interdependent patches.  If anybody can help me
apply these as quickly as possible, we should have minimal problems with it
all.  If it drags out, it will get hard to keep rebasing the patch sequence
all the time.



-- 
Ted Dunning, CTO
DeepDyve


[jira] Updated: (MAHOUT-145) PartialData mapreduce Random Forests

2009-08-31 Thread Deneche A. Hakim (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deneche A. Hakim updated MAHOUT-145:


Attachment: partial_August_31.patch

* Corrected some bugs in the new code when testing in a pseudo-distributed 
cluster

 PartialData mapreduce Random Forests
 

 Key: MAHOUT-145
 URL: https://issues.apache.org/jira/browse/MAHOUT-145
 Project: Mahout
  Issue Type: New Feature
  Components: Classification
Reporter: Deneche A. Hakim
Priority: Minor
 Attachments: partial_August_10.patch, partial_August_13.patch, 
 partial_August_15.patch, partial_August_17.patch, partial_August_19.patch, 
 partial_August_2.patch, partial_August_24.patch, partial_August_27.patch, 
 partial_August_31.patch, partial_August_9.patch


 This implementation is based on a suggestion by Ted:
 modify the original algorithm to build multiple trees for different portions 
 of the data. That loses some of the solidity of the original method, but 
 could actually do better if the splits exposed non-stationary behavior.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: issues found

2009-08-31 Thread Grant Ingersoll



On Aug 31, 2009, at 2:55 AM, Ted Dunning wrote:


I just did an exercise of implementing a faster sparse vector.  In the
process, I uncovered a bunch of warts.  I will be filing some Jiras  
and

patches as soon as I can get to them, but here is a preview:

a) most serialization code for vectors would write vectors with null  
names
out as if they had names of .  This causes grief in tests and  
seems wrong.


b) the Vector/SparseVector hierarchy was oddly split out.  I added a
HashVector and moved the current SparseVector into  
IntDoubleMappingVector
with SparseVector remaining as an abstract class.  This  
unfortunately caused
lots of upheaval even up into the Vector class.  I have yet to sort  
this out

cleanly.

c) the squared distance functions were defined in multiple places.  I
centralized these into SquaredEuclideanDistance as a static  
function.  These

definitions also were wrong and would ignore any components that had
opposite sign and equal magnitude.  In fixing this, I went ahead and  
wrote
an implementation that makes use of any sparsity present.  To do  
that, I
added a method to Vector so that I could tell if a vector is  
sparse.  This
subsumes all of the distance optimizations that we have discussed.   
It also
makes it very easy for new code to use these optimizations without  
knowing

about them.

d) the nonZeroIterator had to be substantially refactored to work in  
an
abstract class without knowing to much about the internals of  
everything.


e) in order ot make sure that all metrics worked reasonably with all  
types,
I have heavily refactored the testing structure for metrics.  I will  
be
doing more of this as well.  The goal is to test all vector types  
against

all distance metrics to make sure that they give the same results as
DenseVector.  Then, I will build tests for DenseVector to verify  
that it
produces the correct result.  This is important because so many  
vectors have

special code to help metric computation.

f) I have uncovered some strangeness in the ARFF I/O code that I  
introduced
with the SparseVector abstraction.  The old code will work as it  
did, but it

won't understand the new HashVector.


Soo

I will be trying to break this down into as small a pieces as I can,  
but the
total will be a bunch of interdependent patches.  If anybody can  
help me
apply these as quickly as possible, we should have minimal problems  
with it
all.  If it drags out, it will get hard to keep rebasing the patch  
sequence

all the time.



These all sound reasonable.  If you're happy w/ the changes, just put  
up a big patch on an issue, give it a few days to percolate and then  
commit.








--
Ted Dunning, CTO
DeepDyve


--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:

http://www.lucidimagination.com/search



[jira] Updated: (MAHOUT-157) Frequent Pattern Mining using Parallel FP-Growth

2009-08-31 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-157:
--

Attachment: MAHOUT-157-August-31.patch

Performance Improvements in sequential version

 Frequent Pattern Mining using Parallel FP-Growth
 

 Key: MAHOUT-157
 URL: https://issues.apache.org/jira/browse/MAHOUT-157
 Project: Mahout
  Issue Type: New Feature
Affects Versions: 0.2
Reporter: Robin Anil
 Fix For: 0.2

 Attachments: MAHOUT-157-August-17.patch, MAHOUT-157-August-24.patch, 
 MAHOUT-157-August-31.patch, MAHOUT-157-August-6.patch, 
 MAHOUT-157-Combinations-BSD-License.patch, 
 MAHOUT-157-Combinations-BSD-License.patch, 
 MAHOUT-157-inProgress-August-5.patch


 Implement: http://infolab.stanford.edu/~echang/recsys08-69.pdf

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Request for brainstorming: how to represent item-item diffs efficiently

2009-08-31 Thread Ted Dunning
On Mon, Aug 31, 2009 at 8:29 AM, Sean Owen sro...@gmail.com wrote:

 And I think it is possible to do away with the two-level data structure
 organized by row, then column. I think. Again thinking of efficient ways to
 access by row without perhaps the overhead of storing by row.


If this is for off-line storage, then you may not need to store by row ...
just store in a combined store with an xor of row and column and sort it all
out later.

-- 
Ted Dunning, CTO
DeepDyve


Re: About Java Code Performance Profiling

2009-08-31 Thread Thilo Goetz
Robin Anil wrote:
 Please suggest some good performance profilers for Java.
 There is eclipse Test and Performance Platform. Which for some reason is
 screwed up with eclipse galileo
 Then there is JProbe which has great reviews but its not free
 Yourkit has also great reviews, there is an option to get free license(for
 opensource community) provided we reference Yourkit on our website
 http://www.yourkit.com/purchase/index.jsp

You can get a YourKit license for work on your Apache
project.  If I remember right, your PMC chair should
contact them with a list of committers (with email
addresses) who would like to have a license.  Each
committer on the list will then receive their license
key individually.  I'm pretty sure there's no requirement
to list them on your home page (which I'm not sure the
ASF would allow, anyway).

I know of more than one Apache project that's using
YourKit, so this should not be a problem.

--Thilo

 
 Discussions and Suggestions welcome
 
 
 Robin
 


Re: Request for brainstorming: how to represent item-item diffs efficiently

2009-08-31 Thread Ted Dunning
Not really.  With off-line use, you can amortize the cost of sorting across
all items.  This allows very tight encodings in some cases.  In on-line
(real-time actually) use you have the problem that you can't amortize
anything because you have a worst case constraint.

Essentially, this is the difference between hadoop and OLTP relational
databases.

On Mon, Aug 31, 2009 at 10:34 AM, Sean Owen sro...@gmail.com wrote:

 This is definitely for on-line use. Though I imagine strategies for
 efficient storage pertain well to both the off- and on-line case.




-- 
Ted Dunning, CTO
DeepDyve


[jira] Created: (MAHOUT-168) Need integer compression routines

2009-08-31 Thread Ted Dunning (JIRA)
Need integer compression routines
-

 Key: MAHOUT-168
 URL: https://issues.apache.org/jira/browse/MAHOUT-168
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Ted Dunning


A selection of these algorithms would be very nice to have:

www.cs.rmit.edu.au/~jz/fulltext/compjour99.pdf

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Request for brainstorming: how to represent item-item diffs efficiently

2009-08-31 Thread Ted Dunning
Ignoring the diff for the moment, you will probably win with some coding for
the actual counts.  My vision would be a row-coded complex matrix.  Each row
would have a list of id's, a list of counts, and a list of diffs.

If you have your ids coded so that the most commonly rated items come first,
then simple integer compression should help on the list of id's.  The counts
are already heavily skewed to small numbers so a bit-variable representation
should win big on that side.  The diffs are probably pretty degenerate as
well.  If you look at the distribution of values you will either have a
small number of very common values (that can be encoded with small integers)
or a pretty tight distribution (that can be encoded by picking ranges at the
level of accuracy you want, encoding the range as an integer and then
huffman coding the integers).

The canonical reference for compressing integers is Williams and Zobel,
http://www.cs.rmit.edu.au/~jz/fulltext/compjour99.pdf

I can try to work up an example implementation for these codes, but my time
this week is kind of tight(er).

On Mon, Aug 31, 2009 at 11:01 AM, Sean Owen sro...@gmail.com wrote:

 . I believe I will proceed with investigating
 a structure that maps a long (really two ints together) to a count and to a
 total diff.




-- 
Ted Dunning, CTO
DeepDyve


Re: About Java Code Performance Profiling

2009-08-31 Thread Sean Owen
I use JProfiler 5 just because I am used to it and ended up buying a
license. Works well. I imagine free solutions work fine too.

On Aug 31, 2009 4:39 PM, Robin Anil robin.a...@gmail.com wrote:

Please suggest some good performance profilers for Java.
There is eclipse Test and Performance Platform. Which for some reason is
screwed up with eclipse galileo
Then there is JProbe which has great reviews but its not free
Yourkit has also great reviews, there is an option to get free license(for
opensource community) provided we reference Yourkit on our website
http://www.yourkit.com/purchase/index.jsp

Discussions and Suggestions welcome


Robin


Re: About Java Code Performance Profiling

2009-08-31 Thread Grant Ingersoll
YourKit has an open source license option that asks you to mail them  
and tell them what project you work on.  I've used both YourKit and  
JProfiler and both are quite good.  I've also used the NetBeans one,  
which can now plugin into IntelliJ too, but I don't think it is as  
good as YourKit.


Over time, we will want to develop some performance benchmarking  
tools, too, similar to Lucene's benchmark contrib, as having some  
standard ways to talk about performance and share settings is crucial  
for this stuff in open source.


On Aug 31, 2009, at 8:39 AM, Robin Anil wrote:


Please suggest some good performance profilers for Java.
There is eclipse Test and Performance Platform. Which for some  
reason is

screwed up with eclipse galileo
Then there is JProbe which has great reviews but its not free
Yourkit has also great reviews, there is an option to get free  
license(for

opensource community) provided we reference Yourkit on our website
http://www.yourkit.com/purchase/index.jsp

Discussions and Suggestions welcome


Robin




[jira] Commented: (MAHOUT-168) Need integer compression routines

2009-08-31 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12749703#action_12749703
 ] 

Grant Ingersoll commented on MAHOUT-168:


Can we leverage some of Lucene's capabilities here?  It doesn't have all the 
algorithms mentioned, but does have delta encodings.

 Need integer compression routines
 -

 Key: MAHOUT-168
 URL: https://issues.apache.org/jira/browse/MAHOUT-168
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Ted Dunning

 A selection of these algorithms would be very nice to have:
 www.cs.rmit.edu.au/~jz/fulltext/compjour99.pdf

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (MAHOUT-168) Need integer compression routines

2009-08-31 Thread Ted Dunning (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Dunning reassigned MAHOUT-168:
--

Assignee: Ted Dunning

 Need integer compression routines
 -

 Key: MAHOUT-168
 URL: https://issues.apache.org/jira/browse/MAHOUT-168
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Ted Dunning
Assignee: Ted Dunning

 A selection of these algorithms would be very nice to have:
 www.cs.rmit.edu.au/~jz/fulltext/compjour99.pdf

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-168) Need integer compression routines

2009-08-31 Thread Ted Dunning (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Dunning updated MAHOUT-168:
---

Status: Patch Available  (was: Open)

First version.  100% test coverage except for BitInputStream at 95%.

Needs review for API and obvious speed improvements.



 Need integer compression routines
 -

 Key: MAHOUT-168
 URL: https://issues.apache.org/jira/browse/MAHOUT-168
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Ted Dunning
Assignee: Ted Dunning

 A selection of these algorithms would be very nice to have:
 www.cs.rmit.edu.au/~jz/fulltext/compjour99.pdf

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-168) Need integer compression routines

2009-08-31 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12749721#action_12749721
 ] 

Ted Dunning commented on MAHOUT-168:



I considered that, but it seemed easier to write a version from scratch.  
Somebody should look at the Lucene code to see if they did anything 
earth-shakingly clever for speed.

A delta-code, btw, includes gamma and unary codes so that is complete enough to 
be interesting.  

It should be very little work to add Golomb and byte variable codes to this.  
Only the byte-variable code seems like a good candidate because it can be made 
faster than these other forms pretty easily.

 

 Need integer compression routines
 -

 Key: MAHOUT-168
 URL: https://issues.apache.org/jira/browse/MAHOUT-168
 Project: Mahout
  Issue Type: Improvement
  Components: Matrix
Reporter: Ted Dunning
Assignee: Ted Dunning

 A selection of these algorithms would be very nice to have:
 www.cs.rmit.edu.au/~jz/fulltext/compjour99.pdf

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.