Speed up Frequent Compile

2010-02-05 Thread Robin Anil
When developing mahout core/util/examples we dont need to generate math
often and dont need to tar gzip bzip2 the jar files. We are mostly concerned
with the job file/ jar file.
Cant there be another target like develop which does this. (waiting 2-3 mins
for a 2 line change is frustrating)

Robin


[jira] Updated: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

2010-02-05 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-237:
--

Attachment: MAHOUT-237-tfidf.patch

4 Main Entry points
DocumentProcessor - does SequenceFile = StringTuple(later replaced by 
StructuredDocumentWritable backed by AvroWritable)
DictionaryVectorizer - StringTuple of documents = Tf Vector
PartialVectorMerger - merges partial vectors based on their doc id. Does 
optional normalizing(used by both DictionaryVectorizer(no normalizing) and 
TFIDFConverter (optional normalizing0
TfidfConverter - Converts tf vector to tfidf vector with optional normalizing

An example which uses all of them
hadoop jar examples/target/mahout-examples-0.3-SNAPSHOT.job 
org.apache.mahout.text.SparseVectorsFromSequenceFiles -i reuters-seqfiles -o 
reuters-vectors -w (tfidf|tf) --norm 2(works only when tfidf enabled not with 
tf)

 Map/Reduce Implementation of Document Vectorizer
 

 Key: MAHOUT-237
 URL: https://issues.apache.org/jira/browse/MAHOUT-237
 Project: Mahout
  Issue Type: New Feature
Affects Versions: 0.3
Reporter: Robin Anil
Assignee: Robin Anil
 Fix For: 0.3

 Attachments: DictionaryVectorizer.patch, DictionaryVectorizer.patch, 
 DictionaryVectorizer.patch, DictionaryVectorizer.patch, 
 DictionaryVectorizer.patch, MAHOUT-237-tfidf.patch, MAHOUT-237-tfidf.patch, 
 SparseVector-VIntWritable.patch


 Current Vectorizer uses Lucene Index to convert documents into SparseVectors
 Ted is working on a Hash based Vectorizer which can map features into Vectors 
 of fixed size and sum it up to get the document Vector
 This is a pure bag-of-words based Vectorizer written in Map/Reduce. 
 The input document is in SequenceFileText,Text . with key = docid, value = 
 content
 First Map/Reduce over the document collection and generate the feature counts.
 Second Sequential pass reads the output of the map/reduce and converts them 
 to SequenceFileText, LongWritable where key=feature, value = unique id 
 Second stage should create shards of features of a given split size
 Third Map/Reduce over the document collection, using each shard and create 
 Partial(containing the features of the given shard) SparseVectors 
 Fourth Map/Reduce over partial shard, group by docid, create full document 
 Vector

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Mahout 0.3 Plan and other changes

2010-02-05 Thread Robin Anil
I am committing the first level of changes so that drew can work it. I have
updated the patch on the issue as a reference. Ted please take a look when
you get time. The names will change correspondingly

What I have right now is

4 Main Entry points
DocumentProcessor - does SequenceFile = StringTuple(later replaced by
StructuredDocumentWritable backed by AvroWritable)
DictionaryVectorizer - StringTuple of documents = Tf Vector
PartialVectorMerger - merges partial vectors based on their doc id. Does
optional normalizing(used by both DictionaryVectorizer(no normalizing) and
TFIDFConverter (optional normalizing0
TfidfConverter - Converts tf vector to tfidf vector with optional
normalizing

An example which uses all of them
hadoop jar examples/target/mahout-examples-0.3-SNAPSHOT.job
org.apache.mahout.text.SparseVectorsFromSequenceFiles -i reuters-seqfiles -o
reuters-vectors -w (tfidf|tf) --norm 2(works only with tfidf for now)

Robin


On Fri, Feb 5, 2010 at 12:46 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 Drew has an early code drop that should be posted shortly.  He has a
 generic
 AvroWritable that can serialize anything with an appropriate schema.  That
 changes your names and philosophy a bit.

 Regarding n-grams, I think that will be best combined with a non-dictionary
 based vectorizer because of the large implied vocabulary that would
 otherwise result.  Also, in many cases vectorization and n-gram generation
 is best done in the learning algorithm itself to avoid moving massive
 amounts of data.  As such, vectorization will probably need to be a library
 rather than a map-reduce program.


 On Thu, Feb 4, 2010 at 7:49 PM, Robin Anil robin.a...@gmail.com wrote:

  Lets break it down into milestones. See if you agree on the
 following(even
  ClassNames ?)
 
  On Fri, Feb 5, 2010 at 12:27 AM, Ted Dunning ted.dunn...@gmail.com
  wrote:
 
   These are good questions.  I see the best course as answering these
 kinds
   of
   questions in phases.
  
   First, the only thing that is working right now is the current text =
   vector stuff.  We should continue to refine this with alternative forms
  of
   vectorization (random indexing, stochastic projection as well as the
   current
   dictionary approach).
  
   The input all these vectorization job is StucturedDocumentWritable
 format
  which you and Drew will work on(Avro based)
 
  To create the StructuredDocumentWritable format we have to write
 Mapreduces
  which will convert
  a) SequenceFile = SingleField token array using Analyzer
  I am going with simple Document
  = StucturedDocumentWritable(encapsulating StringTuple)  in   M1.
  Change it to StucturedDocumentWritable( in M2
  b) Lucene Repo  = StucturedDocumentWritable   M2
  c) Structured XML =  StucturedDocumentWritable  M2
  d) Other Formats/DataSources(RDBMS)  = StucturedDocumentWritable
  M3
 
  Jobs using StructuredDocumentWritable
  a) DictionaryVectorizer - Makes VectorWritable M1
  b) nGram Generator - Makes ngrams -
   1) Appends to the dictionary - Creates Partial Vectors -
 Merges
  with vectors from Dictionary Vectorizer to create ngram based vectors
  M1
   2) Appends to  other vectorizers(random indexing, stochastic)
 M1?
  or M2
  c) Random Indexing Job - Makes VectorWritable  M1? or M2
  d) StochasticProjection Job - Makes Vector writable  M1? or M2
 
 
  How does this sound ? Feel free to edit/reorder them
 
 
 
  A second step is to be able to store and represent more general documents
   similar to what is possible with Lucene.  This is critically important
  for
   some of the things that I want to do where I need to store and
 segregate
   title, publisher, authors, abstracts and body text (and many other
   characteristics ... we probably have 100 of them).  It is also
  critically
   important if we want to embrace the dualism between recommendation and
   search.  Representing documents can be done without discarding the
  simpler
   approach we have now and it can be done in advance of good
 vectorization
  of
   these complex documents.
  
   A third step is to define advanced vectorization for complex documents.
   As
   an interim step, we can simply vectorize using the dictionary and
   alternative vectorizers that we have now, but applied to a single field
  of
   the document.  Shortly, though, we should be able to define cross
   occurrence
   features for a multi-field vectorization.
  
   The only dependencies here are that the third step depends on the first
  and
   second.
  
   You have been working on the Dictionary vectorizer.  I did a bit of
 work
  on
   stochastic projection with some cooccurrence.
  
   In parallel Drew and I have been working on building an Avro document
   schema.  This is driving forward on step 2.  I think that will actually
   bear
   some fruit quickly.  Once that is done, we should merge capabilities.
  I
  am
  

[jira] Updated: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

2010-02-05 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-237:
--

Status: Patch Available  (was: Reopened)

Working Implementation DictionaryVectorizer using with tf, tfidf weighting and 
normalization. 

 Map/Reduce Implementation of Document Vectorizer
 

 Key: MAHOUT-237
 URL: https://issues.apache.org/jira/browse/MAHOUT-237
 Project: Mahout
  Issue Type: New Feature
Affects Versions: 0.3
Reporter: Robin Anil
Assignee: Robin Anil
 Fix For: 0.3

 Attachments: DictionaryVectorizer.patch, DictionaryVectorizer.patch, 
 DictionaryVectorizer.patch, DictionaryVectorizer.patch, 
 DictionaryVectorizer.patch, MAHOUT-237-tfidf.patch, MAHOUT-237-tfidf.patch, 
 SparseVector-VIntWritable.patch


 Current Vectorizer uses Lucene Index to convert documents into SparseVectors
 Ted is working on a Hash based Vectorizer which can map features into Vectors 
 of fixed size and sum it up to get the document Vector
 This is a pure bag-of-words based Vectorizer written in Map/Reduce. 
 The input document is in SequenceFileText,Text . with key = docid, value = 
 content
 First Map/Reduce over the document collection and generate the feature counts.
 Second Sequential pass reads the output of the map/reduce and converts them 
 to SequenceFileText, LongWritable where key=feature, value = unique id 
 Second stage should create shards of features of a given split size
 Third Map/Reduce over the document collection, using each shard and create 
 Partial(containing the features of the given shard) SparseVectors 
 Fourth Map/Reduce over partial shard, group by docid, create full document 
 Vector

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

2010-02-05 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-237:
--

Resolution: Fixed
Status: Resolved  (was: Patch Available)

 Map/Reduce Implementation of Document Vectorizer
 

 Key: MAHOUT-237
 URL: https://issues.apache.org/jira/browse/MAHOUT-237
 Project: Mahout
  Issue Type: New Feature
Affects Versions: 0.3
Reporter: Robin Anil
Assignee: Robin Anil
 Fix For: 0.3

 Attachments: DictionaryVectorizer.patch, DictionaryVectorizer.patch, 
 DictionaryVectorizer.patch, DictionaryVectorizer.patch, 
 DictionaryVectorizer.patch, MAHOUT-237-tfidf.patch, MAHOUT-237-tfidf.patch, 
 SparseVector-VIntWritable.patch


 Current Vectorizer uses Lucene Index to convert documents into SparseVectors
 Ted is working on a Hash based Vectorizer which can map features into Vectors 
 of fixed size and sum it up to get the document Vector
 This is a pure bag-of-words based Vectorizer written in Map/Reduce. 
 The input document is in SequenceFileText,Text . with key = docid, value = 
 content
 First Map/Reduce over the document collection and generate the feature counts.
 Second Sequential pass reads the output of the map/reduce and converts them 
 to SequenceFileText, LongWritable where key=feature, value = unique id 
 Second stage should create shards of features of a given split size
 Third Map/Reduce over the document collection, using each shard and create 
 Partial(containing the features of the given shard) SparseVectors 
 Fourth Map/Reduce over partial shard, group by docid, create full document 
 Vector

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (MAHOUT-220) Mahout Bayes Code cleanup

2010-02-05 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil resolved MAHOUT-220.
---

Resolution: Fixed

Committed. 

 Mahout Bayes Code cleanup
 -

 Key: MAHOUT-220
 URL: https://issues.apache.org/jira/browse/MAHOUT-220
 Project: Mahout
  Issue Type: Improvement
  Components: Classification
Affects Versions: 0.3
Reporter: Robin Anil
Assignee: Robin Anil
 Fix For: 0.3

 Attachments: MAHOUT-BAYES.patch, MAHOUT-BAYES.patch


 Following isabel's checkstyle, I am adding a whole slew of code cleanup with 
 the following exceptions
 1.  Line length used is 120 instead of 80. 
 2.  static final log is kept as is. not LOG. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (MAHOUT-221) Implementation of FP-Bonsai Pruning for fast pattern mining

2010-02-05 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil resolved MAHOUT-221.
---

Resolution: Fixed

Committed 

 Implementation of FP-Bonsai Pruning for fast pattern mining
 ---

 Key: MAHOUT-221
 URL: https://issues.apache.org/jira/browse/MAHOUT-221
 Project: Mahout
  Issue Type: New Feature
  Components: Frequent Itemset/Association Rule Mining
Affects Versions: 0.2
Reporter: Robin Anil
Assignee: Robin Anil
 Fix For: 0.3

 Attachments: MAHOUT-FPGROWTH.patch, MAHOUT-FPGROWTH.patch


 FP Bonsai is a method to prune long chained FP-Trees for faster growth. 
 http://win.ua.ac.be/~adrem/bibrem/pubs/fpbonsai.pdf
 This implementation also adds a transaction preprocessing map/reduce job 
 which converts a list of transactions {1, 2, 4, 5}, {1, 2, 3}, {1, 2} into a 
 tree structure and thus saves space during fpgrowth map/reduce 
 the tree formed from above is. For typical this improves the storage space by 
 a great amount and thus saves on time during shuffle and sort
 (1,3) - (2,3) | - (4,1) - (5,1)
   (3,1)
 Also added a reducer to PFPgrowth (not part of the original paper) which does 
 this compression and saves on space. 
 This patch also adds an example transaction dataset generator from flickr and 
 delicious data set 
 https://www.uni-koblenz.de/FB4/Institutes/IFI/AGStaab/Research/DataSets/PINTSExperimentsDataSets/
 Both of them are GIG of tag data. Where date userid itemid tag is given. 
 The example maker creates a transaction based on all the unique tags a user 
 has tagged on an item. 
  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-153) Implement kmeans++ for initial cluster selection in kmeans

2010-02-05 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12830056#action_12830056
 ] 

Robin Anil commented on MAHOUT-153:
---

Any progress on this? Will it be ready soon or should it be pushed to 0.4 
release ?

 Implement kmeans++ for initial cluster selection in kmeans
 --

 Key: MAHOUT-153
 URL: https://issues.apache.org/jira/browse/MAHOUT-153
 Project: Mahout
  Issue Type: New Feature
  Components: Clustering
Affects Versions: 0.2
 Environment: OS Independent
Reporter: Panagiotis Papadimitriou
 Fix For: 0.3

   Original Estimate: 336h
  Remaining Estimate: 336h

 The current implementation of k-means includes the following algorithms for 
 initial cluster selection (seed selection): 1) random selection of k points, 
 2) use of canopy clusters.
 I plan to implement k-means++. The details of the algorithm are available 
 here: http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf.
 Design Outline: I will create an abstract class SeedGenerator and a subclass 
 KMeansPlusPlusSeedGenerator. The existing class RandomSeedGenerator will 
 become a subclass of SeedGenerator.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Release thinking

2010-02-05 Thread Robin Anil
Reviving this thread. Copy paste the whole thing as we move forward

Current Snapshot

Key Summary
 MAHOUT-221  Implementation of FP-Bonsai Pruning for fast pattern mining
Done
 MAHOUT-227  Parallel SVM   In Progress
 MAHOUT-240  Parallel version of Perceptron   Little Progress
 MAHOUT-241  Example for perceptron Little Progress
 MAHOUT-185  Add mahout shell script for easy launching of various
 algorithms   In Progress
 MAHOUT-153  Implement kmeans++ for initial cluster selection in
 kmeansLittle Progress  (There is discussion, but no patch yet)
 MAHOUT-232  Implementation of sequential SVM solver based on PegasosIn
 Progress
 MAHOUT-228  Need sequential logistic regression implementation using
 SGD techniques In Progress

MAHOUT-263  Matrix interface should extend IterableVector for better
 integration with distributed storage   Done
 MAHOUT-237  Map/Reduce Implementation of Document Vectorizer   Done
 MAHOUT-220  Mahout Bayes Code cleanup Done

MAHOUT-265  Error with creating MVC from Lucene Index or Arff Done
 MAHOUT-215  Provide jars with mahout release. Done
 MAHOUT-209  Add aggregate() methods for Vector Done
 MAHOUT-231  Upgrade QM reports to use Clover 2.6Little Progress Not
 that required in the release(developer thing)
  MAHOUT-106  PLSI/EM in pig based on hofmann's ACM 04 paper.In
 Progress
 MAHOUT-155  ARFF VectorIterable  Little Progress
 MAHOUT-214  Implement Stacked RBM Little Progress




[jira] Commented: (MAHOUT-185) Add mahout shell script for easy launching of various algorithms

2010-02-05 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12830077#action_12830077
 ] 

Robin Anil commented on MAHOUT-185:
---

I like the script as i am running k-means these days :)
{code}
if [ $COMMAND = vectordump ] ; then
  CLASS=org.apache.mahout.utils.vectors.VectorDumper
elif [ $COMMAND = clusterdump ] ; then
  CLASS=org.apache.mahout.utils.clustering.ClusterDumper
elif [ $COMMAND = seqdump ] ; then
  CLASS=org.apache.mahout.utils.SequenceFileDumper
elif [ $COMMAND = kmeans ] ; then
  CLASS=org.apache.mahout.clustering.kmeans.KMeansDriver
elif [ $COMMAND = canopy ] ; then
  CLASS=org.apache.mahout.clustering.canopy.CanopyDriver
elif [ $COMMAND = lucenevector ]; then
  CLASS=org.apache.mahout.utils.vectors.lucene.Driver
elif [ $COMMAND = seqdirectory ]; then
  CLASS=org.apache.mahout.text.SequenceFilesFromDirectory
elif [ $COMMAND = seqwiki ]; then
  CLASS=org.apache.mahout.text.WikipediaToSequenceFile
{code}

If we go like this we might have too many options. Any way to streamline this ?

One thought i have is to have package level Main classes in Core like 
org.apache.mahout.Clustering.java which internally calls the different main 
functions ?
Similarly in examples and util we can keep One Entry class each Examples.java 
and Util.java

So with this limited set we can keep a global conf object which implements Tool 
and the fs object which is the default filesystem as specified by the conf
This way each algorithm can request a conf object (which copies everything Tool 
has set)
How does that sound? I can whip up all the main classes tonight











 Add mahout shell script for easy launching of various algorithms
 

 Key: MAHOUT-185
 URL: https://issues.apache.org/jira/browse/MAHOUT-185
 Project: Mahout
  Issue Type: New Feature
Affects Versions: 0.2
 Environment: linux, bash
Reporter: Robin Anil
 Fix For: 0.3

 Attachments: MAHOUT-185.patch


 Currently, Each algorithm has a different point of entry. At its too 
 complicated to understand and launch each one.  A mahout shell script needs 
 to be made in the bin directory which does something like the following
 mahout classify -algorithm bayes [OPTIONS]
 mahout cluster -algorithm canopy  [OPTIONS]
 mahout fpm -algorithm pfpgrowth [OPTIONS]
 mahout taste -algorithm slopeone [OPTIONS] 
 mahout misc -algorithm createVectorsFromText [OPTIONS]
 mahout examples WikipediaExample

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Proposing a C++ Port for Apache Mahout

2010-02-05 Thread Grant Ingersoll
One thought on these lines is that we should start the process to be a TLP, 
then we could have a subproject explicitly dedicated to C++ (or any other 
language) and there wouldn't necessarily need to be a 1-1 port.

-Grant

On Feb 5, 2010, at 12:56 AM, Kay Kay wrote:

 If there were an effort to write in C++ , it would definitely be useful and 
 to exploit the maximum advantages, porting would be more beneficial over time 
 compared to the wrapper, even if it were to apply to a subset of algorithms 
 supported by Mahout.  Wrapper, would serve the syntactic purpose, but when it 
 comes to profiling / performance extraction would be a huge distraction then.
 
 But, as been pointed earlier - the algorithm depends on the M-R framework 
 very much and hence , the success of this effort would also be tied to the 
 Hadoop C/C++ port's maturity as well. Something worth noting before venturing 
 along these lines.
 
 
 
 On 02/04/2010 09:22 AM, Atul Kulkarni wrote:
 Hey guys,
 
 My 1 cent...
 
 I would be really happy to contribute to this task of enabling use of Mahout
 via C++ (Wrapper / Port either way). I have some experience with C++ and
 have been wanting to use mahout via C++ (as that is my comfort zone compared
 to Java.).
 
 I think port will give the code directly in the hands of the C++ developers,
 which sounds really exciting to me as a C++ developer. But I also understand
 the concern of maintaining two different code bases for the same task, and
 hence also like the idea of writing wrappers. So I am divided on the two
 options, either works for me.
 
 Regards,
 Atul.
 
 On Thu, Feb 4, 2010 at 10:54 AM, Robin Anilrobin.a...@gmail.com  wrote:
 
   
 Hi Israel. I think its a wonderful idea to have ports of mahout, it tells
 us
 that we have a great platform with people really want to use. The only
 concern is Hadoop is still in Java and they are not going with C++. They
 work around it by using native libraries to execute cpu intensive tasks
 like
 sorting and compressing. The reason being that Java is much easier to
 manage
 in such a distributed system(i guess lot of people may differ in opinion).
 
 Regardless, I guess wrappers could be made to ease execution of mahout
 algorithms from any language. If thats a solution you like then folks here
 can concentrate on improving just one code base.
 
 Robin
 
 On Thu, Feb 4, 2010 at 10:08 PM, Israel Ekpoisraele...@gmail.com  wrote:
 
 
 Hey guys,
 
 First of all I would like to start by thanking all the commiters and
 contributors for all their hard work so far on this project.
 
 Most importantly, I want to thank the Apache Mahout community for
   
 bringing
 
 this very promising project to where it is now.
 
 It's pretty amazing to see what the project has accomplished in a short
 span
 of 2 years.
 
 I strongly believe that Apache Mahout is really going to change things
 around for the data mining and machine learning community the same way
 Apache Lucene and Apache Solr is taking over this sector as we speak.
 
 Currently Apache Mahout is only available in Java and there are a lot of
 tools in Mahout that is very useful and a lot of people (students,
 instructors, researchers and computer scientists are using it daily).
 
 I think it would be nice if all of these tools in Mahout were also
 available
 in C++ so that users that already have systems written in C++ can plug in
 an
 integrate Mahout a lot easier with their existing or planned C++ systems.
 
 If we have the C++ port up and running possibly more members of the data
 mining and machine learning community could get involved and ideas could
   
 be
 
 shuffled in both directions (Java and C++ port)
 
 I will volunteer to spearhead this porting effort to get things started.
 
 I am sending this message to all members of the Apache Mahout community
   
 on
 
 what you think can should be done to get this porting effort up and
 running.
 
 Thanks in advance for you constructive and anticipated responses.
 
 Sincerely,
 Israel Ekpo
 
 --
 Good Enough is not good enough.
 To give anything less than your best is to sacrifice the gift.
 Quality First. Measure Twice. Cut Once.
 http://www.israelekpo.com/
 
   
 
 
 
   
 



Re: Release thinking

2010-02-05 Thread Ted Dunning
I just marked the 0.1 and 0.2 releases as released (about time).  This makes
the JIRA road map feature more usable.

See here for the live version of this summary:
https://issues.apache.org/jira/browse/MAHOUT?report=com.atlassian.jira.plugin.system.project:roadmap-panel

On Fri, Feb 5, 2010 at 3:16 AM, Robin Anil robin.a...@gmail.com wrote:

 Reviving this thread. Copy paste the whole thing as we move forward

 Current Snapshot

 Key Summary
  MAHOUT-221  Implementation of FP-Bonsai Pruning for fast pattern
 mining
 Done
  MAHOUT-227  Parallel SVM   In Progress
  MAHOUT-240  Parallel version of Perceptron   Little Progress
  MAHOUT-241  Example for perceptron Little Progress
  MAHOUT-185  Add mahout shell script for easy launching of various
  algorithms   In Progress
  MAHOUT-153  Implement kmeans++ for initial cluster selection in
  kmeansLittle Progress  (There is discussion, but no patch yet)
  MAHOUT-232  Implementation of sequential SVM solver based on Pegasos
In
  Progress
  MAHOUT-228  Need sequential logistic regression implementation using
  SGD techniques In Progress

 MAHOUT-263  Matrix interface should extend IterableVector for better
  integration with distributed storage   Done
  MAHOUT-237  Map/Reduce Implementation of Document Vectorizer   Done
  MAHOUT-220  Mahout Bayes Code cleanup Done

 MAHOUT-265  Error with creating MVC from Lucene Index or Arff Done
  MAHOUT-215  Provide jars with mahout release. Done
  MAHOUT-209  Add aggregate() methods for Vector Done
  MAHOUT-231  Upgrade QM reports to use Clover 2.6Little Progress
 Not
  that required in the release(developer thing)
   MAHOUT-106  PLSI/EM in pig based on hofmann's ACM 04 paper.In
  Progress
  MAHOUT-155  ARFF VectorIterable  Little Progress
  MAHOUT-214  Implement Stacked RBM Little Progress
 
 




-- 
Ted Dunning, CTO
DeepDyve


Re: [jira] Commented: (MAHOUT-185) Add mahout shell script for easy launching of various algorithms

2010-02-05 Thread Ted Dunning
Surely there is a clever way to use annotations for this.  Not that I know
what it might be.

On Fri, Feb 5, 2010 at 4:05 AM, Robin Anil (JIRA) j...@apache.org wrote:

 If we go like this we might have too many options. Any way to streamline
 this ?

 One thought i have is to have package level Main classes in Core like
 org.apache.mahout.Clustering.java which internally calls the different main
 functions ?




-- 
Ted Dunning, CTO
DeepDyve


Re: Release thinking

2010-02-05 Thread Robin Anil
Yum Yum.

0.1 59 issues
0.2 66 issues
0.3 91 issues - 13 left





On Fri, Feb 5, 2010 at 9:47 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 I just marked the 0.1 and 0.2 releases as released (about time).  This
 makes
 the JIRA road map feature more usable.

 See here for the live version of this summary:

 https://issues.apache.org/jira/browse/MAHOUT?report=com.atlassian.jira.plugin.system.project:roadmap-panel

 On Fri, Feb 5, 2010 at 3:16 AM, Robin Anil robin.a...@gmail.com wrote:

  Reviving this thread. Copy paste the whole thing as we move forward
 
  Current Snapshot
 
  Key Summary
   MAHOUT-221  Implementation of FP-Bonsai Pruning for fast pattern
  mining
  Done
   MAHOUT-227  Parallel SVM   In Progress
   MAHOUT-240  Parallel version of Perceptron   Little Progress
   MAHOUT-241  Example for perceptron Little Progress
   MAHOUT-185  Add mahout shell script for easy launching of various
   algorithms   In Progress
   MAHOUT-153  Implement kmeans++ for initial cluster selection in
   kmeansLittle Progress  (There is discussion, but no patch yet)
   MAHOUT-232  Implementation of sequential SVM solver based on
 Pegasos
 In
   Progress
   MAHOUT-228  Need sequential logistic regression implementation
 using
   SGD techniques In Progress
 
  MAHOUT-263  Matrix interface should extend IterableVector for
 better
   integration with distributed storage   Done
   MAHOUT-237  Map/Reduce Implementation of Document Vectorizer   Done
   MAHOUT-220  Mahout Bayes Code cleanup Done
 
  MAHOUT-265  Error with creating MVC from Lucene Index or Arff
 Done
   MAHOUT-215  Provide jars with mahout release. Done
   MAHOUT-209  Add aggregate() methods for Vector Done
   MAHOUT-231  Upgrade QM reports to use Clover 2.6Little Progress
  Not
   that required in the release(developer thing)
MAHOUT-106  PLSI/EM in pig based on hofmann's ACM 04 paper.In
   Progress
   MAHOUT-155  ARFF VectorIterable  Little Progress
   MAHOUT-214  Implement Stacked RBM Little Progress
  
  
 



 --
 Ted Dunning, CTO
 DeepDyve



[jira] Created: (MAHOUT-274) Use avro for serialization of structured documents.

2010-02-05 Thread Drew Farris (JIRA)
Use avro for serialization of structured documents.
---

 Key: MAHOUT-274
 URL: https://issues.apache.org/jira/browse/MAHOUT-274
 Project: Mahout
  Issue Type: Improvement
Reporter: Drew Farris
Priority: Minor


Explore the intersection between Writables and Avro to see how serialization 
can be improved within Mahout. 

An intermediate goal is the provide a structured document format that can be 
serialized using Avro as an Input/OutputFormat and Writable 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-274) Use avro for serialization of structured documents.

2010-02-05 Thread Drew Farris (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Drew Farris updated MAHOUT-274:
---

Attachment: mahout-avro-examples.tar.gz

Very rudimentary exploration of using avro to produce writables.

Uses the avro specific java class generation facility to produce a structured 
document class which is wrapped in a generic writable container for 
serialization.
 
* clases on o.a.m.avro are produces from schema in 
src/main/schemata/o../a../m../avro/AvroDocument.avsc using 
o.a.m.avro.util.AvroDocumentCompiler
* provides a generic avro Writable implementation in 
o.a.m.avro.mapred.SpecificAvroWritable
* see the test in src/test/java o.a.m.avro.mapred.SpecificAvroWritableTest to 
see how this can be used 

'mvn clean install' will run the whole shebang.

 Use avro for serialization of structured documents.
 ---

 Key: MAHOUT-274
 URL: https://issues.apache.org/jira/browse/MAHOUT-274
 Project: Mahout
  Issue Type: Improvement
Reporter: Drew Farris
Priority: Minor
 Attachments: mahout-avro-examples.tar.gz


 Explore the intersection between Writables and Avro to see how serialization 
 can be improved within Mahout. 
 An intermediate goal is the provide a structured document format that can be 
 serialized using Avro as an Input/OutputFormat and Writable 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Release thinking

2010-02-05 Thread Drew Farris
On Fri, Feb 5, 2010 at 11:17 AM, Ted Dunning ted.dunn...@gmail.com wrote:
 I just marked the 0.1 and 0.2 releases as released (about time).  This makes
 the JIRA road map feature more usable.

 See here for the live version of this summary:
 https://issues.apache.org/jira/browse/MAHOUT?report=com.atlassian.jira.plugin.system.project:roadmap-panel


Very nice, thanks Ted.


Re: Speed up Frequent Compile

2010-02-05 Thread Drew Farris
On Fri, Feb 5, 2010 at 3:27 AM, Robin Anil robin.a...@gmail.com wrote:
 When developing mahout core/util/examples we dont need to generate math
 often and dont need to tar gzip bzip2 the jar files. We are mostly concerned
 with the job file/ jar file.
 Cant there be another target like develop which does this. (waiting 2-3 mins
 for a 2 line change is frustrating)

Indeed.

Robin, how are you doing your builds? I could have sworn I eliminated
the building of tar, gzip, bzip2 files unless the -Prelease flag is
specified.


Re: Speed up Frequent Compile

2010-02-05 Thread Ted Dunning
I usually do an initial compilation using mvn package.  Then, during
development I use IntelliJ's incremental compilation which generally only
takes a few seconds.  Since that compilation doesn't handle things like
copying resources, I get caught out and surprised now and again, but this
works almost all the time.

On Fri, Feb 5, 2010 at 12:27 AM, Robin Anil robin.a...@gmail.com wrote:

 When developing mahout core/util/examples we dont need to generate math
 often and dont need to tar gzip bzip2 the jar files. We are mostly
 concerned
 with the job file/ jar file.
 Cant there be another target like develop which does this. (waiting 2-3
 mins
 for a 2 line change is frustrating)

 Robin




-- 
Ted Dunning, CTO
DeepDyve


Re: Release thinking

2010-02-05 Thread Ted Dunning
Makes a lot of sense.  Drew?

On Fri, Feb 5, 2010 at 8:48 AM, Jake Mannix jake.man...@gmail.com wrote:

 So are we really planning on all this structured document stuff and Avro
 for
 0.3?  Can we just try and finish up what was already scoped for 0.3 and
 have
 a quick turnaround for getting things which have only been really started
 worked on in the past week or so for 0.4 sometime next month?




-- 
Ted Dunning, CTO
DeepDyve


Re: Release thinking

2010-02-05 Thread Jake Mannix
On Fri, Feb 5, 2010 at 8:48 AM, Jake Mannix jake.man...@gmail.com wrote:

 So are we really planning on all this structured document stuff and Avro
 for 0.3?  Can we just try and finish up what was already scoped for 0.3 and
 have a quick turnaround for getting things which have only been really
 started worked on in the past week or so for 0.4 sometime next month?


Which is not to say that we shouldn't continue work on them, let's keep the
patches going and up to date, let's just not worry about holding up 0.3
until they're fully tested and checked in.

  -jake


Re: Release thinking

2010-02-05 Thread Drew Farris
Sounds great to me.

On Fri, Feb 5, 2010 at 11:50 AM, Ted Dunning ted.dunn...@gmail.com wrote:
 Makes a lot of sense.  Drew?

 On Fri, Feb 5, 2010 at 8:48 AM, Jake Mannix jake.man...@gmail.com wrote:

 So are we really planning on all this structured document stuff and Avro
 for
 0.3?  Can we just try and finish up what was already scoped for 0.3 and
 have
 a quick turnaround for getting things which have only been really started
 worked on in the past week or so for 0.4 sometime next month?




 --
 Ted Dunning, CTO
 DeepDyve



Re: Release thinking

2010-02-05 Thread Drew Farris
On Fri, Feb 5, 2010 at 11:53 AM, Jake Mannix jake.man...@gmail.com wrote:


 Which is not to say that we shouldn't continue work on them, let's keep the
 patches going and up to date, let's just not worry about holding up 0.3
 until they're fully tested and checked in.

Yes absolutely. I'm also interested in hearing Robin's thoughts on how
far the current document vectorizer, n-gram work should go for 0.3

Drew


Re: Speed up Frequent Compile

2010-02-05 Thread Robin Anil
mvn install to generate the job.  around 2-3 mins it generates the bz2 zip
gz
mvn compile otherwise(15 secs are in compiling math) out of 33 sec


On Fri, Feb 5, 2010 at 10:18 PM, Drew Farris drew.far...@gmail.com wrote:

 On Fri, Feb 5, 2010 at 3:27 AM, Robin Anil robin.a...@gmail.com wrote:
  When developing mahout core/util/examples we dont need to generate math
  often and dont need to tar gzip bzip2 the jar files. We are mostly
 concerned
  with the job file/ jar file.
  Cant there be another target like develop which does this. (waiting 2-3
 mins
  for a 2 line change is frustrating)

 Indeed.

 Robin, how are you doing your builds? I could have sworn I eliminated
 the building of tar, gzip, bzip2 files unless the -Prelease flag is
 specified.



Re: Speed up Frequent Compile

2010-02-05 Thread Robin Anil
Yes for editing i use eclipse in the same fashion. If i want to try out a
job and see how it performs on hadoop I need job compiled fast.

On another note. I think there will be a lot of dead code in the job(with
all the jar files bundles) Is there an optimiser for that i.e to remove
classes which mahout never use indirectly

I see loading jar takes 10-20 seconds when initializing mapper or reducer.
It doesnt affect long running jobs but 20 sec overhead for processing 64MB
chunk sucks

On Fri, Feb 5, 2010 at 10:19 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 I usually do an initial compilation using mvn package.  Then, during
 development I use IntelliJ's incremental compilation which generally only
 takes a few seconds.  Since that compilation doesn't handle things like
 copying resources, I get caught out and surprised now and again, but this
 works almost all the time.

 On Fri, Feb 5, 2010 at 12:27 AM, Robin Anil robin.a...@gmail.com wrote:

  When developing mahout core/util/examples we dont need to generate math
  often and dont need to tar gzip bzip2 the jar files. We are mostly
  concerned
  with the job file/ jar file.
  Cant there be another target like develop which does this. (waiting 2-3
  mins
  for a 2 line change is frustrating)
 
  Robin
 



 --
 Ted Dunning, CTO
 DeepDyve



Re: Release thinking

2010-02-05 Thread Robin Anil
I just updated it here.

http://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html

Lets rename/refactor the classes and get basic avro thing in for 0.3. So
that people who use gets a smooth upgrade to 0.4

Robin

On Fri, Feb 5, 2010 at 10:32 PM, Drew Farris drew.far...@gmail.com wrote:

 On Fri, Feb 5, 2010 at 11:53 AM, Jake Mannix jake.man...@gmail.com
 wrote:

 
  Which is not to say that we shouldn't continue work on them, let's keep
 the
  patches going and up to date, let's just not worry about holding up 0.3
  until they're fully tested and checked in.

 Yes absolutely. I'm also interested in hearing Robin's thoughts on how
 far the current document vectorizer, n-gram work should go for 0.3

 Drew



[jira] Updated: (MAHOUT-272) Add licenses for 3rd party jars to mahout binary release and remove additional unused dependencies.

2010-02-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated MAHOUT-272:
-

Resolution: Fixed
  Assignee: Drew Farris
Status: Resolved  (was: Patch Available)

 Add licenses for 3rd party jars to mahout binary release and remove 
 additional unused dependencies.
 ---

 Key: MAHOUT-272
 URL: https://issues.apache.org/jira/browse/MAHOUT-272
 Project: Mahout
  Issue Type: Bug
Affects Versions: 0.3
Reporter: Drew Farris
Assignee: Drew Farris
 Fix For: 0.3

 Attachments: MAHOUT-272.patch


 The binary release produced by MAHOUT-215 includes some 3rd party jars that 
 require licenses and other 3rd party jars (xpp3 + xstream) that are not 
 required at all (eclipse core, a transitive dependency of hadoop, jfreechart 
 a transitive dependency of watchmaker-swing).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Speed up Frequent Compile

2010-02-05 Thread Drew Farris
So, I'm running: mvn -o install -DskipTests=true at project root (in mahout)

Comment out or remove the maven-assembly-plugin definition in
core/pom.xml -- it reduced my core build time from 26s to 6s -- I can
submit a patch for this.

Mahout math is still 17s here due to code generation. I'm wondering if
there's a way to modify the generation plugin to that it doesn't
re-generate if there are no changes to the templates. You can remove
the plugin definition from math/pom.xml and it doesn't seem to break
anything unless you're doing a clean. Brings down math compilation to
3s without it. Total compile time is 22s.

re: the job, I'll have to look into that further later.

On Fri, Feb 5, 2010 at 12:06 PM, Robin Anil robin.a...@gmail.com wrote:
 Yes for editing i use eclipse in the same fashion. If i want to try out a
 job and see how it performs on hadoop I need job compiled fast.

 On another note. I think there will be a lot of dead code in the job(with
 all the jar files bundles) Is there an optimiser for that i.e to remove
 classes which mahout never use indirectly

 I see loading jar takes 10-20 seconds when initializing mapper or reducer.
 It doesnt affect long running jobs but 20 sec overhead for processing 64MB
 chunk sucks

 On Fri, Feb 5, 2010 at 10:19 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 I usually do an initial compilation using mvn package.  Then, during
 development I use IntelliJ's incremental compilation which generally only
 takes a few seconds.  Since that compilation doesn't handle things like
 copying resources, I get caught out and surprised now and again, but this
 works almost all the time.

 On Fri, Feb 5, 2010 at 12:27 AM, Robin Anil robin.a...@gmail.com wrote:

  When developing mahout core/util/examples we dont need to generate math
  often and dont need to tar gzip bzip2 the jar files. We are mostly
  concerned
  with the job file/ jar file.
  Cant there be another target like develop which does this. (waiting 2-3
  mins
  for a 2 line change is frustrating)
 
  Robin
 



 --
 Ted Dunning, CTO
 DeepDyve




Re: Proposing a C++ Port for Apache Mahout

2010-02-05 Thread Israel Ekpo
Thanks everyone for your responses so far.

The Apache Hadoop dependency was something I thought about initially but I
still went ahead to ask the question anyways.

At this time, it would be a better use of resources and time to come up with
a wrapper or HTTP server/client set up of some sort.

My reasoning behind this is because of the Hadoop dependency and the
volatile nature of the API as pointed out by Sean and Robin

Thanks again for all your responses.

On Thu, Feb 4, 2010 at 12:22 PM, Atul Kulkarni atulskulka...@gmail.comwrote:

 Hey guys,

 My 1 cent...

 I would be really happy to contribute to this task of enabling use of
 Mahout
 via C++ (Wrapper / Port either way). I have some experience with C++ and
 have been wanting to use mahout via C++ (as that is my comfort zone
 compared
 to Java.).

 I think port will give the code directly in the hands of the C++
 developers,
 which sounds really exciting to me as a C++ developer. But I also
 understand
 the concern of maintaining two different code bases for the same task, and
 hence also like the idea of writing wrappers. So I am divided on the two
 options, either works for me.

 Regards,
 Atul.

 On Thu, Feb 4, 2010 at 10:54 AM, Robin Anil robin.a...@gmail.com wrote:

  Hi Israel. I think its a wonderful idea to have ports of mahout, it tells
  us
  that we have a great platform with people really want to use. The only
  concern is Hadoop is still in Java and they are not going with C++. They
  work around it by using native libraries to execute cpu intensive tasks
  like
  sorting and compressing. The reason being that Java is much easier to
  manage
  in such a distributed system(i guess lot of people may differ in
 opinion).
 
  Regardless, I guess wrappers could be made to ease execution of mahout
  algorithms from any language. If thats a solution you like then folks
 here
  can concentrate on improving just one code base.
 
  Robin
 
  On Thu, Feb 4, 2010 at 10:08 PM, Israel Ekpo israele...@gmail.com
 wrote:
 
   Hey guys,
  
   First of all I would like to start by thanking all the commiters and
   contributors for all their hard work so far on this project.
  
   Most importantly, I want to thank the Apache Mahout community for
  bringing
   this very promising project to where it is now.
  
   It's pretty amazing to see what the project has accomplished in a short
   span
   of 2 years.
  
   I strongly believe that Apache Mahout is really going to change things
   around for the data mining and machine learning community the same way
   Apache Lucene and Apache Solr is taking over this sector as we speak.
  
   Currently Apache Mahout is only available in Java and there are a lot
 of
   tools in Mahout that is very useful and a lot of people (students,
   instructors, researchers and computer scientists are using it daily).
  
   I think it would be nice if all of these tools in Mahout were also
   available
   in C++ so that users that already have systems written in C++ can plug
 in
   an
   integrate Mahout a lot easier with their existing or planned C++
 systems.
  
   If we have the C++ port up and running possibly more members of the
 data
   mining and machine learning community could get involved and ideas
 could
  be
   shuffled in both directions (Java and C++ port)
  
   I will volunteer to spearhead this porting effort to get things
 started.
  
   I am sending this message to all members of the Apache Mahout community
  on
   what you think can should be done to get this porting effort up and
   running.
  
   Thanks in advance for you constructive and anticipated responses.
  
   Sincerely,
   Israel Ekpo
  
   --
   Good Enough is not good enough.
   To give anything less than your best is to sacrifice the gift.
   Quality First. Measure Twice. Cut Once.
   http://www.israelekpo.com/
  
 



 --
 Regards,
 Atul Kulkarni
 www.d.umn.edu/~kulka053 http://www.d.umn.edu/%7Ekulka053




-- 
Good Enough is not good enough.
To give anything less than your best is to sacrifice the gift.
Quality First. Measure Twice. Cut Once.
http://www.israelekpo.com/


Re: Speed up Frequent Compile

2010-02-05 Thread Benson Margulies
Yes, the codegen could drop a timestamp file. It's a fair amount of
work, and if we're killing this code for HPCC I'm dubious.

If I could make the split work I could do this next.


On Fri, Feb 5, 2010 at 12:19 PM, Drew Farris drew.far...@gmail.com wrote:
 So, I'm running: mvn -o install -DskipTests=true at project root (in mahout)

 Comment out or remove the maven-assembly-plugin definition in
 core/pom.xml -- it reduced my core build time from 26s to 6s -- I can
 submit a patch for this.

 Mahout math is still 17s here due to code generation. I'm wondering if
 there's a way to modify the generation plugin to that it doesn't
 re-generate if there are no changes to the templates. You can remove
 the plugin definition from math/pom.xml and it doesn't seem to break
 anything unless you're doing a clean. Brings down math compilation to
 3s without it. Total compile time is 22s.

 re: the job, I'll have to look into that further later.

 On Fri, Feb 5, 2010 at 12:06 PM, Robin Anil robin.a...@gmail.com wrote:
 Yes for editing i use eclipse in the same fashion. If i want to try out a
 job and see how it performs on hadoop I need job compiled fast.

 On another note. I think there will be a lot of dead code in the job(with
 all the jar files bundles) Is there an optimiser for that i.e to remove
 classes which mahout never use indirectly

 I see loading jar takes 10-20 seconds when initializing mapper or reducer.
 It doesnt affect long running jobs but 20 sec overhead for processing 64MB
 chunk sucks

 On Fri, Feb 5, 2010 at 10:19 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 I usually do an initial compilation using mvn package.  Then, during
 development I use IntelliJ's incremental compilation which generally only
 takes a few seconds.  Since that compilation doesn't handle things like
 copying resources, I get caught out and surprised now and again, but this
 works almost all the time.

 On Fri, Feb 5, 2010 at 12:27 AM, Robin Anil robin.a...@gmail.com wrote:

  When developing mahout core/util/examples we dont need to generate math
  often and dont need to tar gzip bzip2 the jar files. We are mostly
  concerned
  with the job file/ jar file.
  Cant there be another target like develop which does this. (waiting 2-3
  mins
  for a 2 line change is frustrating)
 
  Robin
 



 --
 Ted Dunning, CTO
 DeepDyve





Re: Proposing a C++ Port for Apache Mahout

2010-02-05 Thread Israel Ekpo
Grant,

Would the TLP be Mahout or under a different name?

I also like the idea that it does not necessarily have to be a 1:1 port.

Kay Kay,

I change my mind (going the wrapper route), I think it would be nice to
explore the possibilities with just a subset of the algorithms.

That would be a good place to start.

I will be in touch

On Feb 5, 2010, at 03:23 PM, Grant Ingersoll wrote:

One thought on these lines is that we should start the process to be a TLP,
then we could have
a subproject explicitly dedicated to C++ (or any other language) and there
wouldn't necessarily
need to be a 1-1 port.

-Grant

On Feb 5, 2010, at 12:56 AM, Kay Kay wrote:

If there were an effort to write in C++ , it would definitely be useful and
to exploit
the maximum advantages, porting would be more beneficial over time compared
to the wrapper,
even if it were to apply to a subset of algorithms supported by Mahout.
Wrapper, would serve
the syntactic purpose, but when it comes to profiling / performance
extraction would be a
huge distraction then.

 But, as been pointed earlier - the algorithm depends on the M-R framework
very much and
hence , the success of this effort would also be tied to the Hadoop C/C++
port's maturity
as well. Something worth noting before venturing along these lines.


On Fri, Feb 5, 2010 at 3:41 PM, Israel Ekpo israele...@gmail.com wrote:

 Thanks everyone for your responses so far.

 The Apache Hadoop dependency was something I thought about initially but I
 still went ahead to ask the question anyways.

 At this time, it would be a better use of resources and time to come up
 with a wrapper or HTTP server/client set up of some sort.

 My reasoning behind this is because of the Hadoop dependency and the
 volatile nature of the API as pointed out by Sean and Robin

 Thanks again for all your responses.


 On Thu, Feb 4, 2010 at 12:22 PM, Atul Kulkarni atulskulka...@gmail.comwrote:

 Hey guys,

 My 1 cent...

 I would be really happy to contribute to this task of enabling use of
 Mahout
 via C++ (Wrapper / Port either way). I have some experience with C++ and
 have been wanting to use mahout via C++ (as that is my comfort zone
 compared
 to Java.).

 I think port will give the code directly in the hands of the C++
 developers,
 which sounds really exciting to me as a C++ developer. But I also
 understand
 the concern of maintaining two different code bases for the same task, and
 hence also like the idea of writing wrappers. So I am divided on the two
 options, either works for me.

 Regards,
 Atul.

 On Thu, Feb 4, 2010 at 10:54 AM, Robin Anil robin.a...@gmail.com wrote:

  Hi Israel. I think its a wonderful idea to have ports of mahout, it
 tells
  us
  that we have a great platform with people really want to use. The only
  concern is Hadoop is still in Java and they are not going with C++. They
  work around it by using native libraries to execute cpu intensive tasks
  like
  sorting and compressing. The reason being that Java is much easier to
  manage
  in such a distributed system(i guess lot of people may differ in
 opinion).
 
  Regardless, I guess wrappers could be made to ease execution of mahout
  algorithms from any language. If thats a solution you like then folks
 here
  can concentrate on improving just one code base.
 
  Robin
 
  On Thu, Feb 4, 2010 at 10:08 PM, Israel Ekpo israele...@gmail.com
 wrote:
 
   Hey guys,
  
   First of all I would like to start by thanking all the commiters and
   contributors for all their hard work so far on this project.
  
   Most importantly, I want to thank the Apache Mahout community for
  bringing
   this very promising project to where it is now.
  
   It's pretty amazing to see what the project has accomplished in a
 short
   span
   of 2 years.
  
   I strongly believe that Apache Mahout is really going to change things
   around for the data mining and machine learning community the same way
   Apache Lucene and Apache Solr is taking over this sector as we speak.
  
   Currently Apache Mahout is only available in Java and there are a lot
 of
   tools in Mahout that is very useful and a lot of people (students,
   instructors, researchers and computer scientists are using it daily).
  
   I think it would be nice if all of these tools in Mahout were also
   available
   in C++ so that users that already have systems written in C++ can plug
 in
   an
   integrate Mahout a lot easier with their existing or planned C++
 systems.
  
   If we have the C++ port up and running possibly more members of the
 data
   mining and machine learning community could get involved and ideas
 could
  be
   shuffled in both directions (Java and C++ port)
  
   I will volunteer to spearhead this porting effort to get things
 started.
  
   I am sending this message to all members of the Apache Mahout
 community
  on
   what you think can should be done to get this porting effort up and
   running.
  
   Thanks in advance for you constructive and anticipated 

Re: Speed up Frequent Compile

2010-02-05 Thread Robin Anil
Its just meant to be a dev only hack :)


On Sat, Feb 6, 2010 at 3:09 AM, Benson Margulies bimargul...@gmail.comwrote:

 Yes, the codegen could drop a timestamp file. It's a fair amount of
 work, and if we're killing this code for HPCC I'm dubious.

 If I could make the split work I could do this next.


 On Fri, Feb 5, 2010 at 12:19 PM, Drew Farris drew.far...@gmail.com
 wrote:
  So, I'm running: mvn -o install -DskipTests=true at project root (in
 mahout)
 
  Comment out or remove the maven-assembly-plugin definition in
  core/pom.xml -- it reduced my core build time from 26s to 6s -- I can
  submit a patch for this.
 
  Mahout math is still 17s here due to code generation. I'm wondering if
  there's a way to modify the generation plugin to that it doesn't
  re-generate if there are no changes to the templates. You can remove
  the plugin definition from math/pom.xml and it doesn't seem to break
  anything unless you're doing a clean. Brings down math compilation to
  3s without it. Total compile time is 22s.
 
  re: the job, I'll have to look into that further later.
 
  On Fri, Feb 5, 2010 at 12:06 PM, Robin Anil robin.a...@gmail.com
 wrote:
  Yes for editing i use eclipse in the same fashion. If i want to try out
 a
  job and see how it performs on hadoop I need job compiled fast.
 
  On another note. I think there will be a lot of dead code in the
 job(with
  all the jar files bundles) Is there an optimiser for that i.e to remove
  classes which mahout never use indirectly
 
  I see loading jar takes 10-20 seconds when initializing mapper or
 reducer.
  It doesnt affect long running jobs but 20 sec overhead for processing
 64MB
  chunk sucks
 
  On Fri, Feb 5, 2010 at 10:19 PM, Ted Dunning ted.dunn...@gmail.com
 wrote:
 
  I usually do an initial compilation using mvn package.  Then, during
  development I use IntelliJ's incremental compilation which generally
 only
  takes a few seconds.  Since that compilation doesn't handle things like
  copying resources, I get caught out and surprised now and again, but
 this
  works almost all the time.
 
  On Fri, Feb 5, 2010 at 12:27 AM, Robin Anil robin.a...@gmail.com
 wrote:
 
   When developing mahout core/util/examples we dont need to generate
 math
   often and dont need to tar gzip bzip2 the jar files. We are mostly
   concerned
   with the job file/ jar file.
   Cant there be another target like develop which does this. (waiting
 2-3
   mins
   for a 2 line change is frustrating)
  
   Robin
  
 
 
 
  --
  Ted Dunning, CTO
  DeepDyve
 
 
 



Re: Speed up Frequent Compile

2010-02-05 Thread Benson Margulies
Then we could make a profile that turns off the code gen and turns on
the build helper to add the generated source dir instead.

On Fri, Feb 5, 2010 at 4:49 PM, Robin Anil robin.a...@gmail.com wrote:
 Its just meant to be a dev only hack :)


 On Sat, Feb 6, 2010 at 3:09 AM, Benson Margulies bimargul...@gmail.comwrote:

 Yes, the codegen could drop a timestamp file. It's a fair amount of
 work, and if we're killing this code for HPCC I'm dubious.

 If I could make the split work I could do this next.


 On Fri, Feb 5, 2010 at 12:19 PM, Drew Farris drew.far...@gmail.com
 wrote:
  So, I'm running: mvn -o install -DskipTests=true at project root (in
 mahout)
 
  Comment out or remove the maven-assembly-plugin definition in
  core/pom.xml -- it reduced my core build time from 26s to 6s -- I can
  submit a patch for this.
 
  Mahout math is still 17s here due to code generation. I'm wondering if
  there's a way to modify the generation plugin to that it doesn't
  re-generate if there are no changes to the templates. You can remove
  the plugin definition from math/pom.xml and it doesn't seem to break
  anything unless you're doing a clean. Brings down math compilation to
  3s without it. Total compile time is 22s.
 
  re: the job, I'll have to look into that further later.
 
  On Fri, Feb 5, 2010 at 12:06 PM, Robin Anil robin.a...@gmail.com
 wrote:
  Yes for editing i use eclipse in the same fashion. If i want to try out
 a
  job and see how it performs on hadoop I need job compiled fast.
 
  On another note. I think there will be a lot of dead code in the
 job(with
  all the jar files bundles) Is there an optimiser for that i.e to remove
  classes which mahout never use indirectly
 
  I see loading jar takes 10-20 seconds when initializing mapper or
 reducer.
  It doesnt affect long running jobs but 20 sec overhead for processing
 64MB
  chunk sucks
 
  On Fri, Feb 5, 2010 at 10:19 PM, Ted Dunning ted.dunn...@gmail.com
 wrote:
 
  I usually do an initial compilation using mvn package.  Then, during
  development I use IntelliJ's incremental compilation which generally
 only
  takes a few seconds.  Since that compilation doesn't handle things like
  copying resources, I get caught out and surprised now and again, but
 this
  works almost all the time.
 
  On Fri, Feb 5, 2010 at 12:27 AM, Robin Anil robin.a...@gmail.com
 wrote:
 
   When developing mahout core/util/examples we dont need to generate
 math
   often and dont need to tar gzip bzip2 the jar files. We are mostly
   concerned
   with the job file/ jar file.
   Cant there be another target like develop which does this. (waiting
 2-3
   mins
   for a 2 line change is frustrating)
  
   Robin
  
 
 
 
  --
  Ted Dunning, CTO
  DeepDyve
 
 
 




Re: [Fwd: Re: Dirichlet Processing Clustering - Synthetic Control Data]

2010-02-05 Thread Jeff Eastman

Jeff Eastman wrote:

Jeff Eastman wrote:

Jeff Eastman wrote:

Ted Dunning wrote:
This could also be caused if the prior is very diffuse.  This makes 
the
probability that a point will go to any new cluster quite low.  You 
can

compensate somewhat for this with different values of alpha.
  
Could you elaborate more on the function of alpha in the algorithm? 
Now I can answer my own question. Alpha_0 determines the probability a 
point will go into an empty cluster (ok, almost Ted's exact words).  
During the first iteration, the total counts of all prior clusters are 
zero. Thus the Beta calculation that drives the Dirichlet distribution 
that determines the mixture probabilities degenerates to beta = rBeta(1, 
alpha_0). Clusters that end up with points for the next iteration will 
overwhelm the small constants (alpha_0, 1) and subsequent new mixture 
probabilities will derive from beta ~=  rBeta(count, total) which is the 
current implementation. All empty clusters will subsequently be driven 
by beta ~= rBeta(1, total) as alpha_0 is insignificant and count is 0.


The current implementation ends up using beta = rBeta(alpha_0/k, 
alpha_0) as initial values during all iterations because the counts are 
all initialized to alpha_0/k. Close but no cigar.


Jeff

(nothing new below)
Looking at the current implementation, it is only used to initialize 
the totalCount values (to alpha/k) when sampling from the prior. 
AFAICT it is not used anywhere else. Its current role is pretty 
minimal and I wonder if something fell through the cracks during all 
of the refactoring from the R prototype.
Well, I looked over the R code and alpha_0 does appear to be used in 
two places, not one:


- in state initialization beta = rbeta(K, 1, alpha_0) [K is the 
number of models]
- during state update beta[k] = rbeta(1, 1 + counts[k], alpha_0 + 
N-counts[k]) [N is the cardinality of the sample vector and counts 
corresponds to totalCounts in the implementation]


The value of beta[k] is then used in the Dirichlet distribution 
calculation which results in the mixture probabilities pi[i], for the 
iteration:


   other = 1 # product accumulator
   for (k in 1:K) {
 pi[k] = beta[k] * other;# beta_k * 
prod_{nk} beta_n

 other = other * (1-beta[k])
 }

Alpha_0 does not appear to ever be added to the total counts nor is 
it divided by K as in the implementation so it looks like something 
did get lost in the refactoring. In the implementation, 
UncommonDistributions.rDirichlet(Vector alpha) is passed the 
totalCounts to compute the mixture probabilities and the rBeta 
arguments do not use alpha_0 as in R. There are other differences; 
however, and rDirichlet looks like:


 public static Vector rDirichlet(Vector alpha) {
   Vector r = alpha.like();
   double total = alpha.zSum();
   double remainder = 1;
   for (int i = 0; i  r.size(); i++) {
 double a = alpha.get(i);
 total -= a;
 double beta = rBeta(a, Math.max(0, total));
 double p = beta * remainder;
 r.set(i, p);
 remainder -= p;
   }
   return r;
 }




Hi Ted,

I made the following changes, which still seem to work. I added 
alpha_0 as an argument to rDirichlet and included it in the beta 
calculation. I also removed the alpha_0/k totalCount initialization. 
This now corresponds, I think, to the R code above and degenerates to 
the same initial beta arguments during initialization when totalCounts 
are 0. Could you please look this over and see if you agree?


Thanks,
Jeff

 /**
  * Sample from a Dirichlet distribution, returning a vector of 
probabilities using a

  * stick-breaking algorithm
  *
  * @param totalCounts an unnormalized count Vector
  * @param alpha_0 a double
  * @return a Vector of probabilities
  */
 public static Vector rDirichlet(Vector totalCounts, double alpha_0) {
   Vector result = totalCounts.like();
   double total = totalCounts.zSum();
   double other = 1.0;
   for (int i = 0; i  result.size(); i++) {
 double count = totalCounts.get(i);
 total -= count;
 double beta = rBeta(1 + count, Math.max(0, alpha_0 + total));
 double pi = beta * other;
 result.set(i, pi);
 other *= 1 - beta;
   }
   return result;
 }