[jira] Updated: (MAHOUT-274) Use avro for serialization of structured documents.

2010-02-15 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-274:
--

Affects Version/s: 0.4
Fix Version/s: 0.4
 Assignee: Drew Farris

> Use avro for serialization of structured documents.
> ---
>
> Key: MAHOUT-274
> URL: https://issues.apache.org/jira/browse/MAHOUT-274
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.4
>Reporter: Drew Farris
>Assignee: Drew Farris
>Priority: Minor
> Fix For: 0.4
>
> Attachments: mahout-avro-examples.tar.gz, mahout-avro-examples.tar.gz
>
>
> Explore the intersection between Writables and Avro to see how serialization 
> can be improved within Mahout. 
> An intermediate goal is the provide a structured document format that can be 
> serialized using Avro as an Input/OutputFormat and Writable 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-274) Use avro for serialization of structured documents.

2010-02-15 Thread Drew Farris (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Drew Farris updated MAHOUT-274:
---

Attachment: (was: mahout-avro-examples.tar.bz)

> Use avro for serialization of structured documents.
> ---
>
> Key: MAHOUT-274
> URL: https://issues.apache.org/jira/browse/MAHOUT-274
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Drew Farris
>Priority: Minor
> Attachments: mahout-avro-examples.tar.gz, mahout-avro-examples.tar.gz
>
>
> Explore the intersection between Writables and Avro to see how serialization 
> can be improved within Mahout. 
> An intermediate goal is the provide a structured document format that can be 
> serialized using Avro as an Input/OutputFormat and Writable 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-274) Use avro for serialization of structured documents.

2010-02-15 Thread Drew Farris (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Drew Farris updated MAHOUT-274:
---

Comment: was deleted

(was: re-added latest tarball with proper extension.)

> Use avro for serialization of structured documents.
> ---
>
> Key: MAHOUT-274
> URL: https://issues.apache.org/jira/browse/MAHOUT-274
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Drew Farris
>Priority: Minor
> Attachments: mahout-avro-examples.tar.gz, mahout-avro-examples.tar.gz
>
>
> Explore the intersection between Writables and Avro to see how serialization 
> can be improved within Mahout. 
> An intermediate goal is the provide a structured document format that can be 
> serialized using Avro as an Input/OutputFormat and Writable 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-274) Use avro for serialization of structured documents.

2010-02-15 Thread Drew Farris (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Drew Farris updated MAHOUT-274:
---

Attachment: (was: mahout-colloc.tar.gz)

> Use avro for serialization of structured documents.
> ---
>
> Key: MAHOUT-274
> URL: https://issues.apache.org/jira/browse/MAHOUT-274
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Drew Farris
>Priority: Minor
> Attachments: mahout-avro-examples.tar.gz, mahout-avro-examples.tar.gz
>
>
> Explore the intersection between Writables and Avro to see how serialization 
> can be improved within Mahout. 
> An intermediate goal is the provide a structured document format that can be 
> serialized using Avro as an Input/OutputFormat and Writable 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-274) Use avro for serialization of structured documents.

2010-02-15 Thread Drew Farris (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Drew Farris updated MAHOUT-274:
---

Attachment: mahout-avro-examples.tar.gz

(this is really the right tarball this time, honest)

> Use avro for serialization of structured documents.
> ---
>
> Key: MAHOUT-274
> URL: https://issues.apache.org/jira/browse/MAHOUT-274
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Drew Farris
>Priority: Minor
> Attachments: mahout-avro-examples.tar.gz, mahout-avro-examples.tar.gz
>
>
> Explore the intersection between Writables and Avro to see how serialization 
> can be improved within Mahout. 
> An intermediate goal is the provide a structured document format that can be 
> serialized using Avro as an Input/OutputFormat and Writable 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-274) Use avro for serialization of structured documents.

2010-02-15 Thread Drew Farris (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Drew Farris updated MAHOUT-274:
---

Attachment: mahout-colloc.tar.gz

re-added latest tarball with proper extension.

> Use avro for serialization of structured documents.
> ---
>
> Key: MAHOUT-274
> URL: https://issues.apache.org/jira/browse/MAHOUT-274
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Drew Farris
>Priority: Minor
> Attachments: mahout-avro-examples.tar.bz, 
> mahout-avro-examples.tar.gz, mahout-colloc.tar.gz
>
>
> Explore the intersection between Writables and Avro to see how serialization 
> can be improved within Mahout. 
> An intermediate goal is the provide a structured document format that can be 
> serialized using Avro as an Input/OutputFormat and Writable 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-274) Use avro for serialization of structured documents.

2010-02-15 Thread Drew Farris (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Drew Farris updated MAHOUT-274:
---

Attachment: mahout-avro-examples.tar.bz

Status update w/ new tarball which contains a maven project (mvn clean install 
should do the trick) 

README.txt included, relevant portions included below:

Provided are two different versions of AvroInputFormat/AvroOutputFormat that 
are compatible with the mapred (pre 0.20) and mapreduce (0.20+) apis. They are 
based on, code provided as a part of  MAPREDUCE-815 and other patches. Also 
provided are backports of the SerializationBase/AvroSerialization classes from 
the current hadoop-core trunk.

When writing a job using the pre 0.20 apis:

Add serializations:

{code}
conf.setStrings("io.serializations",
new String[] {
  WritableSerialization.class.getName(), 
  AvroSpecificSerialization.class.getName(), 
  AvroReflectSerialization.class.getName(),
  AvroGenericSerialization.class.getName()
});
{code}

Setup input and output formats:

{code}
conf.setInputFormat(AvroInputFormat.class);
conf.setOutputFormat(AvroOutputFormat.class);

AvroInputFormat.setAvroInputClass(conf, AvroDocument.class);
AvroOutputFormat.setAvroOutputClass(conf, AvroDocument.class);
{code}

AvroInputFormat provides the specified class as the key and a LongWritable file 
offset as the value.
AvroOutputFormat expects the specified class as the key and expects a 
NullWritable as a value.

If an avro serializable class is passed between the map and reduce phases it is 
necessary to set the following:

{code}
AvroComparator.setSchema(AvroDocument._SCHEMA);
conf.setClass("mapred.output.key.comparator.class", 
  AvroComparator.class, RawComparator.class);
{code}

So far I've been using avro 'specific' serialization, which compiles an avro 
schema into a Java class. see 
src/main/schemata/org/apache/mahout/avro/AvroDocument.avsc. This is currently 
compiled into classes o.a.m.avro.document (AvroDocument|AvroField) using 
o.a.m.avro.util.AvroDocumentCompiler (eventually to be replaced by a maven 
plugin, Generated sources are currently checked in.).

Helper classes for AvroDocument and AvroField include 
o.a.m.avro.document.Avro(Document|Field)Builder,  
o.a.m.avro(Document|Field)Reader. This seems to work ok here, but I'm not 
certain that this is be best pattern to use, especially when there are many 
pre-existing classes (such as there are in the case of vector. 

Avro also provides reflection-based serialization and schema-based 
serialization, both should be supported by the infrastructure that has been 
backported here, but that's something else to explore.
 
Examples:

These are quick and dirty and need much cleanup work before they can be taken 
out to the dance.

see o.a.m.avro.text, o.a.m.avro.text.mapred and o.a.m.avro.text.mapreduce:

* AvroDocumentsFromDirectory: quick and dirty port of 
SequenceFilesFromDirectory to use AvroDocuments. Writes a file containing 
documents in avro format; file contents is stored in a single field named 
'content', contents are stored in the originalText portion of this field.
* AvroDocumentsDumper: dump an avro documents file to a standard output
* AvroDocumentsWordCount: perform a wordcount on an avro document input file.
* AvroDocumentProcessor: tokenizes the text found in the input document file, 
reads from the originalText of the field named content and writes original 
document+tokens to output file.

Running the examples:

(haven't tested with the hadoop driver yet)

{code}
mvn exec:java 
-Dexec.mainClass=org.apache.mahout.avro.text.AvroDocumentsFromDirectory \
  -Dexec.args='--parent /home/drew/mahout/20news-18828 \
  --outputDir /home/drew/mahout/20news-18828-example \
  --charset UTF-8'

mvn exec:java 
-Dexec.mainClass=org.apache.mahout.avro.text.mapred.AvroDocumentProcessor \
   -Dexec.args='/home/drew/mahout/20news-18828-example 
/home/drew/mahout/20news-18828-processed' 

mvn exec:java -Dexec.mainClass=org.apache.mahout.avro.text.AvroDocumentsDumper \
  -Dexec.args='/home/drew/mahout/20news-18828-processed/.avro-r-0' > 
foobar.txt
{code}

The Wikipedia stuff is in there, but isn't working yet. Many thanks (apologies) 
to Robin for the starting point for much of this code and hacking it to pieces 
so badly. 


> Use avro for serialization of structured documents.
> ---
>
> Key: MAHOUT-274
> URL: https://issues.apache.org/jira/browse/MAHOUT-274
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Drew Farris
>Priority: Minor
> Attachments: mahout-avro-examples.tar.bz, mahout-avro-examples.tar.gz
>
>
> Explore the intersection between Writables and Avro to see how serialization 
> can be improved within Mahout. 
> An intermediate g

[jira] Updated: (MAHOUT-292) Classifier Test Data and Self Tests

2010-02-15 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-292:
--

Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed

> Classifier Test Data and Self Tests
> ---
>
> Key: MAHOUT-292
> URL: https://issues.apache.org/jira/browse/MAHOUT-292
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.3
>Reporter: Robin Anil
>Assignee: Robin Anil
> Fix For: 0.3
>
> Attachments: MAHOUT-292.patch
>
>
> Till now there was no means to test if quality of classification suffered due 
> to a code change. 
> Added Classifier data with 3 labels (mahout, lucene and spamassasin) with 4 
> long sentences in each of the labels. 
> Added a SelfTest which trains Bayes and CBayes model and classify the train 
> dataset while testing and check accuracy and confusion matrix

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Mass Code Cleanup

2010-02-15 Thread Robin Anil
I picked out the formatter issues and committed the rest. Will have smaller
patches if anything looks horribly machine formatter. So far not much

Robin

On Mon, Feb 15, 2010 at 8:18 PM, Robin Anil  wrote:

> SGD kmeans++ pegasus seems fine. Isabel can you check with the latest trunk
> if the perceptron is alright?
> I dont see any other open issues which requires patch testing as extensive
> as these do
>
> Robin
>
>
> On Mon, Feb 15, 2010 at 8:10 PM, Drew Farris wrote:
>
>> On Mon, Feb 15, 2010 at 1:09 AM, Robin Anil  wrote:
>> > If its A. I have a few patches ready to commit like the static qualifier
>> > fix. I really need you guys to be on board on this. We just cant leave
>> it at
>> > this discussion.
>> >
>> > If its B. I will do the revert. But would have to patch some commits.
>> >
>> > If A sounds reasonable. Its easier to go forward than go back. I will
>> not be
>> > making any more changes at this scale. except bunch of classes from time
>> to
>> > time.
>>
>> I think A sounds reasonable, given a patch for MAHOUT-291 that isn't
>> as extensive, but I can't really comment on the potential for breaking
>> other patches here. I would say that the people with that sort of time
>> invested really should have the final say.
>>
>> Would it make sense for those with outstanding patches to apply 291
>> and then attempt to apply their patches to determine the extent of
>> breakage? To be honest, anyone can do it really. If someone wants to
>> post some jira issue references for patches that need to be tested I
>> can mess around with trying to apply them this evening.
>>
>> Drew
>>
>
>


[jira] Commented: (MAHOUT-291) Mahout Code Cleanup

2010-02-15 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12833905#action_12833905
 ] 

Robin Anil commented on MAHOUT-291:
---

Picked out the formatter errors. Committing the rest of the fix.

> Mahout Code Cleanup
> ---
>
> Key: MAHOUT-291
> URL: https://issues.apache.org/jira/browse/MAHOUT-291
> Project: Mahout
>  Issue Type: Improvement
>  Components: Classification, Clustering, Collaborative Filtering, 
> Frequent Itemset/Association Rule Mining, Genetic Algorithms, Math, Utils, 
> Website
>Affects Versions: 0.3
>Reporter: Robin Anil
>Assignee: Robin Anil
>Priority: Minor
> Fix For: 0.3
>
> Attachments: MAHOUT-291.patch
>
>
> Code Cleanup
> Organize imports
> Remove space in blank lines
> make local variables final

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-291) Mahout Code Cleanup

2010-02-15 Thread Benson Margulies (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12833879#action_12833879
 ] 

Benson Margulies commented on MAHOUT-291:
-

Robin's going to have to pick through the diffs and find all the places where 
the formatter splatted and put them back.



> Mahout Code Cleanup
> ---
>
> Key: MAHOUT-291
> URL: https://issues.apache.org/jira/browse/MAHOUT-291
> Project: Mahout
>  Issue Type: Improvement
>  Components: Classification, Clustering, Collaborative Filtering, 
> Frequent Itemset/Association Rule Mining, Genetic Algorithms, Math, Utils, 
> Website
>Affects Versions: 0.3
>Reporter: Robin Anil
>Assignee: Robin Anil
>Priority: Minor
> Fix For: 0.3
>
> Attachments: MAHOUT-291.patch
>
>
> Code Cleanup
> Organize imports
> Remove space in blank lines
> make local variables final

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAHOUT-293) Add more tunable parameters to PFPGrowth implementation

2010-02-15 Thread Robin Anil (JIRA)
Add more tunable parameters to PFPGrowth implementation
---

 Key: MAHOUT-293
 URL: https://issues.apache.org/jira/browse/MAHOUT-293
 Project: Mahout
  Issue Type: Improvement
  Components: Frequent Itemset/Association Rule Mining
Affects Versions: 0.4
Reporter: Robin Anil
Assignee: Robin Anil
 Fix For: 0.4


Objective is to add more tunable parameters to the PFPGrowth algorithm.

>From Neal on Mahout User list:

I often use Christian Borgelt's itemset implementations for playing
with data.  He's implemented a nice set of switches, see below.
Setting a minimum support threshold and mimimum itemset size are both
convenient and tend to make the algorithm run a bit faster.

http://www.borgelt.net/software.html

ne...@nrichter-laptop:~$ fpgrowth_fim
usage: fpgrowth_fim [options] infile outfile
find frequent item sets with the fpgrowth algorithm
version 1.13 (2008.05.02)(c) 2004-2008   Christian Borgelt
-m#  minimal number of items per item set (default: 1)
-n#  maximal number of items per item set (default: no limit)
-s#  minimal support of an item set (default: 10%)
(positive: percentage, negative: absolute number)
-d#  minimal binary logarithm of support quotient (default: none)
-p#  output format for the item set support (default: "%.1f")
-a   print absolute support (number of transactions)
-g   write output in scanable form (quote certain characters)
-q#  sort items w.r.t. their frequency (default: -2)
(1: ascending, -1: descending, 0: do not sort,
 2: ascending, -2: descending w.r.t. transaction size sum)
-u   use alternative tree projection method
-z   do not prune tree projections to bonsai
-j   use quicksort to sort the transactions (default: heapsort)
-i#  ignore records starting with a character in the given string
-b/f/r#  blank characters, field and record separators
(default: " \t\r", " \t", "\n")
infile   file to read transactions from
outfile  file to write frequent item se

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-292) Classifier Test Data and Self Tests

2010-02-15 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-292:
--

Status: Patch Available  (was: Open)

Patch ready to go in

> Classifier Test Data and Self Tests
> ---
>
> Key: MAHOUT-292
> URL: https://issues.apache.org/jira/browse/MAHOUT-292
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.3
>Reporter: Robin Anil
>Assignee: Robin Anil
> Fix For: 0.3
>
> Attachments: MAHOUT-292.patch
>
>
> Till now there was no means to test if quality of classification suffered due 
> to a code change. 
> Added Classifier data with 3 labels (mahout, lucene and spamassasin) with 4 
> long sentences in each of the labels. 
> Added a SelfTest which trains Bayes and CBayes model and classify the train 
> dataset while testing and check accuracy and confusion matrix

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-292) Classifier Test Data and Self Tests

2010-02-15 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-292:
--

Attachment: MAHOUT-292.patch

> Classifier Test Data and Self Tests
> ---
>
> Key: MAHOUT-292
> URL: https://issues.apache.org/jira/browse/MAHOUT-292
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.3
>Reporter: Robin Anil
>Assignee: Robin Anil
> Fix For: 0.3
>
> Attachments: MAHOUT-292.patch
>
>
> Till now there was no means to test if quality of classification suffered due 
> to a code change. 
> Added Classifier data with 3 labels (mahout, lucene and spamassasin) with 4 
> long sentences in each of the labels. 
> Added a SelfTest which trains Bayes and CBayes model and classify the train 
> dataset while testing and check accuracy and confusion matrix

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAHOUT-292) Classifier Test Data and Self Tests

2010-02-15 Thread Robin Anil (JIRA)
Classifier Test Data and Self Tests
---

 Key: MAHOUT-292
 URL: https://issues.apache.org/jira/browse/MAHOUT-292
 Project: Mahout
  Issue Type: Improvement
Affects Versions: 0.3
Reporter: Robin Anil
Assignee: Robin Anil
 Fix For: 0.3


Till now there was no means to test if quality of classification suffered due 
to a code change. 

Added Classifier data with 3 labels (mahout, lucene and spamassasin) with 4 
long sentences in each of the labels. 

Added a SelfTest which trains Bayes and CBayes model and classify the train 
dataset while testing and check accuracy and confusion matrix



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Mass Code Cleanup

2010-02-15 Thread Robin Anil
SGD kmeans++ pegasus seems fine. Isabel can you check with the latest trunk
if the perceptron is alright?
I dont see any other open issues which requires patch testing as extensive
as these do

Robin


On Mon, Feb 15, 2010 at 8:10 PM, Drew Farris  wrote:

> On Mon, Feb 15, 2010 at 1:09 AM, Robin Anil  wrote:
> > If its A. I have a few patches ready to commit like the static qualifier
> > fix. I really need you guys to be on board on this. We just cant leave it
> at
> > this discussion.
> >
> > If its B. I will do the revert. But would have to patch some commits.
> >
> > If A sounds reasonable. Its easier to go forward than go back. I will not
> be
> > making any more changes at this scale. except bunch of classes from time
> to
> > time.
>
> I think A sounds reasonable, given a patch for MAHOUT-291 that isn't
> as extensive, but I can't really comment on the potential for breaking
> other patches here. I would say that the people with that sort of time
> invested really should have the final say.
>
> Would it make sense for those with outstanding patches to apply 291
> and then attempt to apply their patches to determine the extent of
> breakage? To be honest, anyone can do it really. If someone wants to
> post some jira issue references for patches that need to be tested I
> can mess around with trying to apply them this evening.
>
> Drew
>


[jira] Commented: (MAHOUT-291) Mahout Code Cleanup

2010-02-15 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12833810#action_12833810
 ] 

Robin Anil commented on MAHOUT-291:
---

The last one was done by me not the formatter. the removed lines are the ones 
created by the formatter. 

About the first one. There are only a couple of places with such a problem 
(thats when the num chars tread close to the limit)


I do agree the options look much better before.  The same formatter does the 
above one in 80 columns. and does the below one in 120 columns. 

> Mahout Code Cleanup
> ---
>
> Key: MAHOUT-291
> URL: https://issues.apache.org/jira/browse/MAHOUT-291
> Project: Mahout
>  Issue Type: Improvement
>  Components: Classification, Clustering, Collaborative Filtering, 
> Frequent Itemset/Association Rule Mining, Genetic Algorithms, Math, Utils, 
> Website
>Affects Versions: 0.3
>Reporter: Robin Anil
>Assignee: Robin Anil
>Priority: Minor
> Fix For: 0.3
>
> Attachments: MAHOUT-291.patch
>
>
> Code Cleanup
> Organize imports
> Remove space in blank lines
> make local variables final

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Mass Code Cleanup

2010-02-15 Thread Drew Farris
On Mon, Feb 15, 2010 at 1:09 AM, Robin Anil  wrote:
> If its A. I have a few patches ready to commit like the static qualifier
> fix. I really need you guys to be on board on this. We just cant leave it at
> this discussion.
>
> If its B. I will do the revert. But would have to patch some commits.
>
> If A sounds reasonable. Its easier to go forward than go back. I will not be
> making any more changes at this scale. except bunch of classes from time to
> time.

I think A sounds reasonable, given a patch for MAHOUT-291 that isn't
as extensive, but I can't really comment on the potential for breaking
other patches here. I would say that the people with that sort of time
invested really should have the final say.

Would it make sense for those with outstanding patches to apply 291
and then attempt to apply their patches to determine the extent of
breakage? To be honest, anyone can do it really. If someone wants to
post some jira issue references for patches that need to be tested I
can mess around with trying to apply them this evening.

Drew


[jira] Commented: (MAHOUT-291) Mahout Code Cleanup

2010-02-15 Thread Drew Farris (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12833806#action_12833806
 ] 

Drew Farris commented on MAHOUT-291:


Thanks very much Robin for posting a patch to review. Things like 
KMeansClusterer.log.debug -> log.debug look great herehere, and I'm ok with the 
whitespace oriented changes for the most part, bu there are some cases where 
auto-code formatting is really making a hash of things:

e.g:

{code}
System.out.println("Generating " + num + " samples m=[" + mx + ", " + my + "] 
sd=[" + sdx + ", " + sdy + ']');
{code}

gets transformed to:

{code}
System.out.println("Generating " + num + " samples m=[" + mx + ", " + my + "] 
sd=[" + sdx + ", " + sdy
+ ']');
{code}

which despite the 120 line length rule seems a little too strict IMHO. 

Also, a nicely formatted OptionBuilder is turned into something nasty and 
unreadable.

{code}
-Option clustersOpt = obuilder
-.withLongName("clusters")
-.withRequired(true)
-
.withArgument(abuilder.withName("clusters").withMinimum(1).withMaximum(1).create())
-.withDescription(
-  "The input centroids, as Vectors.  Must be a SequenceFile of 
Writable, Cluster/Canopy.  "
-  + "If k is also specified, then a random set of vectors will be 
selected and written out to this path first")
+Option clustersOpt = 
obuilder.withLongName("clusters").withRequired(true).withArgument(
+  
abuilder.withName("clusters").withMinimum(1).withMaximum(1).create()).withDescription(
+  "The input centroids, as Vectors.  Must be a SequenceFile of Writable, 
Cluster/Canopy.  "
+  + "If k is also specified, then a random set of vectors will be 
selected and"
+  + "written out to this path first")
 .withShortName("c").create();
{code}

And things like the following, but honestly which of these is the greater sin? 
(From LDAInference)

{code}
-double t = f
-   * (-1 / 12.0 + f
-  * (1 / 120.0 + f
- * (-1 / 252.0 + f
- * (1 / 240.0 + f
-* 
(-1 / 132.0 + f
-   
 * (691 / 32760.0 + f
-   
* (-1 / 12.0 + f * 3617.0 / 8160.0)));
+double t = f * (-1 / 12.0 + f * (1 / 120.0 + f * (-1 / 252.0 
++ f * (1 / 240.0 + f * (-1 / 132.0 + f * (691 / 32760.0 + f * (-1 / 
12.0 + f * 3617.0 / 8160.0)));
{code}

What's the best way to proceed from here given this?

> Mahout Code Cleanup
> ---
>
> Key: MAHOUT-291
> URL: https://issues.apache.org/jira/browse/MAHOUT-291
> Project: Mahout
>  Issue Type: Improvement
>  Components: Classification, Clustering, Collaborative Filtering, 
> Frequent Itemset/Association Rule Mining, Genetic Algorithms, Math, Utils, 
> Website
>Affects Versions: 0.3
>Reporter: Robin Anil
>Assignee: Robin Anil
>Priority: Minor
> Fix For: 0.3
>
> Attachments: MAHOUT-291.patch
>
>
> Code Cleanup
> Organize imports
> Remove space in blank lines
> make local variables final

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Mass Code Cleanup

2010-02-15 Thread Robin Anil
In a previous jira issue I had aligned the CS with the lucene style. Its
that version which is checked in. Could you try now with ti
Robin


On Mon, Feb 15, 2010 at 7:44 PM, Benson Margulies wrote:

> To answer a question of Robin's:
>
> Some months ago, I started to make arrangements to include cs in our
> build. However, I discovered an aspect of 'Lucene style' that was, at
> the time, 100%-incompatible with cs. There was no option to cs to
> align it.
>
> So, the first step here is to agree to a style that cs can, in fact, check.
>


Re: Mass Code Cleanup

2010-02-15 Thread Benson Margulies
To answer a question of Robin's:

Some months ago, I started to make arrangements to include cs in our
build. However, I discovered an aspect of 'Lucene style' that was, at
the time, 100%-incompatible with cs. There was no option to cs to
align it.

So, the first step here is to agree to a style that cs can, in fact, check.


[jira] Updated: (MAHOUT-291) Mahout Code Cleanup

2010-02-15 Thread Robin Anil (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-291:
--

Attachment: MAHOUT-291.patch

Remove static qualifiers. Fix most of the 120+ line issues 

> Mahout Code Cleanup
> ---
>
> Key: MAHOUT-291
> URL: https://issues.apache.org/jira/browse/MAHOUT-291
> Project: Mahout
>  Issue Type: Improvement
>  Components: Classification, Clustering, Collaborative Filtering, 
> Frequent Itemset/Association Rule Mining, Genetic Algorithms, Math, Utils, 
> Website
>Affects Versions: 0.3
>Reporter: Robin Anil
>Assignee: Robin Anil
>Priority: Minor
> Fix For: 0.3
>
> Attachments: MAHOUT-291.patch
>
>
> Code Cleanup
> Organize imports
> Remove space in blank lines
> make local variables final

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Mass Code Cleanup

2010-02-15 Thread Drew Farris
On Mon, Feb 15, 2010 at 1:09 AM, Robin Anil  wrote:
> If its A. I have a few patches ready to commit like the static qualifier
> fix. I really need you guys to be on board on this. We just cant leave it at
> this discussion.

Is there a patch on JIRA for these? It would be easier to review and
vote on if there is.

Drew


Re: Mahout as TLP

2010-02-15 Thread Jeff Eastman

+1 on Isabel's comments.


Isabel Drost wrote:

On Sat Grant Ingersoll  wrote:
  

I don't see any harm in getting 0.3 out first if that makes folks
more comfortable.
  

Yeah, this feels better to me the more I think about it.



+1 from me as well: I really like the idea of Mahout becoming a TLP -
even before a 1.0 release is available.

However I think it makes sense to sort out the 0.3 release first. If I
am counting correctly, that would make for three reasons for press
releases: A new release, Mahout becoming a TLP and later on a 1.0
release. ;)

Isabel

  




Re: Mahout as TLP

2010-02-15 Thread Robin Anil
+1


Re: Mahout as TLP

2010-02-15 Thread Isabel Drost
On Sat Grant Ingersoll  wrote:
> > I don't see any harm in getting 0.3 out first if that makes folks
> > more comfortable.
> 
> Yeah, this feels better to me the more I think about it.

+1 from me as well: I really like the idea of Mahout becoming a TLP -
even before a 1.0 release is available.

However I think it makes sense to sort out the 0.3 release first. If I
am counting correctly, that would make for three reasons for press
releases: A new release, Mahout becoming a TLP and later on a 1.0
release. ;)

Isabel