Re: Categorization stuff

2009-07-17 Thread Isabel Drost
On Friday 17 July 2009 16:17:50 Grant Ingersoll wrote: > Also, do we have any tools for setting up training/test sets for > Wikipedia examples? Seems like a generally useful thing to have. > Take annotated data and automatically split, no? That certainly would be useful at tuning/ evaluation time

Re: Checkstyle

2009-07-17 Thread Benson Margulies
THanks. Let me see if I can make cs go along. I might make a patch for all the copyrights first just to reduce the noise level. On Fri, Jul 17, 2009 at 12:26 PM, Sean Owen wrote: > I think the Lucene conventions are Sun conventions plus these > indentation rules -- here's the IntelliJ file. It's

Re: Checkstyle

2009-07-17 Thread Sean Owen
I think the Lucene conventions are Sun conventions plus these indentation rules -- here's the IntelliJ file. It's pretty much human readable. All four stanzas are really the same thing.

Re: Checkstyle

2009-07-17 Thread Benson Margulies
Let me explain where I'm arriving from. At CXF and XmlSchema (at Apache) plus on some projects I cloned from them, there is: a checkstyle.xml with checkstyle rules. a pmd XML file with PMD rules and a set of eclipse configuation settings. They all agree. if you format the code with Eclipse (fo

[jira] Commented: (MAHOUT-147) Wikipedia Example improvements

2009-07-17 Thread Grant Ingersoll (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12732578#action_12732578 ] Grant Ingersoll commented on MAHOUT-147: I think, since we are using ',' as the del

[jira] Commented: (MAHOUT-147) Wikipedia Example improvements

2009-07-17 Thread Grant Ingersoll (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12732572#action_12732572 ] Grant Ingersoll commented on MAHOUT-147: There is a small bug in BayesTFIDFMapper t

[jira] Created: (MAHOUT-147) Wikipedia Example improvements

2009-07-17 Thread Grant Ingersoll (JIRA)
Wikipedia Example improvements -- Key: MAHOUT-147 URL: https://issues.apache.org/jira/browse/MAHOUT-147 Project: Mahout Issue Type: Improvement Components: Classification Reporter: Grant Inge

Re: Categorization stuff

2009-07-17 Thread Grant Ingersoll
Also, do we have any tools for setting up training/test sets for Wikipedia examples? Seems like a generally useful thing to have. Take annotated data and automatically split, no? -Grant On Jul 17, 2009, at 8:32 AM, Grant Ingersoll wrote: On Jul 17, 2009, at 5:06 AM, Robin Anil wrote:

[jira] Updated: (MAHOUT-146) Make Wikipedia Example Classifier more generic

2009-07-17 Thread Grant Ingersoll (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated MAHOUT-146: --- Attachment: MAHOUT-146.patch Patch to make this happen, plus removes one of the redundant dri

Re: Checkstyle

2009-07-17 Thread Sean Owen
On a related note, I get this error now when building from head: > mvn install ... Caused by: org.apache.maven.artifact.resolver.ArtifactNotFoundException: Unable to download the artifact from any repository org.apache.mahout:mahout-parent:pom:0.2-SNAPSHOT from the specified remote repositori

Re: Checkstyle

2009-07-17 Thread Benson Margulies
if you run mvn -Psourcecheck you can see the disagreements for yourself. On Jul 17, 2009, at 8:42 AM, Sean Owen wrote: Ditto, so Benson, I wonder if you can confirm whether these checkstyle rules appear to be at odds with the current formatting, since in theory I already committed a big-bang c

Re: Checkstyle

2009-07-17 Thread Sean Owen
Ditto, so Benson, I wonder if you can confirm whether these checkstyle rules appear to be at odds with the current formatting, since in theory I already committed a big-bang change to fix all the formatting. If there's much disagreement then we should investigate why. On Fri, Jul 17, 2009 at 1:40

Re: Checkstyle

2009-07-17 Thread Grant Ingersoll
On Jul 17, 2009, at 8:36 AM, Sean Owen wrote: Yeah I think the point would be to make sure this happens automatically. I too am wary of maintaining 2, 3, 4 style configurations for the project. But while I use an IDE almost exlcusively (IntelliJ), I am not sure I agree that the project should

Re: Checkstyle

2009-07-17 Thread Sean Owen
Yeah I think the point would be to make sure this happens automatically. I too am wary of maintaining 2, 3, 4 style configurations for the project. But while I use an IDE almost exlcusively (IntelliJ), I am not sure I agree that the project should assume an IDE, and therefore, there is some point

Re: Categorization stuff

2009-07-17 Thread Grant Ingersoll
On Jul 17, 2009, at 5:06 AM, Robin Anil wrote: the reason i used countries was i couldn't think of some other larger group of labels. Also wikipedia has over 100K categories, A document has multiple categories too. So finding a non overlapped sets of documents wasn't easy(Which makes it

Re: Checkstyle

2009-07-17 Thread Grant Ingersoll
On Jul 17, 2009, at 7:31 AM, Benson Margulies wrote: So, would you smile on a patch that whomped all the indents to be acceptable to this and automated setting eclipse settings? I don't know about IntelliJ. FWIW, I see no point in a tool that can't be replicated in an IDE. Or is this some

Re: Checkstyle

2009-07-17 Thread Sean Owen
Sounds like you're doing as much as checkstyle can here. Are you saying the resulting configuration warns about indentation a lot? I wonder if you can illustrate the nature of the warnings then? is it "right" or is it apparently flagging things that look correctly formatted according to our rules?

Re: Checkstyle

2009-07-17 Thread Benson Margulies
See http://checkstyle.sourceforge.net/config_misc.html under Indentation. Checkstyle does not have a lot of knobs here. I turned one and reduced the number of complaints. It's not obvious how to turn the other two and match up with what's out there. Could I interest you in having a look? Maybe I'm

Re: Checkstyle

2009-07-17 Thread Sean Owen
I am for it. In theory I already whomped the code base to update the formatting. If your patch produces a lot more formatting changes, then maybe there is a mismatch between the rules we implemented. Let's discuss before re-whomping. It's possible I got the rules wrong. What's it look like, what's

[jira] Commented: (MAHOUT-146) Make Wikipedia Example Classifier more generic

2009-07-17 Thread Grant Ingersoll (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12732489#action_12732489 ] Grant Ingersoll commented on MAHOUT-146: Yeah, it's really just a matter of renamin

Re: Checkstyle

2009-07-17 Thread Benson Margulies
So, would you smile on a patch that whomped all the indents to be acceptable to this and automated setting eclipse settings? I don't know about IntelliJ. On Thu, Jul 16, 2009 at 10:05 PM, Ted Dunning wrote: > Very cool. > > I would love to have consistency on this and be able to avoid messing up t

[jira] Commented: (MAHOUT-146) Make Wikipedia Example Classifier more generic

2009-07-17 Thread Robin Anil (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12732435#action_12732435 ] Robin Anil commented on MAHOUT-146: --- Its already generic to some extend. Check out Mahout

Re: Categorization stuff

2009-07-17 Thread Robin Anil
> > the reason i used countries was i couldn't think of some other larger group > of labels. > Also wikipedia has over 100K categories, A document has multiple categories > too. So finding a non overlapped sets of documents wasn't easy(Which makes > it easy to differentiate them).First thing I coul

[jira] Commented: (MAHOUT-144) Some maven refactoring and prep for enforcing code style

2009-07-17 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12732399#action_12732399 ] Sean Owen commented on MAHOUT-144: -- +1 to the checkstyle config and all those related chan