Re: GSOC 2010 is here

2010-02-02 Thread Isabel Drost
On Mon Robin Anil robin.a...@gmail.com wrote: 2. UIMA Integration with Mahout? (Maybe a good project if UIMA folks are taking in GSOC students) I guess one could easily split this one in two: a) Using UIMA (whole pipeline or just the analysers if that is possible) for data pre-processing

[jira] Updated: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

2010-02-02 Thread Robin Anil (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin Anil updated MAHOUT-237: -- Attachment: MAHOUT-237-tfidf.patch Added IDF job which takes a sequence file of doc-id=Vector.

[jira] Commented: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

2010-02-02 Thread Ted Dunning (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12828763#action_12828763 ] Ted Dunning commented on MAHOUT-237: {quote} Seems like the Text field Vector Class

Re: [jira] Commented: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

2010-02-02 Thread Jake Mannix
You volunteering to port to avro, Ted? Awesome! :) -jake On Feb 2, 2010 1:10 PM, Ted Dunning (JIRA) j...@apache.org wrote: [ https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12828763#action_12828763]

Re: [jira] Commented: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

2010-02-02 Thread Drew Farris
I'm going to get back to it eventually, honest! On Tue, Feb 2, 2010 at 4:13 PM, Jake Mannix jake.man...@gmail.com wrote: You volunteering to port to avro, Ted?  Awesome! :)  -jake

[jira] Created: (MAHOUT-272) Add licences for 3rd party jars to mahout binary release and remove additional unused dependencies.

2010-02-02 Thread Drew Farris (JIRA)
Add licences for 3rd party jars to mahout binary release and remove additional unused dependencies. --- Key: MAHOUT-272 URL:

Re: [jira] Commented: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

2010-02-02 Thread Ted Dunning
:-) On Tue, Feb 2, 2010 at 1:13 PM, Jake Mannix jake.man...@gmail.com wrote: You volunteering to port to avro, Ted? Awesome! :) -- Ted Dunning, CTO DeepDyve

[jira] Updated: (MAHOUT-272) Add licences for 3rd party jars to mahout binary release and remove additional unused dependencies.

2010-02-02 Thread Drew Farris (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Drew Farris updated MAHOUT-272: --- Attachment: MAHOUT-272.patch * Added exclusion for eclipse core to hadoop dependency in

[jira] Updated: (MAHOUT-272) Add licences for 3rd party jars to mahout binary release and remove additional unused dependencies.

2010-02-02 Thread Drew Farris (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Drew Farris updated MAHOUT-272: --- Status: Patch Available (was: Open) Add licences for 3rd party jars to mahout binary release and

[jira] Updated: (MAHOUT-272) Add licenses for 3rd party jars to mahout binary release and remove additional unused dependencies.

2010-02-02 Thread Drew Farris (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Drew Farris updated MAHOUT-272: --- Summary: Add licenses for 3rd party jars to mahout binary release and remove additional unused

[Fwd: Re: Dirichlet Processing Clustering - Synthetic Control Data]

2010-02-02 Thread Jeff Eastman
Just notice this didn't go to the list. ---BeginMessage--- Hi Jerry, I'm not sure why Dirichlet is doing that with this dataset and have not been able to get better results than you. I have gotten excellent results using it with other models on other datasets, so I'm pretty confident in the

[jira] Updated: (MAHOUT-242) LLR Collocation Identifier

2010-02-02 Thread Drew Farris (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Drew Farris updated MAHOUT-242: --- Attachment: MAHOUT-242.patch Updated patch, removed pom modifications checked in as a part of

Re: [Fwd: Re: Dirichlet Processing Clustering - Synthetic Control Data]

2010-02-02 Thread Ted Dunning
This could also be caused if the prior is very diffuse. This makes the probability that a point will go to any new cluster quite low. You can compensate somewhat for this with different values of alpha. I have had some half thoughts about how to improve the mixing and currently think that