Re: Welcome Benson Marguiles as Mahout Committer

2010-01-14 Thread Dawid Weiss
Congratulations, Benson! D. On Wed, Jan 13, 2010 at 9:28 PM, Grant Ingersoll gsing...@apache.org wrote: The Lucene PMC is pleased to welcome the addition of Benson Marguiles as a committer on Mahout.  I hope you'll join me in offering Benson a warm welcome. Benson, Lucene tradition is that

Re: Collections of primitives.

2010-01-14 Thread Dawid Weiss
Let's do this, guys: I have finished the implementation of basic data structures. I will try to merge this code with Carrot2, replacing PCJ; this should give me an additional level of confidency that everything is working fine. I plan to have this step done by Friday. Then, I will make this code

Re: Welcome Benson Marguiles as Mahout Committer

2010-01-14 Thread Isabel Drost
On Wed Grant Ingersoll gsing...@apache.org wrote: The Lucene PMC is pleased to welcome the addition of Benson Marguiles as a committer on Mahout. Welcome Benson - thanks to all the great work you have done so far for the mahout-math stuff. Looking forward to working together with you. Isabel

Re: Fisheye?

2010-01-14 Thread Isabel Drost
On Wed Benson Margulies bimargul...@gmail.com wrote: Are we set up? If we are, than at least I am not aware of it. Isabel

Re: [math] no-such-integer value

2010-01-14 Thread Isabel Drost
On Mon Grant Ingersoll gsing...@apache.org wrote: I'm sensing a theme. I think for this stuff we should prune fairly aggressively, then add back in places once we have a need. +1 Isabel

[jira] Resolved: (MAHOUT-216) Improve the results of MAHOUT-145 by uniformly distributing the classes in the partitioned data

2010-01-14 Thread Deneche A. Hakim (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Deneche A. Hakim resolved MAHOUT-216. - Resolution: Fixed Done. Improve the results of MAHOUT-145 by uniformly distributing

[jira] Updated: (MAHOUT-245) Better handling of Categorical attributes when building Decision Forests

2010-01-14 Thread Deneche A. Hakim (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Deneche A. Hakim updated MAHOUT-245: Attachment: mahout-245.patch Better handling of Categorical attributes when building

[jira] Commented: (MAHOUT-245) Better handling of Categorical attributes when building Decision Forests

2010-01-14 Thread Deneche A. Hakim (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12800156#action_12800156 ] Deneche A. Hakim commented on MAHOUT-245: - I modified the code to not select

[jira] Updated: (MAHOUT-245) Better handling of Categorical attributes when building Decision Forests

2010-01-14 Thread Deneche A. Hakim (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Deneche A. Hakim updated MAHOUT-245: Status: Patch Available (was: Open) Better handling of Categorical attributes when

Re: DictionaryVectorizer meets Wikipedia.

2010-01-14 Thread Robin Anil
Cool. now i know what c1 types are for Thanks for the tip! On Thu, Jan 14, 2010 at 1:25 PM, Shashikant Kore shashik...@gmail.comwrote: On Thu, Jan 14, 2010 at 3:20 AM, Ted Dunning ted.dunn...@gmail.com wrote: Large instances may give you more cost effective throughput. Even if they are

Re: DictionaryVectorizer meets Wikipedia.

2010-01-14 Thread Robin Anil
Thanks Oliver!. Could you file a JIRA issue on that. There are a couple of places where the old api is used. On Thu, Jan 14, 2010 at 4:09 PM, Olivier Grisel olivier.gri...@ensta.orgwrote: 2010/1/13 Robin Anil robin.a...@gmail.com: I have fired up a small instance of EC2(Single node for the

[jira] Assigned: (MAHOUT-244) Add root log-likelihood method to LogLikehood class.

2010-01-14 Thread Isabel Drost (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Isabel Drost reassigned MAHOUT-244: --- Assignee: Drew Farris Add root log-likelihood method to LogLikehood class.

[jira] Updated: (MAHOUT-244) Add root log-likelihood method to LogLikehood class.

2010-01-14 Thread Isabel Drost (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Isabel Drost updated MAHOUT-244: Resolution: Fixed Status: Resolved (was: Patch Available) Patch applies cleanly and looks

Re: DictionaryVectorizer meets Wikipedia.

2010-01-14 Thread Robin Anil
On the question of analyzer quality. (Assuming speed could be circumvented by madding more machines) Wikipedia data is in wikitext format so there are many {{Title}} [[Link|LinkText]] some html tags Should I be writing my own stream based analyzer maybe some regex rules to filter them will do?

Re: DictionaryVectorizer meets Wikipedia.

2010-01-14 Thread Olivier Grisel
2010/1/14 Robin Anil robin.a...@gmail.com: On the question of analyzer quality. (Assuming speed could be circumvented by madding more machines) Wikipedia data is in wikitext format so there are many {{Title}} [[Link|LinkText]] some html tags Should I be writing my own stream based analyzer

Re: Collections of primitives.

2010-01-14 Thread Dawid Weiss
Oh, as a side note to Benson -- your effort on porting these COLT collections is appreciated from more than one angle, regardless of the HPPC discussion. We have been using COLT's math/ matrix packages in C2 and had a long-open issue of getting rid of the LGPL parts. Solved now, thanks!

Re: DictionaryVectorizer meets Wikipedia.

2010-01-14 Thread Grant Ingersoll
On Jan 14, 2010, at 6:11 AM, Robin Anil wrote: On the question of analyzer quality. (Assuming speed could be circumvented by madding more machines) Wikipedia data is in wikitext format so there are many {{Title}} [[Link|LinkText]] some html tags There is a Wikipedia Tokenizer in Lucene

Re: forrest 1, benson 0

2010-01-14 Thread Grant Ingersoll
http://www.grantingersoll.com/2009/09/11/how-to-restore-java-1-5-and-1-4-on-os-x-snow-leopard/ On Jan 13, 2010, at 6:47 PM, Benson Margulies wrote: SnowLeopard creates links named 1.5 which point to 1.6. THere is this extremely finicky procedures for installing an actual 1.5, but it doesn't

Re: DictionaryVectorizer meets Wikipedia.

2010-01-14 Thread Robin Anil
Retrying with C1.medium. Its almost 4 times faster at 2.4 times the cost. Robin On Thu, Jan 14, 2010 at 1:25 PM, Shashikant Kore shashik...@gmail.comwrote: On Thu, Jan 14, 2010 at 3:20 AM, Ted Dunning ted.dunn...@gmail.com wrote: Large instances may give you more cost effective throughput.

Re: Fisheye?

2010-01-14 Thread Benson Margulies
I'll open the ticket. On Thu, Jan 14, 2010 at 3:31 AM, Isabel Drost isa...@apache.org wrote: On Wed Benson Margulies bimargul...@gmail.com wrote: Are we set up? If we are, than at least I am not aware of it. Isabel

Re: forrest 1, benson 0

2010-01-14 Thread Benson Margulies
Been, there, done that. Believe me, running forrest in a Linux VM is less work. On Thu, Jan 14, 2010 at 6:57 AM, Grant Ingersoll gsing...@apache.org wrote: http://www.grantingersoll.com/2009/09/11/how-to-restore-java-1-5-and-1-4-on-os-x-snow-leopard/ On Jan 13, 2010, at 6:47 PM, Benson

Re: Collections of primitives.

2010-01-14 Thread Benson Margulies
Hey, credit where credit is due. I did \not/ do the initial license-sorting port. I've been cleaning up and filling in after that (aside from one little licensing problem that was not related to LGPL). I think that Jake and Sean get the credit for the heavy lifting. On Thu, Jan 14, 2010 at 6:52

Re: Collections of primitives.

2010-01-14 Thread Dawid Weiss
Ha! In that case fanfares to all of you. D. On Thu, Jan 14, 2010 at 1:09 PM, Benson Margulies bimargul...@gmail.com wrote: Hey, credit where credit is due. I did \not/ do the initial license-sorting port. I've been cleaning up and filling in after that (aside from one little licensing problem

[jira] Commented: (MAHOUT-206) Separate and clearly label different SparseVector implementations

2010-01-14 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12800188#action_12800188 ] Sean Owen commented on MAHOUT-206: -- You want me to try this and commit it? Separate and

Re: DictionaryVectorizer meets Wikipedia.

2010-01-14 Thread Robin Anil
Some issues I am encountering. I use a chunk of the dictionary on every map/reduce pass. to create partial vectors. - If i do the vectorization in the reducer. Lot of data(the entire dataset) gets thrown around the network during shuffle - If i do the vectorization in the mapper, the

[jira] Commented: (MAHOUT-244) Add root log-likelihood method to LogLikehood class.

2010-01-14 Thread Drew Farris (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12800196#action_12800196 ] Drew Farris commented on MAHOUT-244: Thanks Isabel Add root log-likelihood method to

[jira] Created: (MAHOUT-246) upgrade to new lucene TokenStream API to cleanup deprecation API

2010-01-14 Thread Olivier Grisel (JIRA)
upgrade to new lucene TokenStream API to cleanup deprecation API Key: MAHOUT-246 URL: https://issues.apache.org/jira/browse/MAHOUT-246 Project: Mahout Issue Type: Improvement

[jira] Updated: (MAHOUT-246) upgrade to new lucene TokenStream API to cleanup deprecation API

2010-01-14 Thread Olivier Grisel (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olivier Grisel updated MAHOUT-246: -- Attachment: remove-lucene-deprecation-warnings.diff upgrade to new lucene TokenStream API to

[jira] Updated: (MAHOUT-246) upgrade to new lucene TokenStream API to cleanup deprecation API

2010-01-14 Thread Olivier Grisel (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olivier Grisel updated MAHOUT-246: -- Status: Patch Available (was: Open) upgrade to new lucene TokenStream API to cleanup

Re: DictionaryVectorizer meets Wikipedia.

2010-01-14 Thread Olivier Grisel
2010/1/14 Robin Anil robin.a...@gmail.com: Some issues I am encountering. I use a chunk of the dictionary on every map/reduce pass. to create partial vectors.   - If i do the vectorization in the reducer. Lot of data(the entire   dataset) gets thrown around the network during shuffle   -

Re: DictionaryVectorizer meets Wikipedia.

2010-01-14 Thread Olivier Grisel
2010/1/14 Robin Anil robin.a...@gmail.com: Thanks Oliver!. Could you file a JIRA issue on that. There are a couple of places where the old api is used. I think I tracked them all down, patch available here: https://issues.apache.org/jira/browse/MAHOUT-246 , all tests pass in maven. -- Olivier

Re: DictionaryVectorizer meets Wikipedia.

2010-01-14 Thread Robin Anil
many map/reduce passes for partial vectorization Have you tried setting the Reducer as a Combiner too? I am not sure that will work too, as the Combiner only gets a small chunk. The cost of loading the dictionary into memory outweighs the cost of reading the chunk in the combiner.

[jira] Updated: (MAHOUT-246) upgrade to new lucene TokenStream API to cleanup deprecation warnings

2010-01-14 Thread Olivier Grisel (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olivier Grisel updated MAHOUT-246: -- Summary: upgrade to new lucene TokenStream API to cleanup deprecation warnings (was: upgrade

[jira] Updated: (MAHOUT-246) upgrade to new lucene TokenStream API to cleanup deprecation warnings

2010-01-14 Thread Olivier Grisel (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olivier Grisel updated MAHOUT-246: -- Priority: Minor (was: Major) upgrade to new lucene TokenStream API to cleanup deprecation

[jira] Commented: (MAHOUT-228) Need sequential logistic regression implementation using SGD techniques

2010-01-14 Thread Ted Dunning (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12800248#action_12800248 ] Ted Dunning commented on MAHOUT-228: We need a few things: - a few functions should

[jira] Created: (MAHOUT-247) GenericUserBasedRecommender.recommend causes connection leak when called for user with no preferences

2010-01-14 Thread Tolga Oral (JIRA)
GenericUserBasedRecommender.recommend causes connection leak when called for user with no preferences - Key: MAHOUT-247 URL:

[jira] Updated: (MAHOUT-247) GenericUserBasedRecommender.recommend causes connection leak when called for user with no preferences

2010-01-14 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated MAHOUT-247: - Due Date: 21/Jan/10 Fix Version/s: 0.3 Assignee: Sean Owen Good one. I will review

[jira] Commented: (MAHOUT-206) Separate and clearly label different SparseVector implementations

2010-01-14 Thread Jake Mannix (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12800268#action_12800268 ] Jake Mannix commented on MAHOUT-206: If you (or anyone else) wants to try this out for

[jira] Commented: (MAHOUT-206) Separate and clearly label different SparseVector implementations

2010-01-14 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12800271#action_12800271 ] Sean Owen commented on MAHOUT-206: -- It all passes for me. I had to delete another instance

[jira] Resolved: (MAHOUT-206) Separate and clearly label different SparseVector implementations

2010-01-14 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved MAHOUT-206. -- Resolution: Fixed Separate and clearly label different SparseVector implementations

Re: Welcome Benson Marguiles as Mahout Committer

2010-01-14 Thread Jason Rutherglen
Congrats Benson! Basis primarily uses a JNI wrapper to integrate with Lucene? I'm indexing using Hadoop and it'd be great if it were all in Java... So yeah, We shall see. :) Jason On Wed, Jan 13, 2010 at 7:33 PM, Benson Margulies bimargul...@gmail.com wrote: I'm a somewhat grizzled software

Re: Welcome Benson Marguiles as Mahout Committer

2010-01-14 Thread Benson Margulies
Ah, well, a longer story. We sell segmenters lemmatizers that plug into Lucene. Until recently, JNI all the way down. We've delivered a new version to a customer that does some European languages entirely in Java, and we expect to be able to do this for many more languages this year. On Thu,

Re: Welcome Benson Marguiles as Mahout Committer

2010-01-14 Thread Olivier Grisel
2010/1/14 Benson Margulies bimargul...@gmail.com: If there's one NLP thing I know something about, now, it is named entity extraction with averaged perceptrons and passive-aggressive training. This has the advantage of being mathematically trivial unless you want to prove that it works, which

Re: Welcome Benson Marguiles as Mahout Committer

2010-01-14 Thread Benson Margulies
that paper's new to me. Our work combines MillerGuinness on perceptrons and entities with Cramer on passive agressive, plus some secret sauce. I'm not in a position to open source it just now, but that may change. On Thu, Jan 14, 2010 at 2:12 PM, Olivier Grisel olivier.gri...@ensta.org wrote:

Fwd: New MEAP: Mahout in Action

2010-01-14 Thread Sean Owen
Manning was suggesting we post a cover/link to Mahout in Action MEAP, like the Lucene Java page has such a link to Lucene in Action MEAP: http://lucene.apache.org/java/docs/index.html I think it's a good idea since it points people to more resources, and I want feedback on the text via MEAP. But

Re: Fwd: New MEAP: Mahout in Action

2010-01-14 Thread Jake Mannix
+1 We should definitely have it posted. -jake On Jan 14, 2010 4:25 PM, Sean Owen sro...@gmail.com wrote: Manning was suggesting we post a cover/link to Mahout in Action MEAP, like the Lucene Java page has such a link to Lucene in Action MEAP: http://lucene.apache.org/java/docs/index.html I

[jira] Updated: (MAHOUT-248) Next collections expansion kit: OpenObjectWhateverHashMapT

2010-01-14 Thread Benson Margulies (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benson Margulies updated MAHOUT-248: Attachment: MAHOUT-248.patch Next collections expansion kit: OpenObjectWhateverHashMapT

[jira] Updated: (MAHOUT-248) Next collections expansion kit: OpenObjectWhateverHashMapT

2010-01-14 Thread Benson Margulies (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benson Margulies updated MAHOUT-248: Status: Patch Available (was: Open) I am not going to let this sit, since it's just more

Re: New MEAP: Mahout in Action

2010-01-14 Thread Grant Ingersoll
+1. (BTW, great read so far, I've got 3 more chapters to go in the first 6!) We should state up front, just like in Lucene land, that anyone who has a book on Mahout is welcome to link it on the page. The more books on Mahout the merrier! On Jan 14, 2010, at 7:25 PM, Sean Owen wrote:

Re: New MEAP: Mahout in Action

2010-01-14 Thread Otis Gospodnetic
+1 Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message From: Grant Ingersoll gsing...@apache.org To: mahout-dev@lucene.apache.org Sent: Thu, January 14, 2010 9:14:09 PM Subject: Re: New MEAP: Mahout in Action +1. (BTW, great read so far, I've

Re: New MEAP: Mahout in Action

2010-01-14 Thread Ted Dunning
+1 (belatedly) On Thu, Jan 14, 2010 at 6:34 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: +1 Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message From: Grant Ingersoll gsing...@apache.org To: mahout-dev@lucene.apache.org Sent:

Re: New MEAP: Mahout in Action

2010-01-14 Thread Drew Farris
+1 as well. On Thu, Jan 14, 2010 at 9:41 PM, Ted Dunning ted.dunn...@gmail.com wrote: +1 (belatedly)

Re: New MEAP: Mahout in Action

2010-01-14 Thread Isabel Drost
On 15.01.2010 Grant Ingersoll wrote: (BTW, great read so far, I've got 3 more chapters to go in the first 6!) Can second that: Great book indeed. We should state up front, just like in Lucene land, that anyone who has a book on Mahout is welcome to link it on the page. The more books on