Re: New MEAP: Mahout in Action

2010-01-14 Thread Isabel Drost
On 15.01.2010 Grant Ingersoll wrote: > (BTW, great read so far, I've got 3 more chapters to go in the first > 6!) Can second that: Great book indeed. > We should state up front, just like in Lucene land, that anyone who has a > book on Mahout is welcome to link it on the page. The more books

Re: New MEAP: Mahout in Action

2010-01-14 Thread Drew Farris
+1 as well. On Thu, Jan 14, 2010 at 9:41 PM, Ted Dunning wrote: > +1 (belatedly) > >

Re: New MEAP: Mahout in Action

2010-01-14 Thread Ted Dunning
+1 (belatedly) On Thu, Jan 14, 2010 at 6:34 PM, Otis Gospodnetic < otis_gospodne...@yahoo.com> wrote: > +1 > > Otis > -- > Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch > > > > - Original Message > > From: Grant Ingersoll > > To: mahout-dev@lucene.apache.org > > Sent: Thu,

Re: New MEAP: Mahout in Action

2010-01-14 Thread Otis Gospodnetic
+1 Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message > From: Grant Ingersoll > To: mahout-dev@lucene.apache.org > Sent: Thu, January 14, 2010 9:14:09 PM > Subject: Re: New MEAP: Mahout in Action > > +1. (BTW, great read so far, I've got 3 more cha

Re: New MEAP: Mahout in Action

2010-01-14 Thread Grant Ingersoll
+1. (BTW, great read so far, I've got 3 more chapters to go in the first 6!) We should state up front, just like in Lucene land, that anyone who has a book on Mahout is welcome to link it on the page. The more books on Mahout the merrier! On Jan 14, 2010, at 7:25 PM, Sean Owen wrote: > Manni

[jira] Updated: (MAHOUT-248) Next collections expansion kit: OpenObjectWhateverHashMap

2010-01-14 Thread Benson Margulies (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benson Margulies updated MAHOUT-248: Status: Patch Available (was: Open) I am not going to let this sit, since it's just more o

[jira] Updated: (MAHOUT-248) Next collections expansion kit: OpenObjectWhateverHashMap

2010-01-14 Thread Benson Margulies (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benson Margulies updated MAHOUT-248: Attachment: MAHOUT-248.patch > Next collections expansion kit: OpenObjectWhateverHashMap >

[jira] Created: (MAHOUT-248) Next collections expansion kit: OpenObjectWhateverHashMap

2010-01-14 Thread Benson Margulies (JIRA)
Next collections expansion kit: OpenObjectWhateverHashMap Key: MAHOUT-248 URL: https://issues.apache.org/jira/browse/MAHOUT-248 Project: Mahout Issue Type: Improvement Co

[jira] Resolved: (MAHOUT-247) GenericUserBasedRecommender.recommend causes connection leak when called for user with no preferences

2010-01-14 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved MAHOUT-247. -- Resolution: Fixed Tentatively resolving this after applying a fix to the most obvious case where this

[jira] Commented: (MAHOUT-247) GenericUserBasedRecommender.recommend causes connection leak when called for user with no preferences

2010-01-14 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12800466#action_12800466 ] Sean Owen commented on MAHOUT-247: -- This is actually very messy to really fix -- LongPrimi

Re: Fwd: New MEAP: Mahout in Action

2010-01-14 Thread Jake Mannix
+1 We should definitely have it posted. -jake On Jan 14, 2010 4:25 PM, "Sean Owen" wrote: Manning was suggesting we post a cover/link to Mahout in Action MEAP, like the Lucene Java page has such a link to Lucene in Action MEAP: http://lucene.apache.org/java/docs/index.html I think it's a go

Fwd: New MEAP: Mahout in Action

2010-01-14 Thread Sean Owen
Manning was suggesting we post a cover/link to Mahout in Action MEAP, like the Lucene Java page has such a link to Lucene in Action MEAP: http://lucene.apache.org/java/docs/index.html I think it's a good idea since it points people to more resources, and I want feedback on the text via MEAP. But I

Re: Welcome Benson Marguiles as Mahout Committer

2010-01-14 Thread Benson Margulies
that paper's new to me. Our work combines Miller&Guinness on perceptrons and entities with Cramer on passive agressive, plus some secret sauce. I'm not in a position to open source it just now, but that may change. On Thu, Jan 14, 2010 at 2:12 PM, Olivier Grisel wrote: > 2010/1/14 Benson Margul

Re: Welcome Benson Marguiles as Mahout Committer

2010-01-14 Thread Olivier Grisel
2010/1/14 Benson Margulies : > > If there's one NLP thing I know something about, now, it is named > entity extraction with averaged perceptrons and passive-aggressive > training. This has the advantage of being mathematically trivial > unless you want to prove that it works, which is as about as u

Re: Welcome Benson Marguiles as Mahout Committer

2010-01-14 Thread Benson Margulies
Ah, well, a longer story. We sell segmenters lemmatizers that plug into Lucene. Until recently, JNI all the way down. We've delivered a new version to a customer that does some European languages entirely in Java, and we expect to be able to do this for many more languages this year. On Thu, Jan

Re: Welcome Benson Marguiles as Mahout Committer

2010-01-14 Thread Jason Rutherglen
Congrats Benson! Basis primarily uses a JNI wrapper to integrate with Lucene? I'm indexing using Hadoop and it'd be great if it were all in Java... So yeah, "We shall see". :) Jason On Wed, Jan 13, 2010 at 7:33 PM, Benson Margulies wrote: > I'm a somewhat grizzled software guy. My background i

[jira] Resolved: (MAHOUT-206) Separate and clearly label different SparseVector implementations

2010-01-14 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved MAHOUT-206. -- Resolution: Fixed > Separate and clearly label different SparseVector implementations > ---

[jira] Commented: (MAHOUT-206) Separate and clearly label different SparseVector implementations

2010-01-14 Thread Jake Mannix (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12800273#action_12800273 ] Jake Mannix commented on MAHOUT-206: bq. I had to delete another instance of TestVector

[jira] Commented: (MAHOUT-206) Separate and clearly label different SparseVector implementations

2010-01-14 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12800271#action_12800271 ] Sean Owen commented on MAHOUT-206: -- It all passes for me. I had to delete another instance

[jira] Commented: (MAHOUT-206) Separate and clearly label different SparseVector implementations

2010-01-14 Thread Jake Mannix (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12800268#action_12800268 ] Jake Mannix commented on MAHOUT-206: If you (or anyone else) wants to try this out for

[jira] Updated: (MAHOUT-247) GenericUserBasedRecommender.recommend causes connection leak when called for user with no preferences

2010-01-14 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated MAHOUT-247: - Due Date: 21/Jan/10 Fix Version/s: 0.3 Assignee: Sean Owen Good one. I will review this

[jira] Created: (MAHOUT-247) GenericUserBasedRecommender.recommend causes connection leak when called for user with no preferences

2010-01-14 Thread Tolga Oral (JIRA)
GenericUserBasedRecommender.recommend causes connection leak when called for user with no preferences - Key: MAHOUT-247 URL: https://issues.apache.org/jira/browse/

[jira] Commented: (MAHOUT-228) Need sequential logistic regression implementation using SGD techniques

2010-01-14 Thread Ted Dunning (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12800248#action_12800248 ] Ted Dunning commented on MAHOUT-228: We need a few things: - a few functions should b

[jira] Updated: (MAHOUT-246) upgrade to new lucene TokenStream API to cleanup deprecation warnings

2010-01-14 Thread Olivier Grisel (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olivier Grisel updated MAHOUT-246: -- Priority: Minor (was: Major) > upgrade to new lucene TokenStream API to cleanup deprecation wa

[jira] Updated: (MAHOUT-246) upgrade to new lucene TokenStream API to cleanup deprecation warnings

2010-01-14 Thread Olivier Grisel (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olivier Grisel updated MAHOUT-246: -- Summary: upgrade to new lucene TokenStream API to cleanup deprecation warnings (was: upgrade t

Re: DictionaryVectorizer meets Wikipedia.

2010-01-14 Thread Robin Anil
> > > > many map/reduce passes for partial vectorization > > Have you tried setting the Reducer as a Combiner too? > I am not sure that will work too, as the Combiner only gets a small chunk. > The cost of loading the dictionary into memory outweighs the cost of reading > the chunk in the combin

Re: DictionaryVectorizer meets Wikipedia.

2010-01-14 Thread Olivier Grisel
2010/1/14 Robin Anil : > Thanks Oliver!. Could you file a JIRA issue on that. There are a couple of > places where the old api is used. I think I tracked them all down, patch available here: https://issues.apache.org/jira/browse/MAHOUT-246 , all tests pass in maven. -- Olivier http://twitter.com

Re: DictionaryVectorizer meets Wikipedia.

2010-01-14 Thread Olivier Grisel
2010/1/14 Robin Anil : > Some issues I am encountering. > > I use a chunk of the dictionary on every map/reduce pass. to create partial > vectors. > > >   - If i do the vectorization in the reducer. Lot of data(the entire >   dataset) gets thrown around the network during shuffle >   - If i do the

[jira] Updated: (MAHOUT-246) upgrade to new lucene TokenStream API to cleanup deprecation API

2010-01-14 Thread Olivier Grisel (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olivier Grisel updated MAHOUT-246: -- Status: Patch Available (was: Open) > upgrade to new lucene TokenStream API to cleanup depreca

[jira] Updated: (MAHOUT-246) upgrade to new lucene TokenStream API to cleanup deprecation API

2010-01-14 Thread Olivier Grisel (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olivier Grisel updated MAHOUT-246: -- Attachment: remove-lucene-deprecation-warnings.diff > upgrade to new lucene TokenStream API to

[jira] Updated: (MAHOUT-246) upgrade to new lucene TokenStream API to cleanup deprecation API

2010-01-14 Thread Olivier Grisel (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olivier Grisel updated MAHOUT-246: -- Description: The attached patch use the new ts.incrementToken() / TermAttribute API instead of

[jira] Created: (MAHOUT-246) upgrade to new lucene TokenStream API to cleanup deprecation API

2010-01-14 Thread Olivier Grisel (JIRA)
upgrade to new lucene TokenStream API to cleanup deprecation API Key: MAHOUT-246 URL: https://issues.apache.org/jira/browse/MAHOUT-246 Project: Mahout Issue Type: Improvement

[jira] Commented: (MAHOUT-244) Add root log-likelihood method to LogLikehood class.

2010-01-14 Thread Drew Farris (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12800196#action_12800196 ] Drew Farris commented on MAHOUT-244: Thanks Isabel > Add root log-likelihood method to

Re: DictionaryVectorizer meets Wikipedia.

2010-01-14 Thread Robin Anil
Some issues I am encountering. I use a chunk of the dictionary on every map/reduce pass. to create partial vectors. - If i do the vectorization in the reducer. Lot of data(the entire dataset) gets thrown around the network during shuffle - If i do the vectorization in the mapper, the in

[jira] Commented: (MAHOUT-206) Separate and clearly label different SparseVector implementations

2010-01-14 Thread Sean Owen (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12800188#action_12800188 ] Sean Owen commented on MAHOUT-206: -- You want me to try this and commit it? > Separate and

Re: Collections of primitives.

2010-01-14 Thread Dawid Weiss
Ha! In that case fanfares to all of you. D. On Thu, Jan 14, 2010 at 1:09 PM, Benson Margulies wrote: > Hey, credit where credit is due. I did \not/ do the initial > license-sorting port. I've been cleaning up and filling in after that > (aside from one little licensing problem that was not relat

Re: Collections of primitives.

2010-01-14 Thread Benson Margulies
Hey, credit where credit is due. I did \not/ do the initial license-sorting port. I've been cleaning up and filling in after that (aside from one little licensing problem that was not related to LGPL). I think that Jake and Sean get the credit for the heavy lifting. On Thu, Jan 14, 2010 at 6:52 AM

Re: forrest 1, benson 0

2010-01-14 Thread Benson Margulies
Been, there, done that. Believe me, running forrest in a Linux VM is less work. On Thu, Jan 14, 2010 at 6:57 AM, Grant Ingersoll wrote: > http://www.grantingersoll.com/2009/09/11/how-to-restore-java-1-5-and-1-4-on-os-x-snow-leopard/ > > On Jan 13, 2010, at 6:47 PM, Benson Margulies wrote: > >> Sn

Re: Fisheye?

2010-01-14 Thread Benson Margulies
I'll open the ticket. On Thu, Jan 14, 2010 at 3:31 AM, Isabel Drost wrote: > On Wed Benson Margulies wrote: >> Are we set up? > > If we are, than at least I am not aware of it. > > Isabel >

Re: DictionaryVectorizer meets Wikipedia.

2010-01-14 Thread Robin Anil
Retrying with C1.medium. Its almost 4 times faster at 2.4 times the cost. Robin On Thu, Jan 14, 2010 at 1:25 PM, Shashikant Kore wrote: > On Thu, Jan 14, 2010 at 3:20 AM, Ted Dunning > wrote: > > Large instances may give you more cost effective throughput. Even if > they > > are a break-even,

Re: forrest 1, benson 0

2010-01-14 Thread Grant Ingersoll
http://www.grantingersoll.com/2009/09/11/how-to-restore-java-1-5-and-1-4-on-os-x-snow-leopard/ On Jan 13, 2010, at 6:47 PM, Benson Margulies wrote: > SnowLeopard creates links named 1.5 which point to 1.6. THere is this > extremely finicky procedures for installing an actual 1.5, but it > doesn't

Re: DictionaryVectorizer meets Wikipedia.

2010-01-14 Thread Grant Ingersoll
On Jan 14, 2010, at 6:11 AM, Robin Anil wrote: > On the question of analyzer quality. (Assuming speed could be circumvented > by madding more machines) > > Wikipedia data is in wikitext format > > so there are many {{Title}} [[Link|LinkText]] some html tags There is a Wikipedia Tokenizer in Lu

Re: Collections of primitives.

2010-01-14 Thread Dawid Weiss
Oh, as a side note to Benson -- your effort on porting these COLT collections is appreciated from more than one angle, regardless of the HPPC discussion. We have been using COLT's math/ matrix packages in C2 and had a long-open issue of getting rid of the LGPL parts. Solved now, thanks! http://is

Re: DictionaryVectorizer meets Wikipedia.

2010-01-14 Thread Olivier Grisel
2010/1/14 Robin Anil : > On the question of analyzer quality. (Assuming speed could be circumvented > by madding more machines) > > Wikipedia data is in wikitext format > > so there are many {{Title}} [[Link|LinkText]] some html tags > > Should I be writing my own stream based analyzer maybe some r

Re: DictionaryVectorizer meets Wikipedia.

2010-01-14 Thread Robin Anil
On the question of analyzer quality. (Assuming speed could be circumvented by madding more machines) Wikipedia data is in wikitext format so there are many {{Title}} [[Link|LinkText]] some html tags Should I be writing my own stream based analyzer maybe some regex rules to filter them will do?

[jira] Updated: (MAHOUT-244) Add root log-likelihood method to LogLikehood class.

2010-01-14 Thread Isabel Drost (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Isabel Drost updated MAHOUT-244: Resolution: Fixed Status: Resolved (was: Patch Available) Patch applies cleanly and looks

[jira] Assigned: (MAHOUT-244) Add root log-likelihood method to LogLikehood class.

2010-01-14 Thread Isabel Drost (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Isabel Drost reassigned MAHOUT-244: --- Assignee: Drew Farris > Add root log-likelihood method to LogLikehood class. > --

Re: DictionaryVectorizer meets Wikipedia.

2010-01-14 Thread Robin Anil
Thanks Oliver!. Could you file a JIRA issue on that. There are a couple of places where the old api is used. On Thu, Jan 14, 2010 at 4:09 PM, Olivier Grisel wrote: > 2010/1/13 Robin Anil : > > I have fired up a small instance of EC2(Single node for the moment) and > have > > been dabbling with th

Re: DictionaryVectorizer meets Wikipedia.

2010-01-14 Thread Robin Anil
Cool. now i know what c1 types are for Thanks for the tip! On Thu, Jan 14, 2010 at 1:25 PM, Shashikant Kore wrote: > On Thu, Jan 14, 2010 at 3:20 AM, Ted Dunning > wrote: > > Large instances may give you more cost effective throughput. Even if > they > > are a break-even, the cost/job should

Re: DictionaryVectorizer meets Wikipedia.

2010-01-14 Thread Olivier Grisel
2010/1/13 Robin Anil : > I have fired up a small instance of EC2(Single node for the moment) and have > been dabbling with the latest XML dump of the articles base of Wikipedia > > wiki XML is around 25GB which was split into 128MB chunks and stored on hdfs > WikipediaToSequenceFile class runs an M

[jira] Updated: (MAHOUT-245) Better handling of Categorical attributes when building Decision Forests

2010-01-14 Thread Deneche A. Hakim (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Deneche A. Hakim updated MAHOUT-245: Status: Patch Available (was: Open) > Better handling of Categorical attributes when build

[jira] Commented: (MAHOUT-245) Better handling of Categorical attributes when building Decision Forests

2010-01-14 Thread Deneche A. Hakim (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12800156#action_12800156 ] Deneche A. Hakim commented on MAHOUT-245: - I modified the code to not select Catego

[jira] Updated: (MAHOUT-245) Better handling of Categorical attributes when building Decision Forests

2010-01-14 Thread Deneche A. Hakim (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Deneche A. Hakim updated MAHOUT-245: Attachment: mahout-245.patch > Better handling of Categorical attributes when building Deci

[jira] Created: (MAHOUT-245) Better handling of Categorical attributes when building Decision Forests

2010-01-14 Thread Deneche A. Hakim (JIRA)
Better handling of Categorical attributes when building Decision Forests Key: MAHOUT-245 URL: https://issues.apache.org/jira/browse/MAHOUT-245 Project: Mahout Issue Typ

[jira] Resolved: (MAHOUT-216) Improve the results of MAHOUT-145 by uniformly distributing the classes in the partitioned data

2010-01-14 Thread Deneche A. Hakim (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Deneche A. Hakim resolved MAHOUT-216. - Resolution: Fixed Done. > Improve the results of MAHOUT-145 by uniformly distributing th

Re: [math] no-such-integer value

2010-01-14 Thread Isabel Drost
On Mon Grant Ingersoll wrote: > I'm sensing a theme. I think for this stuff we should prune fairly > aggressively, then add back in places once we have a need. +1 Isabel

Re: Fisheye?

2010-01-14 Thread Isabel Drost
On Wed Benson Margulies wrote: > Are we set up? If we are, than at least I am not aware of it. Isabel

Re: Welcome Benson Marguiles as Mahout Committer

2010-01-14 Thread Isabel Drost
On Wed Grant Ingersoll wrote: > The Lucene PMC is pleased to welcome the addition of Benson Marguiles > as a committer on Mahout. Welcome Benson - thanks to all the great work you have done so far for the mahout-math stuff. Looking forward to working together with you. Isabel

Re: Collections of primitives.

2010-01-14 Thread Dawid Weiss
Let's do this, guys: I have finished the implementation of basic data structures. I will try to merge this code with Carrot2, replacing PCJ; this should give me an additional level of confidency that everything is working fine. I plan to have this step done by Friday. Then, I will make this code a

Re: Welcome Benson Marguiles as Mahout Committer

2010-01-14 Thread Dawid Weiss
Congratulations, Benson! D. On Wed, Jan 13, 2010 at 9:28 PM, Grant Ingersoll wrote: > The Lucene PMC is pleased to welcome the addition of Benson Marguiles as a > committer on Mahout.  I hope you'll join me in offering Benson a warm welcome. > > Benson, Lucene tradition is that new committers pr