Congratulations, Benson!
D.
On Wed, Jan 13, 2010 at 9:28 PM, Grant Ingersoll gsing...@apache.org wrote:
The Lucene PMC is pleased to welcome the addition of Benson Marguiles as a
committer on Mahout. I hope you'll join me in offering Benson a warm welcome.
Benson, Lucene tradition is that
Let's do this, guys: I have finished the implementation of basic data
structures. I will try to merge this code with Carrot2, replacing PCJ;
this should give me an additional level of confidency that everything
is working fine. I plan to have this step done by Friday.
Then, I will make this code
On Wed Grant Ingersoll gsing...@apache.org wrote:
The Lucene PMC is pleased to welcome the addition of Benson Marguiles
as a committer on Mahout.
Welcome Benson - thanks to all the great work you have done so far for
the mahout-math stuff. Looking forward to working together with you.
Isabel
On Wed Benson Margulies bimargul...@gmail.com wrote:
Are we set up?
If we are, than at least I am not aware of it.
Isabel
On Mon Grant Ingersoll gsing...@apache.org wrote:
I'm sensing a theme. I think for this stuff we should prune fairly
aggressively, then add back in places once we have a need.
+1
Isabel
[
https://issues.apache.org/jira/browse/MAHOUT-216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Deneche A. Hakim resolved MAHOUT-216.
-
Resolution: Fixed
Done.
Improve the results of MAHOUT-145 by uniformly distributing
[
https://issues.apache.org/jira/browse/MAHOUT-245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Deneche A. Hakim updated MAHOUT-245:
Attachment: mahout-245.patch
Better handling of Categorical attributes when building
[
https://issues.apache.org/jira/browse/MAHOUT-245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12800156#action_12800156
]
Deneche A. Hakim commented on MAHOUT-245:
-
I modified the code to not select
[
https://issues.apache.org/jira/browse/MAHOUT-245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Deneche A. Hakim updated MAHOUT-245:
Status: Patch Available (was: Open)
Better handling of Categorical attributes when
Cool. now i know what c1 types are for Thanks for the tip!
On Thu, Jan 14, 2010 at 1:25 PM, Shashikant Kore shashik...@gmail.comwrote:
On Thu, Jan 14, 2010 at 3:20 AM, Ted Dunning ted.dunn...@gmail.com
wrote:
Large instances may give you more cost effective throughput. Even if
they
are
Thanks Oliver!. Could you file a JIRA issue on that. There are a couple of
places where the old api is used.
On Thu, Jan 14, 2010 at 4:09 PM, Olivier Grisel olivier.gri...@ensta.orgwrote:
2010/1/13 Robin Anil robin.a...@gmail.com:
I have fired up a small instance of EC2(Single node for the
[
https://issues.apache.org/jira/browse/MAHOUT-244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Isabel Drost reassigned MAHOUT-244:
---
Assignee: Drew Farris
Add root log-likelihood method to LogLikehood class.
[
https://issues.apache.org/jira/browse/MAHOUT-244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Isabel Drost updated MAHOUT-244:
Resolution: Fixed
Status: Resolved (was: Patch Available)
Patch applies cleanly and looks
On the question of analyzer quality. (Assuming speed could be circumvented
by madding more machines)
Wikipedia data is in wikitext format
so there are many {{Title}} [[Link|LinkText]] some html tags
Should I be writing my own stream based analyzer maybe some regex rules to
filter them will do?
2010/1/14 Robin Anil robin.a...@gmail.com:
On the question of analyzer quality. (Assuming speed could be circumvented
by madding more machines)
Wikipedia data is in wikitext format
so there are many {{Title}} [[Link|LinkText]] some html tags
Should I be writing my own stream based analyzer
Oh, as a side note to Benson -- your effort on porting these COLT
collections is appreciated from more than one angle, regardless of the
HPPC discussion. We have been using COLT's math/ matrix packages in C2
and had a long-open issue of getting rid of the LGPL parts. Solved
now, thanks!
On Jan 14, 2010, at 6:11 AM, Robin Anil wrote:
On the question of analyzer quality. (Assuming speed could be circumvented
by madding more machines)
Wikipedia data is in wikitext format
so there are many {{Title}} [[Link|LinkText]] some html tags
There is a Wikipedia Tokenizer in Lucene
http://www.grantingersoll.com/2009/09/11/how-to-restore-java-1-5-and-1-4-on-os-x-snow-leopard/
On Jan 13, 2010, at 6:47 PM, Benson Margulies wrote:
SnowLeopard creates links named 1.5 which point to 1.6. THere is this
extremely finicky procedures for installing an actual 1.5, but it
doesn't
Retrying with C1.medium. Its almost 4 times faster at 2.4 times the cost.
Robin
On Thu, Jan 14, 2010 at 1:25 PM, Shashikant Kore shashik...@gmail.comwrote:
On Thu, Jan 14, 2010 at 3:20 AM, Ted Dunning ted.dunn...@gmail.com
wrote:
Large instances may give you more cost effective throughput.
I'll open the ticket.
On Thu, Jan 14, 2010 at 3:31 AM, Isabel Drost isa...@apache.org wrote:
On Wed Benson Margulies bimargul...@gmail.com wrote:
Are we set up?
If we are, than at least I am not aware of it.
Isabel
Been, there, done that. Believe me, running forrest in a Linux VM is less work.
On Thu, Jan 14, 2010 at 6:57 AM, Grant Ingersoll gsing...@apache.org wrote:
http://www.grantingersoll.com/2009/09/11/how-to-restore-java-1-5-and-1-4-on-os-x-snow-leopard/
On Jan 13, 2010, at 6:47 PM, Benson
Hey, credit where credit is due. I did \not/ do the initial
license-sorting port. I've been cleaning up and filling in after that
(aside from one little licensing problem that was not related to
LGPL). I think that Jake and Sean get the credit for the heavy
lifting.
On Thu, Jan 14, 2010 at 6:52
Ha! In that case fanfares to all of you.
D.
On Thu, Jan 14, 2010 at 1:09 PM, Benson Margulies bimargul...@gmail.com wrote:
Hey, credit where credit is due. I did \not/ do the initial
license-sorting port. I've been cleaning up and filling in after that
(aside from one little licensing problem
[
https://issues.apache.org/jira/browse/MAHOUT-206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12800188#action_12800188
]
Sean Owen commented on MAHOUT-206:
--
You want me to try this and commit it?
Separate and
Some issues I am encountering.
I use a chunk of the dictionary on every map/reduce pass. to create partial
vectors.
- If i do the vectorization in the reducer. Lot of data(the entire
dataset) gets thrown around the network during shuffle
- If i do the vectorization in the mapper, the
[
https://issues.apache.org/jira/browse/MAHOUT-244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12800196#action_12800196
]
Drew Farris commented on MAHOUT-244:
Thanks Isabel
Add root log-likelihood method to
upgrade to new lucene TokenStream API to cleanup deprecation API
Key: MAHOUT-246
URL: https://issues.apache.org/jira/browse/MAHOUT-246
Project: Mahout
Issue Type: Improvement
[
https://issues.apache.org/jira/browse/MAHOUT-246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Olivier Grisel updated MAHOUT-246:
--
Attachment: remove-lucene-deprecation-warnings.diff
upgrade to new lucene TokenStream API to
[
https://issues.apache.org/jira/browse/MAHOUT-246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Olivier Grisel updated MAHOUT-246:
--
Status: Patch Available (was: Open)
upgrade to new lucene TokenStream API to cleanup
2010/1/14 Robin Anil robin.a...@gmail.com:
Some issues I am encountering.
I use a chunk of the dictionary on every map/reduce pass. to create partial
vectors.
- If i do the vectorization in the reducer. Lot of data(the entire
dataset) gets thrown around the network during shuffle
-
2010/1/14 Robin Anil robin.a...@gmail.com:
Thanks Oliver!. Could you file a JIRA issue on that. There are a couple of
places where the old api is used.
I think I tracked them all down, patch available here:
https://issues.apache.org/jira/browse/MAHOUT-246 , all tests pass in
maven.
--
Olivier
many map/reduce passes for partial vectorization
Have you tried setting the Reducer as a Combiner too?
I am not sure that will work too, as the Combiner only gets a small chunk.
The cost of loading the dictionary into memory outweighs the cost of reading
the chunk in the combiner.
[
https://issues.apache.org/jira/browse/MAHOUT-246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Olivier Grisel updated MAHOUT-246:
--
Summary: upgrade to new lucene TokenStream API to cleanup deprecation
warnings (was: upgrade
[
https://issues.apache.org/jira/browse/MAHOUT-246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Olivier Grisel updated MAHOUT-246:
--
Priority: Minor (was: Major)
upgrade to new lucene TokenStream API to cleanup deprecation
[
https://issues.apache.org/jira/browse/MAHOUT-228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12800248#action_12800248
]
Ted Dunning commented on MAHOUT-228:
We need a few things:
- a few functions should
GenericUserBasedRecommender.recommend causes connection leak when called for
user with no preferences
-
Key: MAHOUT-247
URL:
[
https://issues.apache.org/jira/browse/MAHOUT-247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sean Owen updated MAHOUT-247:
-
Due Date: 21/Jan/10
Fix Version/s: 0.3
Assignee: Sean Owen
Good one. I will review
[
https://issues.apache.org/jira/browse/MAHOUT-206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12800268#action_12800268
]
Jake Mannix commented on MAHOUT-206:
If you (or anyone else) wants to try this out for
[
https://issues.apache.org/jira/browse/MAHOUT-206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12800271#action_12800271
]
Sean Owen commented on MAHOUT-206:
--
It all passes for me. I had to delete another instance
[
https://issues.apache.org/jira/browse/MAHOUT-206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sean Owen resolved MAHOUT-206.
--
Resolution: Fixed
Separate and clearly label different SparseVector implementations
Congrats Benson!
Basis primarily uses a JNI wrapper to integrate with Lucene? I'm
indexing using Hadoop and it'd be great if it were all in Java... So
yeah, We shall see. :)
Jason
On Wed, Jan 13, 2010 at 7:33 PM, Benson Margulies bimargul...@gmail.com wrote:
I'm a somewhat grizzled software
Ah, well, a longer story.
We sell segmenters lemmatizers that plug into Lucene. Until recently,
JNI all the way down. We've delivered a new version to a customer that
does some European languages entirely in Java, and we expect to be
able to do this for many more languages this year.
On Thu,
2010/1/14 Benson Margulies bimargul...@gmail.com:
If there's one NLP thing I know something about, now, it is named
entity extraction with averaged perceptrons and passive-aggressive
training. This has the advantage of being mathematically trivial
unless you want to prove that it works, which
that paper's new to me.
Our work combines MillerGuinness on perceptrons and entities with
Cramer on passive agressive, plus some secret sauce. I'm not in a
position to open source it just now, but that may change.
On Thu, Jan 14, 2010 at 2:12 PM, Olivier Grisel
olivier.gri...@ensta.org wrote:
Manning was suggesting we post a cover/link to Mahout in Action MEAP,
like the Lucene Java page has such a link to Lucene in Action MEAP:
http://lucene.apache.org/java/docs/index.html
I think it's a good idea since it points people to more resources, and
I want feedback on the text via MEAP. But
+1
We should definitely have it posted.
-jake
On Jan 14, 2010 4:25 PM, Sean Owen sro...@gmail.com wrote:
Manning was suggesting we post a cover/link to Mahout in Action MEAP,
like the Lucene Java page has such a link to Lucene in Action MEAP:
http://lucene.apache.org/java/docs/index.html
I
[
https://issues.apache.org/jira/browse/MAHOUT-248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Benson Margulies updated MAHOUT-248:
Attachment: MAHOUT-248.patch
Next collections expansion kit: OpenObjectWhateverHashMapT
[
https://issues.apache.org/jira/browse/MAHOUT-248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Benson Margulies updated MAHOUT-248:
Status: Patch Available (was: Open)
I am not going to let this sit, since it's just more
+1. (BTW, great read so far, I've got 3 more chapters to go in the first 6!)
We should state up front, just like in Lucene land, that anyone who has a book
on Mahout is welcome to link it on the page. The more books on Mahout the
merrier!
On Jan 14, 2010, at 7:25 PM, Sean Owen wrote:
+1
Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
- Original Message
From: Grant Ingersoll gsing...@apache.org
To: mahout-dev@lucene.apache.org
Sent: Thu, January 14, 2010 9:14:09 PM
Subject: Re: New MEAP: Mahout in Action
+1. (BTW, great read so far, I've
+1 (belatedly)
On Thu, Jan 14, 2010 at 6:34 PM, Otis Gospodnetic
otis_gospodne...@yahoo.com wrote:
+1
Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
- Original Message
From: Grant Ingersoll gsing...@apache.org
To: mahout-dev@lucene.apache.org
Sent:
+1 as well.
On Thu, Jan 14, 2010 at 9:41 PM, Ted Dunning ted.dunn...@gmail.com wrote:
+1 (belatedly)
On 15.01.2010 Grant Ingersoll wrote:
(BTW, great read so far, I've got 3 more chapters to go in the first
6!)
Can second that: Great book indeed.
We should state up front, just like in Lucene land, that anyone who has a
book on Mahout is welcome to link it on the page. The more books on
53 matches
Mail list logo