On 15.01.2010 Grant Ingersoll wrote:
> (BTW, great read so far, I've got 3 more chapters to go in the first
> 6!)
Can second that: Great book indeed.
> We should state up front, just like in Lucene land, that anyone who has a
> book on Mahout is welcome to link it on the page. The more books
+1 as well.
On Thu, Jan 14, 2010 at 9:41 PM, Ted Dunning wrote:
> +1 (belatedly)
>
>
+1 (belatedly)
On Thu, Jan 14, 2010 at 6:34 PM, Otis Gospodnetic <
otis_gospodne...@yahoo.com> wrote:
> +1
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
>
>
>
> - Original Message
> > From: Grant Ingersoll
> > To: mahout-dev@lucene.apache.org
> > Sent: Thu,
+1
Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
- Original Message
> From: Grant Ingersoll
> To: mahout-dev@lucene.apache.org
> Sent: Thu, January 14, 2010 9:14:09 PM
> Subject: Re: New MEAP: Mahout in Action
>
> +1. (BTW, great read so far, I've got 3 more cha
+1. (BTW, great read so far, I've got 3 more chapters to go in the first 6!)
We should state up front, just like in Lucene land, that anyone who has a book
on Mahout is welcome to link it on the page. The more books on Mahout the
merrier!
On Jan 14, 2010, at 7:25 PM, Sean Owen wrote:
> Manni
[
https://issues.apache.org/jira/browse/MAHOUT-248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Benson Margulies updated MAHOUT-248:
Status: Patch Available (was: Open)
I am not going to let this sit, since it's just more o
[
https://issues.apache.org/jira/browse/MAHOUT-248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Benson Margulies updated MAHOUT-248:
Attachment: MAHOUT-248.patch
> Next collections expansion kit: OpenObjectWhateverHashMap
>
Next collections expansion kit: OpenObjectWhateverHashMap
Key: MAHOUT-248
URL: https://issues.apache.org/jira/browse/MAHOUT-248
Project: Mahout
Issue Type: Improvement
Co
[
https://issues.apache.org/jira/browse/MAHOUT-247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sean Owen resolved MAHOUT-247.
--
Resolution: Fixed
Tentatively resolving this after applying a fix to the most obvious case where
this
[
https://issues.apache.org/jira/browse/MAHOUT-247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12800466#action_12800466
]
Sean Owen commented on MAHOUT-247:
--
This is actually very messy to really fix -- LongPrimi
+1
We should definitely have it posted.
-jake
On Jan 14, 2010 4:25 PM, "Sean Owen" wrote:
Manning was suggesting we post a cover/link to Mahout in Action MEAP,
like the Lucene Java page has such a link to Lucene in Action MEAP:
http://lucene.apache.org/java/docs/index.html
I think it's a go
Manning was suggesting we post a cover/link to Mahout in Action MEAP,
like the Lucene Java page has such a link to Lucene in Action MEAP:
http://lucene.apache.org/java/docs/index.html
I think it's a good idea since it points people to more resources, and
I want feedback on the text via MEAP. But I
that paper's new to me.
Our work combines Miller&Guinness on perceptrons and entities with
Cramer on passive agressive, plus some secret sauce. I'm not in a
position to open source it just now, but that may change.
On Thu, Jan 14, 2010 at 2:12 PM, Olivier Grisel
wrote:
> 2010/1/14 Benson Margul
2010/1/14 Benson Margulies :
>
> If there's one NLP thing I know something about, now, it is named
> entity extraction with averaged perceptrons and passive-aggressive
> training. This has the advantage of being mathematically trivial
> unless you want to prove that it works, which is as about as u
Ah, well, a longer story.
We sell segmenters lemmatizers that plug into Lucene. Until recently,
JNI all the way down. We've delivered a new version to a customer that
does some European languages entirely in Java, and we expect to be
able to do this for many more languages this year.
On Thu, Jan
Congrats Benson!
Basis primarily uses a JNI wrapper to integrate with Lucene? I'm
indexing using Hadoop and it'd be great if it were all in Java... So
yeah, "We shall see". :)
Jason
On Wed, Jan 13, 2010 at 7:33 PM, Benson Margulies wrote:
> I'm a somewhat grizzled software guy. My background i
[
https://issues.apache.org/jira/browse/MAHOUT-206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sean Owen resolved MAHOUT-206.
--
Resolution: Fixed
> Separate and clearly label different SparseVector implementations
> ---
[
https://issues.apache.org/jira/browse/MAHOUT-206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12800273#action_12800273
]
Jake Mannix commented on MAHOUT-206:
bq. I had to delete another instance of TestVector
[
https://issues.apache.org/jira/browse/MAHOUT-206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12800271#action_12800271
]
Sean Owen commented on MAHOUT-206:
--
It all passes for me. I had to delete another instance
[
https://issues.apache.org/jira/browse/MAHOUT-206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12800268#action_12800268
]
Jake Mannix commented on MAHOUT-206:
If you (or anyone else) wants to try this out for
[
https://issues.apache.org/jira/browse/MAHOUT-247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sean Owen updated MAHOUT-247:
-
Due Date: 21/Jan/10
Fix Version/s: 0.3
Assignee: Sean Owen
Good one. I will review this
GenericUserBasedRecommender.recommend causes connection leak when called for
user with no preferences
-
Key: MAHOUT-247
URL: https://issues.apache.org/jira/browse/
[
https://issues.apache.org/jira/browse/MAHOUT-228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12800248#action_12800248
]
Ted Dunning commented on MAHOUT-228:
We need a few things:
- a few functions should b
[
https://issues.apache.org/jira/browse/MAHOUT-246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Olivier Grisel updated MAHOUT-246:
--
Priority: Minor (was: Major)
> upgrade to new lucene TokenStream API to cleanup deprecation wa
[
https://issues.apache.org/jira/browse/MAHOUT-246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Olivier Grisel updated MAHOUT-246:
--
Summary: upgrade to new lucene TokenStream API to cleanup deprecation
warnings (was: upgrade t
>
>
> > many map/reduce passes for partial vectorization
>
> Have you tried setting the Reducer as a Combiner too?
> I am not sure that will work too, as the Combiner only gets a small chunk.
> The cost of loading the dictionary into memory outweighs the cost of reading
> the chunk in the combin
2010/1/14 Robin Anil :
> Thanks Oliver!. Could you file a JIRA issue on that. There are a couple of
> places where the old api is used.
I think I tracked them all down, patch available here:
https://issues.apache.org/jira/browse/MAHOUT-246 , all tests pass in
maven.
--
Olivier
http://twitter.com
2010/1/14 Robin Anil :
> Some issues I am encountering.
>
> I use a chunk of the dictionary on every map/reduce pass. to create partial
> vectors.
>
>
> - If i do the vectorization in the reducer. Lot of data(the entire
> dataset) gets thrown around the network during shuffle
> - If i do the
[
https://issues.apache.org/jira/browse/MAHOUT-246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Olivier Grisel updated MAHOUT-246:
--
Status: Patch Available (was: Open)
> upgrade to new lucene TokenStream API to cleanup depreca
[
https://issues.apache.org/jira/browse/MAHOUT-246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Olivier Grisel updated MAHOUT-246:
--
Attachment: remove-lucene-deprecation-warnings.diff
> upgrade to new lucene TokenStream API to
[
https://issues.apache.org/jira/browse/MAHOUT-246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Olivier Grisel updated MAHOUT-246:
--
Description:
The attached patch use the new ts.incrementToken() / TermAttribute API instead
of
upgrade to new lucene TokenStream API to cleanup deprecation API
Key: MAHOUT-246
URL: https://issues.apache.org/jira/browse/MAHOUT-246
Project: Mahout
Issue Type: Improvement
[
https://issues.apache.org/jira/browse/MAHOUT-244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12800196#action_12800196
]
Drew Farris commented on MAHOUT-244:
Thanks Isabel
> Add root log-likelihood method to
Some issues I am encountering.
I use a chunk of the dictionary on every map/reduce pass. to create partial
vectors.
- If i do the vectorization in the reducer. Lot of data(the entire
dataset) gets thrown around the network during shuffle
- If i do the vectorization in the mapper, the in
[
https://issues.apache.org/jira/browse/MAHOUT-206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12800188#action_12800188
]
Sean Owen commented on MAHOUT-206:
--
You want me to try this and commit it?
> Separate and
Ha! In that case fanfares to all of you.
D.
On Thu, Jan 14, 2010 at 1:09 PM, Benson Margulies wrote:
> Hey, credit where credit is due. I did \not/ do the initial
> license-sorting port. I've been cleaning up and filling in after that
> (aside from one little licensing problem that was not relat
Hey, credit where credit is due. I did \not/ do the initial
license-sorting port. I've been cleaning up and filling in after that
(aside from one little licensing problem that was not related to
LGPL). I think that Jake and Sean get the credit for the heavy
lifting.
On Thu, Jan 14, 2010 at 6:52 AM
Been, there, done that. Believe me, running forrest in a Linux VM is less work.
On Thu, Jan 14, 2010 at 6:57 AM, Grant Ingersoll wrote:
> http://www.grantingersoll.com/2009/09/11/how-to-restore-java-1-5-and-1-4-on-os-x-snow-leopard/
>
> On Jan 13, 2010, at 6:47 PM, Benson Margulies wrote:
>
>> Sn
I'll open the ticket.
On Thu, Jan 14, 2010 at 3:31 AM, Isabel Drost wrote:
> On Wed Benson Margulies wrote:
>> Are we set up?
>
> If we are, than at least I am not aware of it.
>
> Isabel
>
Retrying with C1.medium. Its almost 4 times faster at 2.4 times the cost.
Robin
On Thu, Jan 14, 2010 at 1:25 PM, Shashikant Kore wrote:
> On Thu, Jan 14, 2010 at 3:20 AM, Ted Dunning
> wrote:
> > Large instances may give you more cost effective throughput. Even if
> they
> > are a break-even,
http://www.grantingersoll.com/2009/09/11/how-to-restore-java-1-5-and-1-4-on-os-x-snow-leopard/
On Jan 13, 2010, at 6:47 PM, Benson Margulies wrote:
> SnowLeopard creates links named 1.5 which point to 1.6. THere is this
> extremely finicky procedures for installing an actual 1.5, but it
> doesn't
On Jan 14, 2010, at 6:11 AM, Robin Anil wrote:
> On the question of analyzer quality. (Assuming speed could be circumvented
> by madding more machines)
>
> Wikipedia data is in wikitext format
>
> so there are many {{Title}} [[Link|LinkText]] some html tags
There is a Wikipedia Tokenizer in Lu
Oh, as a side note to Benson -- your effort on porting these COLT
collections is appreciated from more than one angle, regardless of the
HPPC discussion. We have been using COLT's math/ matrix packages in C2
and had a long-open issue of getting rid of the LGPL parts. Solved
now, thanks!
http://is
2010/1/14 Robin Anil :
> On the question of analyzer quality. (Assuming speed could be circumvented
> by madding more machines)
>
> Wikipedia data is in wikitext format
>
> so there are many {{Title}} [[Link|LinkText]] some html tags
>
> Should I be writing my own stream based analyzer maybe some r
On the question of analyzer quality. (Assuming speed could be circumvented
by madding more machines)
Wikipedia data is in wikitext format
so there are many {{Title}} [[Link|LinkText]] some html tags
Should I be writing my own stream based analyzer maybe some regex rules to
filter them will do?
[
https://issues.apache.org/jira/browse/MAHOUT-244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Isabel Drost updated MAHOUT-244:
Resolution: Fixed
Status: Resolved (was: Patch Available)
Patch applies cleanly and looks
[
https://issues.apache.org/jira/browse/MAHOUT-244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Isabel Drost reassigned MAHOUT-244:
---
Assignee: Drew Farris
> Add root log-likelihood method to LogLikehood class.
> --
Thanks Oliver!. Could you file a JIRA issue on that. There are a couple of
places where the old api is used.
On Thu, Jan 14, 2010 at 4:09 PM, Olivier Grisel wrote:
> 2010/1/13 Robin Anil :
> > I have fired up a small instance of EC2(Single node for the moment) and
> have
> > been dabbling with th
Cool. now i know what c1 types are for Thanks for the tip!
On Thu, Jan 14, 2010 at 1:25 PM, Shashikant Kore wrote:
> On Thu, Jan 14, 2010 at 3:20 AM, Ted Dunning
> wrote:
> > Large instances may give you more cost effective throughput. Even if
> they
> > are a break-even, the cost/job should
2010/1/13 Robin Anil :
> I have fired up a small instance of EC2(Single node for the moment) and have
> been dabbling with the latest XML dump of the articles base of Wikipedia
>
> wiki XML is around 25GB which was split into 128MB chunks and stored on hdfs
> WikipediaToSequenceFile class runs an M
[
https://issues.apache.org/jira/browse/MAHOUT-245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Deneche A. Hakim updated MAHOUT-245:
Status: Patch Available (was: Open)
> Better handling of Categorical attributes when build
[
https://issues.apache.org/jira/browse/MAHOUT-245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12800156#action_12800156
]
Deneche A. Hakim commented on MAHOUT-245:
-
I modified the code to not select Catego
[
https://issues.apache.org/jira/browse/MAHOUT-245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Deneche A. Hakim updated MAHOUT-245:
Attachment: mahout-245.patch
> Better handling of Categorical attributes when building Deci
Better handling of Categorical attributes when building Decision Forests
Key: MAHOUT-245
URL: https://issues.apache.org/jira/browse/MAHOUT-245
Project: Mahout
Issue Typ
[
https://issues.apache.org/jira/browse/MAHOUT-216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Deneche A. Hakim resolved MAHOUT-216.
-
Resolution: Fixed
Done.
> Improve the results of MAHOUT-145 by uniformly distributing th
On Mon Grant Ingersoll wrote:
> I'm sensing a theme. I think for this stuff we should prune fairly
> aggressively, then add back in places once we have a need.
+1
Isabel
On Wed Benson Margulies wrote:
> Are we set up?
If we are, than at least I am not aware of it.
Isabel
On Wed Grant Ingersoll wrote:
> The Lucene PMC is pleased to welcome the addition of Benson Marguiles
> as a committer on Mahout.
Welcome Benson - thanks to all the great work you have done so far for
the mahout-math stuff. Looking forward to working together with you.
Isabel
Let's do this, guys: I have finished the implementation of basic data
structures. I will try to merge this code with Carrot2, replacing PCJ;
this should give me an additional level of confidency that everything
is working fine. I plan to have this step done by Friday.
Then, I will make this code a
Congratulations, Benson!
D.
On Wed, Jan 13, 2010 at 9:28 PM, Grant Ingersoll wrote:
> The Lucene PMC is pleased to welcome the addition of Benson Marguiles as a
> committer on Mahout. I hope you'll join me in offering Benson a warm welcome.
>
> Benson, Lucene tradition is that new committers pr
60 matches
Mail list logo