Abbreviations?

2010-01-16 Thread Benson Margulies
I have approval from the CEO to contribute our collection of abbreviations to Mahout. We use them with the ICU breakers. I guess IP clearance is called for here, but, thinking ahead, where would people like to see files of abbreviations in various languages show up?

[jira] Updated: (MAHOUT-252) Sets (primitive types)

2010-01-16 Thread Benson Margulies (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benson Margulies updated MAHOUT-252: Summary: Sets (primitive types) (was: Sets (primitive to primitive)) Sets (primitive

[jira] Created: (MAHOUT-253) Proposal for high performance primitive collections.

2010-01-16 Thread Dawid Weiss (JIRA)
Proposal for high performance primitive collections. Key: MAHOUT-253 URL: https://issues.apache.org/jira/browse/MAHOUT-253 Project: Mahout Issue Type: New Feature Components:

[jira] Updated: (MAHOUT-253) Proposal for high performance primitive collections.

2010-01-16 Thread Dawid Weiss (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss updated MAHOUT-253: --- Attachment: hppc-1.0-dev.zip Proposal for high performance primitive collections.

Efficient dictionary storage in memory

2010-01-16 Thread Robin Anil
Currently java strings use double the space of the characters in it because its all in utf-16. A 190MB dictionary file therefore uses around 600MB when loaded into a HashMapString, Integer. Is there some optimization we could do in terms of storing them and ensuring that chinese, devanagiri and

Re: Efficient dictionary storage in memory

2010-01-16 Thread Sean Owen
I'm speaking only off the top of my head, but my hunch it's not worth optimizing this. Yes, the alternative is to store the string's UTF-8 encoding as a byte[]. That's going to incur overhead in translating back and forth to String where needed, and my guess is that's going to be big enough to

Re: Efficient dictionary storage in memory

2010-01-16 Thread Benson Margulies
While I egged Robin on to some extent on this topic by IM, I should point out the following. We run large amounts of text through Java at Basis, and we always use String. I have an 8G laptop :-), but there you have it. Anything we do in English we do shortly afterwards in Arabic (UTF-8=UTF-16)

Re: Efficient dictionary storage in memory

2010-01-16 Thread Robin Anil
In this specific scenario. Ability to handle bigger dictionary per node where the dictionary is load once is a big win for the dictionary vectorizer. This in turn reduces the number of partial vector generation passes. I ran the whole wikipedia. I got an 880MB dictionary. I pruned words which

Re: Efficient dictionary storage in memory

2010-01-16 Thread Robin Anil
If there is an option of storing keys in compressed form in memory, I am all for exploring that On Sat, Jan 16, 2010 at 7:59 PM, Robin Anil robin.a...@gmail.com wrote: In this specific scenario. Ability to handle bigger dictionary per node where the dictionary is load once is a big win for

Re: Efficient dictionary storage in memory

2010-01-16 Thread Sean Owen
351MB isn't so bad. I do think the next-best idea to explore is a trie, which could use a char-Object map data structure provided by our new collections module? To the extent this data is more compact when encoded in UTF-8, it will be *much* more compact encoded in a trie. Sean On Sat, Jan 16,

Re: Efficient dictionary storage in memory

2010-01-16 Thread Olivier Grisel
2010/1/16 Sean Owen sro...@gmail.com: 351MB isn't so bad. I do think the next-best idea to explore is a trie, which could use a char-Object map data structure provided by our new collections module? To the extent this data is more compact when encoded in UTF-8, it will be *much* more compact

[jira] Updated: (MAHOUT-185) Add mahout shell script for easy launching of various algorithms

2010-01-16 Thread Drew Farris (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Drew Farris updated MAHOUT-185: --- Attachment: MAHOUT-185.patch This patch adds bin/mahout, a simple bash script based heavily on

Re: Abbreviations?

2010-01-16 Thread Olivier Grisel
2010/1/16 Grant Ingersoll gsing...@apache.org: I think we should start a new module, that will be the seed for a subproject, called NLP and that contains the stuff for NLP. Either that or put them in the utils module, which is where I envision all of things that are helpful for ML go, but

Re: Efficient dictionary storage in memory

2010-01-16 Thread Drew Farris
I agree the overhead of byte[] - UTF-8 probably isn't too good for lookup performance. In line with Sean's suggestion, I've used tries in the past for doing this sort of string - integer mapping. They generally perform well enough for adding entries as well as retrieval. Not nearly as efficient

[jira] Commented: (MAHOUT-252) Sets (primitive types)

2010-01-16 Thread Drew Farris (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12801204#action_12801204 ] Drew Farris commented on MAHOUT-252: Is this committed? It seems like there are classes

Re: Abbreviations?

2010-01-16 Thread Benson Margulies
Sure. However, the immediate contribution is data. src/main/resources? Something else? On Sat, Jan 16, 2010 at 10:16 AM, Olivier Grisel olivier.gri...@ensta.org wrote: 2010/1/16 Grant Ingersoll gsing...@apache.org: I think we should start a new module, that will be the seed for a subproject,

A modest proposal for the Carrot integration

2010-01-16 Thread Benson Margulies
I propose a branch. Diffs from the branch to the trunk can still be posted on the JIRA, but I think that a branch would be worthwhile in facilitating collaboration. I volunteer to fight with the maven-release-plugin to make it.

[jira] Created: (MAHOUT-254) Primitive set unit tests

2010-01-16 Thread Benson Margulies (JIRA)
Primitive set unit tests Key: MAHOUT-254 URL: https://issues.apache.org/jira/browse/MAHOUT-254 Project: Mahout Issue Type: New Feature Components: Math Affects Versions: 0.3 Reporter:

[jira] Created: (MAHOUT-255) Open hash set and map that plug into java.util

2010-01-16 Thread Benson Margulies (JIRA)
Open hash set and map that plug into java.util -- Key: MAHOUT-255 URL: https://issues.apache.org/jira/browse/MAHOUT-255 Project: Mahout Issue Type: New Feature Components: Math

[jira] Created: (MAHOUT-256) Clean up raw type usage

2010-01-16 Thread Benson Margulies (JIRA)
Clean up raw type usage --- Key: MAHOUT-256 URL: https://issues.apache.org/jira/browse/MAHOUT-256 Project: Mahout Issue Type: New Feature Components: Math Affects Versions: 0.3 Reporter: Benson

[jira] Created: (MAHOUT-257) Get rid of GenericSorting.java

2010-01-16 Thread Benson Margulies (JIRA)
Get rid of GenericSorting.java -- Key: MAHOUT-257 URL: https://issues.apache.org/jira/browse/MAHOUT-257 Project: Mahout Issue Type: New Feature Components: Math Affects Versions: 0.3

[jira] Created: (MAHOUT-258) Unit test failure in CDInfo example

2010-01-16 Thread Benson Margulies (JIRA)
Unit test failure in CDInfo example --- Key: MAHOUT-258 URL: https://issues.apache.org/jira/browse/MAHOUT-258 Project: Mahout Issue Type: Bug Affects Versions: 0.3 Reporter: Benson Margulies

Re: Abbreviations?

2010-01-16 Thread Ted Dunning
How about src/main/resources/nlp? On Sat, Jan 16, 2010 at 9:31 AM, Benson Margulies bimargul...@gmail.comwrote: Sure. However, the immediate contribution is data. src/main/resources? Something else? On Sat, Jan 16, 2010 at 10:16 AM, Olivier Grisel olivier.gri...@ensta.org wrote:

Re: Abbreviations?

2010-01-16 Thread Ted Dunning
+1 as well. I think it should be in core rather than utils due to dependency issues. On Sat, Jan 16, 2010 at 7:16 AM, Olivier Grisel olivier.gri...@ensta.orgwrote: 2010/1/16 Grant Ingersoll gsing...@apache.org: I think we should start a new module, that will be the seed for a subproject,

[jira] Commented: (MAHOUT-258) Unit test failure in CDInfo example

2010-01-16 Thread Benson Margulies (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12801236#action_12801236 ] Benson Margulies commented on MAHOUT-258: - manually deleting 'target' and running

Re: A modest proposal for the Carrot integration

2010-01-16 Thread Ted Dunning
How can we say no? On Sat, Jan 16, 2010 at 9:33 AM, Benson Margulies bimargul...@gmail.comwrote: I volunteer to fight with the maven-release-plugin to make it. -- Ted Dunning, CTO DeepDyve

[jira] Updated: (MAHOUT-248) Next collections expansion kit: OpenObjectWhateverHashMapT

2010-01-16 Thread Benson Margulies (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benson Margulies updated MAHOUT-248: Resolution: Fixed Status: Resolved (was: Patch Available) Committed. Next

Re: A modest proposal for the Carrot integration

2010-01-16 Thread Benson Margulies
Sure you could. The 'refine patches attached to JIRA' approach is the classic Lucene project methodology, and I'm the new kid on the block here. On Sat, Jan 16, 2010 at 12:50 PM, Ted Dunning ted.dunn...@gmail.com wrote: How can we say no? On Sat, Jan 16, 2010 at 9:33 AM, Benson Margulies

Re: A modest proposal for the Carrot integration

2010-01-16 Thread Ted Dunning
I try to never say anything that decreases the output of a very productive person. I often fail, but I try. On Sat, Jan 16, 2010 at 10:11 AM, Benson Margulies bimargul...@gmail.comwrote: Sure you could. The 'refine patches attached to JIRA' approach is the classic Lucene project methodology,

Re: A modest proposal for the Carrot integration

2010-01-16 Thread Dawid Weiss
I propose a branch. Diffs from the branch to the trunk can still be posted on the JIRA, but I think that a branch would be worthwhile in facilitating collaboration. Do you mean -- for merging with the code I posted earlier? By the way, I've intergrated Colt from Mahout with our code base.

Re: A modest proposal for the Carrot integration

2010-01-16 Thread Benson Margulies
On Sat, Jan 16, 2010 at 1:15 PM, Dawid Weiss dawid.we...@gmail.com wrote: I propose a branch. Diffs from the branch to the trunk can still be posted on the JIRA, but I think that a branch would be worthwhile in facilitating collaboration. Do you mean -- for merging with the code I posted

[jira] Commented: (MAHOUT-252) Sets (primitive types)

2010-01-16 Thread Drew Farris (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12801247#action_12801247 ] Drew Farris commented on MAHOUT-252: It was:

[jira] Commented: (MAHOUT-252) Sets (primitive types)

2010-01-16 Thread Benson Margulies (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12801248#action_12801248 ] Benson Margulies commented on MAHOUT-252: - That was the 'Map' patch, which was

[jira] Updated: (MAHOUT-254) Primitive set unit tests

2010-01-16 Thread Benson Margulies (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benson Margulies updated MAHOUT-254: Attachment: MAHOUT-254.patch Primitive set unit tests

Re: Unit test failure

2010-01-16 Thread deneche abdelhakim
Yeah, its probably due to the way I used to generate random data...the problem is that I never get this error =P so it's very difficult to fix...I'll try my best as soon as I have some time. In the mean time, rerunning 'mvn clean install' again generally does the trick. On Sat, Jan 16, 2010 at

Re: A modest proposal for the Carrot integration

2010-01-16 Thread Dawid Weiss
Have you finished with Colt? I think this is still worth completing before we proceed to HPPC. Just talked to Staszek, we will move HPPC code to Carrot2 labs SVN repository (sourceforge) because we want to get rid of PCJ as soon as possible and need something versioned and sticky. I plan to make a

Re: A modest proposal for the Carrot integration

2010-01-16 Thread Benson Margulies
I'm not quite done with Colt. If you think you can refine a patch to go straight into the mahout trunk, don't let me stop you. On Sat, Jan 16, 2010 at 3:48 PM, Dawid Weiss dawid.we...@gmail.com wrote: Have you finished with Colt? I think this is still worth completing before we proceed to

Re: Unit test lag?

2010-01-16 Thread Benson Margulies
. Running through strace showed that something was attempting to reading from /dev/random. Sometimes it ran fine, but at least 25-30% it ended up blocking until the entropy pool is refilled. To test I moved /dev/random, and created a link from /dev/urandom to /dev/random (the former doesn't

[jira] Commented: (MAHOUT-153) Implement kmeans++ for initial cluster selection in kmeans

2010-01-16 Thread Isabel Drost (JIRA)
[ https://issues.apache.org/jira/browse/MAHOUT-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12801280#action_12801280 ] Isabel Drost commented on MAHOUT-153: - Welcome to Mahout. Thanks for stepping up and

Re: Unit test lag?

2010-01-16 Thread Drew Farris
On Sat, Jan 16, 2010 at 4:42 PM, Benson Margulies bimargul...@gmail.com wrote: . Running through strace showed that something was attempting to reading from /dev/random. Sometimes it ran fine, but at least 25-30% it ended up blocking until the entropy pool is refilled. To test I moved

Re: Unit test lag?

2010-01-16 Thread Olivier Grisel
2010/1/16 Benson Margulies bimargul...@gmail.com: . Running through strace showed that something was attempting to reading from /dev/random. Sometimes it ran fine, but at least 25-30% it ended up blocking until the entropy pool is refilled. To test I moved /dev/random, and created a link from

Re: Unit test lag?

2010-01-16 Thread Benson Margulies
It looks as if this could be related to the loading of the SecureRandomSeedGenerator class. Let's fix that class to defer until there's a good reason to make a seed.

Re: Efficient dictionary storage in memory

2010-01-16 Thread Robin Anil
Here is my attempt at making a dictionary lookup using lucene. Need some pointers in optimising. Currently it takes 30 secs for a million lookups using a dictionary of 500K words about 30x of that of a hashmap. But space used is almost same as far as i can see in memory sizes looks almost the

Re: Unit test lag?

2010-01-16 Thread Benson Margulies
This is going to be a lot of fun. That class is in uncommons-math, and the connection to it from Mahout is hardly obvious. On Sat, Jan 16, 2010 at 5:34 PM, Benson Margulies bimargul...@gmail.com wrote: It looks as if this could be related to the loading of the SecureRandomSeedGenerator class.

Re: Unit test lag?

2010-01-16 Thread Benson Margulies
I see a way, but it involves loading this class explicitly with reflection. I'll make a patch.

Re: Unit test lag?

2010-01-16 Thread Benson Margulies
Oh, I see. We have to give up on the MerseneTwisterRNG in tests and just use the JRE. Is that OK? On Sat, Jan 16, 2010 at 5:44 PM, Olivier Grisel olivier.gri...@ensta.org wrote: 2010/1/16 Drew Farris drew.far...@gmail.com: On Sat, Jan 16, 2010 at 4:42 PM, Benson Margulies bimargul...@gmail.com

Re: Unit test lag?

2010-01-16 Thread Olivier Grisel
Some tests are probably not calling: RandomUtils.useTestSeed(); in a setUp() or static init. Maybe a mixin class MahoutTestCase base class with a default static init that calls it would do. Otherwise, I confirm that setting forkModel to once in maven/pom.xml solves the issue (and all tests

Re: Efficient dictionary storage in memory

2010-01-16 Thread Grant Ingersoll
On the indexing side, add in batches and reuse the document and fields. On the search side, no need for a BooleanQuery and no need for scoring, so you will likely want your own Collector (dead simple to write). It _MAY_ even be faster to simply do the indexing as a word w/ the id as a

Re: Unit test lag?

2010-01-16 Thread Benson Margulies
Unit tests should generally be using a fixed seed and not need to load a secure seed from dev/random.  I would say that RandomUtils is probably the problem here.  The secure seed should be loaded lazily only if the test seed is not in use. The problem, as I see it, is that the uncommons-math

Re: Efficient dictionary storage in memory

2010-01-16 Thread Robin Anil
On Sun, Jan 17, 2010 at 4:53 AM, Grant Ingersoll gsing...@apache.orgwrote: On the indexing side, add in batches and reuse the document and fields. Done squeeze out 5 secs there 25 from 30 and further to 22 by increasing max merge docs. On the search side, no need for a BooleanQuery and no

Re: Unit test lag?

2010-01-16 Thread deneche abdelhakim
I'm getting similar slowdowns with my VirtualBox Ubuntu 9.04 I'm suspecting that the problem is not -only- caused by RandomUtils because: 1. I'm familiar with MerseneTwisterRNG slowdowns (I use it a lot) but the test time used to be reported accurately by maven. Now maven reports that a test

Re: Unit test lag?

2010-01-16 Thread deneche abdelhakim
removing the maven repository does not solve the problem, neither a fresh checkout of the trunk. but older revisions don't show any slowdown!!! I tried the following revisions: Those old revisions seem Ok: r896946 | srowen