Re: Unit test lag?
removing the maven repository does not solve the problem, neither a fresh checkout of the trunk. but older revisions don't show any slowdown!!! I tried the following revisions: Those old revisions seem Ok: r896946 | srowen | 2010-01-07 19:02:41 +0100 (Thu, 07 Jan 2010) | 1 line MAHOUT-238 r897134 | robinanil | 2010-01-08 09:23:22 +0100 (Fri, 08 Jan 2010) | 1 line MAHOUT-221 Missed out two files while checking in FP-Bonsai r897405 | adeneche | 2010-01-09 11:02:49 +0100 (Sat, 09 Jan 2010) | 1 line MAHOUT-216 >>> The slowdowns start at this revision !!! r897440 | srowen | 2010-01-09 13:53:25 +0100 (Sat, 09 Jan 2010) | 1 line Code style adjustments; enabled/fixed TestSamplingIterator On Sun, Jan 17, 2010 at 5:47 AM, deneche abdelhakim wrote: > I'm getting similar slowdowns with my VirtualBox Ubuntu 9.04 > > I'm suspecting that the problem is not -only- caused by RandomUtils because: > > 1. I'm familiar with MerseneTwisterRNG slowdowns (I use it a lot) but > the test time used to be reported accurately by maven. Now maven > reports that a test took less than a second but it actually took a lot > more ! > > 2. Most of my tests actually call RandomUtils.useTestSeed() in setup() > (InMemInputSplitTest included) but the tests still take a lot of time, > and again its not reported accurately by maven > > 3. I generally launch a 'mvn clean install' every Thursday. I never > got this slowdowns until last Thursday (dit we change anything that > could have caused this slowdowns) > > On Sun, Jan 17, 2010 at 12:33 AM, Benson Margulies > wrote: >>> Unit tests should generally be using a fixed seed and not need to load a >>> secure seed from dev/random. I would say that RandomUtils is probably the >>> problem here. The secure seed should be loaded lazily only if the test seed >>> is not in use. >> >> The problem, as I see it, is that the uncommons-math package start >> initializing a random seed as soon as you touch it, whether you need >> it or not. RandomUtils can only avoid this by avoiding uncommons-math >> in unit test mode. >> >>> >>> >>> >>> -- >>> Ted Dunning, CTO >>> DeepDyve >>> >> >
Re: Unit test lag?
I'm getting similar slowdowns with my VirtualBox Ubuntu 9.04 I'm suspecting that the problem is not -only- caused by RandomUtils because: 1. I'm familiar with MerseneTwisterRNG slowdowns (I use it a lot) but the test time used to be reported accurately by maven. Now maven reports that a test took less than a second but it actually took a lot more ! 2. Most of my tests actually call RandomUtils.useTestSeed() in setup() (InMemInputSplitTest included) but the tests still take a lot of time, and again its not reported accurately by maven 3. I generally launch a 'mvn clean install' every Thursday. I never got this slowdowns until last Thursday (dit we change anything that could have caused this slowdowns) On Sun, Jan 17, 2010 at 12:33 AM, Benson Margulies wrote: >>> >> Unit tests should generally be using a fixed seed and not need to load a >> secure seed from dev/random. I would say that RandomUtils is probably the >> problem here. The secure seed should be loaded lazily only if the test seed >> is not in use. > > The problem, as I see it, is that the uncommons-math package start > initializing a random seed as soon as you touch it, whether you need > it or not. RandomUtils can only avoid this by avoiding uncommons-math > in unit test mode. > >> >> >> >> -- >> Ted Dunning, CTO >> DeepDyve >> >
Re: Efficient dictionary storage in memory
On Sun, Jan 17, 2010 at 4:53 AM, Grant Ingersoll wrote: > On the indexing side, add in batches and reuse the document and fields. > Done squeeze out 5 secs there 25 from 30 and further to 22 by increasing max merge docs. > > On the search side, no need for a BooleanQuery and no need for scoring, so > you will likely want your own Collector (dead simple to write). > bought it down to 15 secs from 30 for 1mil lookups using TermQuery and Collector which is instantiated at once > > It _MAY_ even be faster to simply do the indexing as a word w/ the id as a > payload and then use TermPositions (and no query at all) and forgo searching > all together. Then you just need an IndexReader. First search will always > be slow, unless you "warm" it first. This should help avoid the cost of > going to document storage, which is almost always the most expensive thing > one does in Lucene do to it's random nature. Might even be beneficial to be > able to retrieve IDs in batches (sorted lexicographically, too). > Since all the words have unique ids' then i dont think there is any need for assigning ids. Will re-use lucene document id. Testing shows that it decreased index time to 13 sec and lookup time to 11 sec But I still dont get the "not searching" part. Will take a look at TermPosition and how its done. > > Don't get me wrong, it will likely be slower than a hash map, but the hash > map won't scale and the Lucene term dictionary is delta encoded, so it will > compress a fair amount. Also, as you grow, you will need to use an > FSDirectory. I stil havent seen the size diff for what I was doing previously. But after I removed ID field I get 1/3 savings(220MB) for 5 million word dictionary as compared to a HashMap. with 5 mil words and 10mil lookups Hashmap is 4x faster in ADD and 6x faster in lookup. Inmemory Lucene dict gives around 100K lookups per second. Which is like 1MB/s for 10byte tokens. a bit far away from 50MB/s disk speed limit. Then again, it just need to match lucene Analyzer's speed with which tokens are processed. > -Grant > > On Jan 16, 2010, at 5:37 PM, Robin Anil wrote: > > > Here is my attempt at making a dictionary lookup using lucene. Need some > > pointers in optimising. Currently it takes 30 secs for a million lookups > > using a dictionary of 500K words about 30x of that of a hashmap. But > space > > used is almost same as far as i can see in memory sizes looks almost the > > same(from the process manager). > > > > > > private static final String ID = "id"; > > private static final String WORD = "word"; > > private IndexWriter iwriter; > > private IndexSearcher isearcher; > > private RAMDirectory idx = new RAMDirectory(); > > private Analyzer analyzer = new WhitespaceAnalyzer(); > > > > public void init() throws Exception { > >this.iwriter = > >new IndexWriter(idx, analyzer, true, > > IndexWriter.MaxFieldLength.LIMITED); > > > > } > > > > public void destroy() throws Exception { > >iwriter.close(); > >isearcher.close(); > > } > > > > public void ready() throws Exception { > >iwriter.optimize(); > >iwriter.close(); > > > >this.isearcher = new IndexSearcher(idx, true); > > } > > > > public void addToDictionary(String word, Integer id) throws IOException > { > > > >Document doc = new Document(); > >doc.add(new Field(WORD, word, Field.Store.NO, > > Field.Index.NOT_ANALYZED)); > >doc.add(new Field(ID, id.toString(), Store.YES, > > Field.Index.NOT_ANALYZED)); > > ?? Is there a way other than storing the id as string ? > >iwriter.addDocument(doc); > > } > > > > public Integer get(String word) throws IOException, ParseException { > >BooleanQuery query = new BooleanQuery(); > >query.add(new TermQuery(new Term(WORD, word)), Occur.SHOULD); > >TopDocs top = isearcher.search(query, null, 1); > >ScoreDoc[] hits = top.scoreDocs; > >if (hits.length == 0) return null; > >return Integer.valueOf(isearcher.doc(hits[0].doc).get(ID)); > > } > > > > On Sat, Jan 16, 2010 at 10:20 PM, Grant Ingersoll >wrote: > > > >> A Lucene index, w/ no storage, positions, etc. (optionally) turned off > will > >> be very efficient. Plus, there is virtually no code to write. I've > seen > >> bare bones indexes be as little as 20% of the original w/ very fast > lookup. > >> Furthermore, there are many options available for controlling how much > is > >> loaded into memory, etc. Finally, it will handle all the languages you > >> throw at it. > >> > >> -Grant > >> > >> On Jan 16, 2010, at 9:10 AM, Robin Anil wrote: > >> > >>> Currently java strings use double the space of the characters in it > >> because > >>> its all in utf-16. A 190MB dictionary file therefore uses around 600MB > >> when > >>> loaded into a HashMap. Is there some optimization we > >> could > >>> do in terms of storing them and ensuring that chinese, devanagiri and > >> other > >>> characters dont get messed up in the process. > >>> > >>> Some options benson su
Re: Unit test lag?
>> > Unit tests should generally be using a fixed seed and not need to load a > secure seed from dev/random. I would say that RandomUtils is probably the > problem here. The secure seed should be loaded lazily only if the test seed > is not in use. The problem, as I see it, is that the uncommons-math package start initializing a random seed as soon as you touch it, whether you need it or not. RandomUtils can only avoid this by avoiding uncommons-math in unit test mode. > > > > -- > Ted Dunning, CTO > DeepDyve >
Re: Unit test lag?
On Sat, Jan 16, 2010 at 1:40 PM, Drew Farris wrote: > Mahout does per-test forking, which means we're forking off a new JVM > for each unit text execution, this adds overhead to tests that takes > 0.2s to complete. Is per-test forking strictly needed? > It shouldn't be. I would count it a bug if it were. > ... wall time 30s (!) or so. ... attempting to reading from /dev/random. > > Unit tests should generally be using a fixed seed and not need to load a secure seed from dev/random. I would say that RandomUtils is probably the problem here. The secure seed should be loaded lazily only if the test seed is not in use. -- Ted Dunning, CTO DeepDyve
Re: Efficient dictionary storage in memory
On the indexing side, add in batches and reuse the document and fields. On the search side, no need for a BooleanQuery and no need for scoring, so you will likely want your own Collector (dead simple to write). It _MAY_ even be faster to simply do the indexing as a word w/ the id as a payload and then use TermPositions (and no query at all) and forgo searching all together. Then you just need an IndexReader. First search will always be slow, unless you "warm" it first. This should help avoid the cost of going to document storage, which is almost always the most expensive thing one does in Lucene do to it's random nature. Might even be beneficial to be able to retrieve IDs in batches (sorted lexicographically, too). Don't get me wrong, it will likely be slower than a hash map, but the hash map won't scale and the Lucene term dictionary is delta encoded, so it will compress a fair amount. Also, as you grow, you will need to use an FSDirectory. -Grant On Jan 16, 2010, at 5:37 PM, Robin Anil wrote: > Here is my attempt at making a dictionary lookup using lucene. Need some > pointers in optimising. Currently it takes 30 secs for a million lookups > using a dictionary of 500K words about 30x of that of a hashmap. But space > used is almost same as far as i can see in memory sizes looks almost the > same(from the process manager). > > > private static final String ID = "id"; > private static final String WORD = "word"; > private IndexWriter iwriter; > private IndexSearcher isearcher; > private RAMDirectory idx = new RAMDirectory(); > private Analyzer analyzer = new WhitespaceAnalyzer(); > > public void init() throws Exception { >this.iwriter = >new IndexWriter(idx, analyzer, true, > IndexWriter.MaxFieldLength.LIMITED); > > } > > public void destroy() throws Exception { >iwriter.close(); >isearcher.close(); > } > > public void ready() throws Exception { >iwriter.optimize(); >iwriter.close(); > >this.isearcher = new IndexSearcher(idx, true); > } > > public void addToDictionary(String word, Integer id) throws IOException { > >Document doc = new Document(); >doc.add(new Field(WORD, word, Field.Store.NO, > Field.Index.NOT_ANALYZED)); >doc.add(new Field(ID, id.toString(), Store.YES, > Field.Index.NOT_ANALYZED)); > ?? Is there a way other than storing the id as string ? >iwriter.addDocument(doc); > } > > public Integer get(String word) throws IOException, ParseException { >BooleanQuery query = new BooleanQuery(); >query.add(new TermQuery(new Term(WORD, word)), Occur.SHOULD); >TopDocs top = isearcher.search(query, null, 1); >ScoreDoc[] hits = top.scoreDocs; >if (hits.length == 0) return null; >return Integer.valueOf(isearcher.doc(hits[0].doc).get(ID)); > } > > On Sat, Jan 16, 2010 at 10:20 PM, Grant Ingersoll wrote: > >> A Lucene index, w/ no storage, positions, etc. (optionally) turned off will >> be very efficient. Plus, there is virtually no code to write. I've seen >> bare bones indexes be as little as 20% of the original w/ very fast lookup. >> Furthermore, there are many options available for controlling how much is >> loaded into memory, etc. Finally, it will handle all the languages you >> throw at it. >> >> -Grant >> >> On Jan 16, 2010, at 9:10 AM, Robin Anil wrote: >> >>> Currently java strings use double the space of the characters in it >> because >>> its all in utf-16. A 190MB dictionary file therefore uses around 600MB >> when >>> loaded into a HashMap. Is there some optimization we >> could >>> do in terms of storing them and ensuring that chinese, devanagiri and >> other >>> characters dont get messed up in the process. >>> >>> Some options benson suggested was: storing just the byte[] form and >> adding >>> the the option of supplying the hash function in OpenObjectIntHashmap or >>> even using a UTF-8 string. >>> >>> Or we could leave this alone. I currently estimate the memory requirement >>> using the formula 8 * ( (int) ( num_chars *2 + 45)/8 ) for strings when >>> generating the dictionary split for the vectorizer >>> >>> Robin >> >> -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search
Re: Unit test lag?
Some tests are probably not calling: RandomUtils.useTestSeed(); in a setUp() or static init. Maybe a mixin class MahoutTestCase base class with a default static init that calls it would do. Otherwise, I confirm that setting forkModel to "once" in maven/pom.xml solves the issue (and all tests pass). -- Olivier http://twitter.com/ogrisel - http://code.oliviergrisel.name
Re: Unit test lag?
Oh, I see. We have to give up on the MerseneTwisterRNG in tests and just use the JRE. Is that OK? On Sat, Jan 16, 2010 at 5:44 PM, Olivier Grisel wrote: > 2010/1/16 Drew Farris : >> On Sat, Jan 16, 2010 at 4:42 PM, Benson Margulies >> wrote: >>> . Running through strace showed that something was attempting to reading from /dev/random. Sometimes it ran fine, but at least 25-30% it ended up blocking until the entropy pool is refilled. To test I moved /dev/random, and created a link from /dev/urandom to /dev/random (the former doesn't block, but isn't cryptographically secure). It looks as if this could be related to the loading of the SecureRandomSeedGenerator class. >>> >>> Why not use a fixed random seed for unit tests? That would make them >>> more repeatable and avoid this problem, no? >>> >> >> It appears we are. in RandomUtils: >> >> public static Random getRandom() { >> return testSeed ? new MersenneTwisterRNG(STANDARD_SEED) : new >> MersenneTwisterRNG(); >> } >> >> But something somewhere is forcing SecureRandomSeedGenerator to get >> loaded by the classloader which in turn does a 'new SecureRandom()' in >> a private static final field assignment. Trying to track down what is >> causing the generator to get loaded in the first place. >> >> But something is forcing the SecureRandomSeedGenerator class to get >> loaded, which I suspect >> > > > MersenneTwisterRNG constructor calls: > > this(DefaultSeedGenerator.getInstance().generateSeed(SEED_SIZE_BYTES)); > > Which in turn calls: > > private static final SeedGenerator[] GENERATORS = new SeedGenerator[] > { > new DevRandomSeedGenerator(), > new RandomDotOrgSeedGenerator(), > new SecureRandomSeedGenerator() > }; > > In the definition of the class: DefaultSeedGenerator > > Unless the forking tests are disabled I don't see how to prevent the > MersenneTwisterRNG to inderctly fetch entropy from /dev/random / > SecureRandom. > -- > Olivier > http://twitter.com/ogrisel - http://code.oliviergrisel.name >
Re: Unit test lag?
I see a way, but it involves loading this class explicitly with reflection. I'll make a patch.
Re: Unit test lag?
2010/1/16 Drew Farris : > On Sat, Jan 16, 2010 at 4:42 PM, Benson Margulies > wrote: >> . Running through strace showed >>> that something was attempting to reading from /dev/random. Sometimes >>> it ran fine, but at least 25-30% it ended up blocking until the >>> entropy pool is refilled. To test I moved /dev/random, and created a >>> link from /dev/urandom to /dev/random (the former doesn't block, but >>> isn't cryptographically secure). It looks as if this could be related >>> to the loading of the SecureRandomSeedGenerator class. >> >> Why not use a fixed random seed for unit tests? That would make them >> more repeatable and avoid this problem, no? >> > > It appears we are. in RandomUtils: > > public static Random getRandom() { > return testSeed ? new MersenneTwisterRNG(STANDARD_SEED) : new > MersenneTwisterRNG(); > } > > But something somewhere is forcing SecureRandomSeedGenerator to get > loaded by the classloader which in turn does a 'new SecureRandom()' in > a private static final field assignment. Trying to track down what is > causing the generator to get loaded in the first place. > > But something is forcing the SecureRandomSeedGenerator class to get > loaded, which I suspect > MersenneTwisterRNG constructor calls: this(DefaultSeedGenerator.getInstance().generateSeed(SEED_SIZE_BYTES)); Which in turn calls: private static final SeedGenerator[] GENERATORS = new SeedGenerator[] { new DevRandomSeedGenerator(), new RandomDotOrgSeedGenerator(), new SecureRandomSeedGenerator() }; In the definition of the class: DefaultSeedGenerator Unless the forking tests are disabled I don't see how to prevent the MersenneTwisterRNG to inderctly fetch entropy from /dev/random / SecureRandom. -- Olivier http://twitter.com/ogrisel - http://code.oliviergrisel.name
Re: Unit test lag?
This is going to be a lot of fun. That class is in uncommons-math, and the connection to it from Mahout is hardly obvious. On Sat, Jan 16, 2010 at 5:34 PM, Benson Margulies wrote: > It looks as if this could be related to the loading of the SecureRandomSeedGenerator class. >>> > > Let's fix that class to defer until there's a good reason to make a seed. >
Re: Efficient dictionary storage in memory
Here is my attempt at making a dictionary lookup using lucene. Need some pointers in optimising. Currently it takes 30 secs for a million lookups using a dictionary of 500K words about 30x of that of a hashmap. But space used is almost same as far as i can see in memory sizes looks almost the same(from the process manager). private static final String ID = "id"; private static final String WORD = "word"; private IndexWriter iwriter; private IndexSearcher isearcher; private RAMDirectory idx = new RAMDirectory(); private Analyzer analyzer = new WhitespaceAnalyzer(); public void init() throws Exception { this.iwriter = new IndexWriter(idx, analyzer, true, IndexWriter.MaxFieldLength.LIMITED); } public void destroy() throws Exception { iwriter.close(); isearcher.close(); } public void ready() throws Exception { iwriter.optimize(); iwriter.close(); this.isearcher = new IndexSearcher(idx, true); } public void addToDictionary(String word, Integer id) throws IOException { Document doc = new Document(); doc.add(new Field(WORD, word, Field.Store.NO, Field.Index.NOT_ANALYZED)); doc.add(new Field(ID, id.toString(), Store.YES, Field.Index.NOT_ANALYZED)); ?? Is there a way other than storing the id as string ? iwriter.addDocument(doc); } public Integer get(String word) throws IOException, ParseException { BooleanQuery query = new BooleanQuery(); query.add(new TermQuery(new Term(WORD, word)), Occur.SHOULD); TopDocs top = isearcher.search(query, null, 1); ScoreDoc[] hits = top.scoreDocs; if (hits.length == 0) return null; return Integer.valueOf(isearcher.doc(hits[0].doc).get(ID)); } On Sat, Jan 16, 2010 at 10:20 PM, Grant Ingersoll wrote: > A Lucene index, w/ no storage, positions, etc. (optionally) turned off will > be very efficient. Plus, there is virtually no code to write. I've seen > bare bones indexes be as little as 20% of the original w/ very fast lookup. > Furthermore, there are many options available for controlling how much is > loaded into memory, etc. Finally, it will handle all the languages you > throw at it. > > -Grant > > On Jan 16, 2010, at 9:10 AM, Robin Anil wrote: > > > Currently java strings use double the space of the characters in it > because > > its all in utf-16. A 190MB dictionary file therefore uses around 600MB > when > > loaded into a HashMap. Is there some optimization we > could > > do in terms of storing them and ensuring that chinese, devanagiri and > other > > characters dont get messed up in the process. > > > > Some options benson suggested was: storing just the byte[] form and > adding > > the the option of supplying the hash function in OpenObjectIntHashmap or > > even using a UTF-8 string. > > > > Or we could leave this alone. I currently estimate the memory requirement > > using the formula 8 * ( (int) ( num_chars *2 + 45)/8 ) for strings when > > generating the dictionary split for the vectorizer > > > > Robin > >
Re: Unit test lag?
It looks as if this could be related >>> to the loading of the SecureRandomSeedGenerator class. >> Let's fix that class to defer until there's a good reason to make a seed.
Re: Unit test lag?
2010/1/16 Benson Margulies : > . Running through strace showed >> that something was attempting to reading from /dev/random. Sometimes >> it ran fine, but at least 25-30% it ended up blocking until the >> entropy pool is refilled. To test I moved /dev/random, and created a >> link from /dev/urandom to /dev/random (the former doesn't block, but >> isn't cryptographically secure). It looks as if this could be related >> to the loading of the SecureRandomSeedGenerator class. >> I also experience the same slowdown Drew describes. ubuntu machines too. > Why not use a fixed random seed for unit tests? That would make them > more repeatable and avoid this problem, no? > +1 for the fixed seed (42 is my favorite seed). -- Olivier http://twitter.com/ogrisel - http://code.oliviergrisel.name
Re: Unit test lag?
On Sat, Jan 16, 2010 at 4:42 PM, Benson Margulies wrote: > . Running through strace showed >> that something was attempting to reading from /dev/random. Sometimes >> it ran fine, but at least 25-30% it ended up blocking until the >> entropy pool is refilled. To test I moved /dev/random, and created a >> link from /dev/urandom to /dev/random (the former doesn't block, but >> isn't cryptographically secure). It looks as if this could be related >> to the loading of the SecureRandomSeedGenerator class. > > Why not use a fixed random seed for unit tests? That would make them > more repeatable and avoid this problem, no? > It appears we are. in RandomUtils: public static Random getRandom() { return testSeed ? new MersenneTwisterRNG(STANDARD_SEED) : new MersenneTwisterRNG(); } But something somewhere is forcing SecureRandomSeedGenerator to get loaded by the classloader which in turn does a 'new SecureRandom()' in a private static final field assignment. Trying to track down what is causing the generator to get loaded in the first place. But something is forcing the SecureRandomSeedGenerator class to get loaded, which I suspect
[jira] Commented: (MAHOUT-153) Implement kmeans++ for initial cluster selection in kmeans
[ https://issues.apache.org/jira/browse/MAHOUT-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801280#action_12801280 ] Isabel Drost commented on MAHOUT-153: - Welcome to Mahout. Thanks for stepping up and volunteering to take over the work for this issue. > Implement kmeans++ for initial cluster selection in kmeans > -- > > Key: MAHOUT-153 > URL: https://issues.apache.org/jira/browse/MAHOUT-153 > Project: Mahout > Issue Type: New Feature > Components: Clustering >Affects Versions: 0.2 > Environment: OS Independent >Reporter: Panagiotis Papadimitriou > Fix For: 0.3 > > Original Estimate: 336h > Remaining Estimate: 336h > > The current implementation of k-means includes the following algorithms for > initial cluster selection (seed selection): 1) random selection of k points, > 2) use of canopy clusters. > I plan to implement k-means++. The details of the algorithm are available > here: http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf. > Design Outline: I will create an abstract class SeedGenerator and a subclass > KMeansPlusPlusSeedGenerator. The existing class RandomSeedGenerator will > become a subclass of SeedGenerator. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Unit test lag?
. Running through strace showed > that something was attempting to reading from /dev/random. Sometimes > it ran fine, but at least 25-30% it ended up blocking until the > entropy pool is refilled. To test I moved /dev/random, and created a > link from /dev/urandom to /dev/random (the former doesn't block, but > isn't cryptographically secure). It looks as if this could be related > to the loading of the SecureRandomSeedGenerator class. > Why not use a fixed random seed for unit tests? That would make them more repeatable and avoid this problem, no?
Unit test lag?
Recently I've been noticing that Mahout's unit tests generally take a considerably long time to run, generally longer than what is reported in the individual test output. I took a look as to why this was the case and found a couple things: Mahout does per-test forking, which means we're forking off a new JVM for each unit text execution, this adds overhead to tests that takes 0.2s to complete. Is per-test forking strictly needed? I captured the command-line used to execute one of the forked tests (InMemInputSplitTest) by running mvn -X and executed it from the shell repeatedly using time see what was going on. In one of every few invocations, the test in question would report completion in 3s, but time reported a wall time 30s (!) or so. Running through strace showed that something was attempting to reading from /dev/random. Sometimes it ran fine, but at least 25-30% it ended up blocking until the entropy pool is refilled. To test I moved /dev/random, and created a link from /dev/urandom to /dev/random (the former doesn't block, but isn't cryptographically secure). It looks as if this could be related to the loading of the SecureRandomSeedGenerator class. I'm running on Ubuntu 9.04, kernel 2.6.28-17-server with the latest patches. Is anyone else experiencing similar slowness? Drew
Re: A modest proposal for the Carrot integration
> I'm not quite done with Colt. No, no -- you didn't understand me right. Let's work in parallel, I'll try to polish the edges of HPPC in those places that I know are not exactly the way I feel they should be, you finish with Colt's integration -- having Apache-licensed Colt is a value on its own. I will provide a cleaner patch, but branching is a good idea since moving from Colt collections may require major code sweeps and we don't want everyone to suffer because of this. I think I should be done with this "cleaner" HPPC release by Wednesday, if it's all right. D. > > If you think you can refine a patch to go straight into the mahout > trunk, don't let me stop you. > > > On Sat, Jan 16, 2010 at 3:48 PM, Dawid Weiss wrote: >> Have you finished with Colt? I think this is still worth completing >> before we proceed to HPPC. Just talked to Staszek, we will move HPPC >> code to Carrot2 labs SVN repository (sourceforge) because we want to >> get rid of PCJ as soon as possible and need something versioned and >> sticky. I plan to make a few additions to HPPC that I could work on >> while you're completing the Colt stuff. Hopefully we can also get this >> ArrayIndexOutOfBounds beast in the mean time. >> >> If you're done with Colt, I can commit directly to Mahout's branch and >> work from there. >> >> Dawid >> >
Re: A modest proposal for the Carrot integration
I'm not quite done with Colt. If you think you can refine a patch to go straight into the mahout trunk, don't let me stop you. On Sat, Jan 16, 2010 at 3:48 PM, Dawid Weiss wrote: > Have you finished with Colt? I think this is still worth completing > before we proceed to HPPC. Just talked to Staszek, we will move HPPC > code to Carrot2 labs SVN repository (sourceforge) because we want to > get rid of PCJ as soon as possible and need something versioned and > sticky. I plan to make a few additions to HPPC that I could work on > while you're completing the Colt stuff. Hopefully we can also get this > ArrayIndexOutOfBounds beast in the mean time. > > If you're done with Colt, I can commit directly to Mahout's branch and > work from there. > > Dawid >
[jira] Updated: (MAHOUT-242) LLR Collocation Identifier
[ https://issues.apache.org/jira/browse/MAHOUT-242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Drew Farris updated MAHOUT-242: --- Attachment: MAHOUT-242.patch Log-likelihood collocation identifier in patch form. This puts itself in o.a.m.nlp.collocations.llr I think there are some improvements that can be made, but if possible it would be nice to review, commit this version and add on to it later through additional patches More specifically, I'd like to see this: * include the ability to avoid forming collocations around sentence boundaries and other boundaries per: http://www.lucidimagination.com/search/document/d259def498803ffe/collocation_clarification#29fbb050cf5fa64 * work for non-whitespace delimited languages, e.g: anything an analyzer can produce tokens for. I removed the ability to read in files from a directory, Robin's document -> sequence file work fits into this well. > LLR Collocation Identifier > -- > > Key: MAHOUT-242 > URL: https://issues.apache.org/jira/browse/MAHOUT-242 > Project: Mahout > Issue Type: New Feature >Affects Versions: 0.3 >Reporter: Drew Farris >Priority: Minor > Attachments: MAHOUT-242.patch, mahout-colloc.tar.gz, > mahout-colloc.tar.gz > > > Identifies interesting Collocations in text using ngrams scored via the > LogLikelihoodRatio calculation. > As discussed in: > * > http://www.lucidimagination.com/search/document/d051123800ab6ce7/collocations_in_mahout#26634d6364c2c0d2 > * > http://www.lucidimagination.com/search/document/b8d5bb0745eef6e8/n_grams_for_terms#f16fa54417697d8e > Current form is a tar of a maven project that depends on mahout. Build as > usual with 'mvn clean install', can be executed using: > {noformat} > mvn -e exec:java -Dexec.mainClass="org.apache.mahout.colloc.CollocDriver" > -Dexec.args="--input src/test/resources/article --colloc target/colloc > --output target/output -w" > {noformat} > Output will be placed in target/output and can be viewed nicely using: > {noformat} > sort -rn -k1 target/output/part-0 > {noformat} > Includes rudimentary unit tests. Please review and comment. Needs more work > to get this into patch state and integrate with Robin's document vectorizer > work in MAHOUT-237 > Some basic TODO/FIXME's include: > * use mahout math's ObjectInt map implementation when available > * make the analyzer configurable > * better input validation + negative unit tests. > * more flexible ways to generate units of analysis (n-1)grams. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: A modest proposal for the Carrot integration
Have you finished with Colt? I think this is still worth completing before we proceed to HPPC. Just talked to Staszek, we will move HPPC code to Carrot2 labs SVN repository (sourceforge) because we want to get rid of PCJ as soon as possible and need something versioned and sticky. I plan to make a few additions to HPPC that I could work on while you're completing the Colt stuff. Hopefully we can also get this ArrayIndexOutOfBounds beast in the mean time. If you're done with Colt, I can commit directly to Mahout's branch and work from there. Dawid
Re: Unit test failure
Yeah, its probably due to the way I used to generate random data...the problem is that I never get this error =P so it's very difficult to fix...I'll try my best as soon as I have some time. In the mean time, rerunning 'mvn clean install' again generally does the trick. On Sat, Jan 16, 2010 at 6:58 PM, Grant Ingersoll wrote: > try rerunning... I think that one has intermittent failures. Perhaps Deneche > can dig in. You will likely need to look in the Hadoop logs too. > On Jan 16, 2010, at 12:49 PM, Benson Margulies wrote: > >> https://issues.apache.org/jira/browse/MAHOUT-258 >> >> The error message: >> >> testGatherInfos(org.apache.mahout.ga.watchmaker.cd.tool.CDInfosToolTest) >> Time elapsed: 6.731 sec <<< ERROR! >> java.io.IOException: Job failed! >> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252) >> >> does not give me much to go on. >> >> I don't see how adding new Set classes to my tree could cause this ... > >
[jira] Updated: (MAHOUT-254) Primitive set unit tests
[ https://issues.apache.org/jira/browse/MAHOUT-254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benson Margulies updated MAHOUT-254: Fix Version/s: 0.3 Status: Patch Available (was: Open) > Primitive set unit tests > > > Key: MAHOUT-254 > URL: https://issues.apache.org/jira/browse/MAHOUT-254 > Project: Mahout > Issue Type: New Feature > Components: Math >Affects Versions: 0.3 >Reporter: Benson Margulies >Assignee: Benson Margulies > Fix For: 0.3 > > Attachments: MAHOUT-254.patch > > > The primitive sets need unit tests. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-254) Primitive set unit tests
[ https://issues.apache.org/jira/browse/MAHOUT-254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benson Margulies updated MAHOUT-254: Attachment: MAHOUT-254.patch > Primitive set unit tests > > > Key: MAHOUT-254 > URL: https://issues.apache.org/jira/browse/MAHOUT-254 > Project: Mahout > Issue Type: New Feature > Components: Math >Affects Versions: 0.3 >Reporter: Benson Margulies >Assignee: Benson Margulies > Fix For: 0.3 > > Attachments: MAHOUT-254.patch > > > The primitive sets need unit tests. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-252) Sets (primitive types)
[ https://issues.apache.org/jira/browse/MAHOUT-252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801248#action_12801248 ] Benson Margulies commented on MAHOUT-252: - That was the 'Map' patch, which was indeed committed some time before. > Sets (primitive types) > -- > > Key: MAHOUT-252 > URL: https://issues.apache.org/jira/browse/MAHOUT-252 > Project: Mahout > Issue Type: New Feature > Components: Math >Affects Versions: 0.3 >Reporter: Benson Margulies >Assignee: Benson Margulies > Fix For: 0.3 > > Attachments: MAHOUT-252.patch > > > Here come the sets. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-252) Sets (primitive types)
[ https://issues.apache.org/jira/browse/MAHOUT-252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801247#action_12801247 ] Drew Farris commented on MAHOUT-252: It was: ./target/classes/org/apache/mahout/math/map/OpenObjectIntHashMap.class, which I hadn't expected, but it's a moot point since this is now committed. Thx. > Sets (primitive types) > -- > > Key: MAHOUT-252 > URL: https://issues.apache.org/jira/browse/MAHOUT-252 > Project: Mahout > Issue Type: New Feature > Components: Math >Affects Versions: 0.3 >Reporter: Benson Margulies >Assignee: Benson Margulies > Fix For: 0.3 > > Attachments: MAHOUT-252.patch > > > Here come the sets. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: A modest proposal for the Carrot integration
On Sat, Jan 16, 2010 at 1:15 PM, Dawid Weiss wrote: >> I propose a branch. Diffs from the branch to the trunk can still be >> posted on the JIRA, but I think that a branch would be worthwhile in >> facilitating collaboration. > > Do you mean -- for merging with the code I posted earlier? Yes, To be specific: 1) make a branch 2) in the branch, make a module for HPPC, and check in. 3) in the branch, fiddle the other math code to use HPPC instead of the colt collections. 4) Stir vigorously until the sort of thing you're reporting is dealt with. 5) Patch across to the trunk.
Re: A modest proposal for the Carrot integration
> I propose a branch. Diffs from the branch to the trunk can still be > posted on the JIRA, but I think that a branch would be worthwhile in > facilitating collaboration. Do you mean -- for merging with the code I posted earlier? By the way, I've intergrated Colt from Mahout with our code base. Interesting things started to happen. First, we had this: Caused by: java.lang.ArrayIndexOutOfBoundsException: -1 at org.apache.mahout.math.matrix.doublealgo.Sorting$4.compare(Sorting.java:214) at org.apache.mahout.math.Sorting.quickSort0(Sorting.java:725) at org.apache.mahout.math.Sorting.quickSort0(Sorting.java:773) at org.apache.mahout.math.Sorting.quickSort(Sorting.java:662) at org.apache.mahout.math.matrix.doublealgo.Sorting.runSort(Sorting.java:80) at org.apache.mahout.math.matrix.doublealgo.Sorting.sort(Sorting.java:236) at org.carrot2.matrix.factorization.IterativeMatrixFactorizationBase.order(IterativeMatrixFactorizationBase.java:149) When we added debugging statements -- the exception was gone. After a (longer) while, I checked for VM bugs. Yes, that was it -- there was a bug in the release of SUN's JVM 1.5 that we had on our server (for running 1.5-compliance builds). We upgraded that release and... we still have random exceptions with the above stack. More -- we have them with the newest 1.6 as well... Adding debugging statements makes the builds pass in flying colors. The bug only happens on one machine (which does have memory correction and is a server-class stuff). In other words -- I've no idea what is happening. D.
Re: A modest proposal for the Carrot integration
I try to never say anything that decreases the output of a very productive person. I often fail, but I try. On Sat, Jan 16, 2010 at 10:11 AM, Benson Margulies wrote: > Sure you could. The 'refine patches attached to JIRA' approach is the > classic Lucene project methodology, and I'm the new kid on the block > here. > -- Ted Dunning, CTO DeepDyve
Re: Efficient dictionary storage in memory
I would recommend either the hashed representation (which cannot be easily reversed) or the Lucene version. No need to go to great lengths to rewrite this code. On Sat, Jan 16, 2010 at 8:50 AM, Grant Ingersoll wrote: > A Lucene index, w/ no storage, positions, etc. (optionally) turned off will > be very efficient. Plus, there is virtually no code to write. I've seen > bare bones indexes be as little as 20% of the original w/ very fast lookup. > Furthermore, there are many options available for controlling how much is > loaded into memory, etc. Finally, it will handle all the languages you > throw at it. > > -Grant > >
Re: A modest proposal for the Carrot integration
Sure you could. The 'refine patches attached to JIRA' approach is the classic Lucene project methodology, and I'm the new kid on the block here. On Sat, Jan 16, 2010 at 12:50 PM, Ted Dunning wrote: > How can we say no? > > On Sat, Jan 16, 2010 at 9:33 AM, Benson Margulies > wrote: > >> I volunteer to fight with the maven-release-plugin to make it. > > > > > -- > Ted Dunning, CTO > DeepDyve >
Re: Unit test failure
try rerunning... I think that one has intermittent failures. Perhaps Deneche can dig in. You will likely need to look in the Hadoop logs too. On Jan 16, 2010, at 12:49 PM, Benson Margulies wrote: > https://issues.apache.org/jira/browse/MAHOUT-258 > > The error message: > > testGatherInfos(org.apache.mahout.ga.watchmaker.cd.tool.CDInfosToolTest) > Time elapsed: 6.731 sec <<< ERROR! > java.io.IOException: Job failed! > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252) > > does not give me much to go on. > > I don't see how adding new Set classes to my tree could cause this ...
[jira] Updated: (MAHOUT-248) Next collections expansion kit: OpenObjectWhateverHashMap
[ https://issues.apache.org/jira/browse/MAHOUT-248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benson Margulies updated MAHOUT-248: Resolution: Fixed Status: Resolved (was: Patch Available) Committed. > Next collections expansion kit: OpenObjectWhateverHashMap > > > Key: MAHOUT-248 > URL: https://issues.apache.org/jira/browse/MAHOUT-248 > Project: Mahout > Issue Type: Improvement > Components: Math >Affects Versions: 0.3 >Reporter: Benson Margulies >Assignee: Benson Margulies > Fix For: 0.3 > > Attachments: MAHOUT-248.patch > > > Here's the next slice. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-252) Sets (primitive types)
[ https://issues.apache.org/jira/browse/MAHOUT-252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benson Margulies updated MAHOUT-252: Resolution: Fixed Status: Resolved (was: Patch Available) OK, now it is committed. > Sets (primitive types) > -- > > Key: MAHOUT-252 > URL: https://issues.apache.org/jira/browse/MAHOUT-252 > Project: Mahout > Issue Type: New Feature > Components: Math >Affects Versions: 0.3 >Reporter: Benson Margulies >Assignee: Benson Margulies > Fix For: 0.3 > > Attachments: MAHOUT-252.patch > > > Here come the sets. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-258) Unit test failure in CDInfo example
[ https://issues.apache.org/jira/browse/MAHOUT-258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801236#action_12801236 ] Benson Margulies commented on MAHOUT-258: - manually deleting 'target' and running mvn again worked. Something is wrong with 'clean'. > Unit test failure in CDInfo example > --- > > Key: MAHOUT-258 > URL: https://issues.apache.org/jira/browse/MAHOUT-258 > Project: Mahout > Issue Type: Bug >Affects Versions: 0.3 >Reporter: Benson Margulies > > {noformat} > --- > Test set: org.apache.mahout.ga.watchmaker.cd.tool.CDInfosToolTest > --- > Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 6.844 sec <<< > FAILURE! > testGatherInfos(org.apache.mahout.ga.watchmaker.cd.tool.CDInfosToolTest) > Time elapsed: 6.731 sec <<< ERROR! > java.io.IOException: Job failed! > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252) > at > org.apache.mahout.ga.watchmaker.cd.tool.CDInfosTool.gatherInfos(CDInfosTool.java:90) > at > org.apache.mahout.ga.watchmaker.cd.tool.CDInfosToolTest.testGatherInfos(CDInfosToolTest.java:220) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at junit.framework.TestCase.runTest(TestCase.java:168) > at junit.framework.TestCase.runBare(TestCase.java:134) > at junit.framework.TestResult$1.protect(TestResult.java:110) > at junit.framework.TestResult.runProtected(TestResult.java:128) > at junit.framework.TestResult.run(TestResult.java:113) > at junit.framework.TestCase.run(TestCase.java:124) > at junit.framework.TestSuite.runTest(TestSuite.java:232) > at junit.framework.TestSuite.run(TestSuite.java:227) > at > org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:83) > at > org.apache.maven.surefire.junit4.JUnit4TestSet.execute(JUnit4TestSet.java:62) > at > org.apache.maven.surefire.suite.AbstractDirectoryTestSuite.executeTestSet(AbstractDirectoryTestSuite.java:140) > at > org.apache.maven.surefire.suite.AbstractDirectoryTestSuite.execute(AbstractDirectoryTestSuite.java:165) > at org.apache.maven.surefire.Surefire.run(Surefire.java:107) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > org.apache.maven.surefire.booter.SurefireBooter.runSuitesInProcess(SurefireBooter.java:289) > at > org.apache.maven.surefire.booter.SurefireBooter.main(SurefireBooter.java:1005) > {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: A modest proposal for the Carrot integration
How can we say no? On Sat, Jan 16, 2010 at 9:33 AM, Benson Margulies wrote: > I volunteer to fight with the maven-release-plugin to make it. -- Ted Dunning, CTO DeepDyve
Re: Abbreviations?
+1 as well. I think it should be in core rather than utils due to dependency issues. On Sat, Jan 16, 2010 at 7:16 AM, Olivier Grisel wrote: > 2010/1/16 Grant Ingersoll : > > I think we should start a new module, that will be the seed for a > subproject, called NLP and that contains the stuff for NLP. > > > > Either that or put them in the utils module, which is where I envision > all of things that are "helpful" for ML go, but aren't required. > > +1 for an explicit "org.apache.mahout.nlp module". Tools to turn > wikipedia dumps into term freq vectors could also move there instead > of "examples". > > -- > Olivier > http://twitter.com/ogrisel - http://code.oliviergrisel.name > -- Ted Dunning, CTO DeepDyve
Re: Abbreviations?
How about src/main/resources/nlp? On Sat, Jan 16, 2010 at 9:31 AM, Benson Margulies wrote: > Sure. > > However, the immediate contribution is data. src/main/resources? Something > else? > > On Sat, Jan 16, 2010 at 10:16 AM, Olivier Grisel > wrote: > > 2010/1/16 Grant Ingersoll : > >> I think we should start a new module, that will be the seed for a > subproject, called NLP and that contains the stuff for NLP. > >> > >> Either that or put them in the utils module, which is where I envision > all of things that are "helpful" for ML go, but aren't required. > > > > +1 for an explicit "org.apache.mahout.nlp module". Tools to turn > > wikipedia dumps into term freq vectors could also move there instead > > of "examples". > > > > -- > > Olivier > > http://twitter.com/ogrisel - http://code.oliviergrisel.name > > > -- Ted Dunning, CTO DeepDyve
Unit test failure
https://issues.apache.org/jira/browse/MAHOUT-258 The error message: testGatherInfos(org.apache.mahout.ga.watchmaker.cd.tool.CDInfosToolTest) Time elapsed: 6.731 sec <<< ERROR! java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252) does not give me much to go on. I don't see how adding new Set classes to my tree could cause this ...
[jira] Created: (MAHOUT-258) Unit test failure in CDInfo example
Unit test failure in CDInfo example --- Key: MAHOUT-258 URL: https://issues.apache.org/jira/browse/MAHOUT-258 Project: Mahout Issue Type: Bug Affects Versions: 0.3 Reporter: Benson Margulies {noformat} --- Test set: org.apache.mahout.ga.watchmaker.cd.tool.CDInfosToolTest --- Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 6.844 sec <<< FAILURE! testGatherInfos(org.apache.mahout.ga.watchmaker.cd.tool.CDInfosToolTest) Time elapsed: 6.731 sec <<< ERROR! java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252) at org.apache.mahout.ga.watchmaker.cd.tool.CDInfosTool.gatherInfos(CDInfosTool.java:90) at org.apache.mahout.ga.watchmaker.cd.tool.CDInfosToolTest.testGatherInfos(CDInfosToolTest.java:220) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at junit.framework.TestCase.runTest(TestCase.java:168) at junit.framework.TestCase.runBare(TestCase.java:134) at junit.framework.TestResult$1.protect(TestResult.java:110) at junit.framework.TestResult.runProtected(TestResult.java:128) at junit.framework.TestResult.run(TestResult.java:113) at junit.framework.TestCase.run(TestCase.java:124) at junit.framework.TestSuite.runTest(TestSuite.java:232) at junit.framework.TestSuite.run(TestSuite.java:227) at org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:83) at org.apache.maven.surefire.junit4.JUnit4TestSet.execute(JUnit4TestSet.java:62) at org.apache.maven.surefire.suite.AbstractDirectoryTestSuite.executeTestSet(AbstractDirectoryTestSuite.java:140) at org.apache.maven.surefire.suite.AbstractDirectoryTestSuite.execute(AbstractDirectoryTestSuite.java:165) at org.apache.maven.surefire.Surefire.run(Surefire.java:107) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.maven.surefire.booter.SurefireBooter.runSuitesInProcess(SurefireBooter.java:289) at org.apache.maven.surefire.booter.SurefireBooter.main(SurefireBooter.java:1005) {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (MAHOUT-257) Get rid of GenericSorting.java
Get rid of GenericSorting.java -- Key: MAHOUT-257 URL: https://issues.apache.org/jira/browse/MAHOUT-257 Project: Mahout Issue Type: New Feature Components: Math Affects Versions: 0.3 Reporter: Benson Margulies Assignee: Benson Margulies GenericSorting.java has one function left in it. Let's move that to Sorting.java and delete the class. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (MAHOUT-256) Clean up raw type usage
Clean up raw type usage --- Key: MAHOUT-256 URL: https://issues.apache.org/jira/browse/MAHOUT-256 Project: Mahout Issue Type: New Feature Components: Math Affects Versions: 0.3 Reporter: Benson Margulies Assignee: Benson Margulies Turning the Object-related Colt maps into Generics has left a number of other classes referencing raw types. (e.g. matrices). These need to be made generic and cleaned up. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (MAHOUT-255) Open hash set and map that plug into java.util
Open hash set and map that plug into java.util -- Key: MAHOUT-255 URL: https://issues.apache.org/jira/browse/MAHOUT-255 Project: Mahout Issue Type: New Feature Components: Math Affects Versions: 0.3 Reporter: Benson Margulies Assignee: Benson Margulies Aside from the primitive type issues, the usual java.util.HashMap/Set classes suffer from horrible storage inefficiency. The Colt code can be adapted to add OpenHashSet and OpenHashMap that use open hashing and implement the full Collections interfaces. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (MAHOUT-254) Primitive set unit tests
Primitive set unit tests Key: MAHOUT-254 URL: https://issues.apache.org/jira/browse/MAHOUT-254 Project: Mahout Issue Type: New Feature Components: Math Affects Versions: 0.3 Reporter: Benson Margulies Assignee: Benson Margulies The primitive sets need unit tests. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
A modest proposal for the Carrot integration
I propose a branch. Diffs from the branch to the trunk can still be posted on the JIRA, but I think that a branch would be worthwhile in facilitating collaboration. I volunteer to fight with the maven-release-plugin to make it.
Re: Abbreviations?
Sure. However, the immediate contribution is data. src/main/resources? Something else? On Sat, Jan 16, 2010 at 10:16 AM, Olivier Grisel wrote: > 2010/1/16 Grant Ingersoll : >> I think we should start a new module, that will be the seed for a >> subproject, called NLP and that contains the stuff for NLP. >> >> Either that or put them in the utils module, which is where I envision all >> of things that are "helpful" for ML go, but aren't required. > > +1 for an explicit "org.apache.mahout.nlp module". Tools to turn > wikipedia dumps into term freq vectors could also move there instead > of "examples". > > -- > Olivier > http://twitter.com/ogrisel - http://code.oliviergrisel.name >
[jira] Commented: (MAHOUT-252) Sets (primitive types)
[ https://issues.apache.org/jira/browse/MAHOUT-252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801233#action_12801233 ] Benson Margulies commented on MAHOUT-252: - I hope not. What are you seeing? {noformat} A math/src/test/java-templates/org/apache/mahout/math/set M math/src/test/java-templates/org/apache/mahout/math/map/OpenKeyTypeValueTypeHashMapTest.java.t M math/src/test/java-templates/org/apache/mahout/math/map/OpenKeyTypeObjectHashMapTest.java.t M math/src/test/java-templates/org/apache/mahout/math/map/OpenObjectValueTypeHashMapTest.java.t ! math/src/main/ObjectValueTypeProcedure.java.t M math/src/main/java/org/apache/mahout/math/matrix/impl/SelectedSparseObjectMatrix1D.java D math/src/main/java/org/apache/mahout/math/map/AbstractMap.java A math/src/main/java/org/apache/mahout/math/set A math/src/main/java/org/apache/mahout/math/set/AbstractSet.java A math/src/main/java-templates/org/apache/mahout/math/set A math/src/main/java-templates/org/apache/mahout/math/set/AbstractKeyTypeSet.java.t A math/src/main/java-templates/org/apache/mahout/math/set/OpenKeyTypeHashSet.java.t M math/src/main/java-templates/org/apache/mahout/math/map/AbstractKeyTypeObjectMap.java.t M math/src/main/java-templates/org/apache/mahout/math/map/AbstractObjectValueTypeMap.java.t M math/src/main/java-templates/org/apache/mahout/math/map/AbstractKeyTypeValueTypeMap.java.t { noformat} > Sets (primitive types) > -- > > Key: MAHOUT-252 > URL: https://issues.apache.org/jira/browse/MAHOUT-252 > Project: Mahout > Issue Type: New Feature > Components: Math >Affects Versions: 0.3 >Reporter: Benson Margulies >Assignee: Benson Margulies > Fix For: 0.3 > > Attachments: MAHOUT-252.patch > > > Here come the sets. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Efficient dictionary storage in memory
A Lucene index, w/ no storage, positions, etc. (optionally) turned off will be very efficient. Plus, there is virtually no code to write. I've seen bare bones indexes be as little as 20% of the original w/ very fast lookup. Furthermore, there are many options available for controlling how much is loaded into memory, etc. Finally, it will handle all the languages you throw at it. -Grant On Jan 16, 2010, at 9:10 AM, Robin Anil wrote: > Currently java strings use double the space of the characters in it because > its all in utf-16. A 190MB dictionary file therefore uses around 600MB when > loaded into a HashMap. Is there some optimization we could > do in terms of storing them and ensuring that chinese, devanagiri and other > characters dont get messed up in the process. > > Some options benson suggested was: storing just the byte[] form and adding > the the option of supplying the hash function in OpenObjectIntHashmap or > even using a UTF-8 string. > > Or we could leave this alone. I currently estimate the memory requirement > using the formula 8 * ( (int) ( num_chars *2 + 45)/8 ) for strings when > generating the dictionary split for the vectorizer > > Robin
Re: Efficient dictionary storage in memory
2010/1/16 Drew Farris : > I agree the overhead of byte[] -> UTF-8 probably isn't too good for > lookup performance. > > In line with Sean's suggestion, I've used tries in the past for doing > this sort of string -> integer mapping. They generally perform well > enough for adding entries as well as retrieval. Not nearly as > efficient as a hash, but there is usually enough of a memory savings > to make it worth it. They have the added benefit of making it easy to > do prefix searches, although that isn't a strict requirement here. > > As Oliver suggests, a bloom filter may be an option, but wouldn't a > secondary data structure be required to hold the actual values? Would > false positives really be an issue with a dictionary scale problem? > > I presume there's a need for compact integer -> string representation > which can be achieved by using string difference compression. Seeking > to a mod of the id and then building up the final string by scanning > forward through the list of incremental changes. iirc, lucene does > something like this. AFAIK we only use a dictionary for term value (string representation) to term index (or more generally feature index) mapping. But then the value is no longer needed for training and testing the models. Only Vectors of feature values (term counts, frequencies, TF-IDF) are needed to classify / cluster a document or train a model. Hence the use a of a hashed representation where the dictionary from term representations to feature indexes is only implictly represented by a hash function up to some adjustable hash collisions rate. In practice the collisions do not hurt convergence of models such as linear SVMs a.k.a large margin perceptrons (or regularized logistic regression and probably naive bayesian classifiers too) thanks to the redundant nature of dataset features in NLP tasks (see papers cited by John Langford in the previous webpage for reference). -- Olivier http://twitter.com/ogrisel - http://code.oliviergrisel.name
[jira] Commented: (MAHOUT-252) Sets (primitive types)
[ https://issues.apache.org/jira/browse/MAHOUT-252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801204#action_12801204 ] Drew Farris commented on MAHOUT-252: Is this committed? It seems like there are classes related to this in mahout-math now. > Sets (primitive types) > -- > > Key: MAHOUT-252 > URL: https://issues.apache.org/jira/browse/MAHOUT-252 > Project: Mahout > Issue Type: New Feature > Components: Math >Affects Versions: 0.3 >Reporter: Benson Margulies >Assignee: Benson Margulies > Fix For: 0.3 > > Attachments: MAHOUT-252.patch > > > Here come the sets. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Efficient dictionary storage in memory
I agree the overhead of byte[] -> UTF-8 probably isn't too good for lookup performance. In line with Sean's suggestion, I've used tries in the past for doing this sort of string -> integer mapping. They generally perform well enough for adding entries as well as retrieval. Not nearly as efficient as a hash, but there is usually enough of a memory savings to make it worth it. They have the added benefit of making it easy to do prefix searches, although that isn't a strict requirement here. As Oliver suggests, a bloom filter may be an option, but wouldn't a secondary data structure be required to hold the actual values? Would false positives really be an issue with a dictionary scale problem? I presume there's a need for compact integer -> string representation which can be achieved by using string difference compression. Seeking to a mod of the id and then building up the final string by scanning forward through the list of incremental changes. iirc, lucene does something like this. Drew On Sat, Jan 16, 2010 at 9:15 AM, Sean Owen wrote: > I'm speaking only off the top of my head, but my hunch it's not worth > optimizing this. Yes, the alternative is to store the string's UTF-8 > encoding as a byte[]. That's going to incur overhead in translating > back and forth to String where needed, and my guess is that's going to > be big enough to make this not worthwhile. > > The only other idea I have is a trie, which is typically a great data > structure for dictionaries like this. > > Sean > > > On Sat, Jan 16, 2010 at 2:10 PM, Robin Anil wrote: >> Currently java strings use double the space of the characters in it because >> its all in utf-16. A 190MB dictionary file therefore uses around 600MB when >> loaded into a HashMap. Is there some optimization we could >> do in terms of storing them and ensuring that chinese, devanagiri and other >> characters dont get messed up in the process. >> >> Some options benson suggested was: storing just the byte[] form and adding >> the the option of supplying the hash function in OpenObjectIntHashmap or >> even using a UTF-8 string. >> >> Or we could leave this alone. I currently estimate the memory requirement >> using the formula 8 * ( (int) ( num_chars *2 + 45)/8 ) for strings when >> generating the dictionary split for the vectorizer >> >> Robin >> >
Re: Abbreviations?
2010/1/16 Grant Ingersoll : > I think we should start a new module, that will be the seed for a subproject, > called NLP and that contains the stuff for NLP. > > Either that or put them in the utils module, which is where I envision all of > things that are "helpful" for ML go, but aren't required. +1 for an explicit "org.apache.mahout.nlp module". Tools to turn wikipedia dumps into term freq vectors could also move there instead of "examples". -- Olivier http://twitter.com/ogrisel - http://code.oliviergrisel.name
[jira] Updated: (MAHOUT-185) Add mahout shell script for easy launching of various algorithms
[ https://issues.apache.org/jira/browse/MAHOUT-185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Drew Farris updated MAHOUT-185: --- Attachment: MAHOUT-185.patch This patch adds bin/mahout, a simple bash script based heavily on similar scripts found in hadoop and nutch. Doesn't follow Robin's original spec to the letter, but perhaps this is a reasonable start upon which we can build. I really put this together because I'm tired of typing 'mvn exec:java -D [...]' all the time. > Add mahout shell script for easy launching of various algorithms > > > Key: MAHOUT-185 > URL: https://issues.apache.org/jira/browse/MAHOUT-185 > Project: Mahout > Issue Type: New Feature >Affects Versions: 0.2 > Environment: linux, bash >Reporter: Robin Anil > Fix For: 0.3 > > Attachments: MAHOUT-185.patch > > > Currently, Each algorithm has a different point of entry. At its too > complicated to understand and launch each one. A mahout shell script needs > to be made in the bin directory which does something like the following > mahout classify -algorithm bayes [OPTIONS] > mahout cluster -algorithm canopy [OPTIONS] > mahout fpm -algorithm pfpgrowth [OPTIONS] > mahout taste -algorithm slopeone [OPTIONS] > mahout misc -algorithm createVectorsFromText [OPTIONS] > mahout examples WikipediaExample -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Abbreviations?
I think we should start a new module, that will be the seed for a subproject, called NLP and that contains the stuff for NLP. Either that or put them in the utils module, which is where I envision all of things that are "helpful" for ML go, but aren't required. On Jan 16, 2010, at 8:41 AM, Benson Margulies wrote: > I have approval from the CEO to contribute our collection of > abbreviations to Mahout. > > We use them with the ICU breakers. > > I guess IP clearance is called for here, but, thinking ahead, where > would people like to see files of abbreviations in various languages > show up?
Re: Efficient dictionary storage in memory
2010/1/16 Sean Owen : > 351MB isn't so bad. > > I do think the next-best idea to explore is a trie, which could use a > char->Object map data structure provided by our new collections > module? To the extent this data is more compact when encoded in UTF-8, > it will be *much* more compact encoded in a trie. A more radical way to solve this dictionary memory issue would be to use a hashed representation of the term counts: http://hunch.net/~jl/projects/hash_reps/index.html or maybe a less radical yet more complicated to implement approach such as Counting Filters (a variant of Bloom Filters http://en.wikipedia.org/wiki/Bloom_filter#Counting_filters ). Maybe it would be best implemented as a extracting the public API of DictionaryVectorizer as an interface TermVectorizer or just Vectorizer and providing alternative implementations such as HashingVectorizer and CountingFiltersVectorizer (though I haven't checked yet if they are iso-functional even setting aside the conflict / false negative probabilities). -- Olivier http://twitter.com/ogrisel - http://code.oliviergrisel.name
Re: Efficient dictionary storage in memory
351MB isn't so bad. I do think the next-best idea to explore is a trie, which could use a char->Object map data structure provided by our new collections module? To the extent this data is more compact when encoded in UTF-8, it will be *much* more compact encoded in a trie. Sean On Sat, Jan 16, 2010 at 2:29 PM, Robin Anil wrote: > In this specific scenario. Ability to handle bigger dictionary per node > where the dictionary is load once is a big win for the dictionary > vectorizer. This in turn reduces the number of partial vector generation > passes. >
Re: Efficient dictionary storage in memory
If there is an option of storing keys in compressed form in memory, I am all for exploring that On Sat, Jan 16, 2010 at 7:59 PM, Robin Anil wrote: > In this specific scenario. Ability to handle bigger dictionary per node > where the dictionary is load once is a big win for the dictionary > vectorizer. This in turn reduces the number of partial vector generation > passes. > > > I ran the whole wikipedia. I got an 880MB dictionary. I pruned words which > occur only once in the entire set and i got a 351MB dictionary file. I had > to split it on c1.medium(2 core 1.7GB ec2 instance) at about 180-190 mb each > so that it could be loaded in to the memory. This added another 1-2 hours to > the whole job. > > Currently the stats are as follows > > 20 GB of wikipedia data in sequence files(uncompressed) > Counting Job took 1:20 mins > 2 partial vector generation each took 2 hours each > vector merging took about 40 mins more. > finally generated a gzip compressed vectors file of 3.50GB(which i think is > too large) > > Total 6 hours to run. I could easily cut down the 2 pass into one pass had > I was able to fit the whole dictionary in memory > > Robin > > > > On Sat, Jan 16, 2010 at 7:45 PM, Benson Margulies > wrote: > >> While I egged Robin on to some extent on this topic by IM, I should >> point out the following. >> >> We run large amounts of text through Java at Basis, and we always use >> String. I have an 8G laptop :-), but there you have it. Anything we do >> in English we do shortly afterwards in Arabic (UTF-8=UTF-16) and Hanzi >> (UTF-8>UTF-16) so it doesn't make sense for us to optimize this. >> Obviously, compression is an option in various ways, and we could >> imagine some magic containers that optimized string storage in one way >> or the other. >> >> On Sat, Jan 16, 2010 at 9:10 AM, Robin Anil wrote: >> > Currently java strings use double the space of the characters in it >> because >> > its all in utf-16. A 190MB dictionary file therefore uses around 600MB >> when >> > loaded into a HashMap. Is there some optimization we >> could >> > do in terms of storing them and ensuring that chinese, devanagiri and >> other >> > characters dont get messed up in the process. >> > >> > Some options benson suggested was: storing just the byte[] form and >> adding >> > the the option of supplying the hash function in OpenObjectIntHashmap or >> > even using a UTF-8 string. >> > >> > Or we could leave this alone. I currently estimate the memory >> requirement >> > using the formula 8 * ( (int) ( num_chars *2 + 45)/8 ) for strings >> when >> > generating the dictionary split for the vectorizer >> > >> > Robin >> > >> > >
Re: Efficient dictionary storage in memory
In this specific scenario. Ability to handle bigger dictionary per node where the dictionary is load once is a big win for the dictionary vectorizer. This in turn reduces the number of partial vector generation passes. I ran the whole wikipedia. I got an 880MB dictionary. I pruned words which occur only once in the entire set and i got a 351MB dictionary file. I had to split it on c1.medium(2 core 1.7GB ec2 instance) at about 180-190 mb each so that it could be loaded in to the memory. This added another 1-2 hours to the whole job. Currently the stats are as follows 20 GB of wikipedia data in sequence files(uncompressed) Counting Job took 1:20 mins 2 partial vector generation each took 2 hours each vector merging took about 40 mins more. finally generated a gzip compressed vectors file of 3.50GB(which i think is too large) Total 6 hours to run. I could easily cut down the 2 pass into one pass had I was able to fit the whole dictionary in memory Robin On Sat, Jan 16, 2010 at 7:45 PM, Benson Margulies wrote: > While I egged Robin on to some extent on this topic by IM, I should > point out the following. > > We run large amounts of text through Java at Basis, and we always use > String. I have an 8G laptop :-), but there you have it. Anything we do > in English we do shortly afterwards in Arabic (UTF-8=UTF-16) and Hanzi > (UTF-8>UTF-16) so it doesn't make sense for us to optimize this. > Obviously, compression is an option in various ways, and we could > imagine some magic containers that optimized string storage in one way > or the other. > > On Sat, Jan 16, 2010 at 9:10 AM, Robin Anil wrote: > > Currently java strings use double the space of the characters in it > because > > its all in utf-16. A 190MB dictionary file therefore uses around 600MB > when > > loaded into a HashMap. Is there some optimization we > could > > do in terms of storing them and ensuring that chinese, devanagiri and > other > > characters dont get messed up in the process. > > > > Some options benson suggested was: storing just the byte[] form and > adding > > the the option of supplying the hash function in OpenObjectIntHashmap or > > even using a UTF-8 string. > > > > Or we could leave this alone. I currently estimate the memory requirement > > using the formula 8 * ( (int) ( num_chars *2 + 45)/8 ) for strings when > > generating the dictionary split for the vectorizer > > > > Robin > > >
Re: Efficient dictionary storage in memory
While I egged Robin on to some extent on this topic by IM, I should point out the following. We run large amounts of text through Java at Basis, and we always use String. I have an 8G laptop :-), but there you have it. Anything we do in English we do shortly afterwards in Arabic (UTF-8=UTF-16) and Hanzi (UTF-8>UTF-16) so it doesn't make sense for us to optimize this. Obviously, compression is an option in various ways, and we could imagine some magic containers that optimized string storage in one way or the other. On Sat, Jan 16, 2010 at 9:10 AM, Robin Anil wrote: > Currently java strings use double the space of the characters in it because > its all in utf-16. A 190MB dictionary file therefore uses around 600MB when > loaded into a HashMap. Is there some optimization we could > do in terms of storing them and ensuring that chinese, devanagiri and other > characters dont get messed up in the process. > > Some options benson suggested was: storing just the byte[] form and adding > the the option of supplying the hash function in OpenObjectIntHashmap or > even using a UTF-8 string. > > Or we could leave this alone. I currently estimate the memory requirement > using the formula 8 * ( (int) ( num_chars *2 + 45)/8 ) for strings when > generating the dictionary split for the vectorizer > > Robin >
Re: Efficient dictionary storage in memory
I'm speaking only off the top of my head, but my hunch it's not worth optimizing this. Yes, the alternative is to store the string's UTF-8 encoding as a byte[]. That's going to incur overhead in translating back and forth to String where needed, and my guess is that's going to be big enough to make this not worthwhile. The only other idea I have is a trie, which is typically a great data structure for dictionaries like this. Sean On Sat, Jan 16, 2010 at 2:10 PM, Robin Anil wrote: > Currently java strings use double the space of the characters in it because > its all in utf-16. A 190MB dictionary file therefore uses around 600MB when > loaded into a HashMap. Is there some optimization we could > do in terms of storing them and ensuring that chinese, devanagiri and other > characters dont get messed up in the process. > > Some options benson suggested was: storing just the byte[] form and adding > the the option of supplying the hash function in OpenObjectIntHashmap or > even using a UTF-8 string. > > Or we could leave this alone. I currently estimate the memory requirement > using the formula 8 * ( (int) ( num_chars *2 + 45)/8 ) for strings when > generating the dictionary split for the vectorizer > > Robin >
Efficient dictionary storage in memory
Currently java strings use double the space of the characters in it because its all in utf-16. A 190MB dictionary file therefore uses around 600MB when loaded into a HashMap. Is there some optimization we could do in terms of storing them and ensuring that chinese, devanagiri and other characters dont get messed up in the process. Some options benson suggested was: storing just the byte[] form and adding the the option of supplying the hash function in OpenObjectIntHashmap or even using a UTF-8 string. Or we could leave this alone. I currently estimate the memory requirement using the formula 8 * ( (int) ( num_chars *2 + 45)/8 ) for strings when generating the dictionary split for the vectorizer Robin
[jira] Updated: (MAHOUT-253) Proposal for high performance primitive collections.
[ https://issues.apache.org/jira/browse/MAHOUT-253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss updated MAHOUT-253: --- Attachment: hppc-1.0-dev.zip > Proposal for high performance primitive collections. > > > Key: MAHOUT-253 > URL: https://issues.apache.org/jira/browse/MAHOUT-253 > Project: Mahout > Issue Type: New Feature > Components: Utils >Reporter: Dawid Weiss >Assignee: Dawid Weiss >Priority: Minor > Attachments: hppc-1.0-dev.zip > > > A proposal for template-driven collections library (lists, sets, maps, > deques), with specializations for Java primitive types to save memory and > increase performance. The "templates" are regular Java classes written with > generics and certain "intrinsics", that is blocks replaceable by a > regexp-preprocessor. This lets one write the code once, immediately test it > (tests are also templates) and generate primitive versions from a single > source. > An additional interesting part is the benchmarking subsystem written on top > of JUnit ;) > There are major differences from the Java Collections API, most notably no > interfaces and interface-compatible views over sub-collections or key/value > sets. These classes also expose their internal implementation (buffers, > addressing, etc.) so that the code can be optimized for a particular use case. > These motivations are further discussed here, together with an API overview. > http://www.carrot-search.com/download/hppc/index.html > I am curious what you think about it. If folks like it, Carrot Search will > donate the code to Mahout (or Apache Commons-?) and will maintain it (because > we plan to use it in our internal projects anyway). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (MAHOUT-253) Proposal for high performance primitive collections.
Proposal for high performance primitive collections. Key: MAHOUT-253 URL: https://issues.apache.org/jira/browse/MAHOUT-253 Project: Mahout Issue Type: New Feature Components: Utils Reporter: Dawid Weiss Assignee: Dawid Weiss Priority: Minor A proposal for template-driven collections library (lists, sets, maps, deques), with specializations for Java primitive types to save memory and increase performance. The "templates" are regular Java classes written with generics and certain "intrinsics", that is blocks replaceable by a regexp-preprocessor. This lets one write the code once, immediately test it (tests are also templates) and generate primitive versions from a single source. An additional interesting part is the benchmarking subsystem written on top of JUnit ;) There are major differences from the Java Collections API, most notably no interfaces and interface-compatible views over sub-collections or key/value sets. These classes also expose their internal implementation (buffers, addressing, etc.) so that the code can be optimized for a particular use case. These motivations are further discussed here, together with an API overview. http://www.carrot-search.com/download/hppc/index.html I am curious what you think about it. If folks like it, Carrot Search will donate the code to Mahout (or Apache Commons-?) and will maintain it (because we plan to use it in our internal projects anyway). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-252) Sets (primitive types)
[ https://issues.apache.org/jira/browse/MAHOUT-252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benson Margulies updated MAHOUT-252: Summary: Sets (primitive types) (was: Sets (primitive to primitive)) > Sets (primitive types) > -- > > Key: MAHOUT-252 > URL: https://issues.apache.org/jira/browse/MAHOUT-252 > Project: Mahout > Issue Type: New Feature > Components: Math >Affects Versions: 0.3 >Reporter: Benson Margulies >Assignee: Benson Margulies > Fix For: 0.3 > > Attachments: MAHOUT-252.patch > > > Here come the sets. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Abbreviations?
I have approval from the CEO to contribute our collection of abbreviations to Mahout. We use them with the ICU breakers. I guess IP clearance is called for here, but, thinking ahead, where would people like to see files of abbreviations in various languages show up?