Re: Unit test lag?

2010-01-16 Thread deneche abdelhakim
removing the maven repository does not solve the problem, neither a
fresh checkout of the trunk.

but older revisions don't show any slowdown!!! I tried the following revisions:

Those old revisions seem Ok:

r896946 | srowen | 2010-01-07 19:02:41 +0100 (Thu, 07 Jan 2010) | 1 line
MAHOUT-238

r897134 | robinanil | 2010-01-08 09:23:22 +0100 (Fri, 08 Jan 2010) | 1 line
MAHOUT-221 Missed out two files while checking in FP-Bonsai

r897405 | adeneche | 2010-01-09 11:02:49 +0100 (Sat, 09 Jan 2010) | 1 line
MAHOUT-216


>>> The slowdowns start at this revision !!!

r897440 | srowen | 2010-01-09 13:53:25 +0100 (Sat, 09 Jan 2010) | 1 line
Code style adjustments; enabled/fixed TestSamplingIterator



On Sun, Jan 17, 2010 at 5:47 AM, deneche abdelhakim  wrote:
> I'm getting similar slowdowns with my VirtualBox Ubuntu 9.04
>
> I'm suspecting that the problem is not -only- caused by RandomUtils because:
>
> 1. I'm familiar with MerseneTwisterRNG slowdowns (I use it a lot) but
> the test time used to be reported accurately by maven. Now maven
> reports that a test took less than a second but it actually took a lot
> more !
>
> 2. Most of my tests actually call RandomUtils.useTestSeed() in setup()
> (InMemInputSplitTest included) but the tests still take a lot of time,
> and again its not reported accurately by maven
>
> 3. I generally launch a 'mvn clean install' every Thursday. I never
> got this slowdowns until last Thursday (dit we change anything that
> could have caused this slowdowns)
>
> On Sun, Jan 17, 2010 at 12:33 AM, Benson Margulies
>  wrote:

>>> Unit tests should generally be using a fixed seed and not need to load a
>>> secure seed from dev/random.  I would say that RandomUtils is probably the
>>> problem here.  The secure seed should be loaded lazily only if the test seed
>>> is not in use.
>>
>> The problem, as I see it, is that the uncommons-math package start
>> initializing a random seed as soon as you touch it, whether you need
>> it or not. RandomUtils can only avoid this by avoiding uncommons-math
>> in unit test mode.
>>
>>>
>>>
>>>
>>> --
>>> Ted Dunning, CTO
>>> DeepDyve
>>>
>>
>


Re: Unit test lag?

2010-01-16 Thread deneche abdelhakim
I'm getting similar slowdowns with my VirtualBox Ubuntu 9.04

I'm suspecting that the problem is not -only- caused by RandomUtils because:

1. I'm familiar with MerseneTwisterRNG slowdowns (I use it a lot) but
the test time used to be reported accurately by maven. Now maven
reports that a test took less than a second but it actually took a lot
more !

2. Most of my tests actually call RandomUtils.useTestSeed() in setup()
(InMemInputSplitTest included) but the tests still take a lot of time,
and again its not reported accurately by maven

3. I generally launch a 'mvn clean install' every Thursday. I never
got this slowdowns until last Thursday (dit we change anything that
could have caused this slowdowns)

On Sun, Jan 17, 2010 at 12:33 AM, Benson Margulies
 wrote:
>>>
>> Unit tests should generally be using a fixed seed and not need to load a
>> secure seed from dev/random.  I would say that RandomUtils is probably the
>> problem here.  The secure seed should be loaded lazily only if the test seed
>> is not in use.
>
> The problem, as I see it, is that the uncommons-math package start
> initializing a random seed as soon as you touch it, whether you need
> it or not. RandomUtils can only avoid this by avoiding uncommons-math
> in unit test mode.
>
>>
>>
>>
>> --
>> Ted Dunning, CTO
>> DeepDyve
>>
>


Re: Efficient dictionary storage in memory

2010-01-16 Thread Robin Anil
On Sun, Jan 17, 2010 at 4:53 AM, Grant Ingersoll wrote:

> On the indexing side, add in batches and reuse the document and fields.
>
Done squeeze out 5 secs there 25 from 30 and further to 22 by increasing max
merge docs.

>
> On the search side, no need for a BooleanQuery and no need for scoring, so
> you will likely want your own Collector (dead simple to write).
>
bought it down to 15 secs from 30 for 1mil lookups using TermQuery and
Collector which is instantiated at once


>
> It _MAY_ even be faster to simply do the indexing as a word w/ the id as a
> payload and then use TermPositions (and no query at all) and forgo searching
> all together.  Then you just need an IndexReader.  First search will always
> be slow, unless you "warm" it first.  This should help avoid the cost of
> going to document storage, which is almost always the most expensive thing
> one does in Lucene do to it's random nature.  Might even be beneficial to be
> able to retrieve IDs in batches (sorted lexicographically, too).
>

Since all the words have unique ids' then i dont think there is any need for
assigning ids. Will re-use lucene document id.
Testing shows that it decreased index time to 13 sec and lookup time to 11
sec

But I still dont get the "not searching" part. Will take a look at
TermPosition and how its done.

>
> Don't get me wrong, it will likely be slower than a hash map, but the hash
> map won't scale and the Lucene term dictionary is delta encoded, so it will
> compress a fair amount.  Also, as you grow, you will need to use an
> FSDirectory.

I stil havent seen the size diff for what I was doing previously. But after
I removed ID field I get 1/3 savings(220MB) for 5 million word dictionary as
compared to a HashMap.

with 5 mil words and 10mil lookups
Hashmap is 4x faster in ADD and 6x faster in lookup.
Inmemory Lucene dict gives around 100K lookups per second. Which is like
1MB/s for 10byte tokens. a bit far away from 50MB/s disk speed limit. Then
again, it just need to match lucene Analyzer's speed with which tokens are
processed.






> -Grant
>
> On Jan 16, 2010, at 5:37 PM, Robin Anil wrote:
>
> > Here is my attempt at making a dictionary lookup using lucene. Need some
> > pointers in optimising. Currently it takes 30 secs for a million lookups
> > using a dictionary of 500K words about 30x of that of a hashmap. But
> space
> > used is almost same as far as i can see in memory sizes looks almost the
> > same(from the process manager).
> >
> >
> > private static final String ID = "id";
> >  private static final String WORD = "word";
> >  private IndexWriter iwriter;
> >  private IndexSearcher isearcher;
> >  private RAMDirectory idx = new RAMDirectory();
> >  private Analyzer analyzer = new WhitespaceAnalyzer();
> >
> >  public void init() throws Exception {
> >this.iwriter =
> >new IndexWriter(idx, analyzer, true,
> > IndexWriter.MaxFieldLength.LIMITED);
> >
> >  }
> >
> >  public void destroy() throws Exception {
> >iwriter.close();
> >isearcher.close();
> >  }
> >
> >  public void ready() throws Exception {
> >iwriter.optimize();
> >iwriter.close();
> >
> >this.isearcher = new IndexSearcher(idx, true);
> >  }
> >
> >  public void addToDictionary(String word, Integer id) throws IOException
> {
> >
> >Document doc = new Document();
> >doc.add(new Field(WORD, word, Field.Store.NO,
> > Field.Index.NOT_ANALYZED));
> >doc.add(new Field(ID, id.toString(), Store.YES,
> > Field.Index.NOT_ANALYZED));
> > ?? Is there a way other than storing the id as string ?
> >iwriter.addDocument(doc);
> >  }
> >
> >  public Integer get(String word) throws IOException, ParseException {
> >BooleanQuery query = new BooleanQuery();
> >query.add(new TermQuery(new Term(WORD, word)), Occur.SHOULD);
> >TopDocs top = isearcher.search(query, null, 1);
> >ScoreDoc[] hits = top.scoreDocs;
> >if (hits.length == 0) return null;
> >return Integer.valueOf(isearcher.doc(hits[0].doc).get(ID));
> >  }
> >
> > On Sat, Jan 16, 2010 at 10:20 PM, Grant Ingersoll  >wrote:
> >
> >> A Lucene index, w/ no storage, positions, etc. (optionally) turned off
> will
> >> be very efficient.  Plus, there is virtually no code to write.  I've
> seen
> >> bare bones indexes be as little as 20% of the original w/ very fast
> lookup.
> >> Furthermore, there are many options available for controlling how much
> is
> >> loaded into memory, etc.  Finally, it will handle all the languages you
> >> throw at it.
> >>
> >> -Grant
> >>
> >> On Jan 16, 2010, at 9:10 AM, Robin Anil wrote:
> >>
> >>> Currently java strings use double the space of the characters in it
> >> because
> >>> its all in utf-16. A 190MB dictionary file therefore uses around 600MB
> >> when
> >>> loaded into a HashMap.  Is there some optimization we
> >> could
> >>> do in terms of storing them and ensuring that chinese, devanagiri and
> >> other
> >>> characters dont get messed up in the process.
> >>>
> >>> Some options benson su

Re: Unit test lag?

2010-01-16 Thread Benson Margulies
>>
> Unit tests should generally be using a fixed seed and not need to load a
> secure seed from dev/random.  I would say that RandomUtils is probably the
> problem here.  The secure seed should be loaded lazily only if the test seed
> is not in use.

The problem, as I see it, is that the uncommons-math package start
initializing a random seed as soon as you touch it, whether you need
it or not. RandomUtils can only avoid this by avoiding uncommons-math
in unit test mode.

>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>


Re: Unit test lag?

2010-01-16 Thread Ted Dunning
On Sat, Jan 16, 2010 at 1:40 PM, Drew Farris  wrote:

> Mahout does per-test forking, which means we're forking off a new JVM
> for each unit text execution, this adds overhead to tests that takes
> 0.2s to complete. Is per-test forking strictly needed?
>

It shouldn't be.  I would count it a bug if it were.


>  ... wall time 30s (!) or so. ... attempting to reading from /dev/random.
>
>
Unit tests should generally be using a fixed seed and not need to load a
secure seed from dev/random.  I would say that RandomUtils is probably the
problem here.  The secure seed should be loaded lazily only if the test seed
is not in use.



-- 
Ted Dunning, CTO
DeepDyve


Re: Efficient dictionary storage in memory

2010-01-16 Thread Grant Ingersoll
On the indexing side, add in batches and reuse the document and fields.

On the search side, no need for a BooleanQuery and no need for scoring, so you 
will likely want your own Collector (dead simple to write).  

It _MAY_ even be faster to simply do the indexing as a word w/ the id as a 
payload and then use TermPositions (and no query at all) and forgo searching 
all together.  Then you just need an IndexReader.  First search will always be 
slow, unless you "warm" it first.  This should help avoid the cost of going to 
document storage, which is almost always the most expensive thing one does in 
Lucene do to it's random nature.  Might even be beneficial to be able to 
retrieve IDs in batches (sorted lexicographically, too).

Don't get me wrong, it will likely be slower than a hash map, but the hash map 
won't scale and the Lucene term dictionary is delta encoded, so it will 
compress a fair amount.  Also, as you grow, you will need to use an FSDirectory.

-Grant

On Jan 16, 2010, at 5:37 PM, Robin Anil wrote:

> Here is my attempt at making a dictionary lookup using lucene. Need some
> pointers in optimising. Currently it takes 30 secs for a million lookups
> using a dictionary of 500K words about 30x of that of a hashmap. But space
> used is almost same as far as i can see in memory sizes looks almost the
> same(from the process manager).
> 
> 
> private static final String ID = "id";
>  private static final String WORD = "word";
>  private IndexWriter iwriter;
>  private IndexSearcher isearcher;
>  private RAMDirectory idx = new RAMDirectory();
>  private Analyzer analyzer = new WhitespaceAnalyzer();
> 
>  public void init() throws Exception {
>this.iwriter =
>new IndexWriter(idx, analyzer, true,
> IndexWriter.MaxFieldLength.LIMITED);
> 
>  }
> 
>  public void destroy() throws Exception {
>iwriter.close();
>isearcher.close();
>  }
> 
>  public void ready() throws Exception {
>iwriter.optimize();
>iwriter.close();
> 
>this.isearcher = new IndexSearcher(idx, true);
>  }
> 
>  public void addToDictionary(String word, Integer id) throws IOException {
> 
>Document doc = new Document();
>doc.add(new Field(WORD, word, Field.Store.NO,
> Field.Index.NOT_ANALYZED));
>doc.add(new Field(ID, id.toString(), Store.YES,
> Field.Index.NOT_ANALYZED));
> ?? Is there a way other than storing the id as string ?
>iwriter.addDocument(doc);
>  }
> 
>  public Integer get(String word) throws IOException, ParseException {
>BooleanQuery query = new BooleanQuery();
>query.add(new TermQuery(new Term(WORD, word)), Occur.SHOULD);
>TopDocs top = isearcher.search(query, null, 1);
>ScoreDoc[] hits = top.scoreDocs;
>if (hits.length == 0) return null;
>return Integer.valueOf(isearcher.doc(hits[0].doc).get(ID));
>  }
> 
> On Sat, Jan 16, 2010 at 10:20 PM, Grant Ingersoll wrote:
> 
>> A Lucene index, w/ no storage, positions, etc. (optionally) turned off will
>> be very efficient.  Plus, there is virtually no code to write.  I've seen
>> bare bones indexes be as little as 20% of the original w/ very fast lookup.
>> Furthermore, there are many options available for controlling how much is
>> loaded into memory, etc.  Finally, it will handle all the languages you
>> throw at it.
>> 
>> -Grant
>> 
>> On Jan 16, 2010, at 9:10 AM, Robin Anil wrote:
>> 
>>> Currently java strings use double the space of the characters in it
>> because
>>> its all in utf-16. A 190MB dictionary file therefore uses around 600MB
>> when
>>> loaded into a HashMap.  Is there some optimization we
>> could
>>> do in terms of storing them and ensuring that chinese, devanagiri and
>> other
>>> characters dont get messed up in the process.
>>> 
>>> Some options benson suggested was: storing just the byte[] form and
>> adding
>>> the the option of supplying the hash function in OpenObjectIntHashmap or
>>> even using a UTF-8 string.
>>> 
>>> Or we could leave this alone. I currently estimate the memory requirement
>>> using the formula 8 *  ( (int) ( num_chars *2  + 45)/8 ) for strings when
>>> generating the dictionary split for the vectorizer
>>> 
>>> Robin
>> 
>> 

--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: 
http://www.lucidimagination.com/search



Re: Unit test lag?

2010-01-16 Thread Olivier Grisel
Some tests are probably not calling:

RandomUtils.useTestSeed();

in a setUp() or static init. Maybe a mixin class MahoutTestCase base
class with a default static init that calls it would do.

Otherwise, I confirm that setting forkModel to "once" in maven/pom.xml
solves the issue (and all tests pass).

-- 
Olivier
http://twitter.com/ogrisel - http://code.oliviergrisel.name


Re: Unit test lag?

2010-01-16 Thread Benson Margulies
Oh, I see. We have to give up on the MerseneTwisterRNG in tests and
just use the JRE. Is that OK?

On Sat, Jan 16, 2010 at 5:44 PM, Olivier Grisel
 wrote:
> 2010/1/16 Drew Farris :
>> On Sat, Jan 16, 2010 at 4:42 PM, Benson Margulies  
>> wrote:
>>> . Running through strace showed
 that something was attempting to reading from /dev/random. Sometimes
 it ran fine, but at least 25-30% it ended up blocking until the
 entropy pool is refilled. To test I moved /dev/random, and created a
 link from /dev/urandom to /dev/random (the former doesn't block, but
 isn't cryptographically secure). It looks as if this could be related
 to the loading of the SecureRandomSeedGenerator class.
>>>
>>> Why not use a fixed random seed for unit tests? That would make them
>>> more repeatable and avoid this problem, no?
>>>
>>
>> It appears we are. in RandomUtils:
>>
>>  public static Random getRandom() {
>>    return testSeed ? new MersenneTwisterRNG(STANDARD_SEED) : new
>> MersenneTwisterRNG();
>>  }
>>
>> But something somewhere is forcing SecureRandomSeedGenerator to get
>> loaded by the classloader which in turn does a 'new SecureRandom()' in
>> a private static final field assignment. Trying to track down what is
>> causing the generator to get loaded in the first place.
>>
>> But something is forcing the SecureRandomSeedGenerator class to get
>> loaded, which I suspect
>>
>
>
> MersenneTwisterRNG constructor calls:
>
>  this(DefaultSeedGenerator.getInstance().generateSeed(SEED_SIZE_BYTES));
>
> Which in turn calls:
>
>    private static final SeedGenerator[] GENERATORS = new SeedGenerator[]
>    {
>        new DevRandomSeedGenerator(),
>        new RandomDotOrgSeedGenerator(),
>        new SecureRandomSeedGenerator()
>    };
>
> In the definition of the class: DefaultSeedGenerator
>
> Unless the forking tests are disabled I don't see how to prevent the
> MersenneTwisterRNG to inderctly fetch entropy from /dev/random /
> SecureRandom.
> --
> Olivier
> http://twitter.com/ogrisel - http://code.oliviergrisel.name
>


Re: Unit test lag?

2010-01-16 Thread Benson Margulies
I see a way, but it involves loading this class explicitly with reflection.

I'll make a patch.


Re: Unit test lag?

2010-01-16 Thread Olivier Grisel
2010/1/16 Drew Farris :
> On Sat, Jan 16, 2010 at 4:42 PM, Benson Margulies  
> wrote:
>> . Running through strace showed
>>> that something was attempting to reading from /dev/random. Sometimes
>>> it ran fine, but at least 25-30% it ended up blocking until the
>>> entropy pool is refilled. To test I moved /dev/random, and created a
>>> link from /dev/urandom to /dev/random (the former doesn't block, but
>>> isn't cryptographically secure). It looks as if this could be related
>>> to the loading of the SecureRandomSeedGenerator class.
>>
>> Why not use a fixed random seed for unit tests? That would make them
>> more repeatable and avoid this problem, no?
>>
>
> It appears we are. in RandomUtils:
>
>  public static Random getRandom() {
>    return testSeed ? new MersenneTwisterRNG(STANDARD_SEED) : new
> MersenneTwisterRNG();
>  }
>
> But something somewhere is forcing SecureRandomSeedGenerator to get
> loaded by the classloader which in turn does a 'new SecureRandom()' in
> a private static final field assignment. Trying to track down what is
> causing the generator to get loaded in the first place.
>
> But something is forcing the SecureRandomSeedGenerator class to get
> loaded, which I suspect
>


MersenneTwisterRNG constructor calls:

  this(DefaultSeedGenerator.getInstance().generateSeed(SEED_SIZE_BYTES));

Which in turn calls:

private static final SeedGenerator[] GENERATORS = new SeedGenerator[]
{
new DevRandomSeedGenerator(),
new RandomDotOrgSeedGenerator(),
new SecureRandomSeedGenerator()
};

In the definition of the class: DefaultSeedGenerator

Unless the forking tests are disabled I don't see how to prevent the
MersenneTwisterRNG to inderctly fetch entropy from /dev/random /
SecureRandom.
-- 
Olivier
http://twitter.com/ogrisel - http://code.oliviergrisel.name


Re: Unit test lag?

2010-01-16 Thread Benson Margulies
This is going to be a lot of fun. That class is in uncommons-math, and
the connection to it from Mahout is hardly obvious.

On Sat, Jan 16, 2010 at 5:34 PM, Benson Margulies  wrote:
> It looks as if this could be related
 to the loading of the SecureRandomSeedGenerator class.
>>>
>
> Let's fix that class to defer until there's a good reason to make a seed.
>


Re: Efficient dictionary storage in memory

2010-01-16 Thread Robin Anil
Here is my attempt at making a dictionary lookup using lucene. Need some
pointers in optimising. Currently it takes 30 secs for a million lookups
using a dictionary of 500K words about 30x of that of a hashmap. But space
used is almost same as far as i can see in memory sizes looks almost the
same(from the process manager).


 private static final String ID = "id";
  private static final String WORD = "word";
  private IndexWriter iwriter;
  private IndexSearcher isearcher;
  private RAMDirectory idx = new RAMDirectory();
  private Analyzer analyzer = new WhitespaceAnalyzer();

  public void init() throws Exception {
this.iwriter =
new IndexWriter(idx, analyzer, true,
IndexWriter.MaxFieldLength.LIMITED);

  }

  public void destroy() throws Exception {
iwriter.close();
isearcher.close();
  }

  public void ready() throws Exception {
iwriter.optimize();
iwriter.close();

this.isearcher = new IndexSearcher(idx, true);
  }

  public void addToDictionary(String word, Integer id) throws IOException {

Document doc = new Document();
doc.add(new Field(WORD, word, Field.Store.NO,
Field.Index.NOT_ANALYZED));
doc.add(new Field(ID, id.toString(), Store.YES,
Field.Index.NOT_ANALYZED));
?? Is there a way other than storing the id as string ?
iwriter.addDocument(doc);
  }

  public Integer get(String word) throws IOException, ParseException {
BooleanQuery query = new BooleanQuery();
query.add(new TermQuery(new Term(WORD, word)), Occur.SHOULD);
TopDocs top = isearcher.search(query, null, 1);
ScoreDoc[] hits = top.scoreDocs;
if (hits.length == 0) return null;
return Integer.valueOf(isearcher.doc(hits[0].doc).get(ID));
  }

On Sat, Jan 16, 2010 at 10:20 PM, Grant Ingersoll wrote:

> A Lucene index, w/ no storage, positions, etc. (optionally) turned off will
> be very efficient.  Plus, there is virtually no code to write.  I've seen
> bare bones indexes be as little as 20% of the original w/ very fast lookup.
>  Furthermore, there are many options available for controlling how much is
> loaded into memory, etc.  Finally, it will handle all the languages you
> throw at it.
>
> -Grant
>
> On Jan 16, 2010, at 9:10 AM, Robin Anil wrote:
>
> > Currently java strings use double the space of the characters in it
> because
> > its all in utf-16. A 190MB dictionary file therefore uses around 600MB
> when
> > loaded into a HashMap.  Is there some optimization we
> could
> > do in terms of storing them and ensuring that chinese, devanagiri and
> other
> > characters dont get messed up in the process.
> >
> > Some options benson suggested was: storing just the byte[] form and
> adding
> > the the option of supplying the hash function in OpenObjectIntHashmap or
> > even using a UTF-8 string.
> >
> > Or we could leave this alone. I currently estimate the memory requirement
> > using the formula 8 *  ( (int) ( num_chars *2  + 45)/8 ) for strings when
> > generating the dictionary split for the vectorizer
> >
> > Robin
>
>


Re: Unit test lag?

2010-01-16 Thread Benson Margulies
It looks as if this could be related
>>> to the loading of the SecureRandomSeedGenerator class.
>>

Let's fix that class to defer until there's a good reason to make a seed.


Re: Unit test lag?

2010-01-16 Thread Olivier Grisel
2010/1/16 Benson Margulies :
> . Running through strace showed
>> that something was attempting to reading from /dev/random. Sometimes
>> it ran fine, but at least 25-30% it ended up blocking until the
>> entropy pool is refilled. To test I moved /dev/random, and created a
>> link from /dev/urandom to /dev/random (the former doesn't block, but
>> isn't cryptographically secure). It looks as if this could be related
>> to the loading of the SecureRandomSeedGenerator class.
>>

I also experience the same slowdown Drew describes. ubuntu machines too.

> Why not use a fixed random seed for unit tests? That would make them
> more repeatable and avoid this problem, no?
>

+1 for the fixed seed (42 is my favorite seed).

-- 
Olivier
http://twitter.com/ogrisel - http://code.oliviergrisel.name


Re: Unit test lag?

2010-01-16 Thread Drew Farris
On Sat, Jan 16, 2010 at 4:42 PM, Benson Margulies  wrote:
> . Running through strace showed
>> that something was attempting to reading from /dev/random. Sometimes
>> it ran fine, but at least 25-30% it ended up blocking until the
>> entropy pool is refilled. To test I moved /dev/random, and created a
>> link from /dev/urandom to /dev/random (the former doesn't block, but
>> isn't cryptographically secure). It looks as if this could be related
>> to the loading of the SecureRandomSeedGenerator class.
>
> Why not use a fixed random seed for unit tests? That would make them
> more repeatable and avoid this problem, no?
>

It appears we are. in RandomUtils:

  public static Random getRandom() {
return testSeed ? new MersenneTwisterRNG(STANDARD_SEED) : new
MersenneTwisterRNG();
  }

But something somewhere is forcing SecureRandomSeedGenerator to get
loaded by the classloader which in turn does a 'new SecureRandom()' in
a private static final field assignment. Trying to track down what is
causing the generator to get loaded in the first place.

But something is forcing the SecureRandomSeedGenerator class to get
loaded, which I suspect


[jira] Commented: (MAHOUT-153) Implement kmeans++ for initial cluster selection in kmeans

2010-01-16 Thread Isabel Drost (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801280#action_12801280
 ] 

Isabel Drost commented on MAHOUT-153:
-

Welcome to Mahout. Thanks for stepping up and volunteering to take over the 
work for this issue.

> Implement kmeans++ for initial cluster selection in kmeans
> --
>
> Key: MAHOUT-153
> URL: https://issues.apache.org/jira/browse/MAHOUT-153
> Project: Mahout
>  Issue Type: New Feature
>  Components: Clustering
>Affects Versions: 0.2
> Environment: OS Independent
>Reporter: Panagiotis Papadimitriou
> Fix For: 0.3
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> The current implementation of k-means includes the following algorithms for 
> initial cluster selection (seed selection): 1) random selection of k points, 
> 2) use of canopy clusters.
> I plan to implement k-means++. The details of the algorithm are available 
> here: http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf.
> Design Outline: I will create an abstract class SeedGenerator and a subclass 
> KMeansPlusPlusSeedGenerator. The existing class RandomSeedGenerator will 
> become a subclass of SeedGenerator.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Unit test lag?

2010-01-16 Thread Benson Margulies
. Running through strace showed
> that something was attempting to reading from /dev/random. Sometimes
> it ran fine, but at least 25-30% it ended up blocking until the
> entropy pool is refilled. To test I moved /dev/random, and created a
> link from /dev/urandom to /dev/random (the former doesn't block, but
> isn't cryptographically secure). It looks as if this could be related
> to the loading of the SecureRandomSeedGenerator class.
>

Why not use a fixed random seed for unit tests? That would make them
more repeatable and avoid this problem, no?


Unit test lag?

2010-01-16 Thread Drew Farris
Recently I've been noticing that Mahout's unit tests generally take a
considerably long time to run, generally longer than what is reported
in the individual test output. I took a look as to why this was the
case and found a couple things:

Mahout does per-test forking, which means we're forking off a new JVM
for each unit text execution, this adds overhead to tests that takes
0.2s to complete. Is per-test forking strictly needed?

I captured the command-line used to execute one of the forked tests
(InMemInputSplitTest) by running mvn -X and executed it from the shell
repeatedly using time see what was going on. In one of every few
invocations, the test in question would report completion in 3s, but
time reported a wall time 30s (!) or so. Running through strace showed
that something was attempting to reading from /dev/random. Sometimes
it ran fine, but at least 25-30% it ended up blocking until the
entropy pool is refilled. To test I moved /dev/random, and created a
link from /dev/urandom to /dev/random (the former doesn't block, but
isn't cryptographically secure). It looks as if this could be related
to the loading of the SecureRandomSeedGenerator class.

I'm running on Ubuntu 9.04, kernel 2.6.28-17-server with the latest patches.

Is anyone else experiencing similar slowness?

Drew


Re: A modest proposal for the Carrot integration

2010-01-16 Thread Dawid Weiss
> I'm not quite done with Colt.

No, no -- you didn't understand me right. Let's work in parallel, I'll
try to polish the edges of HPPC in those places that I know are not
exactly the way I feel they should be, you finish with Colt's
integration -- having Apache-licensed Colt is a value on its own. I
will provide a cleaner patch, but branching is a good idea since
moving from Colt collections may require major code sweeps and we
don't want everyone to suffer because of this.

I think I should be done with this "cleaner" HPPC release by
Wednesday, if it's all right.

D.

>
> If you think you can refine a patch to go straight into the mahout
> trunk, don't let me stop you.
>
>
> On Sat, Jan 16, 2010 at 3:48 PM, Dawid Weiss  wrote:
>> Have you finished with Colt? I think this is still worth completing
>> before we proceed to HPPC. Just talked to Staszek, we will move HPPC
>> code to Carrot2 labs SVN repository (sourceforge) because we want to
>> get rid of PCJ as soon as possible and need something versioned and
>> sticky. I plan to make a few additions to HPPC that I could work on
>> while you're completing the Colt stuff. Hopefully we can also get this
>> ArrayIndexOutOfBounds beast in the mean time.
>>
>> If you're done with Colt, I can commit directly to Mahout's branch and
>> work from there.
>>
>> Dawid
>>
>


Re: A modest proposal for the Carrot integration

2010-01-16 Thread Benson Margulies
I'm not quite done with Colt.

If you think you can refine a patch to go straight into the mahout
trunk, don't let me stop you.


On Sat, Jan 16, 2010 at 3:48 PM, Dawid Weiss  wrote:
> Have you finished with Colt? I think this is still worth completing
> before we proceed to HPPC. Just talked to Staszek, we will move HPPC
> code to Carrot2 labs SVN repository (sourceforge) because we want to
> get rid of PCJ as soon as possible and need something versioned and
> sticky. I plan to make a few additions to HPPC that I could work on
> while you're completing the Colt stuff. Hopefully we can also get this
> ArrayIndexOutOfBounds beast in the mean time.
>
> If you're done with Colt, I can commit directly to Mahout's branch and
> work from there.
>
> Dawid
>


[jira] Updated: (MAHOUT-242) LLR Collocation Identifier

2010-01-16 Thread Drew Farris (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Drew Farris updated MAHOUT-242:
---

Attachment: MAHOUT-242.patch

Log-likelihood collocation identifier  in patch form. This puts itself in 
o.a.m.nlp.collocations.llr

I think there are some improvements that can be made, but if possible it would 
be nice to review, commit this version and add on to it later through 
additional patches More specifically, I'd like to see this:

* include the ability to avoid forming collocations around sentence boundaries 
and other boundaries per: 
http://www.lucidimagination.com/search/document/d259def498803ffe/collocation_clarification#29fbb050cf5fa64
* work for non-whitespace delimited languages, e.g: anything an analyzer can 
produce tokens for.

I removed the ability to read in files from a directory, Robin's document -> 
sequence file work fits into this well.


> LLR Collocation Identifier
> --
>
> Key: MAHOUT-242
> URL: https://issues.apache.org/jira/browse/MAHOUT-242
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.3
>Reporter: Drew Farris
>Priority: Minor
> Attachments: MAHOUT-242.patch, mahout-colloc.tar.gz, 
> mahout-colloc.tar.gz
>
>
> Identifies interesting Collocations in text using ngrams scored via the 
> LogLikelihoodRatio calculation. 
> As discussed in: 
> * 
> http://www.lucidimagination.com/search/document/d051123800ab6ce7/collocations_in_mahout#26634d6364c2c0d2
> * 
> http://www.lucidimagination.com/search/document/b8d5bb0745eef6e8/n_grams_for_terms#f16fa54417697d8e
> Current form is a tar of a maven project that depends on mahout. Build as 
> usual with 'mvn clean install', can be executed using:
> {noformat}
> mvn -e exec:java  -Dexec.mainClass="org.apache.mahout.colloc.CollocDriver" 
> -Dexec.args="--input src/test/resources/article --colloc target/colloc 
> --output target/output -w"
> {noformat}
> Output will be placed in target/output and can be viewed nicely using:
> {noformat}
> sort -rn -k1 target/output/part-0
> {noformat}
> Includes rudimentary unit tests. Please review and comment. Needs more work 
> to get this into patch state and integrate with Robin's document vectorizer 
> work in MAHOUT-237
> Some basic TODO/FIXME's include:
> * use mahout math's ObjectInt map implementation when available
> * make the analyzer configurable
> * better input validation + negative unit tests.
> * more flexible ways to generate units of analysis (n-1)grams.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: A modest proposal for the Carrot integration

2010-01-16 Thread Dawid Weiss
Have you finished with Colt? I think this is still worth completing
before we proceed to HPPC. Just talked to Staszek, we will move HPPC
code to Carrot2 labs SVN repository (sourceforge) because we want to
get rid of PCJ as soon as possible and need something versioned and
sticky. I plan to make a few additions to HPPC that I could work on
while you're completing the Colt stuff. Hopefully we can also get this
ArrayIndexOutOfBounds beast in the mean time.

If you're done with Colt, I can commit directly to Mahout's branch and
work from there.

Dawid


Re: Unit test failure

2010-01-16 Thread deneche abdelhakim
Yeah, its probably due to the way I used to generate random data...the
problem is that I never get this error =P so it's very difficult to
fix...I'll try my best as soon as I have some time. In the mean time,
rerunning 'mvn clean install' again generally does the trick.

On Sat, Jan 16, 2010 at 6:58 PM, Grant Ingersoll  wrote:
> try rerunning... I think that one has intermittent failures.  Perhaps Deneche 
> can dig in.  You will likely need to look in the Hadoop logs too.
> On Jan 16, 2010, at 12:49 PM, Benson Margulies wrote:
>
>> https://issues.apache.org/jira/browse/MAHOUT-258
>>
>> The error message:
>>
>> testGatherInfos(org.apache.mahout.ga.watchmaker.cd.tool.CDInfosToolTest)
>> Time elapsed: 6.731 sec  <<< ERROR!
>> java.io.IOException: Job failed!
>>       at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
>>
>> does not give me much to go on.
>>
>> I don't see how adding new Set classes to my tree could cause this ...
>
>


[jira] Updated: (MAHOUT-254) Primitive set unit tests

2010-01-16 Thread Benson Margulies (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benson Margulies updated MAHOUT-254:


Fix Version/s: 0.3
   Status: Patch Available  (was: Open)

> Primitive set unit tests
> 
>
> Key: MAHOUT-254
> URL: https://issues.apache.org/jira/browse/MAHOUT-254
> Project: Mahout
>  Issue Type: New Feature
>  Components: Math
>Affects Versions: 0.3
>Reporter: Benson Margulies
>Assignee: Benson Margulies
> Fix For: 0.3
>
> Attachments: MAHOUT-254.patch
>
>
> The primitive sets need unit tests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-254) Primitive set unit tests

2010-01-16 Thread Benson Margulies (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benson Margulies updated MAHOUT-254:


Attachment: MAHOUT-254.patch

> Primitive set unit tests
> 
>
> Key: MAHOUT-254
> URL: https://issues.apache.org/jira/browse/MAHOUT-254
> Project: Mahout
>  Issue Type: New Feature
>  Components: Math
>Affects Versions: 0.3
>Reporter: Benson Margulies
>Assignee: Benson Margulies
> Fix For: 0.3
>
> Attachments: MAHOUT-254.patch
>
>
> The primitive sets need unit tests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-252) Sets (primitive types)

2010-01-16 Thread Benson Margulies (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801248#action_12801248
 ] 

Benson Margulies commented on MAHOUT-252:
-

That was the 'Map' patch, which was indeed committed some time before.

> Sets (primitive types)
> --
>
> Key: MAHOUT-252
> URL: https://issues.apache.org/jira/browse/MAHOUT-252
> Project: Mahout
>  Issue Type: New Feature
>  Components: Math
>Affects Versions: 0.3
>Reporter: Benson Margulies
>Assignee: Benson Margulies
> Fix For: 0.3
>
> Attachments: MAHOUT-252.patch
>
>
> Here come the sets.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-252) Sets (primitive types)

2010-01-16 Thread Drew Farris (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801247#action_12801247
 ] 

Drew Farris commented on MAHOUT-252:


It was: ./target/classes/org/apache/mahout/math/map/OpenObjectIntHashMap.class, 
which I hadn't expected, but it's a moot point since this is now committed. Thx.

> Sets (primitive types)
> --
>
> Key: MAHOUT-252
> URL: https://issues.apache.org/jira/browse/MAHOUT-252
> Project: Mahout
>  Issue Type: New Feature
>  Components: Math
>Affects Versions: 0.3
>Reporter: Benson Margulies
>Assignee: Benson Margulies
> Fix For: 0.3
>
> Attachments: MAHOUT-252.patch
>
>
> Here come the sets.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: A modest proposal for the Carrot integration

2010-01-16 Thread Benson Margulies
On Sat, Jan 16, 2010 at 1:15 PM, Dawid Weiss  wrote:
>> I propose a branch. Diffs from the branch to the trunk can still be
>> posted on the JIRA, but I think that a branch would be worthwhile in
>> facilitating collaboration.
>
> Do you mean -- for merging with the code I posted earlier?

Yes, To be specific:

1) make a branch
2) in the branch, make a module for HPPC, and check in.
3) in the branch, fiddle the other math code to use HPPC instead of
the colt collections.
4) Stir vigorously until the sort of thing you're reporting is dealt with.
5) Patch across to the trunk.


Re: A modest proposal for the Carrot integration

2010-01-16 Thread Dawid Weiss
> I propose a branch. Diffs from the branch to the trunk can still be
> posted on the JIRA, but I think that a branch would be worthwhile in
> facilitating collaboration.

Do you mean -- for merging with the code I posted earlier?

By the way, I've intergrated Colt from Mahout with our code base.
Interesting things started to happen. First, we had this:

Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
at 
org.apache.mahout.math.matrix.doublealgo.Sorting$4.compare(Sorting.java:214)
at org.apache.mahout.math.Sorting.quickSort0(Sorting.java:725)
at org.apache.mahout.math.Sorting.quickSort0(Sorting.java:773)
at org.apache.mahout.math.Sorting.quickSort(Sorting.java:662)
at 
org.apache.mahout.math.matrix.doublealgo.Sorting.runSort(Sorting.java:80)
at 
org.apache.mahout.math.matrix.doublealgo.Sorting.sort(Sorting.java:236)
at 
org.carrot2.matrix.factorization.IterativeMatrixFactorizationBase.order(IterativeMatrixFactorizationBase.java:149)

When we added debugging statements -- the exception was gone. After a
(longer) while, I checked for VM bugs. Yes, that was it -- there was a
bug in the release of SUN's JVM 1.5 that we had on our server (for
running 1.5-compliance builds). We upgraded that release and... we
still have random exceptions with the above stack. More -- we have
them with the newest 1.6 as well... Adding debugging statements makes
the builds pass in flying colors. The bug only happens on one machine
(which does have memory correction and is a server-class stuff).

In other words -- I've no idea what is happening.

D.


Re: A modest proposal for the Carrot integration

2010-01-16 Thread Ted Dunning
I try to never say anything that decreases the output of a very productive
person.  I often fail, but I try.

On Sat, Jan 16, 2010 at 10:11 AM, Benson Margulies wrote:

> Sure you could. The 'refine patches attached to JIRA' approach is the
> classic Lucene project methodology, and I'm the new kid on the block
> here.
>



-- 
Ted Dunning, CTO
DeepDyve


Re: Efficient dictionary storage in memory

2010-01-16 Thread Ted Dunning
I would recommend either the hashed representation (which cannot be easily
reversed) or the Lucene version.  No need to go to great lengths to rewrite
this code.

On Sat, Jan 16, 2010 at 8:50 AM, Grant Ingersoll wrote:

> A Lucene index, w/ no storage, positions, etc. (optionally) turned off will
> be very efficient.  Plus, there is virtually no code to write.  I've seen
> bare bones indexes be as little as 20% of the original w/ very fast lookup.
>  Furthermore, there are many options available for controlling how much is
> loaded into memory, etc.  Finally, it will handle all the languages you
> throw at it.
>
> -Grant
>
>


Re: A modest proposal for the Carrot integration

2010-01-16 Thread Benson Margulies
Sure you could. The 'refine patches attached to JIRA' approach is the
classic Lucene project methodology, and I'm the new kid on the block
here.

On Sat, Jan 16, 2010 at 12:50 PM, Ted Dunning  wrote:
> How can we say no?
>
> On Sat, Jan 16, 2010 at 9:33 AM, Benson Margulies 
> wrote:
>
>> I volunteer to fight with the maven-release-plugin to make it.
>
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>


Re: Unit test failure

2010-01-16 Thread Grant Ingersoll
try rerunning... I think that one has intermittent failures.  Perhaps Deneche 
can dig in.  You will likely need to look in the Hadoop logs too.
On Jan 16, 2010, at 12:49 PM, Benson Margulies wrote:

> https://issues.apache.org/jira/browse/MAHOUT-258
> 
> The error message:
> 
> testGatherInfos(org.apache.mahout.ga.watchmaker.cd.tool.CDInfosToolTest)
> Time elapsed: 6.731 sec  <<< ERROR!
> java.io.IOException: Job failed!
>   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
> 
> does not give me much to go on.
> 
> I don't see how adding new Set classes to my tree could cause this ...



[jira] Updated: (MAHOUT-248) Next collections expansion kit: OpenObjectWhateverHashMap

2010-01-16 Thread Benson Margulies (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benson Margulies updated MAHOUT-248:


Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed.

> Next collections expansion kit: OpenObjectWhateverHashMap
> 
>
> Key: MAHOUT-248
> URL: https://issues.apache.org/jira/browse/MAHOUT-248
> Project: Mahout
>  Issue Type: Improvement
>  Components: Math
>Affects Versions: 0.3
>Reporter: Benson Margulies
>Assignee: Benson Margulies
> Fix For: 0.3
>
> Attachments: MAHOUT-248.patch
>
>
> Here's the next slice.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-252) Sets (primitive types)

2010-01-16 Thread Benson Margulies (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benson Margulies updated MAHOUT-252:


Resolution: Fixed
Status: Resolved  (was: Patch Available)

OK, now it is committed.


> Sets (primitive types)
> --
>
> Key: MAHOUT-252
> URL: https://issues.apache.org/jira/browse/MAHOUT-252
> Project: Mahout
>  Issue Type: New Feature
>  Components: Math
>Affects Versions: 0.3
>Reporter: Benson Margulies
>Assignee: Benson Margulies
> Fix For: 0.3
>
> Attachments: MAHOUT-252.patch
>
>
> Here come the sets.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-258) Unit test failure in CDInfo example

2010-01-16 Thread Benson Margulies (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801236#action_12801236
 ] 

Benson Margulies commented on MAHOUT-258:
-

manually deleting 'target' and running mvn again worked. Something is wrong 
with 'clean'.


> Unit test failure in CDInfo example
> ---
>
> Key: MAHOUT-258
> URL: https://issues.apache.org/jira/browse/MAHOUT-258
> Project: Mahout
>  Issue Type: Bug
>Affects Versions: 0.3
>Reporter: Benson Margulies
>
> {noformat}
> ---
> Test set: org.apache.mahout.ga.watchmaker.cd.tool.CDInfosToolTest
> ---
> Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 6.844 sec <<< 
> FAILURE!
> testGatherInfos(org.apache.mahout.ga.watchmaker.cd.tool.CDInfosToolTest)  
> Time elapsed: 6.731 sec  <<< ERROR!
> java.io.IOException: Job failed!
>   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
>   at 
> org.apache.mahout.ga.watchmaker.cd.tool.CDInfosTool.gatherInfos(CDInfosTool.java:90)
>   at 
> org.apache.mahout.ga.watchmaker.cd.tool.CDInfosToolTest.testGatherInfos(CDInfosToolTest.java:220)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at junit.framework.TestCase.runTest(TestCase.java:168)
>   at junit.framework.TestCase.runBare(TestCase.java:134)
>   at junit.framework.TestResult$1.protect(TestResult.java:110)
>   at junit.framework.TestResult.runProtected(TestResult.java:128)
>   at junit.framework.TestResult.run(TestResult.java:113)
>   at junit.framework.TestCase.run(TestCase.java:124)
>   at junit.framework.TestSuite.runTest(TestSuite.java:232)
>   at junit.framework.TestSuite.run(TestSuite.java:227)
>   at 
> org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:83)
>   at 
> org.apache.maven.surefire.junit4.JUnit4TestSet.execute(JUnit4TestSet.java:62)
>   at 
> org.apache.maven.surefire.suite.AbstractDirectoryTestSuite.executeTestSet(AbstractDirectoryTestSuite.java:140)
>   at 
> org.apache.maven.surefire.suite.AbstractDirectoryTestSuite.execute(AbstractDirectoryTestSuite.java:165)
>   at org.apache.maven.surefire.Surefire.run(Surefire.java:107)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at 
> org.apache.maven.surefire.booter.SurefireBooter.runSuitesInProcess(SurefireBooter.java:289)
>   at 
> org.apache.maven.surefire.booter.SurefireBooter.main(SurefireBooter.java:1005)
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: A modest proposal for the Carrot integration

2010-01-16 Thread Ted Dunning
How can we say no?

On Sat, Jan 16, 2010 at 9:33 AM, Benson Margulies wrote:

> I volunteer to fight with the maven-release-plugin to make it.




-- 
Ted Dunning, CTO
DeepDyve


Re: Abbreviations?

2010-01-16 Thread Ted Dunning
+1 as well.

I think it should be in core rather than utils due to dependency issues.

On Sat, Jan 16, 2010 at 7:16 AM, Olivier Grisel wrote:

> 2010/1/16 Grant Ingersoll :
> > I think we should start a new module, that will be the seed for a
> subproject, called NLP and that contains the stuff for NLP.
> >
> > Either that or put them in the utils module, which is where I envision
> all of things that are "helpful" for ML go, but aren't required.
>
> +1 for an explicit "org.apache.mahout.nlp module". Tools to turn
> wikipedia dumps into term freq vectors could also move there instead
> of "examples".
>
> --
> Olivier
> http://twitter.com/ogrisel - http://code.oliviergrisel.name
>



-- 
Ted Dunning, CTO
DeepDyve


Re: Abbreviations?

2010-01-16 Thread Ted Dunning
How about src/main/resources/nlp?

On Sat, Jan 16, 2010 at 9:31 AM, Benson Margulies wrote:

> Sure.
>
> However, the immediate contribution is data. src/main/resources? Something
> else?
>
> On Sat, Jan 16, 2010 at 10:16 AM, Olivier Grisel
>  wrote:
> > 2010/1/16 Grant Ingersoll :
> >> I think we should start a new module, that will be the seed for a
> subproject, called NLP and that contains the stuff for NLP.
> >>
> >> Either that or put them in the utils module, which is where I envision
> all of things that are "helpful" for ML go, but aren't required.
> >
> > +1 for an explicit "org.apache.mahout.nlp module". Tools to turn
> > wikipedia dumps into term freq vectors could also move there instead
> > of "examples".
> >
> > --
> > Olivier
> > http://twitter.com/ogrisel - http://code.oliviergrisel.name
> >
>



-- 
Ted Dunning, CTO
DeepDyve


Unit test failure

2010-01-16 Thread Benson Margulies
https://issues.apache.org/jira/browse/MAHOUT-258

The error message:

testGatherInfos(org.apache.mahout.ga.watchmaker.cd.tool.CDInfosToolTest)
 Time elapsed: 6.731 sec  <<< ERROR!
java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)

does not give me much to go on.

I don't see how adding new Set classes to my tree could cause this ...


[jira] Created: (MAHOUT-258) Unit test failure in CDInfo example

2010-01-16 Thread Benson Margulies (JIRA)
Unit test failure in CDInfo example
---

 Key: MAHOUT-258
 URL: https://issues.apache.org/jira/browse/MAHOUT-258
 Project: Mahout
  Issue Type: Bug
Affects Versions: 0.3
Reporter: Benson Margulies


{noformat}
---
Test set: org.apache.mahout.ga.watchmaker.cd.tool.CDInfosToolTest
---
Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 6.844 sec <<< 
FAILURE!
testGatherInfos(org.apache.mahout.ga.watchmaker.cd.tool.CDInfosToolTest)  Time 
elapsed: 6.731 sec  <<< ERROR!
java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
at 
org.apache.mahout.ga.watchmaker.cd.tool.CDInfosTool.gatherInfos(CDInfosTool.java:90)
at 
org.apache.mahout.ga.watchmaker.cd.tool.CDInfosToolTest.testGatherInfos(CDInfosToolTest.java:220)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at junit.framework.TestCase.runTest(TestCase.java:168)
at junit.framework.TestCase.runBare(TestCase.java:134)
at junit.framework.TestResult$1.protect(TestResult.java:110)
at junit.framework.TestResult.runProtected(TestResult.java:128)
at junit.framework.TestResult.run(TestResult.java:113)
at junit.framework.TestCase.run(TestCase.java:124)
at junit.framework.TestSuite.runTest(TestSuite.java:232)
at junit.framework.TestSuite.run(TestSuite.java:227)
at 
org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:83)
at 
org.apache.maven.surefire.junit4.JUnit4TestSet.execute(JUnit4TestSet.java:62)
at 
org.apache.maven.surefire.suite.AbstractDirectoryTestSuite.executeTestSet(AbstractDirectoryTestSuite.java:140)
at 
org.apache.maven.surefire.suite.AbstractDirectoryTestSuite.execute(AbstractDirectoryTestSuite.java:165)
at org.apache.maven.surefire.Surefire.run(Surefire.java:107)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.maven.surefire.booter.SurefireBooter.runSuitesInProcess(SurefireBooter.java:289)
at 
org.apache.maven.surefire.booter.SurefireBooter.main(SurefireBooter.java:1005)
{noformat}


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAHOUT-257) Get rid of GenericSorting.java

2010-01-16 Thread Benson Margulies (JIRA)
Get rid of GenericSorting.java
--

 Key: MAHOUT-257
 URL: https://issues.apache.org/jira/browse/MAHOUT-257
 Project: Mahout
  Issue Type: New Feature
  Components: Math
Affects Versions: 0.3
Reporter: Benson Margulies
Assignee: Benson Margulies


GenericSorting.java has one function left in it. Let's move that to 
Sorting.java and delete the class.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAHOUT-256) Clean up raw type usage

2010-01-16 Thread Benson Margulies (JIRA)
Clean up raw type usage
---

 Key: MAHOUT-256
 URL: https://issues.apache.org/jira/browse/MAHOUT-256
 Project: Mahout
  Issue Type: New Feature
  Components: Math
Affects Versions: 0.3
Reporter: Benson Margulies
Assignee: Benson Margulies


Turning the Object-related Colt maps into Generics has left a number of other 
classes referencing raw types. (e.g. matrices). These need to be made generic 
and cleaned up.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAHOUT-255) Open hash set and map that plug into java.util

2010-01-16 Thread Benson Margulies (JIRA)
Open hash set and map that plug into java.util
--

 Key: MAHOUT-255
 URL: https://issues.apache.org/jira/browse/MAHOUT-255
 Project: Mahout
  Issue Type: New Feature
  Components: Math
Affects Versions: 0.3
Reporter: Benson Margulies
Assignee: Benson Margulies


Aside from the primitive type issues, the usual java.util.HashMap/Set classes 
suffer from horrible storage inefficiency.

The Colt code can be adapted to add OpenHashSet and OpenHashMap that 
use open hashing and implement the full Collections interfaces.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAHOUT-254) Primitive set unit tests

2010-01-16 Thread Benson Margulies (JIRA)
Primitive set unit tests


 Key: MAHOUT-254
 URL: https://issues.apache.org/jira/browse/MAHOUT-254
 Project: Mahout
  Issue Type: New Feature
  Components: Math
Affects Versions: 0.3
Reporter: Benson Margulies
Assignee: Benson Margulies


The primitive sets need unit tests.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



A modest proposal for the Carrot integration

2010-01-16 Thread Benson Margulies
I propose a branch. Diffs from the branch to the trunk can still be
posted on the JIRA, but I think that a branch would be worthwhile in
facilitating collaboration.

I volunteer to fight with the maven-release-plugin to make it.


Re: Abbreviations?

2010-01-16 Thread Benson Margulies
Sure.

However, the immediate contribution is data. src/main/resources? Something else?

On Sat, Jan 16, 2010 at 10:16 AM, Olivier Grisel
 wrote:
> 2010/1/16 Grant Ingersoll :
>> I think we should start a new module, that will be the seed for a 
>> subproject, called NLP and that contains the stuff for NLP.
>>
>> Either that or put them in the utils module, which is where I envision all 
>> of things that are "helpful" for ML go, but aren't required.
>
> +1 for an explicit "org.apache.mahout.nlp module". Tools to turn
> wikipedia dumps into term freq vectors could also move there instead
> of "examples".
>
> --
> Olivier
> http://twitter.com/ogrisel - http://code.oliviergrisel.name
>


[jira] Commented: (MAHOUT-252) Sets (primitive types)

2010-01-16 Thread Benson Margulies (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801233#action_12801233
 ] 

Benson Margulies commented on MAHOUT-252:
-

I hope not. What are you seeing?

{noformat}
A   math/src/test/java-templates/org/apache/mahout/math/set
M   
math/src/test/java-templates/org/apache/mahout/math/map/OpenKeyTypeValueTypeHashMapTest.java.t
M   
math/src/test/java-templates/org/apache/mahout/math/map/OpenKeyTypeObjectHashMapTest.java.t
M   
math/src/test/java-templates/org/apache/mahout/math/map/OpenObjectValueTypeHashMapTest.java.t
!   math/src/main/ObjectValueTypeProcedure.java.t
M   
math/src/main/java/org/apache/mahout/math/matrix/impl/SelectedSparseObjectMatrix1D.java
D   math/src/main/java/org/apache/mahout/math/map/AbstractMap.java
A   math/src/main/java/org/apache/mahout/math/set
A   math/src/main/java/org/apache/mahout/math/set/AbstractSet.java
A   math/src/main/java-templates/org/apache/mahout/math/set
A   
math/src/main/java-templates/org/apache/mahout/math/set/AbstractKeyTypeSet.java.t
A   
math/src/main/java-templates/org/apache/mahout/math/set/OpenKeyTypeHashSet.java.t
M   
math/src/main/java-templates/org/apache/mahout/math/map/AbstractKeyTypeObjectMap.java.t
M   
math/src/main/java-templates/org/apache/mahout/math/map/AbstractObjectValueTypeMap.java.t
M   
math/src/main/java-templates/org/apache/mahout/math/map/AbstractKeyTypeValueTypeMap.java.t
{ noformat}


> Sets (primitive types)
> --
>
> Key: MAHOUT-252
> URL: https://issues.apache.org/jira/browse/MAHOUT-252
> Project: Mahout
>  Issue Type: New Feature
>  Components: Math
>Affects Versions: 0.3
>Reporter: Benson Margulies
>Assignee: Benson Margulies
> Fix For: 0.3
>
> Attachments: MAHOUT-252.patch
>
>
> Here come the sets.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Efficient dictionary storage in memory

2010-01-16 Thread Grant Ingersoll
A Lucene index, w/ no storage, positions, etc. (optionally) turned off will be 
very efficient.  Plus, there is virtually no code to write.  I've seen bare 
bones indexes be as little as 20% of the original w/ very fast lookup.  
Furthermore, there are many options available for controlling how much is 
loaded into memory, etc.  Finally, it will handle all the languages you throw 
at it.

-Grant

On Jan 16, 2010, at 9:10 AM, Robin Anil wrote:

> Currently java strings use double the space of the characters in it because
> its all in utf-16. A 190MB dictionary file therefore uses around 600MB when
> loaded into a HashMap.  Is there some optimization we could
> do in terms of storing them and ensuring that chinese, devanagiri and other
> characters dont get messed up in the process.
> 
> Some options benson suggested was: storing just the byte[] form and adding
> the the option of supplying the hash function in OpenObjectIntHashmap or
> even using a UTF-8 string.
> 
> Or we could leave this alone. I currently estimate the memory requirement
> using the formula 8 *  ( (int) ( num_chars *2  + 45)/8 ) for strings when
> generating the dictionary split for the vectorizer
> 
> Robin



Re: Efficient dictionary storage in memory

2010-01-16 Thread Olivier Grisel
2010/1/16 Drew Farris :
> I agree the overhead of byte[] -> UTF-8 probably isn't too good for
> lookup performance.
>
> In line with Sean's suggestion, I've used tries in the past for doing
> this sort of string -> integer mapping. They generally perform well
> enough for adding entries as well as retrieval. Not nearly as
> efficient as a hash, but there is usually enough of a memory savings
> to make it worth it. They have the added benefit of making it easy to
> do prefix searches, although that isn't a strict requirement here.
>
> As Oliver suggests, a bloom filter may be an option, but wouldn't a
> secondary data structure be required to hold the actual values? Would
> false positives really be an issue with a dictionary scale problem?
>
> I presume there's a need for compact integer -> string representation
> which can be achieved by using string difference compression. Seeking
> to a mod of the id and then building up the final string by scanning
> forward through the list of incremental changes. iirc, lucene does
> something like this.

AFAIK we only use a dictionary for term value (string representation)
to term index (or more generally feature index) mapping. But then the
value is no longer needed for training and testing the models. Only
Vectors of feature values (term counts, frequencies, TF-IDF) are
needed to classify / cluster a document or train a model. Hence the
use a of a hashed representation where the dictionary from term
representations to feature indexes is only implictly represented by a
hash function up to some adjustable hash collisions rate. In practice
the collisions do not hurt convergence of models such as linear SVMs
a.k.a large margin perceptrons (or regularized logistic regression and
probably naive bayesian classifiers too) thanks to the redundant
nature of dataset features in NLP tasks (see papers cited by John
Langford  in the previous webpage for reference).

-- 
Olivier
http://twitter.com/ogrisel - http://code.oliviergrisel.name


[jira] Commented: (MAHOUT-252) Sets (primitive types)

2010-01-16 Thread Drew Farris (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12801204#action_12801204
 ] 

Drew Farris commented on MAHOUT-252:


Is this committed? It seems like there are classes related to this in 
mahout-math now.

> Sets (primitive types)
> --
>
> Key: MAHOUT-252
> URL: https://issues.apache.org/jira/browse/MAHOUT-252
> Project: Mahout
>  Issue Type: New Feature
>  Components: Math
>Affects Versions: 0.3
>Reporter: Benson Margulies
>Assignee: Benson Margulies
> Fix For: 0.3
>
> Attachments: MAHOUT-252.patch
>
>
> Here come the sets.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Efficient dictionary storage in memory

2010-01-16 Thread Drew Farris
I agree the overhead of byte[] -> UTF-8 probably isn't too good for
lookup performance.

In line with Sean's suggestion, I've used tries in the past for doing
this sort of string -> integer mapping. They generally perform well
enough for adding entries as well as retrieval. Not nearly as
efficient as a hash, but there is usually enough of a memory savings
to make it worth it. They have the added benefit of making it easy to
do prefix searches, although that isn't a strict requirement here.

As Oliver suggests, a bloom filter may be an option, but wouldn't a
secondary data structure be required to hold the actual values? Would
false positives really be an issue with a dictionary scale problem?

I presume there's a need for compact integer -> string representation
which can be achieved by using string difference compression. Seeking
to a mod of the id and then building up the final string by scanning
forward through the list of incremental changes. iirc, lucene does
something like this.

Drew

On Sat, Jan 16, 2010 at 9:15 AM, Sean Owen  wrote:
> I'm speaking only off the top of my head, but my hunch it's not worth
> optimizing this. Yes, the alternative is to store the string's UTF-8
> encoding as a byte[]. That's going to incur overhead in translating
> back and forth to String where needed, and my guess is that's going to
> be big enough to make this not worthwhile.
>
> The only other idea I have is a trie, which is typically a great data
> structure for dictionaries like this.
>
> Sean
>
>
> On Sat, Jan 16, 2010 at 2:10 PM, Robin Anil  wrote:
>> Currently java strings use double the space of the characters in it because
>> its all in utf-16. A 190MB dictionary file therefore uses around 600MB when
>> loaded into a HashMap.  Is there some optimization we could
>> do in terms of storing them and ensuring that chinese, devanagiri and other
>> characters dont get messed up in the process.
>>
>> Some options benson suggested was: storing just the byte[] form and adding
>> the the option of supplying the hash function in OpenObjectIntHashmap or
>> even using a UTF-8 string.
>>
>> Or we could leave this alone. I currently estimate the memory requirement
>> using the formula 8 *  ( (int) ( num_chars *2  + 45)/8 ) for strings when
>> generating the dictionary split for the vectorizer
>>
>> Robin
>>
>


Re: Abbreviations?

2010-01-16 Thread Olivier Grisel
2010/1/16 Grant Ingersoll :
> I think we should start a new module, that will be the seed for a subproject, 
> called NLP and that contains the stuff for NLP.
>
> Either that or put them in the utils module, which is where I envision all of 
> things that are "helpful" for ML go, but aren't required.

+1 for an explicit "org.apache.mahout.nlp module". Tools to turn
wikipedia dumps into term freq vectors could also move there instead
of "examples".

-- 
Olivier
http://twitter.com/ogrisel - http://code.oliviergrisel.name


[jira] Updated: (MAHOUT-185) Add mahout shell script for easy launching of various algorithms

2010-01-16 Thread Drew Farris (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Drew Farris updated MAHOUT-185:
---

Attachment: MAHOUT-185.patch

This patch adds bin/mahout, a simple bash script based heavily on similar 
scripts found in hadoop and nutch. Doesn't follow Robin's original spec to the 
letter, but perhaps this is a reasonable start upon which we can build. 

I really put this together because I'm tired of typing 'mvn exec:java -D [...]' 
all the time. 




> Add mahout shell script for easy launching of various algorithms
> 
>
> Key: MAHOUT-185
> URL: https://issues.apache.org/jira/browse/MAHOUT-185
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.2
> Environment: linux, bash
>Reporter: Robin Anil
> Fix For: 0.3
>
> Attachments: MAHOUT-185.patch
>
>
> Currently, Each algorithm has a different point of entry. At its too 
> complicated to understand and launch each one.  A mahout shell script needs 
> to be made in the bin directory which does something like the following
> mahout classify -algorithm bayes [OPTIONS]
> mahout cluster -algorithm canopy  [OPTIONS]
> mahout fpm -algorithm pfpgrowth [OPTIONS]
> mahout taste -algorithm slopeone [OPTIONS] 
> mahout misc -algorithm createVectorsFromText [OPTIONS]
> mahout examples WikipediaExample

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Abbreviations?

2010-01-16 Thread Grant Ingersoll
I think we should start a new module, that will be the seed for a subproject, 
called NLP and that contains the stuff for NLP.  

Either that or put them in the utils module, which is where I envision all of 
things that are "helpful" for ML go, but aren't required.

On Jan 16, 2010, at 8:41 AM, Benson Margulies wrote:

> I have approval from the CEO to contribute our collection of
> abbreviations to Mahout.
> 
> We use them with the ICU breakers.
> 
> I guess IP clearance is called for here, but, thinking ahead, where
> would people like to see files of abbreviations in various languages
> show up?



Re: Efficient dictionary storage in memory

2010-01-16 Thread Olivier Grisel
2010/1/16 Sean Owen :
> 351MB isn't so bad.
>
> I do think the next-best idea to explore is a trie, which could use a
> char->Object map data structure provided by our new collections
> module? To the extent this data is more compact when encoded in UTF-8,
> it will be *much* more compact encoded in a trie.

A more radical way to solve this dictionary memory issue would be to
use a hashed representation of the term counts:
http://hunch.net/~jl/projects/hash_reps/index.html or maybe a less
radical yet more complicated to implement approach such as Counting
Filters (a variant of Bloom Filters
http://en.wikipedia.org/wiki/Bloom_filter#Counting_filters ).

Maybe it would be best implemented as a extracting the public API of
DictionaryVectorizer as an interface TermVectorizer or just Vectorizer
and providing alternative implementations such as HashingVectorizer
and CountingFiltersVectorizer (though I haven't checked yet if they
are iso-functional even setting aside the conflict / false negative
probabilities).

-- 
Olivier
http://twitter.com/ogrisel - http://code.oliviergrisel.name


Re: Efficient dictionary storage in memory

2010-01-16 Thread Sean Owen
351MB isn't so bad.

I do think the next-best idea to explore is a trie, which could use a
char->Object map data structure provided by our new collections
module? To the extent this data is more compact when encoded in UTF-8,
it will be *much* more compact encoded in a trie.

Sean

On Sat, Jan 16, 2010 at 2:29 PM, Robin Anil  wrote:
> In this specific scenario. Ability to handle bigger dictionary per node
> where the dictionary is load once is a big win for the dictionary
> vectorizer. This in turn reduces the number of partial vector generation
> passes.
>


Re: Efficient dictionary storage in memory

2010-01-16 Thread Robin Anil
If there is an option of storing keys in compressed form in memory, I am all
for exploring that


On Sat, Jan 16, 2010 at 7:59 PM, Robin Anil  wrote:

> In this specific scenario. Ability to handle bigger dictionary per node
> where the dictionary is load once is a big win for the dictionary
> vectorizer. This in turn reduces the number of partial vector generation
> passes.
>
>
> I ran the whole wikipedia. I got an 880MB dictionary. I pruned words which
> occur only once in the entire set and i got a 351MB dictionary file. I had
> to split it on c1.medium(2 core 1.7GB ec2 instance) at about 180-190 mb each
> so that it could be loaded in to the memory. This added another 1-2 hours to
> the whole job.
>
> Currently the stats are as follows
>
> 20 GB of wikipedia data in sequence files(uncompressed)
> Counting Job took 1:20 mins
> 2 partial vector generation each took 2 hours each
> vector merging took about 40 mins more.
> finally generated a gzip compressed vectors file of 3.50GB(which i think is
> too large)
>
> Total 6 hours to run. I could easily cut down the 2 pass into one pass had
> I was able to fit the whole dictionary in memory
>
> Robin
>
>
>
> On Sat, Jan 16, 2010 at 7:45 PM, Benson Margulies 
> wrote:
>
>> While I egged Robin on to some extent on this topic by IM, I should
>> point out the following.
>>
>> We run large amounts of text through Java at Basis, and we always use
>> String. I have an 8G laptop :-), but there you have it. Anything we do
>> in English we do shortly afterwards in Arabic (UTF-8=UTF-16) and Hanzi
>> (UTF-8>UTF-16) so it doesn't make sense for us to optimize this.
>> Obviously, compression is an option in various ways, and we could
>> imagine some magic containers that optimized string storage in one way
>> or the other.
>>
>> On Sat, Jan 16, 2010 at 9:10 AM, Robin Anil  wrote:
>> > Currently java strings use double the space of the characters in it
>> because
>> > its all in utf-16. A 190MB dictionary file therefore uses around 600MB
>> when
>> > loaded into a HashMap.  Is there some optimization we
>> could
>> > do in terms of storing them and ensuring that chinese, devanagiri and
>> other
>> > characters dont get messed up in the process.
>> >
>> > Some options benson suggested was: storing just the byte[] form and
>> adding
>> > the the option of supplying the hash function in OpenObjectIntHashmap or
>> > even using a UTF-8 string.
>> >
>> > Or we could leave this alone. I currently estimate the memory
>> requirement
>> > using the formula 8 *  ( (int) ( num_chars *2  + 45)/8 ) for strings
>> when
>> > generating the dictionary split for the vectorizer
>> >
>> > Robin
>> >
>>
>
>


Re: Efficient dictionary storage in memory

2010-01-16 Thread Robin Anil
In this specific scenario. Ability to handle bigger dictionary per node
where the dictionary is load once is a big win for the dictionary
vectorizer. This in turn reduces the number of partial vector generation
passes.


I ran the whole wikipedia. I got an 880MB dictionary. I pruned words which
occur only once in the entire set and i got a 351MB dictionary file. I had
to split it on c1.medium(2 core 1.7GB ec2 instance) at about 180-190 mb each
so that it could be loaded in to the memory. This added another 1-2 hours to
the whole job.

Currently the stats are as follows

20 GB of wikipedia data in sequence files(uncompressed)
Counting Job took 1:20 mins
2 partial vector generation each took 2 hours each
vector merging took about 40 mins more.
finally generated a gzip compressed vectors file of 3.50GB(which i think is
too large)

Total 6 hours to run. I could easily cut down the 2 pass into one pass had I
was able to fit the whole dictionary in memory

Robin



On Sat, Jan 16, 2010 at 7:45 PM, Benson Margulies wrote:

> While I egged Robin on to some extent on this topic by IM, I should
> point out the following.
>
> We run large amounts of text through Java at Basis, and we always use
> String. I have an 8G laptop :-), but there you have it. Anything we do
> in English we do shortly afterwards in Arabic (UTF-8=UTF-16) and Hanzi
> (UTF-8>UTF-16) so it doesn't make sense for us to optimize this.
> Obviously, compression is an option in various ways, and we could
> imagine some magic containers that optimized string storage in one way
> or the other.
>
> On Sat, Jan 16, 2010 at 9:10 AM, Robin Anil  wrote:
> > Currently java strings use double the space of the characters in it
> because
> > its all in utf-16. A 190MB dictionary file therefore uses around 600MB
> when
> > loaded into a HashMap.  Is there some optimization we
> could
> > do in terms of storing them and ensuring that chinese, devanagiri and
> other
> > characters dont get messed up in the process.
> >
> > Some options benson suggested was: storing just the byte[] form and
> adding
> > the the option of supplying the hash function in OpenObjectIntHashmap or
> > even using a UTF-8 string.
> >
> > Or we could leave this alone. I currently estimate the memory requirement
> > using the formula 8 *  ( (int) ( num_chars *2  + 45)/8 ) for strings when
> > generating the dictionary split for the vectorizer
> >
> > Robin
> >
>


Re: Efficient dictionary storage in memory

2010-01-16 Thread Benson Margulies
While I egged Robin on to some extent on this topic by IM, I should
point out the following.

We run large amounts of text through Java at Basis, and we always use
String. I have an 8G laptop :-), but there you have it. Anything we do
in English we do shortly afterwards in Arabic (UTF-8=UTF-16) and Hanzi
(UTF-8>UTF-16) so it doesn't make sense for us to optimize this.
Obviously, compression is an option in various ways, and we could
imagine some magic containers that optimized string storage in one way
or the other.

On Sat, Jan 16, 2010 at 9:10 AM, Robin Anil  wrote:
> Currently java strings use double the space of the characters in it because
> its all in utf-16. A 190MB dictionary file therefore uses around 600MB when
> loaded into a HashMap.  Is there some optimization we could
> do in terms of storing them and ensuring that chinese, devanagiri and other
> characters dont get messed up in the process.
>
> Some options benson suggested was: storing just the byte[] form and adding
> the the option of supplying the hash function in OpenObjectIntHashmap or
> even using a UTF-8 string.
>
> Or we could leave this alone. I currently estimate the memory requirement
> using the formula 8 *  ( (int) ( num_chars *2  + 45)/8 ) for strings when
> generating the dictionary split for the vectorizer
>
> Robin
>


Re: Efficient dictionary storage in memory

2010-01-16 Thread Sean Owen
I'm speaking only off the top of my head, but my hunch it's not worth
optimizing this. Yes, the alternative is to store the string's UTF-8
encoding as a byte[]. That's going to incur overhead in translating
back and forth to String where needed, and my guess is that's going to
be big enough to make this not worthwhile.

The only other idea I have is a trie, which is typically a great data
structure for dictionaries like this.

Sean


On Sat, Jan 16, 2010 at 2:10 PM, Robin Anil  wrote:
> Currently java strings use double the space of the characters in it because
> its all in utf-16. A 190MB dictionary file therefore uses around 600MB when
> loaded into a HashMap.  Is there some optimization we could
> do in terms of storing them and ensuring that chinese, devanagiri and other
> characters dont get messed up in the process.
>
> Some options benson suggested was: storing just the byte[] form and adding
> the the option of supplying the hash function in OpenObjectIntHashmap or
> even using a UTF-8 string.
>
> Or we could leave this alone. I currently estimate the memory requirement
> using the formula 8 *  ( (int) ( num_chars *2  + 45)/8 ) for strings when
> generating the dictionary split for the vectorizer
>
> Robin
>


Efficient dictionary storage in memory

2010-01-16 Thread Robin Anil
Currently java strings use double the space of the characters in it because
its all in utf-16. A 190MB dictionary file therefore uses around 600MB when
loaded into a HashMap.  Is there some optimization we could
do in terms of storing them and ensuring that chinese, devanagiri and other
characters dont get messed up in the process.

Some options benson suggested was: storing just the byte[] form and adding
the the option of supplying the hash function in OpenObjectIntHashmap or
even using a UTF-8 string.

Or we could leave this alone. I currently estimate the memory requirement
using the formula 8 *  ( (int) ( num_chars *2  + 45)/8 ) for strings when
generating the dictionary split for the vectorizer

Robin


[jira] Updated: (MAHOUT-253) Proposal for high performance primitive collections.

2010-01-16 Thread Dawid Weiss (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss updated MAHOUT-253:
---

Attachment: hppc-1.0-dev.zip

> Proposal for high performance primitive collections.
> 
>
> Key: MAHOUT-253
> URL: https://issues.apache.org/jira/browse/MAHOUT-253
> Project: Mahout
>  Issue Type: New Feature
>  Components: Utils
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Minor
> Attachments: hppc-1.0-dev.zip
>
>
> A proposal for template-driven collections library (lists, sets, maps, 
> deques), with specializations for Java primitive types to save memory and 
> increase performance. The "templates" are regular Java classes written with 
> generics and certain "intrinsics", that is blocks replaceable by a 
> regexp-preprocessor. This lets one write the code once, immediately test it 
> (tests are also templates) and generate primitive versions from a single 
> source.
> An additional interesting part is the benchmarking subsystem written on top 
> of JUnit ;)
> There are major differences from the Java Collections API, most notably no 
> interfaces and interface-compatible views over sub-collections or key/value 
> sets. These classes also expose their internal implementation (buffers, 
> addressing, etc.) so that the code can be optimized for a particular use case.
> These motivations are further discussed here, together with an API overview.
> http://www.carrot-search.com/download/hppc/index.html
> I am curious what you think about it. If folks like it, Carrot Search will 
> donate the code to Mahout (or Apache Commons-?) and will maintain it (because 
> we plan to use it in our internal projects anyway).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAHOUT-253) Proposal for high performance primitive collections.

2010-01-16 Thread Dawid Weiss (JIRA)
Proposal for high performance primitive collections.


 Key: MAHOUT-253
 URL: https://issues.apache.org/jira/browse/MAHOUT-253
 Project: Mahout
  Issue Type: New Feature
  Components: Utils
Reporter: Dawid Weiss
Assignee: Dawid Weiss
Priority: Minor


A proposal for template-driven collections library (lists, sets, maps, deques), 
with specializations for Java primitive types to save memory and increase 
performance. The "templates" are regular Java classes written with generics and 
certain "intrinsics", that is blocks replaceable by a regexp-preprocessor. This 
lets one write the code once, immediately test it (tests are also templates) 
and generate primitive versions from a single source.

An additional interesting part is the benchmarking subsystem written on top of 
JUnit ;)

There are major differences from the Java Collections API, most notably no 
interfaces and interface-compatible views over sub-collections or key/value 
sets. These classes also expose their internal implementation (buffers, 
addressing, etc.) so that the code can be optimized for a particular use case.
These motivations are further discussed here, together with an API overview.

http://www.carrot-search.com/download/hppc/index.html

I am curious what you think about it. If folks like it, Carrot Search will 
donate the code to Mahout (or Apache Commons-?) and will maintain it (because 
we plan to use it in our internal projects anyway).



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-252) Sets (primitive types)

2010-01-16 Thread Benson Margulies (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benson Margulies updated MAHOUT-252:


Summary: Sets (primitive types)  (was: Sets (primitive to primitive))

> Sets (primitive types)
> --
>
> Key: MAHOUT-252
> URL: https://issues.apache.org/jira/browse/MAHOUT-252
> Project: Mahout
>  Issue Type: New Feature
>  Components: Math
>Affects Versions: 0.3
>Reporter: Benson Margulies
>Assignee: Benson Margulies
> Fix For: 0.3
>
> Attachments: MAHOUT-252.patch
>
>
> Here come the sets.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Abbreviations?

2010-01-16 Thread Benson Margulies
I have approval from the CEO to contribute our collection of
abbreviations to Mahout.

We use them with the ICU breakers.

I guess IP clearance is called for here, but, thinking ahead, where
would people like to see files of abbreviations in various languages
show up?