I have approval from the CEO to contribute our collection of
abbreviations to Mahout.
We use them with the ICU breakers.
I guess IP clearance is called for here, but, thinking ahead, where
would people like to see files of abbreviations in various languages
show up?
[
https://issues.apache.org/jira/browse/MAHOUT-252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Benson Margulies updated MAHOUT-252:
Summary: Sets (primitive types) (was: Sets (primitive to primitive))
Sets (primitive
Proposal for high performance primitive collections.
Key: MAHOUT-253
URL: https://issues.apache.org/jira/browse/MAHOUT-253
Project: Mahout
Issue Type: New Feature
Components:
[
https://issues.apache.org/jira/browse/MAHOUT-253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dawid Weiss updated MAHOUT-253:
---
Attachment: hppc-1.0-dev.zip
Proposal for high performance primitive collections.
Currently java strings use double the space of the characters in it because
its all in utf-16. A 190MB dictionary file therefore uses around 600MB when
loaded into a HashMapString, Integer. Is there some optimization we could
do in terms of storing them and ensuring that chinese, devanagiri and
I'm speaking only off the top of my head, but my hunch it's not worth
optimizing this. Yes, the alternative is to store the string's UTF-8
encoding as a byte[]. That's going to incur overhead in translating
back and forth to String where needed, and my guess is that's going to
be big enough to
While I egged Robin on to some extent on this topic by IM, I should
point out the following.
We run large amounts of text through Java at Basis, and we always use
String. I have an 8G laptop :-), but there you have it. Anything we do
in English we do shortly afterwards in Arabic (UTF-8=UTF-16)
In this specific scenario. Ability to handle bigger dictionary per node
where the dictionary is load once is a big win for the dictionary
vectorizer. This in turn reduces the number of partial vector generation
passes.
I ran the whole wikipedia. I got an 880MB dictionary. I pruned words which
If there is an option of storing keys in compressed form in memory, I am all
for exploring that
On Sat, Jan 16, 2010 at 7:59 PM, Robin Anil robin.a...@gmail.com wrote:
In this specific scenario. Ability to handle bigger dictionary per node
where the dictionary is load once is a big win for
351MB isn't so bad.
I do think the next-best idea to explore is a trie, which could use a
char-Object map data structure provided by our new collections
module? To the extent this data is more compact when encoded in UTF-8,
it will be *much* more compact encoded in a trie.
Sean
On Sat, Jan 16,
2010/1/16 Sean Owen sro...@gmail.com:
351MB isn't so bad.
I do think the next-best idea to explore is a trie, which could use a
char-Object map data structure provided by our new collections
module? To the extent this data is more compact when encoded in UTF-8,
it will be *much* more compact
[
https://issues.apache.org/jira/browse/MAHOUT-185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Drew Farris updated MAHOUT-185:
---
Attachment: MAHOUT-185.patch
This patch adds bin/mahout, a simple bash script based heavily on
2010/1/16 Grant Ingersoll gsing...@apache.org:
I think we should start a new module, that will be the seed for a subproject,
called NLP and that contains the stuff for NLP.
Either that or put them in the utils module, which is where I envision all of
things that are helpful for ML go, but
I agree the overhead of byte[] - UTF-8 probably isn't too good for
lookup performance.
In line with Sean's suggestion, I've used tries in the past for doing
this sort of string - integer mapping. They generally perform well
enough for adding entries as well as retrieval. Not nearly as
efficient
[
https://issues.apache.org/jira/browse/MAHOUT-252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12801204#action_12801204
]
Drew Farris commented on MAHOUT-252:
Is this committed? It seems like there are classes
Sure.
However, the immediate contribution is data. src/main/resources? Something else?
On Sat, Jan 16, 2010 at 10:16 AM, Olivier Grisel
olivier.gri...@ensta.org wrote:
2010/1/16 Grant Ingersoll gsing...@apache.org:
I think we should start a new module, that will be the seed for a
subproject,
I propose a branch. Diffs from the branch to the trunk can still be
posted on the JIRA, but I think that a branch would be worthwhile in
facilitating collaboration.
I volunteer to fight with the maven-release-plugin to make it.
Primitive set unit tests
Key: MAHOUT-254
URL: https://issues.apache.org/jira/browse/MAHOUT-254
Project: Mahout
Issue Type: New Feature
Components: Math
Affects Versions: 0.3
Reporter:
Open hash set and map that plug into java.util
--
Key: MAHOUT-255
URL: https://issues.apache.org/jira/browse/MAHOUT-255
Project: Mahout
Issue Type: New Feature
Components: Math
Clean up raw type usage
---
Key: MAHOUT-256
URL: https://issues.apache.org/jira/browse/MAHOUT-256
Project: Mahout
Issue Type: New Feature
Components: Math
Affects Versions: 0.3
Reporter: Benson
Get rid of GenericSorting.java
--
Key: MAHOUT-257
URL: https://issues.apache.org/jira/browse/MAHOUT-257
Project: Mahout
Issue Type: New Feature
Components: Math
Affects Versions: 0.3
Unit test failure in CDInfo example
---
Key: MAHOUT-258
URL: https://issues.apache.org/jira/browse/MAHOUT-258
Project: Mahout
Issue Type: Bug
Affects Versions: 0.3
Reporter: Benson Margulies
How about src/main/resources/nlp?
On Sat, Jan 16, 2010 at 9:31 AM, Benson Margulies bimargul...@gmail.comwrote:
Sure.
However, the immediate contribution is data. src/main/resources? Something
else?
On Sat, Jan 16, 2010 at 10:16 AM, Olivier Grisel
olivier.gri...@ensta.org wrote:
+1 as well.
I think it should be in core rather than utils due to dependency issues.
On Sat, Jan 16, 2010 at 7:16 AM, Olivier Grisel olivier.gri...@ensta.orgwrote:
2010/1/16 Grant Ingersoll gsing...@apache.org:
I think we should start a new module, that will be the seed for a
subproject,
[
https://issues.apache.org/jira/browse/MAHOUT-258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12801236#action_12801236
]
Benson Margulies commented on MAHOUT-258:
-
manually deleting 'target' and running
How can we say no?
On Sat, Jan 16, 2010 at 9:33 AM, Benson Margulies bimargul...@gmail.comwrote:
I volunteer to fight with the maven-release-plugin to make it.
--
Ted Dunning, CTO
DeepDyve
[
https://issues.apache.org/jira/browse/MAHOUT-248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Benson Margulies updated MAHOUT-248:
Resolution: Fixed
Status: Resolved (was: Patch Available)
Committed.
Next
Sure you could. The 'refine patches attached to JIRA' approach is the
classic Lucene project methodology, and I'm the new kid on the block
here.
On Sat, Jan 16, 2010 at 12:50 PM, Ted Dunning ted.dunn...@gmail.com wrote:
How can we say no?
On Sat, Jan 16, 2010 at 9:33 AM, Benson Margulies
I try to never say anything that decreases the output of a very productive
person. I often fail, but I try.
On Sat, Jan 16, 2010 at 10:11 AM, Benson Margulies bimargul...@gmail.comwrote:
Sure you could. The 'refine patches attached to JIRA' approach is the
classic Lucene project methodology,
I propose a branch. Diffs from the branch to the trunk can still be
posted on the JIRA, but I think that a branch would be worthwhile in
facilitating collaboration.
Do you mean -- for merging with the code I posted earlier?
By the way, I've intergrated Colt from Mahout with our code base.
On Sat, Jan 16, 2010 at 1:15 PM, Dawid Weiss dawid.we...@gmail.com wrote:
I propose a branch. Diffs from the branch to the trunk can still be
posted on the JIRA, but I think that a branch would be worthwhile in
facilitating collaboration.
Do you mean -- for merging with the code I posted
[
https://issues.apache.org/jira/browse/MAHOUT-252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12801247#action_12801247
]
Drew Farris commented on MAHOUT-252:
It was:
[
https://issues.apache.org/jira/browse/MAHOUT-252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12801248#action_12801248
]
Benson Margulies commented on MAHOUT-252:
-
That was the 'Map' patch, which was
[
https://issues.apache.org/jira/browse/MAHOUT-254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Benson Margulies updated MAHOUT-254:
Attachment: MAHOUT-254.patch
Primitive set unit tests
Yeah, its probably due to the way I used to generate random data...the
problem is that I never get this error =P so it's very difficult to
fix...I'll try my best as soon as I have some time. In the mean time,
rerunning 'mvn clean install' again generally does the trick.
On Sat, Jan 16, 2010 at
Have you finished with Colt? I think this is still worth completing
before we proceed to HPPC. Just talked to Staszek, we will move HPPC
code to Carrot2 labs SVN repository (sourceforge) because we want to
get rid of PCJ as soon as possible and need something versioned and
sticky. I plan to make a
I'm not quite done with Colt.
If you think you can refine a patch to go straight into the mahout
trunk, don't let me stop you.
On Sat, Jan 16, 2010 at 3:48 PM, Dawid Weiss dawid.we...@gmail.com wrote:
Have you finished with Colt? I think this is still worth completing
before we proceed to
. Running through strace showed
that something was attempting to reading from /dev/random. Sometimes
it ran fine, but at least 25-30% it ended up blocking until the
entropy pool is refilled. To test I moved /dev/random, and created a
link from /dev/urandom to /dev/random (the former doesn't
[
https://issues.apache.org/jira/browse/MAHOUT-153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12801280#action_12801280
]
Isabel Drost commented on MAHOUT-153:
-
Welcome to Mahout. Thanks for stepping up and
On Sat, Jan 16, 2010 at 4:42 PM, Benson Margulies bimargul...@gmail.com wrote:
. Running through strace showed
that something was attempting to reading from /dev/random. Sometimes
it ran fine, but at least 25-30% it ended up blocking until the
entropy pool is refilled. To test I moved
2010/1/16 Benson Margulies bimargul...@gmail.com:
. Running through strace showed
that something was attempting to reading from /dev/random. Sometimes
it ran fine, but at least 25-30% it ended up blocking until the
entropy pool is refilled. To test I moved /dev/random, and created a
link from
It looks as if this could be related
to the loading of the SecureRandomSeedGenerator class.
Let's fix that class to defer until there's a good reason to make a seed.
Here is my attempt at making a dictionary lookup using lucene. Need some
pointers in optimising. Currently it takes 30 secs for a million lookups
using a dictionary of 500K words about 30x of that of a hashmap. But space
used is almost same as far as i can see in memory sizes looks almost the
This is going to be a lot of fun. That class is in uncommons-math, and
the connection to it from Mahout is hardly obvious.
On Sat, Jan 16, 2010 at 5:34 PM, Benson Margulies bimargul...@gmail.com wrote:
It looks as if this could be related
to the loading of the SecureRandomSeedGenerator class.
I see a way, but it involves loading this class explicitly with reflection.
I'll make a patch.
Oh, I see. We have to give up on the MerseneTwisterRNG in tests and
just use the JRE. Is that OK?
On Sat, Jan 16, 2010 at 5:44 PM, Olivier Grisel
olivier.gri...@ensta.org wrote:
2010/1/16 Drew Farris drew.far...@gmail.com:
On Sat, Jan 16, 2010 at 4:42 PM, Benson Margulies bimargul...@gmail.com
Some tests are probably not calling:
RandomUtils.useTestSeed();
in a setUp() or static init. Maybe a mixin class MahoutTestCase base
class with a default static init that calls it would do.
Otherwise, I confirm that setting forkModel to once in maven/pom.xml
solves the issue (and all tests
On the indexing side, add in batches and reuse the document and fields.
On the search side, no need for a BooleanQuery and no need for scoring, so you
will likely want your own Collector (dead simple to write).
It _MAY_ even be faster to simply do the indexing as a word w/ the id as a
Unit tests should generally be using a fixed seed and not need to load a
secure seed from dev/random. I would say that RandomUtils is probably the
problem here. The secure seed should be loaded lazily only if the test seed
is not in use.
The problem, as I see it, is that the uncommons-math
On Sun, Jan 17, 2010 at 4:53 AM, Grant Ingersoll gsing...@apache.orgwrote:
On the indexing side, add in batches and reuse the document and fields.
Done squeeze out 5 secs there 25 from 30 and further to 22 by increasing max
merge docs.
On the search side, no need for a BooleanQuery and no
I'm getting similar slowdowns with my VirtualBox Ubuntu 9.04
I'm suspecting that the problem is not -only- caused by RandomUtils because:
1. I'm familiar with MerseneTwisterRNG slowdowns (I use it a lot) but
the test time used to be reported accurately by maven. Now maven
reports that a test
removing the maven repository does not solve the problem, neither a
fresh checkout of the trunk.
but older revisions don't show any slowdown!!! I tried the following revisions:
Those old revisions seem Ok:
r896946 | srowen
52 matches
Mail list logo