[jira] Commented: (MAHOUT-391) Make vector more space efficient with variable-length encoding, et al

2010-05-08 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12865446#action_12865446
 ] 

Robin Anil commented on MAHOUT-391:
---

I am sure its just some bug somewhere, conceptually, it should reduce the 
space. A parallel comparison with Hadoop VIntWritable should help pinpoint it

> Make vector more space efficient with variable-length encoding, et al
> -
>
> Key: MAHOUT-391
> URL: https://issues.apache.org/jira/browse/MAHOUT-391
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.3
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 0.4
>
> Attachments: MAHOUT-391.patch
>
>
> There are a few things we can do to make Vector representations smaller on 
> disk:
> - Use variable-length encoding for integer values like size and element 
> indices in sparse representations
> - Further, delta-encode indices in sequential representations
> - Let caller specify that precision isn't crucial in values, allowing it to 
> store values as floats
> Since indices are usually small-ish, I'd guess this saves 2 bytes or so on 
> average, out of 12 bytes per element now.
> Using floats where applicable saves another 4. Not bad.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [jira] Commented: (MAHOUT-391) Make vector more space efficient with variable-length encoding, et al

2010-05-06 Thread Robin Anil
I am guessing SequenceFile does something. Compressing the two yields
similar size with gzip


[jira] Commented: (MAHOUT-391) Make vector more space efficient with variable-length encoding, et al

2010-05-05 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12864304#action_12864304
 ] 

Robin Anil commented on MAHOUT-391:
---

import org.apache.mahout.common.MahoutTestCase;

This import was missing. That was the cause of the error, though the messaging 
was way off. Dunno how it compiles on your side.

> Make vector more space efficient with variable-length encoding, et al
> -
>
> Key: MAHOUT-391
> URL: https://issues.apache.org/jira/browse/MAHOUT-391
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.3
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 0.4
>
> Attachments: MAHOUT-391.patch
>
>
> There are a few things we can do to make Vector representations smaller on 
> disk:
> - Use variable-length encoding for integer values like size and element 
> indices in sparse representations
> - Further, delta-encode indices in sequential representations
> - Let caller specify that precision isn't crucial in values, allowing it to 
> store values as floats
> Since indices are usually small-ish, I'd guess this saves 2 bytes or so on 
> average, out of 12 bytes per element now.
> Using floats where applicable saves another 4. Not bad.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-391) Make vector more space efficient with variable-length encoding, et al

2010-05-05 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12864292#action_12864292
 ] 

Robin Anil commented on MAHOUT-391:
---

I am getting build errors. Maybe you are different JUnit version ?

/Users/robinanil/mahout/core/src/test/java/org/apache/mahout/math/VarintTest.java:[65,6]
 cannot find symbol
symbol  : method assertEquals(long,long)
location: class org.apache.mahout.math.VarintTest


> Make vector more space efficient with variable-length encoding, et al
> -
>
> Key: MAHOUT-391
> URL: https://issues.apache.org/jira/browse/MAHOUT-391
> Project: Mahout
>  Issue Type: Improvement
>Affects Versions: 0.3
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 0.4
>
> Attachments: MAHOUT-391.patch
>
>
> There are a few things we can do to make Vector representations smaller on 
> disk:
> - Use variable-length encoding for integer values like size and element 
> indices in sparse representations
> - Further, delta-encode indices in sequential representations
> - Let caller specify that precision isn't crucial in values, allowing it to 
> store values as floats
> Since indices are usually small-ish, I'd guess this saves 2 bytes or so on 
> average, out of 12 bytes per element now.
> Using floats where applicable saves another 4. Not bad.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Fwd: License for Google's patent

2010-05-04 Thread Robin Anil
Pasted from Hadoop General


From:   Owen O'Malley 
Subject:Re: License for Google's patent
Date:   Fri, 23 Apr 2010 05:27:14 GMT

All,
We got the following email from Larry Rosen, Apache's legal counsel.

-- Owen

On Apr 22, 2010, at 7:49 PM, Lawrence Rosen wrote:

> To: ASF Board
>
> Several weeks ago I sought clarification from Google about its
> recent patent 7,650,331 ["System and method for efficient large-
> scale data processing"] that may be infringed by implementation of
> the Apache Hadoop and Apache MapReduce projects.  I just received
> word from Google's general counsel that "we have granted a license
> for Hadoop, terms of which are specified in the CLA."
>
> I am very pleased to reassure the Apache community about Google's
> continued generosity and commitment to ASF and open source. Will
> someone here please inform the Apache Hadoop and Apache MapReduce
> projects that they need not worry about this patent.
>
> Best regards,
>
> /Larry
>
>
> Lawrence Rosen
> Rosenlaw & Einschlag, a technology law firm (www.rosenlaw.com)
> 3001 King Ranch Road, Ukiah, CA 95482
> Office: 707-485-1242Cell: 707-478-8932
> Apache Software Foundation, member and counsel (www.apache.org)
> Open Web Foundation, board member (www.openwebfoundation.org)
> Stanford University, Instructor in Law
> Author, Open Source Licensing: Software Freedom and Intellectual
> Property Law (Prentice Hall 2004)
>
>


Re: Optimization opportunity: Speed up serialization and deserialization

2010-05-02 Thread Robin Anil
LZO is supposedly the best option, but due to GPL restrictions, it was
removed. The Quicklz hasnt yet been integrated into the Hadoop code base.

Robin

On Mon, May 3, 2010 at 1:15 AM, Drew Farris  wrote:

> Is this what is commonly referred to as zig-zag encoding? Avro uses the
> same
> technique:
> http://hadoop.apache.org/avro/docs/1.3.2/spec.html#binary_encoding
>
> For sequential sparse vectors it we could get an additional win by delta
> encoding the indexes. This would allow the index, stored as the difference
> from the previous index, to be kept to two bytes in many cases.
>
> Regardless, vint encoding will produce a significant space savings and
> Sean's right: it has also been my experience that space savings often trump
> speed simply because of the speed of network or storage.
>
> Do anyone have any idea whether greater gains to be found by finely tuning
> the base encoding vs. relying on some form of SequenceFile block
> compression? (or do both approaches compliment each other nicely?)
>
> On Sun, May 2, 2010 at 12:33 PM, Sean Owen  wrote:
>
> > That's the one! I actually didn't know this was how PBs did the
> > variable length encoding but makes sense, it's about the most
> > efficient thing I can imagine.
> >
> > Values up to 16,383 fit in two bytes, which less than a 4-byte int and
> > the 3 bytes or so it would take the other scheme. Could add up over
> > thousands of elements times millions of vectors.
> >
> > Decoding isn't too slow and if one believes this isn't an unusual
> > encoding, it's not so problematic to use it in a format that others
> > outside Mahout may wish to consume.
> >
> > On Sun, May 2, 2010 at 5:23 PM, Robin Anil  wrote:
> > > You mean this type of encoding instead?
> > >  http://code.google.com/apis/protocolbuffers/docs/encoding.html
> >
>


Re: Optimization opportunity: Speed up serialization and deserialization

2010-05-02 Thread Robin Anil
On Sun, May 2, 2010 at 9:40 PM, Sean Owen  wrote:

> What's the specific improvement idea?
>
> Size and speed improvements would be good. The Hadoop serialization
> mechanism is already pretty low-level, dealing directly in bytes (as
> opposed to fancier stuff like Avro). It's if anything fast and lean
> but quite manual. The latest Writable updates squeezed out most of the
> remaining overhead.
>
> One thing to recall is that in the tradeoff between size and speed, a
> test against a local ramdisk will make the cost of reading/writing
> bytes artificially low. That is to say I'd just err more on the side
> of compactness unless it makes a very big difference in decode time,
> as I imagine the cost of decoding bytes is nothing compared to that of
> storing and transmitting over a network. (Not to mention HDFS's work
> to replicate those bytes, etc.)
>
> I suspect there might be some value in storing vector indices as
> variable length ints, since they're usually not so large. I can also
> imagine more compact variable length encodings than the one in
> WritableUtils -- thinking of the encoding used in MIDI (and elsewhere
> I'd guess), where 7 bits per byte are used and the top bit signals the
> final value. IIRC WritableUtils always spends 8 bits writing the
> length of the encoding.
>
You mean this type of encoding instead?
 http://code.google.com/apis/protocolbuffers/docs/encoding.html

>
> On Sun, May 2, 2010 at 5:02 PM, Robin Anil  wrote:
> > I am getting more and  more ideas as I try to write about scaling Mahout
> > clustering. I added serialize and de serialize benchmark for Vectors and
> > checked the speed of our vectors.
> >
> > Here is the output with Cardinality=1000 Sparsity=1000(dense)
> numVectors=100
> > loop=100 (hence writing 10K(int-doubles) to and reading back from disk)
> > Note: that these are not disk MB/s but the size of vectors/per sec
> > deserialized and the filesystem is a Ramdisk.
>


Optimization opportunity: Speed up serialization and deserialization

2010-05-02 Thread Robin Anil
I am getting more and  more ideas as I try to write about scaling Mahout
clustering. I added serialize and de serialize benchmark for Vectors and
checked the speed of our vectors.

Here is the output with Cardinality=1000 Sparsity=1000(dense) numVectors=100
loop=100 (hence writing 10K(int-doubles) to and reading back from disk)
Note: that these are not disk MB/s but the size of vectors/per sec
deserialized and the filesystem is a Ramdisk.

robinanil$ ls -lh /tmp/*vector
-rwxrwxrwx  1 robinanil 77M May  2 21:25 /tmp/ram/dense-vector
-rwxrwxrwx  1 robinanil115M May  2 21:25 /tmp/ram/randsparse-vector
-rwxrwxrwx  1 robinanil115M May  2 21:25 /tmp/ram/seqsparse-vector

BenchMarks  DenseVector RandSparseVector
 SeqSparseVector
Deserialize

nCalls = 1; nCalls = 1;
nCalls = 1;
sum = 1.30432s; sum = 2.207437s;sum
= 1.681144s;
min = 0.045ms;  min = 0.152ms;  min
= 0.114ms;
max = 74.549ms; max = 8.446ms;  max
= 3.748ms;
mean = 0.130432ms;  mean = 0.220743ms;  mean
= 0.168114ms;
stdDev = 0.904858ms;stdDev = 0.206271ms;
 stdDev = 0.087123ms;
Speed = 7666.83 /secSpeed = 4530.1406 /sec
 Speed = 5948.33 /sec
Rate = 92.00197 MB/sRate = 54.361687 MB/s   Rate
= 71.37997 MB/s

Serialize

nCalls = 1; nCalls = 1;
nCalls = 1;
sum = 3.391168s;sum = 6.300965s;sum
= 5.304873s;
min = 0.068ms;  min = 0.135ms;  min
= 0.12ms;
max = 254.635ms;max = 1183.891ms;   max
= 639.583ms;
mean = 0.339116ms;  mean = 0.630096ms;  mean
= 0.530487ms;
stdDev = 5.558922ms;stdDev = 13.460321ms;
stdDev = 8.618806ms;
Speed = 2948.8364 /sec  Speed = 1587.0585 /sec
 Speed = 1885.0592 /sec
Rate = 35.38604 MB/sRate = 19.044703 MB/s   Rate
= 22.620712 MB/s


Re: Canopy Clustering not scaling

2010-05-02 Thread Robin Anil
I dont think you got the algorithm correct. The canopy list is empty at
start, And automatically populated using the distance threshold, this may
work, I dont have a clue how to get till here.

On Sun, May 2, 2010 at 6:15 PM, Sean Owen  wrote:

> How about this for the first phase? I think you can imagine how the
> rest goes, more later...
>
>
> Mapper 1A.
> map() input: One canopy
> map() output: canopy ID -> canopy
>
> Mapper 1B.
> Has in memory all canopy IDs, read at startup)
> map() input: one point
> map() output: for each canopy ID, canopy ID -> point
>
> Reducer 1.
> reduce() input: canopy ID mapped to many points, one canopy
> reduce() output: for each point, compute distance from point to
> canopy, output (canopy ID, point ID) -> distance
>


Re: Canopy Clustering not scaling

2010-05-02 Thread Robin Anil
On Sun, May 2, 2010 at 5:45 PM, Sean Owen  wrote:

> Not surprising indeed, that won't scale at some point.
> What is the stage that needs everything in memory? maybe describing
> that helps imagine solutions.
>
Algorithm is simple
For each point read into the mapper.
   Find the canopy it is closest to(from memory List<>) and add it
to the canopy.
   Else if the distance is greater than a threshold t1 then create a
new canopy(into memory List<>)



> The typical reason for this, in my experience back in the day, was
> needing to look up data infrequently in a key-value way.
> "Side-loading" off HDFS (well, GFS via Bigtable) was reasonable. For
> whatever reason I cannot get any reasonable performance out of MapFile
> in this regard.
>
> Another common pattern seems to be that you need two or more kinds of
> values for a key in order to performa a computation. (For example in
> recommendations I'd need user vectors and matrix rows, both). The
> natural solution is to load one of them into memory and map the others
> into the computation.
>
> Instead I very much like Ankur's trick(s) for this situation: use two
> mappers, which Hadoop allows. They output different value types
> though, V1 and V2. So create a sort of "V1OrV2Writable" that can hold
> one or the other. It's simple to tell them apart in the mapper.
>
> There are even further tricks to ensure you get V1 or V2 first if needed.
>
> Don't know if that helps but might inspire ideas.
>
>
>
> On Sun, May 2, 2010 at 12:14 PM, Robin Anil  wrote:
> > Keeping all canopies in memory is not making things scale. I frequently
> run
> > into out of memory errors when the distance thresholds are not good on
> > reuters. Any ideas on optimizing this?
> >
> > Robin
> >
>


Canopy Clustering not scaling

2010-05-02 Thread Robin Anil
Keeping all canopies in memory is not making things scale. I frequently run
into out of memory errors when the distance thresholds are not good on
reuters. Any ideas on optimizing this?

Robin


Re: svn commit: r939867 - in /lucene/mahout/trunk: core/src/main/java/org/apache/mahout/clustering/dirichlet/ core/src/main/java/org/apache/mahout/clustering/kmeans/ core/src/main/java/org/apache/ma

2010-05-02 Thread Robin Anil
Works fine :) Sorry about that.


[jira] Commented: (MAHOUT-236) Cluster Evaluation Tools

2010-04-27 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12861273#action_12861273
 ] 

Robin Anil commented on MAHOUT-236:
---

No Jeff, I dont have any implementations with me. Sorry for not replying 
earlier. Will have to start from scratch on it.

> Cluster Evaluation Tools
> 
>
> Key: MAHOUT-236
> URL: https://issues.apache.org/jira/browse/MAHOUT-236
> Project: Mahout
>  Issue Type: New Feature
>  Components: Clustering
>Reporter: Grant Ingersoll
> Attachments: MAHOUT-236.patch, MAHOUT-236.patch, MAHOUT-236.patch, 
> MAHOUT-236.patch
>
>
> Per 
> http://www.lucidimagination.com/search/document/10b562f10288993c/validating_clustering_output#9d3f6a55f4a91cb6,
>  it would be great to have some utilities to help evaluate the effectiveness 
> of clustering.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-297) Canopy and Kmeans clustering slows down on using SeqAccVector for center

2010-04-27 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12861272#action_12861272
 ] 

Robin Anil commented on MAHOUT-297:
---

There was a discussion about this on the dev list. Check the util Vector 
Benchmarks and see how much faster clustering became after this change.  
Shouldnt necessarily be SeqAcc, if the points are all dense vectors. But 
obvious savings for sparse data is much better than the slight loss in 
performance for dense. (you will see that in the vector benchmarks code)



> Canopy and Kmeans clustering slows down on using SeqAccVector for center
> 
>
> Key: MAHOUT-297
> URL: https://issues.apache.org/jira/browse/MAHOUT-297
> Project: Mahout
>  Issue Type: Improvement
>  Components: Clustering
>Affects Versions: 0.4
>Reporter: Robin Anil
>Assignee: Robin Anil
> Fix For: 0.4
>
> Attachments: MAHOUT-297.patch, MAHOUT-297.patch, MAHOUT-297.patch, 
> MAHOUT-297.patch, MAHOUT-297.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: How to tackle Vector->NamedVector and back conversion

2010-04-25 Thread Robin Anil
I think It changed after Jeff commit his code. It was there earlier.


On Mon, Apr 26, 2010 at 12:24 AM, Sean Owen  wrote:

> Where though, I just deleted all the methods to try it and every test
> passes.
>
> On Sun, Apr 25, 2010 at 7:51 PM, Robin Anil  wrote:
> > Its used in clustering to generate clusterid -> point id. Also to be used
> in
> > classification(by end of this summer) to keep class labels.
>


Re: How to tackle Vector->NamedVector and back conversion

2010-04-25 Thread Robin Anil
On Mon, Apr 26, 2010 at 12:17 AM, Sean Owen  wrote:

> I agree that it'd be good to kind of finalize the Vector stuff. I
> don't think it's reasonable for users to expect data output by 0.3 to
> be compatible with 0.4 though, so wouldn't worry about that.
>
> I think we're on the verge of wanting a proper serialization system
> like Avro for vectors here -- but not quite. About 3 flags describe
> any vector: denseness, sequential access-ness, and whether it has a
> name, if you want to unify that too. A simple byte of bit flags seems
> not so bad, if that's about as complex as this will ever get.
>
> What about label bindings, which I brought up earlier?
> Actually, I cannot find where labels are used except in tests. They're
> not serialized or cloned consistently. Are these used? Seems like the
> reason to package them together would be serialization but that's not
> it.
>
Its used in clustering to generate clusterid -> point id. Also to be used in
classification(by end of this summer) to keep class labels.

>
> On Sun, Apr 25, 2010 at 4:36 PM, Robin Anil  wrote:
> > Let more comments come in before tearing it down. This affects
> everything.
> > We *have to *get it right by the next release, not necessarily today or
> > tomorrow. Or that would kind of kill the whole 0.3 users. Once fixed, we
> can
> > provide a convertor to convert to the new representation.
>


Re: How to tackle Vector->NamedVector and back conversion

2010-04-25 Thread Robin Anil
Vector is simply any one of (array of doubles) or array of(int:double) and
this info and other stuff are stored in a MetadataWritable. Makes sense to
me, assuming MetadataWritable allows us to skip over efficiently without
Deserializing


On Sun, Apr 25, 2010 at 8:58 PM, Sean Owen  wrote:

> Yes, I think if we can convince ourselves that there won't be that
> many different possibilities for representing a vector, then a simple
> boolean might unify everything. This approach doesn't 'scale' but I
> don't know there are other representations we must have.
>
> The issue of named vectors is interesting. There's not really such a
> thing as an optional field in Hadoop serialization. You can fake it
> with a boolean but that starts to be messy.
>
> Messy might be necessary as vectors perhaps take on more metadata --
> though I can't envision much more. So perhaps it is right and proper
> to retain a second serialization format, in NamedVectorWritable, which
> is really the "vector with metadata" serializer versus
> VectorWritable's "pure vector" serializer.
>
> It has a logic to me. It gets rid of writing the class name which is
> indeed unpalatable.
>
> Thoughts before I go tearing through again?
>
Let more comments come in before tearing it down. This affects everything.
We *have to *get it right by the next release, not necessarily today or
tomorrow. Or that would kind of kill the whole 0.3 users. Once fixed, we can
provide a convertor to convert to the new representation.

Robin


Re: How to tackle Vector->NamedVector and back conversion

2010-04-25 Thread Robin Anil
>
>
> - How about moving label bindings out to NamedVector?
> - How about similar restructuring of matrices?
>
I dont know what the correct choice is here.  It depends on whether we
should keep a single written representation for all vectors on disk. Then an
optional field could be there for name


- And how about not writing
> "org.apache.mahout.math.RandomAccessSparseVectorWritable" whenever
> VectorWritable does its wrapping.. I think making the package name and
> "Writable" implicit is perhaps worth the loss of generality.
>
Agreed. If we fix the on-disk representation.. The only need is a bool which
says whether the dimensions are stored in sorted manner and another bool
which tells whether the vector is dense. So, dense vector and sequential
access vector could be deserialized in a faster manner(if the conditions are
good). But we keep the same written format for all vectors and say what
format we want to deserialize the vector into explicitly to the algorithm

Robin


Clean checkout Test broken

2010-04-25 Thread Robin Anil
Is this happening to anyone else?

---
Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.184 sec
<<< FAILURE!
testProcessOutput(org.apache.mahout.df.mapreduce.partial.PartialBuilderTest)
 Time elapsed: 0.171 sec  <<< ERROR!
java.io.IOException: wrong value class: {null | null} is not class
org.apache.mahout.math.VectorWritable
 at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1874)
at
org.apache.mahout.df.mapreduce.partial.PartialBuilder.processOutput(PartialBuilder.java:185)
 at
org.apache.mahout.df.mapreduce.partial.PartialBuilderTest.testProcessOutput(PartialBuilderTest.java:82)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:616)
at junit.framework.TestCase.runTest(TestCase.java:168)
 at junit.framework.TestCase.runBare(TestCase.java:134)
at junit.framework.TestResult$1.protect(TestResult.java:110)
 at junit.framework.TestResult.runProtected(TestResult.java:128)
at junit.framework.TestResult.run(TestResult.java:113)
 at junit.framework.TestCase.run(TestCase.java:124)
at junit.framework.TestSuite.runTest(TestSuite.java:232)
 at junit.framework.TestSuite.run(TestSuite.java:227)
at
org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:83)
 at
org.apache.maven.surefire.junit4.JUnit4TestSet.execute(JUnit4TestSet.java:59)
at
org.apache.maven.surefire.suite.AbstractDirectoryTestSuite.executeTestSet(AbstractDirectoryTestSuite.java:115)
 at
org.apache.maven.surefire.suite.AbstractDirectoryTestSuite.execute(AbstractDirectoryTestSuite.java:102)
at org.apache.maven.surefire.Surefire.run(Surefire.java:180)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
 at
org.apache.maven.surefire.booter.SurefireBooter.runSuitesInProcess(SurefireBooter.java:350)
at
org.apache.maven.surefire.booter.SurefireBooter.main(SurefireBooter.java:1021)


Re: How to tackle Vector->NamedVector and back conversion

2010-04-24 Thread Robin Anil
On Sat, Apr 24, 2010 at 11:50 PM, Ted Dunning  wrote:

> If we are talking about the Writable aspect of this, then whatever input
> format we use should reasonably be able to handle both kinds of data with
> the conversions as you suggest.
>
Yes, Having two separate writable classes as of the moment creates this
issue.

>
> For algorithms that are accepting arguments of a particular type, it might
> be reasonable to let NVW extend VW (I am not at all sure about the
> unintended consequences of this, but it sounds plausible).   Then all we
> need is a facade that exposes an NVW interface for a wrapped VectorWritable
> with some kind of default labels (say the indexes as strings).
>
Or the other way around. Let everything be a NamedVectorWritable. during
deserializing use explicit methods to use or skip the name


>
> On Sat, Apr 24, 2010 at 11:04 AM, Robin Anil  wrote:
>
> > Some algorithms are using NamedVectorWritable, Some using VectorWritable.
> > Shouldn't we need an identity convertor for forward and some form of
> naming
> > assign convertor for backward conversion. Otherwise its going to be messy
> >
> > Robin
> >
>


How to tackle Vector->NamedVector and back conversion

2010-04-24 Thread Robin Anil
Some algorithms are using NamedVectorWritable, Some using VectorWritable.
Shouldn't we need an identity convertor for forward and some form of naming
assign convertor for backward conversion. Otherwise its going to be messy

Robin


Re: Mahout In Action

2010-04-23 Thread Robin Anil
Its not aimed at 0.3 per say. Right now its evolving with the code. For. eg.
the quality factor is something that will go in there. I keep updating the
code with the latest changes and so does Sean. There isnt much that got
affected by your latest commit though(it compiles). Though I haven't fully
tested the code with the dataset after the commit, something I plan to do
soon.

Robin

On Fri, Apr 23, 2010 at 9:51 PM, Jeff Eastman wrote:

> I also wonder how much my recent clustering changes have affected the
> examples in the clustering sections. I know the book is currently aimed at
> Mahout 0.3 but users trying the examples with trunk may be frustrated by the
> recent changes in file naming. Do the examples exist in an unannotated
> version somewhere that I could get working again on trunk?
>
> On 4/23/10 9:10 AM, Sean Owen wrote:
>
>> Good eye, this was fixed in the manuscript a while ago.
>>
>> I will ping Manning to re-publish Chapters 1-6 since a lot of small
>> updates have happened since then.
>>
>> On Fri, Apr 23, 2010 at 4:53 PM, Jeff Eastman
>>   wrote:
>>
>>
>>> Section 4.5.1 says:
>>> "The third line shows how it is based on item-item similarities, not
>>> user-user similarities as before. The algorithms are similar, but not
>>> entirely symmetric. They do have notably different properties. For
>>> instance,
>>> the running time of an item-based recommender scales up as the number of
>>> items increases, whereas a user-based recommender’s running time goes up
>>> as
>>> the number of users increases.
>>>
>>> This suggests one reason that you might choose an item-based recommender:
>>> if
>>> the number of users is relatively low compared to the number of items,
>>> the
>>> performance advantage could be significant."
>>>
>>> Shouldn't the second paragraph be?
>>>
>>> "This suggests one reason that you might choose an item-based
>>> recommender:
>>> if the number of users is relatively *high* compared to the number of
>>> items,
>>> the performance advantage could be significant."
>>>
>>>
>>>
>>>
>>
>>
>
>


Re: [proposal] Create Mahout TLP

2010-04-22 Thread Robin Anil
redirect lucene.apache.org/mahout to mahout.apache.org

How about hudson and other such infrastructure services that I am not fully
aware of. Wouldn't there be a mahout quota now?

Robin

On Thu, Apr 22, 2010 at 10:05 PM, Drew Farris  wrote:

> Here's the JIRA issue I'll file with INFRA to get everything setup --
> comments anyone?
> (borrowed from the other INFRA issues to create TLPs)
>
> ---snip!---
>
> The board has agreed to create the Mahout project, formerly a Lucene
> subproject
>
> To aid in the process, would the infrasturcture team please do the
> following:
>
> #===
>
> [0] Root Tasks
>
> Create unix group "mahout"
>
> Create new DNS entry "mahout.apache.org" and configure into website
> server instance
>
> #===
> [1] Mailing List (i) addresses
>
> Please migrate the existing archives, subscribers and moderators to
> the following lists:
>
> d...@mahout.apache.org from mahout-dev@lucene.apache.org
> u...@mahout.apache.org from mahout-u...@lucene.apache.org
> comm...@mahout.apache.org from mahout-comm...@lucene.apache.org
> gene...@mahout.apache.org new!
> priv...@mahout.apache.org new!
>
> (iii) initial moderators for new lists
>
> ... = apache.org
>
> Grant Ingersoll (gsing...@...)
> Sean Owen (sro...@...)
>
> (iv) options
>
> All lists except private should be set up to require subscription
> before posting, have the Reply-To header set, and suppress trailers.
>
> private list should also require moderation for subscription requests
>
> #===
> [2] Source Tracker
>
> (i) subversion
>
> Create mahout, we will handle the move from lucene/mahout once
> everything else is ready.
>
> #===
> [3] Initial Commiter/PMC list
>
> The initial PMC/Commiters:
>
> ... = apache.org
>
> Abdelhakim Deneche 
> Isabel Drost (isa...@...)
> Ted Dunning (tdunn...@...)
> Jeff Eastman (jeast...@...)
> Drew Farris (d...@...)
> Grant Ingersoll (gsing...@...)
> Benson Margulies (bimargul...@...)
> Sean Owen (sro...@...)
> Robin Anil (robina...@...)
> Jake Mannix  (jman...@...)
>
> Please add them to the new "mahout" group and private and general mailing
> lists
>
> #===
> [4] cwiki
>
> the (confluence) cwiki does not need to be moved (space = MAHOUT)
>
> Add the following admins:
>
> Robin Anil (robina...@...)
> Sean Owen (sro...@...)
>
> wiki diffs go to comm...@mahout.apache.org
>
> #===
> [5] Add "srowen" PMC chair to the appropriate authorizations in auth file.
>
> #===
>
> [6] create "mahout" directory in /www/www.apache.org/dist directory
> for publishing software
>
> #===
>
> [7] JIRA
>
> Create new category MAHOUT and move the MAHOUT project from the Lucene
> category to the new MAHOUT category.
>
> Set 'srowen' as the project owner and grant full administrative rights
> to the MAHOUT project.
>
> notifications go to d...@mahout.apache.org
> notifications come from d...@mahout.apache.org
>


[jira] Commented: (MAHOUT-384) Implement of AVF algorithm

2010-04-22 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859693#action_12859693
 ] 

Robin Anil commented on MAHOUT-384:
---

Hi Tony. Nice work on the patch. But before we commit this, there are a couple 
of things you need to cover. I still have to read the algorithm in detail to 
know whats happening. But I have some queries and suggestions below which is a 
kind of a checklist to make this a commitable patch

1) I am not a fan of Text based input, though it is what most of the algorithms 
in Mahout was first implement in. The idea of splitting and joining text files 
based on comma is not very clean. Can you convert this to deal with 
SequenceFile of VectorWritable OR some other Writable Format? Whats your input 
schema?
2) There is a code-style we enforce in Mahout. You can use the mvn 
checkstyle:checkstyle to see the violations. We also have an eclipse formatter 
which formats code that almost match the checkstyle(there are rare manual 
interventions required). Take a look at this 
https://cwiki.apache.org/MAHOUT/howtocontribute.html you will find the Eclipse 
formatter file at the bottom
3) For parsing args use the apache commons cli2 library. Take a look at 
o/a/m/clustering/kmeans/KMeansDriver to see usage
4) What is Utils being used for?
5) @Override
+   public void setup(Context context) throws 
IOException,InterruptedException{
+
+   String filePath = context.getConfiguration().get("a");
+   sumAttribute = Utils.readFile(filePath+"/part-r-0");
+   
+   }
Please use distributed cache to read the file in a map/reduce context. See the 
DictionaryVectorizer Map/Reduce classes for usage
6) job.setNumReduceTasks(1); ? Is this necessary? Doesn't it hurt scalability 
of this algorithm? Is the single reducer going to get a lot of data from the 
mapper? If Yes, then you should think of removing this constraint and let it 
use the hadoop parameters or parameterize it
7) Can this job be Optimised using a Combiner? If yes, its really worth 
spending time to make one
8) Tests! :)

> Implement of AVF algorithm
> --
>
> Key: MAHOUT-384
> URL: https://issues.apache.org/jira/browse/MAHOUT-384
> Project: Mahout
>  Issue Type: New Feature
>  Components: Collaborative Filtering
>Reporter: tony cui
> Attachments: mahout-384.patch
>
>
> This program realize a outlier detection algorithm called avf, which is kind 
> of 
> Fast Parallel Outlier Detection for Categorical Datasets using Mapreduce and 
> introduced by this paper : 
> http://thepublicgrid.org/papers/koufakou_wcci_08.pdf
> Following is an example how to run this program under haodoop:
> $hadoop jar programName.jar avfDriver inputData interTempData outputData
> The output data contains ordered avfValue in the first column, followed by 
> original input data. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [Idea] Support Facebook Opengraph JSON format as an input

2010-04-21 Thread Robin Anil
Basically support extraction of fields and vector. Here is my public info
graph.facebook.com/robin.anil. Here is coco cola graph.facebook.com/cocacola

<http://graph.facebook.com/robin.anil>Robin

On Thu, Apr 22, 2010 at 6:08 AM, Jeff Eastman wrote:

> Mahout Vectors and Clusters currently support JSON encodings for input and
> output. What else is needed?
>
> Jeff
>
>
> On 4/21/10 4:18 PM, Robin Anil wrote:
>
>> The details are not clear at the moment. But, I am sure this will help
>> adoption of the mahout quickly.
>>
>> Things to do. Parse JSON and make the SequenceFiles for use for
>> clustering,
>> classification and recommendation.
>>
>>
>> Robin
>>
>>
>>
>
>


[Idea] Support Facebook Opengraph JSON format as an input

2010-04-21 Thread Robin Anil
The details are not clear at the moment. But, I am sure this will help
adoption of the mahout quickly.

Things to do. Parse JSON and make the SequenceFiles for use for clustering,
classification and recommendation.


Robin


Re: Status of Mahout TLP

2010-04-21 Thread Robin Anil
I can help out in the redesign. Is there a CMS approved by apache security,
something which will get patched automatically?

Robin


On Wed, Apr 21, 2010 at 3:34 PM, Grant Ingersoll wrote:

>
> On Apr 21, 2010, at 5:28 AM, Robin Anil wrote:
>
> > Today is the day :)
>
> Assuming it passes...  (which it should.)  We'll have some heavy lifting to
> do for a few days/weeks before any practical part of it is noticeable, just
> so people have reasonable expectations.
>
> Anyone up for a website redesign?  I'm kind of thinking we do like OFBiz
> and have a nice landing page, and then everything else is driven off
> Confluence.  Thoughts?
>
> -Grant
>
> >
> >
> > On Tue, Apr 13, 2010 at 5:16 AM, Benson Margulies  >wrote:
> >
> >> Here's a practical matter:
> >>
> >> svn layout.
> >>
> >> starting at the root we get, I propose:
> >>
> >>  - sandboxes
> >>  - mahout(/trunk,tag,branches)
> >>  - collections(/trunk/tag/branches)
> >>
> >> sandboxes gives us a home for experimental branches; mahout will
> >> contain the core product modules, collections the codegen and
> >> collections itself, and any other ideas that want loose coupling go in
> >> at the same level.
> >>
> >>
> >> On Mon, Apr 12, 2010 at 7:34 PM, Grant Ingersoll 
> >> wrote:
> >>> Yep.  Meeting is on the 21st.  I will be attending and letting y'all
> know
> >> what happens (I can't imagine it fails).  From the sounds of it, a good
> >> chunk of subprojects will be splitting from Lucene.
> >>>
> >>> Also, we should potentially start thinking about a Press Release to go
> >> with two things:
> >>> 1. Mahout as TLP
> >>> 2. Mahout 0.5 and/or 1.0 or whatever you want to call it.
> >>>
> >>> If we could hit these two things together in the next month or so, I
> >> think we could make some good noise that will be further built on by
> some
> >> successful GSOC projects.
> >>>
> >>> -Grant
> >>>
> >>> On Apr 12, 2010, at 5:50 PM, Jake Mannix wrote:
> >>>
> >>>> From what Grant said last time we talked about this, we need to wait
> >>>> until the next Apache directors meeting (or whatever it's called)
> before
> >>>> we move forward with that, I thought.
> >>>>
> >>>> -jake
> >>>>
> >>>> On Mon, Apr 12, 2010 at 2:43 PM, Robin Anil 
> >> wrote:
> >>>>
> >>>>> Hi everyone,
> >>>>>   I am just checking on the status/plan to move to
> >>>>> mahout.apache.org and the corresponding changes needed as a TLP, any
> >>>>> estimates of when that would happen or are we still on hold for this?
> >>>>>
> >>>>> Robin
> >>>>>
> >>>
> >>>
> >>>
> >>
>
> --
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem using Solr/Lucene:
> http://www.lucidimagination.com/search
>
>


Re: Status of Mahout TLP

2010-04-21 Thread Robin Anil
Today is the day :)


On Tue, Apr 13, 2010 at 5:16 AM, Benson Margulies wrote:

> Here's a practical matter:
>
> svn layout.
>
> starting at the root we get, I propose:
>
>   - sandboxes
>   - mahout(/trunk,tag,branches)
>   - collections(/trunk/tag/branches)
>
> sandboxes gives us a home for experimental branches; mahout will
> contain the core product modules, collections the codegen and
> collections itself, and any other ideas that want loose coupling go in
> at the same level.
>
>
> On Mon, Apr 12, 2010 at 7:34 PM, Grant Ingersoll 
> wrote:
> > Yep.  Meeting is on the 21st.  I will be attending and letting y'all know
> what happens (I can't imagine it fails).  From the sounds of it, a good
> chunk of subprojects will be splitting from Lucene.
> >
> > Also, we should potentially start thinking about a Press Release to go
> with two things:
> > 1. Mahout as TLP
> > 2. Mahout 0.5 and/or 1.0 or whatever you want to call it.
> >
> > If we could hit these two things together in the next month or so, I
> think we could make some good noise that will be further built on by some
> successful GSOC projects.
> >
> > -Grant
> >
> > On Apr 12, 2010, at 5:50 PM, Jake Mannix wrote:
> >
> >> From what Grant said last time we talked about this, we need to wait
> >> until the next Apache directors meeting (or whatever it's called) before
> >> we move forward with that, I thought.
> >>
> >>  -jake
> >>
> >> On Mon, Apr 12, 2010 at 2:43 PM, Robin Anil 
> wrote:
> >>
> >>> Hi everyone,
> >>>I am just checking on the status/plan to move to
> >>> mahout.apache.org and the corresponding changes needed as a TLP, any
> >>> estimates of when that would happen or are we still on hold for this?
> >>>
> >>> Robin
> >>>
> >
> >
> >
>


Re: Urgent - Withdraw my application for Google summer of code

2010-04-20 Thread Robin Anil
Dear Yinhua,
  Most of us here at Mahout have full time day jobs and we
contribute in terms of ideas and discussion and in terms of  code only when
we get time. You are welcome anytime to come and contribute to Mahout and
code up your Algorithm as well as improve the codebase. If you get time this
summer, maybe over a weekend, do check out what's happening with Mahout and
just dive in.

Robin


On Wed, Apr 21, 2010 at 1:18 AM, yinghua hu  wrote:

> Mahout Development,
>
> I just got a summer internship offer today. I would like to withdraw my
> application for GSOC. I am really sorry to tell you this late. I did not
> hear anything from them until last week. I also did not know that I would
> be
> selected for this internship.
>
> Thank you very much for your recent help on the project! Maybe we can still
> get opportunity to work together in the future.
>


> Thanks!
>
> --
> Regards,
>
> Yinghua
>


[jira] Issue Comment Edited: (MAHOUT-236) Cluster Evaluation Tools

2010-04-20 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859042#action_12859042
 ] 

Robin Anil edited comment on MAHOUT-236 at 4/20/10 3:53 PM:


Yeah for partial membership, we can add multiple strategies like choose top K 
clusters or choose Top Cluster or choose Top Cluster and all clusters > 
threshold. The CDbw computation will have to be modified to use the partial 
weights that all. 

So I think your idea do make sense and whether or not it gives meaningful 
result, that we have to experiment and see.

Robin

  was (Author: robinanil):
Yeah for partial membership, we can add multiple strategies like choose top 
K clusters or choose Top Cluster or choose Top Cluster or choose Top Cluster 
and all cluster > threshold. The CDbw computation will have to be modified to 
use the partial weights that all. 

So I think your idea do make sense and whether or not it gives meaningful 
result, that we have to experiment and see.

Robin
  
> Cluster Evaluation Tools
> 
>
> Key: MAHOUT-236
> URL: https://issues.apache.org/jira/browse/MAHOUT-236
> Project: Mahout
>  Issue Type: New Feature
>  Components: Clustering
>Reporter: Grant Ingersoll
> Attachments: MAHOUT-236.patch
>
>
> Per 
> http://www.lucidimagination.com/search/document/10b562f10288993c/validating_clustering_output#9d3f6a55f4a91cb6,
>  it would be great to have some utilities to help evaluate the effectiveness 
> of clustering.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Urgent - Withdraw my application for Google summer of code

2010-04-20 Thread Robin Anil
Noted.

Robin


On Wed, Apr 21, 2010 at 1:18 AM, yinghua hu  wrote:

> Mahout Development,
>
> I just got a summer internship offer today. I would like to withdraw my
> application for GSOC. I am really sorry to tell you this late. I did not
> hear anything from them until last week. I also did not know that I would
> be
> selected for this internship.
>
> Thank you very much for your recent help on the project! Maybe we can still
> get opportunity to work together in the future.
>
> Thanks!
>
> --
> Regards,
>
> Yinghua
>


[jira] Commented: (MAHOUT-236) Cluster Evaluation Tools

2010-04-20 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859042#action_12859042
 ] 

Robin Anil commented on MAHOUT-236:
---

Yeah for partial membership, we can add multiple strategies like choose top K 
clusters or choose Top Cluster or choose Top Cluster or choose Top Cluster and 
all cluster > threshold. The CDbw computation will have to be modified to use 
the partial weights that all. 

So I think your idea do make sense and whether or not it gives meaningful 
result, that we have to experiment and see.

Robin

> Cluster Evaluation Tools
> 
>
> Key: MAHOUT-236
> URL: https://issues.apache.org/jira/browse/MAHOUT-236
> Project: Mahout
>  Issue Type: New Feature
>  Components: Clustering
>Reporter: Grant Ingersoll
> Attachments: MAHOUT-236.patch
>
>
> Per 
> http://www.lucidimagination.com/search/document/10b562f10288993c/validating_clustering_output#9d3f6a55f4a91cb6,
>  it would be great to have some utilities to help evaluate the effectiveness 
> of clustering.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: SnowballAnalyzer

2010-04-20 Thread Robin Anil
+dev
@Delroy: Well even if you did correct the spelling. I believe
SnowballAnalyzer cannot be instantiated without a parameter like
StandardAnalyzer.

Constructor signature is: SnowballAnalyzer(String name);
@dev: I am not a java reflection expert. But is there a way we can find the
parameters of the constructor and automatically put some dummy values in it?

Robin

On Wed, Apr 21, 2010 at 12:23 AM, Robin Anil  wrote:

> org.apache.lucene.analysis.snowball.SnowballAnalyzer
>
> Check spelling
>
> On Tue, Apr 20, 2010 at 10:23 PM, Delroy Cameron  > wrote:
>
>>
>> Grant,
>>
>> i'm trying to generate the Sequence Vectors using the SnowballAnlyzer as
>> opposed to the StandardAnlyzer. I've already gone through this process
>> using
>> the StandardAnlyzer and plotted the output clusters using the k-means dump
>> file, so i'm familiar with clustering in Mahout. i'd like to repeat this
>> exercise with the SnowballAnlyzer, running the following command.
>>
>> ./mahout seq2sparse -s 2 -a
>> org.apache.lucene.anlysis.snowball.SnowballAnlyzer -chunk 100 -i
>> /home/hadoop/tmp/trecdata-seqfiles/chunk-0 -o
>> /home/hadoop/tmp/trecdata-vectors -md 1 -x 75 -wt TFIDF -n 0
>>
>> 1) i've placed the lucene-snowball jar in the  m2 repository
>> /home/delroy/.m2/repository/org/apache/lucene/lucene-snowball/2.9.1
>>
>> 2) and i also updated the Mahout_CORE/pom xml to reflect the dependency
>> 
>>
>>  org.apache.lucene
>>  lucene-snowball
>>  2.9.1
>>
>>
>> 3) then i did a mvn install on the Mahout_CORE and on Mahout_ROOT, which
>> downloaded the lucene-snowball pom and lucene-snowball pom sha1 to the m2
>> repository
>>
>> this error seems to stem from developer code, which incidentally notes
>> that
>> you should not instantiate the anlyzer at
>> SparseVectorsFromSequenceFiles.java:176 any suggestions here?
>>
>> Output:
>> Exception in thread "main" java.lang.InstantiationException:
>> org.apache.lucene.anlysis.snowball.SnowballAnlyzer
>>at java.lang.Class.newInstance0(Class.java:357)
>>at java.lang.Class.newInstance(Class.java:325)
>>at org.apache.mahout.text.SparseVectorsFromSequenceFiles.main()
>>at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>at
>>
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>at
>>
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>at java.lang.reflect.Method.invoke(Method.java:616)
>>at
>>
>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>>at
>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>at
>> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:172)
>>
>> PS: I just love the spam filter..won't let me write too many variants of
>> the
>> word Analyzer because it contains the word anal.
>>
>>
>> -
>> --cheers
>> Delroy
>> --
>> View this message in context:
>> http://n3.nabble.com/SnowballAnalyzer-tp729983p732912.html
>> Sent from the Mahout User List mailing list archive at Nabble.com.
>>
>
>


[jira] Commented: (MAHOUT-236) Cluster Evaluation Tools

2010-04-20 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12858811#action_12858811
 ] 

Robin Anil commented on MAHOUT-236:
---

Great start Jeff, I will test it and see if the CDbw makes sense with Reuters 
data and post results

> Cluster Evaluation Tools
> 
>
> Key: MAHOUT-236
> URL: https://issues.apache.org/jira/browse/MAHOUT-236
> Project: Mahout
>  Issue Type: New Feature
>  Components: Clustering
>Reporter: Grant Ingersoll
> Attachments: MAHOUT-236.patch
>
>
> Per 
> http://www.lucidimagination.com/search/document/10b562f10288993c/validating_clustering_output#9d3f6a55f4a91cb6,
>  it would be great to have some utilities to help evaluate the effectiveness 
> of clustering.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: AbstractVector.minus(Vector)

2010-04-19 Thread Robin Anil
On Mon, Apr 19, 2010 at 9:43 PM, Sean Owen  wrote:

> More on Vector, as I'm browsing through it:
>
> AbstractVector.minus(Vector) says:
>
>  public Vector minus(Vector x) {
>if (size() != x.size()) {
>  throw new CardinalityException();
>}
>if (x instanceof RandomAccessSparseVector || x instanceof DenseVector) {
>  // TODO: if both are RandomAccess check the numNonDefault to
> determine which to iterate
>  Vector result = x.clone();
>  Iterator iter = iterateNonZero();
>  while (iter.hasNext()) {
>Element e = iter.next();
>result.setQuick(e.index(), e.get() - result.getQuick(e.index()));
>  }
>  return result;
>} else { // TODO: check the numNonDefault elements to further optimize
>  Vector result = clone();
>  Iterator iter = x.iterateNonZero();
>  while (iter.hasNext()) {
>Element e = iter.next();
>result.setQuick(e.index(), getQuick(e.index()) - e.get());
>  }
>  return result;
>}
>  }
>
>
> The stanza after the instanceof checks can just become the body of an
> overriding method in these two subclasses right?
>
> Since we're computing "this - that", makes sense to only look at
> "that" where it is nonzero. But the first version iterates over
> indices where "this" is nonzero, so it's wrong. (Yeah just checked
> with a test.)
>
> Was the intent to compute "that - this" in this case, so to be able to
> iterate over nonzero elements of "this", and then invert it at the
> end? This works:
>
>  @Override
>  public Vector minus(Vector that) {
>if (this.size() != that.size()) {
>  throw new CardinalityException();
>}
>Vector result = that.clone();
>Iterator iter = this.iterateNonZero();
>while (iter.hasNext()) {
>  Element thisElement = iter.next();
>  int index = thisElement.index();
>  result.setQuick(index, that.getQuick(index) -
> thisElement.get()); // this is "-(that - this)"
>}
>return result.times(-1.0);
>  }
>
> It's nice but involves another Vector allocation at the end.
>
> But then I'm also confused since this is the version intended for
> DenseVector and RandomAccessSparseVector. I'd imagine it's best used
> with SequentialAccessSparseVector, where "iterateNonZero()" has the
> most benefit?
>
>
> But that's only an issue to optimize if you iterate over "this", like
> in my update above. Could we just have one implementation that
> iterates over "that"? It's more straightforward. The issue isn't
> really implementation but number of non-zero elements in "this" versus
> "that", as the TODO comments point out.

You can test the minus performance using the Benchmark tool in utils. I
believe this optimization was done because, inserting or editing a
sequential access vector is very expensive compared to others.


Re: Web content

2010-04-19 Thread Robin Anil
https://cwiki.apache.org/MAHOUT/howtoupdatethewebsite.html


Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-17 Thread Robin Anil
Agreed. Thats the correct way to go. But like I said, It warrants a complete
overhaul and a separate JIRA issue. The quick fix I indicated ( i.e. putting
the ID back in but removing it from compare/equals function) was just for
this bug.

How does this structuring sound?

Vector(Interface) -> AbstractVector - > Dense|SparseVector
-> NamedDense|SparseVector OR LabelledDense|SparseVector  OR
MultiLabelledDense|SparseVector



Robin

On Sun, Apr 18, 2010 at 4:21 AM, Ted Dunning  wrote:

> That would be a very, very good thing (uniform data usage).
>
> On Sat, Apr 17, 2010 at 2:52 PM, Jake Mannix 
> wrote:
>
> > Currently, FuzzyKMeansClusterMapper has WritableComparable
> > keys which are ignored.  Could we instead have the identifier for the
> > vector live there, where it makes sense?  Then that same key could
> > be mapper output key, instead of the name of the Vector.
> >
> > This kind of change could get the clustering code to effectively be
> > able to run sensibly on the same SequenceFile
> > that DistributedRowMatrix is running on, and that would be very nice,
> > I think.
> >
>


Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-17 Thread Robin Anil
On Sun, Apr 18, 2010 at 12:11 AM, Drew Farris  wrote:

> On Sat, Apr 17, 2010 at 2:23 PM, Sean Owen  wrote:
> >
> > At the moment I want to understand how to patch up the fuzzy k-means
> > code in this regard -- will probably switch to something slightly less
> > state-dependent than asFormatString() as a key and be done with it for
> > the moment.
>
> After looking at it a bit, it seems like the most expedient solution
> would be to add 'name' back into the Vector class. Whether it needs to
> be part of equals(), I don't really know at this point, but I suspect
> not.
>
> It doesn't appear that asFormatString() will do the job simply because
> it's just an alternate representation of the entire vector, not an
> identifier. Not sure what the history with this is here, but why
> asFormatString() as opposed to toString()?
>
> It seems that the decorator alternative would involve something like a
> NamedVector class that adds an id, implements Vector and holds any
> type of Vector to with it delegates all calls to. This might work,
> well but require more extensive modifications to the clustering code.
> Does anyone else think this is an approach worth exploring?
>
> Does the Vector really need a String name or could it simply hold an
> integer or long id?
>
I think a long id would do. As most gigantic tables are indexed these days
by a BIGINT(in MYSQL). It is easy to assign random ids to documents/clusters
in a single map/reduce job by partitioning the int64 space into the number
of mappers. But changing that at the moment will modify a lot of things,
(all clustering algorithms, clusterdumper)

For this bug, lets put the id back in and remove it from the
comparator/equals. Lets focus on getting the document structure correct

Robin

> Drew
>


Re: [jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-17 Thread Robin Anil
Why not just keep the identifier and not compare it when doing equals. ? Let
it be like a tag of the vector.

On Sat, Apr 17, 2010 at 11:53 PM, Sean Owen  wrote:

> At the moment I'm already overreaching on the way to fix MAHOUT-379
> with this patch, as I've expanded to address some mildly related
> issues (equals, iterators).
>
> So I personally am not trying to change serialization formats in
> MAHOUT-379 / my current patch, no. The issue uncovered by removing
> name relates to serialization format (since that becomes a vector's
> new 'name') but is not a problem with the GSON format per se.
>
> I also don't really want to rip up Writable too much, no. I have other
> pet issues to foist on the project first.
>
> At the moment I want to understand how to patch up the fuzzy k-means
> code in this regard -- will probably switch to something slightly less
> state-dependent than asFormatString() as a key and be done with it for
> the moment.
>
>
> On Sat, Apr 17, 2010 at 6:39 PM, Drew Farris 
> wrote:
> > it is worth some investigation to determine if there is merit to
> > adapting Mahout's MR jobs to use avro. Doug has recently committed a
> > patch to avro (https://issues.apache.org/jira/browse/AVRO-493) that
> > involves considerably less complexity than what I had originally
> > proposed in https://issues.apache.org/jira/browse/MAHOUT-274, based on
> > the initial proposed avro/mapreduce integration in MAPREDUCE-815.
> >
> > I'm half waiting for avro 1.4 to be released (which will include
> > AVRO-493) before I dig into further proofs-of-concept of avro usage in
> > Mahout, but I think there is something there worth seriously
> > exploring. (half procrastinating otherwise)
> >
> > Drew
> >
> > On Sat, Apr 17, 2010 at 12:43 PM, Jeff Eastman
> >  wrote:
> >> Seems like a major rewrite to replace Writable within our MR jobs.
> >
>


[jira] Commented: (MAHOUT-379) SequentialAccessSparseVector.equals does not agree with AbstractVector.equivalent

2010-04-17 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12858153#action_12858153
 ] 

Robin Anil commented on MAHOUT-379:
---

If the id from the vector is removed, I believe it will affect all clustering 
algorithms. The final stage is generating the vector_id, cluster_id pair.  will 
have to verify if this doesn't affect that step

> SequentialAccessSparseVector.equals does not agree with 
> AbstractVector.equivalent
> -
>
> Key: MAHOUT-379
> URL: https://issues.apache.org/jira/browse/MAHOUT-379
> Project: Mahout
>  Issue Type: Bug
>  Components: Math
>Affects Versions: 0.4
>Reporter: Danny Leshem
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 0.3
>
> Attachments: MAHOUT-379.patch, MAHOUT-379.patch
>
>
> When a SequentialAccessSparseVector is serialized and deserialized using 
> VectorWritable, the result vector and the original vector are equivalent, yet 
> equals returns false.
> The following unit-test reproduces the problem:
> {code}
> @Test
> public void testSequentialAccessSparseVectorEquals() throws Exception {
> final Vector v = new SequentialAccessSparseVector(1);
> final VectorWritable vectorWritable = new VectorWritable(v);
> final VectorWritable vectorWritable2 = new VectorWritable();
> writeAndRead(vectorWritable, vectorWritable2);
> final Vector v2 = vectorWritable2.get();
> assertTrue(AbstractVector.equivalent(v, v2));
> assertEquals(v, v2); // This line fails!
> }
> private void writeAndRead(Writable toWrite, Writable toRead) throws 
> IOException {
> final ByteArrayOutputStream baos = new ByteArrayOutputStream();
> final DataOutputStream dos = new DataOutputStream(baos);
> toWrite.write(dos);
> final ByteArrayInputStream bais = new 
> ByteArrayInputStream(baos.toByteArray());
> final DataInputStream dis = new DataInputStream(bais);
> toRead.readFields(dis);
> }
> {code}
> The problem seems to be that the original vector name is null, while the new 
> vector's name is an empty string. The same issue probably also happens with 
> RandomAccessSparseVector.
> SequentialAccessSparseVectorWritable (line 40):
> {code}
> dataOutput.writeUTF(getName() == null ? "" : getName());
> {code}
> RandomAccessSparseVectorWritable (line 42):
> {code}
> dataOutput.writeUTF(this.getName() == null ? "" : this.getName());
> {code}
> The simplest fix is probably to change the default Vector's name from null to 
> the empty string.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: mahout/solr integration

2010-04-16 Thread Robin Anil
>
>
>
> Hmm... this was a bit scattered of a response, but I'm really loathe
> to turn away a) nice hooks between Solr and Mahout, b) scripting-style
> wrappers which could expand our community, and c) simply new
> functionality.
>
> +1
I definitely don't want to turn it away, I want to ensure that duplication
of stuff don't happen and the heavy lifting is indeed done in Java and it
should be easy enough for anyone to consume in any existing app in any
language, Even something as crazy as COBOL apps calling Mahout classifiers
http://www.emunix.emich.edu/info/cobol/books/dpjafc.htm

LSA implementation should actually be in Mahout core. I would love it to be

Robin


Re: mahout/solr integration

2010-04-16 Thread Robin Anil
I>

> > Java would have been nicer (Though I am saying that without knowing how
> > well
> > Clojure binaries can talk with Java ones and vice versa)
> >
>
> Clojure is on the JVM, Robin!  http://clojure.org/ - the "j" in the name
> should
> have given the hint! :)
>
> :) Yeah I saw that, I had this really bad ordeal when an F# prototype code
was later re-coded using C# as it hit the performance wall. Just don't want
the same fate with Clojure, especially when its a library that deals with
algorithms that need high performance

Robin


Re: mahout/solr integration

2010-04-16 Thread Robin Anil
On Fri, Apr 16, 2010 at 7:52 PM, Anthony  wrote:

> All,
>
> I have begun work on an integration of Apache Solr and Mahout,
> http://github.com/algoriffic/lsa4solr   which is related to #MAHOUT-343
> (https://issues.apache.org/jira/browse/MAHOUT-343 ).  The
> implementation is in Clojure and interfaces with both the
> DistributedLanczosSolver and the distributed k-means clustering
> algorithm from Mahout.

Java would have been nicer (Though I am saying that without knowing how well
Clojure binaries can talk with Java ones and vice versa)

>  I am about to begin implementing a
> hierarchical clustering algorithm so that the number of clusters does
> not need to be specified in advance.  Has anyone done anything like
> this in Mahout yet?  Also, I'd be happy to contribute the code to
> Mahout if anyone is interested.
>
Hierarchical clustering is missing from Mahout at the moment, It will be
great if you can help bring it to Mahout.

>
> Thanks,
> Anthony
>
> On Fri, Apr 16, 2010 at 9:50 AM, Jake Mannix  wrote:
> > Hey Anthony,
> >   We would love to have hierarchical clustering in Mahout, in Clojure or
> > pure java.  Come on over to the mahout-dev mailing list, and/or file
> > a JIRA ticket, and join the fun, we'd love to work with you (and over
> > on mahout-dev, you'll get even more positive feedback).
> >   If you'd rather, and aren't as familiar with the whole Apache process,
> > I can file a JIRA ticket for you, and you can just comment there and
> > start the conversation that way.
> >   Do you subscribe to the mahout-...@apache.org / mahout-user@
> > mailing lists?  Their not too high traffic.
> >   -jake
>


Status of Mahout TLP

2010-04-12 Thread Robin Anil
Hi everyone,
 I am just checking on the status/plan to move to
mahout.apache.org and the corresponding changes needed as a TLP, any
estimates of when that would happen or are we still on hold for this?

Robin


Re: VOTE: take 2: mahout-collections-1.0

2010-04-11 Thread Robin Anil
+1


On Mon, Apr 12, 2010 at 10:29 AM, deneche abdelhakim wrote:

> +1
>
> On Mon, Apr 12, 2010 at 4:50 AM, Ted Dunning 
> wrote:
> > +1 (on trust, really)
> >
> > On Sun, Apr 11, 2010 at 6:49 PM, Benson Margulies  >wrote:
> >
> >> https://repository.apache.org/content/repositories/orgapachemahout-015/
> >>
> >> contains (this time for sure) all the artifacts for release 1.0 of the
> >> mahout-collections component. This is the first independent release of
> >> collections from the rest of mahout; it differs from the version
> >> released with mahout 0.3 only in removing a dependency on slf4j.
> >>
> >> This vote will remain open for 72 hours.
> >>
> >
>


Re: Mahout GSoC 2010: Association Mining

2010-04-10 Thread Robin Anil
Like Ted said, its a bit late for a GSOC proposal, but I am excited at the
possibility of improving the frequent pattern mining package. Check out the
current Parallel FPGrowth implementation in the code, you can find more
explanation on usage the Mahout wiki. Apriori should be trivially
parallelizable without the extra memory problem of PFPGrowth and should
scale well for large datasets. You can contribute it separately from GSOC,
and Apache community always welcomes such contributions. The wiki should
help you get started on the Mahout development, with correct code style and
practices.  Let me know If you have any doubts or thoughts.

Robin
On Sat, Apr 10, 2010 at 5:51 AM, Ted Dunning  wrote:

> Neal, I think that this might well be a useful contribution to Mahout, but,
> if I am not mistaken, I think that the deadline for student proposals for
> GSoC has just passed.
>
> That likely means that making this contribution an official GSoC project is
> not possible.  I am sure that the Mahout community would welcome you as a
> contributor even without official Google status.  If you would like to do
> this, go ahead and propose what you want to do (when JIRA comes back or
> just
> by email discussion) and you can get started.
>
> On Fri, Apr 9, 2010 at 2:11 PM, Neal Clark  wrote:
>
> > Hello,
> >
> > I just wanted to introduce myself. I am a MSc. Computer Science
> > student at the University of Victoria. My research over the past year
> > has been focused on developing and implementing an Apriori based
> > frequent item-set mining algorithm for mining large data sets at low
> > support counts.
> >
> >
> >
> https://docs.google.com/Doc?docid=0ATkk_-6ZolXnZGZjeGYzNzNfOTBjcjJncGpkaA&hl=en
> >
> > The main finding of the above report is that support levels as low as
> > 0.001% on the webdocs (1.4GB) dataset can be efficiently calculated.
> > On a 100 core cluster all frequent k2 pairs can calculated in
> > approximately 6 minutes.
> >
> > I currently have an optimized k2 Hadoop implementation and algorithm
> > for generating frequent pairs and I am currently extending my work to
> > items of any length. The analysis of the extended approach will be
> > complete within the next two weeks.
> >
> > Would you be interesting in moving forward with such an implementation
> >  as a GSoC project? If so any comments/feedback would be very much
> > appreciated. If you are interested I can create a proposal and submit
> > it to your issue tracker when it comes back online.
> >
> > Thanks,
> >
> > Neal.
> >
>


Re: Mahout GSoC 2010 proposal: Association Mining

2010-04-09 Thread Robin Anil
Hi Lukáš,
It would have been great if you could have participated in GSOC,
there is time left. But you still have your proposal in the GSOC system.
Take your time to decide, but if you choose not participate to do remove the
application from the soc website.

Wiki page for association mining would a good start. The pattern mining
package needs to grow beyond just the FPGrowth. Resource intensive
operations are what Mahout should do best on large datasets using Hadoop. I
can help around the code as much as you like for making it more generic and
suitable for association mining.

Regards
Robin

On Fri, Apr 9, 2010 at 4:56 PM, Lukáš Vlček  wrote:

> Robin,
>
> I think it does not make sense for me to catch with GSoC timeline now as I
> am quite busy with other stuff. However, I will develop the proposal for
> Association Mining (or GUHA if you like) and keep this discussion going on.
> I am really interested in contributing some implementation to Mahout but as
> of now the GCoS timeline is not of any help to me.
>
> Let me look at this in detail and I will get back to mahout community with
> more details.
>
> As for the use cases for association mining there can be find a lot
> examples
> in literature. When it comes to missing or negative attributes of the data
> (of the transaction) I think there can be a lot of examples as well. One
> example would be analysis of click stream, where you can learn that those
> people visiting some negative comment on product blog never enter order
> form. Not saying this is best example but in general this is the essence of
> it. You simply need to take all possible values from the transaction into
> account even if it is missing in the market basket. I also remember that
> Simpson's paradox is often referred in connection with this issue. As for
> GUHA the power of it is that it has well developed theory background. This
> for example means that it stated formalized framework for various analysis
> that have probably origin in psychology and similar kind of soft-science
> and
> "association-like-functions" between data attributes can be expressed using
> 4ft table members and user thresholds.
>
> The biggest challenge in implementing this would be the fact that the
> analysis have to deal with all the data (not just the most frequent
> patterns) and combinations. It is very resource expensive.
>
> I am thinking of starting a wiki page for Mahout about association mining
> ... this could help, what do you think?
>
> Regards,
> Lukas
>
> On Tue, Apr 6, 2010 at 12:01 AM, Robin Anil  wrote:
>
> > PS: Current TopK FPGrowth is pretty tightly coupled. But it can be easily
> > refactored out or even a vanilla implementation of FPGrowth is not so
> > difficult to re-create by re-using the existing methods.
> >
> > Robin
> >
> >
> > On Tue, Apr 6, 2010 at 3:29 AM, Robin Anil  wrote:
> >
> > > Hi Lukas,
> > >Sorry for being late to getting back to you on this.
> > > Association rule mining is a great addition to FPGrowth. I am not sure
> I
> > > understand GUHA method well but then again I understood Ted's LLR after
> > some
> > > deep reading. Could you put up an interesting example to help us
> > understand
> > > this method. Maybe starting from a transaction of shopping cart item ?
> A
> > > great demo is big plus for a GSOC project.
> > >
> > > Robin
> > >
> > >
> > > On Mon, Mar 29, 2010 at 1:46 AM, Lukáš Vlček  > >wrote:
> > >
> > >> Hello,
> > >>
> > >> I would like to apply for Mahout GSoC 2010. My proposal is to
> implement
> > >> Association Mining algorithm utilizing existing PFPGrowth
> implementation
> > (
> > >> http://cwiki.apache.org/MAHOUT/parallelfrequentpatternmining.html).
> > >>
> > >> As for the Assoiciation Mining I would like to implement a very
> general
> > >> algorithm based on old and established method called GUHA [1]. Short
> and
> > >> informative description of GUHA method can be found here [2]. Very
> > >> exhaustive theoretical foundation can be found here [3]. Since the
> GUHA
> > >> method has been developing from 60's and is very rich now and given
> the
> > >> limited time during GSoC program it would be wise to reduce scope of
> > >> initial
> > >> implementation.
> > >>
> > >> There are two main topics that I would like to focus on during GSoC:
> > >>
> > >> 1) Enhancing and generalization of PFPGworth:
> > >> Frequent pattern minin

[jira] Commented: (MAHOUT-368) should package core ,math and collections to one Jar package for hadoop recommendations

2010-04-07 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12854429#action_12854429
 ] 

Robin Anil commented on MAHOUT-368:
---

check out the mahout-examples.job in examples/target its the jar file you need

> should package core ,math and collections to one Jar package for hadoop 
> recommendations
> ---
>
> Key: MAHOUT-368
> URL: https://issues.apache.org/jira/browse/MAHOUT-368
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
>Affects Versions: 0.4
>Reporter: Hui Wen Han
> Fix For: 0.4
>
>
> should package core ,math and collections to one Jar package for 
> org.apache.mahout.cf.taste.hadoop.item.RecommenderJob.
> because RecommenderJob used classes  (for example 
> org.apache.mahout.math.VectorWritable )of the  math module of mahout project,
> but math and core module is the separated jar package.
> so when work on the hadoop env ,the class of math module can not load to 
> classloader in the datanode.
> it will cause class not found exception.
> the work around is package all mahout classes to one package manually.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: VOTE: release mahout-collections-codegen 1.0

2010-04-07 Thread Robin Anil
I will take your word for it :) Just giving a soln to Teds concern.

+1 from me.

Robin

On Wed, Apr 7, 2010 at 4:50 PM, Benson Margulies wrote:

> You have two choices: 1) take my word (or diff) about the zero code
> changes :-), or
>
> 2) I can set up a patch that will cause the 'collections' directory to
> pull this. I have to add a  to the pom so that it gets
> fetched from the special staging repo. I'll do that later today.
>
>
>
> On Wed, Apr 7, 2010 at 1:26 AM, Robin Anil  wrote:
> > Is there a patch which pulls this dependency to build Mahout. Thats the
> good
> > test for it
> >
> > Robin
> >
> > On Wed, Apr 7, 2010 at 10:45 AM, Ted Dunning 
> wrote:
> >
> >> I confirm that the components exist and appear in good order.
> >>
> >> Is there a way for me to test this component?  Is there any testing
> needed
> >> beyond checking existence?
> >>
> >> On Tue, Apr 6, 2010 at 7:13 PM, Benson Margulies  >> >wrote:
> >>
> >> > On Tue, Apr 6, 2010 at 9:40 PM, Ted Dunning 
> >> wrote:
> >> > > Is that possible here instead:
> >> > >
> >> >
> >>
> https://repository.apache.org/content/repositories/staging/org/apache/mahout/
> >> > ?
> >> >
> >> > No, that's not right. That path has our last (0.3) release in it.
> >> > However, I had forgotten to close it.
> >> >
> >> >
> https://repository.apache.org/content/repositories/orgapachemahout-006/
> >> >
> >> > It should work better now.
> >> >
> >> >
> >> > >
> >> > > On Tue, Apr 6, 2010 at 6:08 PM, Benson Margulies <
> >> bimargul...@gmail.com
> >> > >wrote:
> >> > >
> >> > >> In order to decouple the mahout-collections library from the rest
> of
> >> > >> Mahout, to allow more frequent releases and other good things, we
> >> > >> propose to release the code generator for the collections library
> as a
> >> > >> separate Maven artifact. (Followed, in short order, by the
> collections
> >> > >> library proper.) This is proposed release 1.0 of
> >> > >> mahout-collections-codegen-plugin. This is intended as a maven-only
> >> > >> release; we'll put the artifacts in the Mahout download area as
> well,
> >> > >> but we don't ever expect anyone to use this except from Maven,
> >> > >> inasmuch as it is a maven plugin.
> >> > >>
> >> > >> The release artifacts are in the Nexus stage, as follows.
> >> > >>
> >> > >>
> >> https://repository.apache.org/content/repositories/orgapachemahout-006/
> >> > >>
> >> > >> This vote will remain open for 72 hours.
> >> > >>
> >> > >
> >> >
> >>
> >
>


Introducing Gizzard, a framework for creating distributed datastores

2010-04-06 Thread Robin Anil
Its apache licensed and looks like a great option for storing and querying
large graphs. May be useful as a model store for classifier

http://engineering.twitter.com/2010/04/introducing-gizzard-framework-for.html
http://github.com/twitter/gizzard

Robin


Re: VOTE: release mahout-collections-codegen 1.0

2010-04-06 Thread Robin Anil
Is there a patch which pulls this dependency to build Mahout. Thats the good
test for it

Robin

On Wed, Apr 7, 2010 at 10:45 AM, Ted Dunning  wrote:

> I confirm that the components exist and appear in good order.
>
> Is there a way for me to test this component?  Is there any testing needed
> beyond checking existence?
>
> On Tue, Apr 6, 2010 at 7:13 PM, Benson Margulies  >wrote:
>
> > On Tue, Apr 6, 2010 at 9:40 PM, Ted Dunning 
> wrote:
> > > Is that possible here instead:
> > >
> >
> https://repository.apache.org/content/repositories/staging/org/apache/mahout/
> > ?
> >
> > No, that's not right. That path has our last (0.3) release in it.
> > However, I had forgotten to close it.
> >
> > https://repository.apache.org/content/repositories/orgapachemahout-006/
> >
> > It should work better now.
> >
> >
> > >
> > > On Tue, Apr 6, 2010 at 6:08 PM, Benson Margulies <
> bimargul...@gmail.com
> > >wrote:
> > >
> > >> In order to decouple the mahout-collections library from the rest of
> > >> Mahout, to allow more frequent releases and other good things, we
> > >> propose to release the code generator for the collections library as a
> > >> separate Maven artifact. (Followed, in short order, by the collections
> > >> library proper.) This is proposed release 1.0 of
> > >> mahout-collections-codegen-plugin. This is intended as a maven-only
> > >> release; we'll put the artifacts in the Mahout download area as well,
> > >> but we don't ever expect anyone to use this except from Maven,
> > >> inasmuch as it is a maven plugin.
> > >>
> > >> The release artifacts are in the Nexus stage, as follows.
> > >>
> > >>
> https://repository.apache.org/content/repositories/orgapachemahout-006/
> > >>
> > >> This vote will remain open for 72 hours.
> > >>
> > >
> >
>


Re: [GSOC] 2010 Timelines

2010-04-06 Thread Robin Anil
2 days to go till the close of student submissions. A request to mentors to
provide feedback to all the queries on the list so that students can go and
work on tuning their proposal

Robin

On Sat, Apr 3, 2010 at 10:50 PM, Grant Ingersoll wrote:

>
> http://socghop.appspot.com/document/show/gsoc_program/google/gsoc2010/faqs#timeline


GSOC [mentor idea]: Clustering visualization with GraphViz

2010-04-06 Thread Robin Anil
Here is a good project wish list, If anyone wishes to take it forward I
would be willing to help mentor.

http://www.graphviz.org/
Check out one of the graphs which i believe is a good way to represent
clusters. Creating this graph is as easy was writing cluster output to the
graphviz format
http://www.bioconductor.org/overview/Screenshots/photoalbum_photo_view?b_start=6

This is an excellent project which allows you to display graphviz in a
browser. Maybe we can create a generic webapp from the current taste webapp
and add clustering functionality there.
http://code.google.com/p/canviz/

Robin


Re: Proposal: make collections releases independent of the rest of Mahout

2010-04-06 Thread Robin Anil
Great proposal. Hopefully this will push Mahout core to have faster releases


Robin


On Wed, Apr 7, 2010 at 3:29 AM, Grant Ingersoll  wrote:

> +1.  Release early, release often.
>
> -Grant
>
> On Apr 6, 2010, at 5:12 PM, Benson Margulies wrote:
>
> > Indeed. Off I go.
> >
> > On Tue, Apr 6, 2010 at 4:23 PM, Ted Dunning 
> wrote:
> >> Very cool.  Very exciting.
> >>
> >> Benson, that sounds like consensus to me.
> >>
> >> On Tue, Apr 6, 2010 at 1:02 PM, Jake Mannix 
> wrote:
> >>
> >>> ... I'm in favor, I guess, of:
> >>>
> >>> 1: remove collections-codegen and collections from the top-level pom's
> >>> module list.
> >>> 2: change their parents to point to the apache parent.
> >>> 3: tweak their poms so that the release plugin works right with them.
> >>> 4: release them
> >>> 5: change rest of mahout to consume release.
> >>>
> >>>   -jake
> >>>
> >>>
> >>
>
>


Re: Build failed in Hudson: Mahout Trunk #584

2010-04-06 Thread Robin Anil
I have tasted this before, That was when I didn't do a clean install before
checking in.

On Tue, Apr 6, 2010 at 3:13 PM, Sean Owen  wrote:

> Weak, surely my changes that did it but I don't know why I didn't see
> this in a local build / test.
>
> On Tue, Apr 6, 2010 at 10:41 AM, Apache Hudson Server
>  wrote:
> > See <
> http://hudson.zones.apache.org/hudson/job/Mahout%20Trunk/584/changes>
> >
> > Changes:
> >
> > [srowen] MAHOUT-362 last refactorings for now
> >
> > [srowen] MAHOUT-362 More refinement of writables
> >
> > [srowen] MAHOUT-362 Fix test location and merge ItemWritable/UserWritable
> into EntityWritable
> >
> > [srowen] Oops, fixed compile error from last commit which missed out some
> changes
> >
> > [srowen] Initial commit of MAHOUT-362. Refactoring to come.
> >
> > [srowen] Restore logging to SVD related code
> >
> > [srowen] MAHOUT-361 Hearing no objection and believing Math shouldn't
> have log statements and seeing that they're not really used much, I commit
> >
> > [adeneche] MAHOUT-323 Added a Basic Mapreduce version of TestForest
> >
> > [srowen] MAHOUT-361 Remove logging from collections -- uncontroversial it
> seems
> >
>


Re: Build failed in Hudson: Mahout Trunk #584

2010-04-06 Thread Robin Anil
Running org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityTest
Tests run: 7, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.039
sec <<< FAILURE!



On Tue, Apr 6, 2010 at 3:13 PM, Sean Owen  wrote:

> Weak, surely my changes that did it but I don't know why I didn't see
> this in a local build / test.
>
> On Tue, Apr 6, 2010 at 10:41 AM, Apache Hudson Server
>  wrote:
> > See <
> http://hudson.zones.apache.org/hudson/job/Mahout%20Trunk/584/changes>
> >
> > Changes:
> >
> > [srowen] MAHOUT-362 last refactorings for now
> >
> > [srowen] MAHOUT-362 More refinement of writables
> >
> > [srowen] MAHOUT-362 Fix test location and merge ItemWritable/UserWritable
> into EntityWritable
> >
> > [srowen] Oops, fixed compile error from last commit which missed out some
> changes
> >
> > [srowen] Initial commit of MAHOUT-362. Refactoring to come.
> >
> > [srowen] Restore logging to SVD related code
> >
> > [srowen] MAHOUT-361 Hearing no objection and believing Math shouldn't
> have log statements and seeing that they're not really used much, I commit
> >
> > [adeneche] MAHOUT-323 Added a Basic Mapreduce version of TestForest
> >
> > [srowen] MAHOUT-361 Remove logging from collections -- uncontroversial it
> seems
> >
>


Re: Mahout GSoC 2010 proposal: Association Mining

2010-04-05 Thread Robin Anil
PS: Current TopK FPGrowth is pretty tightly coupled. But it can be easily
refactored out or even a vanilla implementation of FPGrowth is not so
difficult to re-create by re-using the existing methods.

Robin


On Tue, Apr 6, 2010 at 3:29 AM, Robin Anil  wrote:

> Hi Lukas,
>Sorry for being late to getting back to you on this.
> Association rule mining is a great addition to FPGrowth. I am not sure I
> understand GUHA method well but then again I understood Ted's LLR after some
> deep reading. Could you put up an interesting example to help us understand
> this method. Maybe starting from a transaction of shopping cart item ? A
> great demo is big plus for a GSOC project.
>
> Robin
>
>
> On Mon, Mar 29, 2010 at 1:46 AM, Lukáš Vlček wrote:
>
>> Hello,
>>
>> I would like to apply for Mahout GSoC 2010. My proposal is to implement
>> Association Mining algorithm utilizing existing PFPGrowth implementation (
>> http://cwiki.apache.org/MAHOUT/parallelfrequentpatternmining.html).
>>
>> As for the Assoiciation Mining I would like to implement a very general
>> algorithm based on old and established method called GUHA [1]. Short and
>> informative description of GUHA method can be found here [2]. Very
>> exhaustive theoretical foundation can be found here [3]. Since the GUHA
>> method has been developing from 60's and is very rich now and given the
>> limited time during GSoC program it would be wise to reduce scope of
>> initial
>> implementation.
>>
>> There are two main topics that I would like to focus on during GSoC:
>>
>> 1) Enhancing and generalization of PFPGworth:
>> Frequent pattern mining is usually part of each association maining task.
>> In
>> Mahout there is existing implementation of PFPGrowth algorithm. I would
>> like
>> to utilize this implementaion but it would be necessary to enhance or
>> generalize it first. The goal here would be to keep current functionality
>> and performance of PFPGrowth but allow more generalized input/output data
>> and conditions. We will need to get frequencies of very rare patterns thus
>> if it will show up that mining only top K is limiting factor then we would
>> need to allow relaxation of this approach. Also we will need to take
>> account
>> on negative features (those missing in individual transaction). Negative
>> features can be either directly supported by implementation of FP
>> algorithm
>> or it can be coded at the feature level (decision about what is better
>> approach will be part of the GSoC program). It should be noted that for
>> the
>> GSoC program we will narrow scope of association mining antecedents and
>> succedents to conjunctions of data features only.
>>
>> 2) API for custom association mining functions based on 4ft table:
>> Association mining in GUHA sense means testing hypothesis on four-fold
>> table
>> (4ft) [see 2. item 5]. There has been proposed a lot of association
>> functions for GUHA, some of them are based on statistical tests (for
>> example
>> Fischer test, Chi-square test), some are based on frequent analysis while
>> not explicitly refering to statistical tests but in both cases all
>> frequencies from four-fold table are needed. Some tests/functions do not
>> require all frequencies, however; keeping this association mining
>> implementation general means that we are targeting for all frequencies
>> from
>> 4ft. The goal here would be to provide implementation of few GUHA
>> functions
>> and design API for custom function based on 4ft table (if the custom
>> function can be expressed using 4ft table frequencies then it should be
>> very
>> easy to implement it for association mining job).
>>
>> General use case scenario:
>> Before the association mining task is executed one would have to decide
>> which features can be on the left hand side of the rule (antecedents) and
>> which on the right hand side of the rule (succedents). It may be practical
>> to limit number of features on both sides (rules with many features may
>> not
>> be very useful). Then a specific test or function for the 4ft table will
>> be
>> specified with additional custom treasholds.
>>
>> Note: The terminology used in GUHA theory is not be unified. Various
>> researches used different terminology. This may be a source of confusion.
>>
>> My background:
>> I have been working as a full-time Java developer for 9 years. Currently,
>> I
>> am working for JBoss (thus being paid for working on open source). A few
>> years ago I also started Ph.

Re: Mahout GSoC 2010 proposal: Association Mining

2010-04-05 Thread Robin Anil
Hi Lukas,
   Sorry for being late to getting back to you on this.
Association rule mining is a great addition to FPGrowth. I am not sure I
understand GUHA method well but then again I understood Ted's LLR after some
deep reading. Could you put up an interesting example to help us understand
this method. Maybe starting from a transaction of shopping cart item ? A
great demo is big plus for a GSOC project.

Robin

On Mon, Mar 29, 2010 at 1:46 AM, Lukáš Vlček  wrote:

> Hello,
>
> I would like to apply for Mahout GSoC 2010. My proposal is to implement
> Association Mining algorithm utilizing existing PFPGrowth implementation (
> http://cwiki.apache.org/MAHOUT/parallelfrequentpatternmining.html).
>
> As for the Assoiciation Mining I would like to implement a very general
> algorithm based on old and established method called GUHA [1]. Short and
> informative description of GUHA method can be found here [2]. Very
> exhaustive theoretical foundation can be found here [3]. Since the GUHA
> method has been developing from 60's and is very rich now and given the
> limited time during GSoC program it would be wise to reduce scope of
> initial
> implementation.
>
> There are two main topics that I would like to focus on during GSoC:
>
> 1) Enhancing and generalization of PFPGworth:
> Frequent pattern mining is usually part of each association maining task.
> In
> Mahout there is existing implementation of PFPGrowth algorithm. I would
> like
> to utilize this implementaion but it would be necessary to enhance or
> generalize it first. The goal here would be to keep current functionality
> and performance of PFPGrowth but allow more generalized input/output data
> and conditions. We will need to get frequencies of very rare patterns thus
> if it will show up that mining only top K is limiting factor then we would
> need to allow relaxation of this approach. Also we will need to take
> account
> on negative features (those missing in individual transaction). Negative
> features can be either directly supported by implementation of FP algorithm
> or it can be coded at the feature level (decision about what is better
> approach will be part of the GSoC program). It should be noted that for the
> GSoC program we will narrow scope of association mining antecedents and
> succedents to conjunctions of data features only.
>
> 2) API for custom association mining functions based on 4ft table:
> Association mining in GUHA sense means testing hypothesis on four-fold
> table
> (4ft) [see 2. item 5]. There has been proposed a lot of association
> functions for GUHA, some of them are based on statistical tests (for
> example
> Fischer test, Chi-square test), some are based on frequent analysis while
> not explicitly refering to statistical tests but in both cases all
> frequencies from four-fold table are needed. Some tests/functions do not
> require all frequencies, however; keeping this association mining
> implementation general means that we are targeting for all frequencies from
> 4ft. The goal here would be to provide implementation of few GUHA functions
> and design API for custom function based on 4ft table (if the custom
> function can be expressed using 4ft table frequencies then it should be
> very
> easy to implement it for association mining job).
>
> General use case scenario:
> Before the association mining task is executed one would have to decide
> which features can be on the left hand side of the rule (antecedents) and
> which on the right hand side of the rule (succedents). It may be practical
> to limit number of features on both sides (rules with many features may not
> be very useful). Then a specific test or function for the 4ft table will be
> specified with additional custom treasholds.
>
> Note: The terminology used in GUHA theory is not be unified. Various
> researches used different terminology. This may be a source of confusion.
>
> My background:
> I have been working as a full-time Java developer for 9 years. Currently, I
> am working for JBoss (thus being paid for working on open source). A few
> years ago I also started Ph.D. studies in Ostrava, Czech Republic and
> association mining is the topic I am focusing on right now. Implementing
> association mining in context of Mahout makes sense because it is one of
> the
> areas of data mining which is not yet covered in Mahout at all. MapReduce
> implementation can be one possible way how to tackle the challenge of
> mining
> association rules from large data.
>
> Regards,
> Lukas Vlcek
>
> [1]
> http://en.wikipedia.org/wiki/Association_rule_learning#GUHA_procedure_ASSOC
> [2] http://www.cs.cas.cz/coufal/guha/index.htm
> [3] http://www.uivt.cas.cz/~hajek/guhabook/index.html<
> http://www.uivt.cas.cz/%7Ehajek/guhabook/index.html>
>


[GSOC] Create adapters for MYSQL and NOSQL(hbase, cassandra) to access data for all the algorithms to use *

2010-04-05 Thread Robin Anil
+changing subject line.

Hi Necati, Like I mentioned on JIRA ticket, you need to take a look at the
current data representation format (Vectors) and how structured data (ARFF
format) is converted to vectors. You will find a basic converter in the
utils folder under trunk.

With regard to NOSQL, the Bayes classifier already interfaces with HBASE to
store and access model stored in HBASE server. We want to extend that to a
generic Matrix adapter which can be consumed by any algorithm in Mahout.

Take a look at these open issues
https://issues.apache.org/jira/browse/MAHOUT-78
https://issues.apache.org/jira/browse/MAHOUT-202

You can follow what I did for Mahout Bayes code last year, here
https://issues.apache.org/jira/browse/MAHOUT-124


Taste already has some wrappers which reads from a MYSQL database.

What I would like in a proposal is this. Atleast for the first cut implement
a data dump tool which can dump selected fields (from SQL NOSQL) and write
them to a sequence file or better in the AVRO document format(@Drew you can
explain more here).

Similar to ARFF to vector conversion, we need to convert this document file
to SequenceFile of vectors with pluggable weighting strategies.

To understand all these in a proposal you would have to read a bit of what
is there in the code and what you think can be re-used. Feel free to post in
case you have any doubts


Robin


On Tue, Apr 6, 2010 at 2:58 AM, Necati Batur  wrote:

> *IDEA:Create adapters for MYSQL and NOSQL(hbase, cassandra) to access data
> for all the algorithms to use *
>
> *Summary*
>
> **First of all,I am very excited to join an organization like
> GSOC and most importantly work for a big open source Project apache.I am
> looking for a good collaboration and new challenges on software
> development.Especially information management issues sound great to me.I am
> confident to work with all new technologies.I took the data structures I ,
> II courses at university so I am ok with data structures.Most importantly I
> am interested in databases.From my software engineering courses experience
> I
> know how to work on a project by iterative development and timelining* *
>
> *About Me*
>
> I am a senior student at computer engineering at
> iztechin
> turkey. My areas of inetrests are information management, OOP(Object
> Oriented Programming) and currently bioinformatics. I have been working
> with
> a Asistan Professor(Jens Allmer ) in molecular
> biology genetics department for one year.Firstly, we worked on a protein
> database 2DB  and we presented the project in
> HIBIT09organization. The Project
>  was “Database management system independence by amending 2DB with a
> database access layer”. Currently, I am working on another project (Kerb)
> as
> my senior project which is a general sqeuential task management system
> intend to reduce the errors and increase time saving in biological
> experiments. We will present this project in
> HIBIT2010too.
>
> *My Offer for  Project *
>
> **The data adapters fort he higher level languages will require
> the good capability of using data structures and some information about
> finite mathematics that I am confident on that issues.Then,the code given
> in
> svn repository seems to need some improvements and also documetation.
>
> Briefly,I would do the following operations fort his project
>
>   1. Understand the underlying maths for adapters
>   2. Determine the data structures that would be used for adapters
>   3. Implement the neccassary methods to handle creation of these
>   structures
>   4. Some test cases that we probably would need to check whether our code
>   cover all the issues required by a data retrieve operations
>   5. New iterations on the code to robust the algorithms
>   6. Documentation of overall project to join our particular Project to
>   overall scope
>


[jira] Commented: (MAHOUT-363) Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)

2010-04-05 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12853409#action_12853409
 ] 

Robin Anil commented on MAHOUT-363:
---

Hi Shannon, did you take time to explore the Mahout code. I believe the k-means 
you are looking to implement is already there it will shave 2 weeks of your 
GSOC :). Reading the code/wiki is a great exercise for you to be more realistic 
in your proposal

> Proposal for GSoC 2010 (EigenCuts clustering algorithm for Mahout)
> --
>
> Key: MAHOUT-363
> URL: https://issues.apache.org/jira/browse/MAHOUT-363
> Project: Mahout
>  Issue Type: Task
>Reporter: Shannon Quinn
>
> Proposal Title: EigenCuts spectral clustering implementation on map/reduce 
> for Apache Mahout (addresses issue Mahout-328)
> Student Name: Shannon Quinn
> Student E-mail: mag...@gmail.com
> Organization/Project:Assigned Mentor:
> Proposal Abstract:
> Clustering algorithms are advantageous when the number of classes are not 
> known a priori. However, most techniques still require an explicit K to be 
> chosen, and most spectral algorithms' use of piecewise constant approximation 
> of eigenvectors breaks down when the clusters are tightly coupled. 
> EigenCuts[1] solves both these problems by choosing an eigenvector to create 
> a new cluster boundary and iterating until no more edges are cut.
> Detailed Description
> Clustering techniques are extremely useful unsupervised methods, particularly 
> within my field of computational biology, for situations where the number 
> (and often the characteristics as well) of classes expressed in the data are 
> not known a priori. K-means is a classic technique which, given some K, 
> attempts to label data points within a cluster as a function of their 
> distance (e.g. Euclidean) from the cluster's mean, iterating to convergence.
> Another approach is spectral clustering, which models the data as a weighted, 
> undirected graph in some n-dimensional space, and creates a matrix M of 
> transition probabilities between nodes. By computing the eigenvalues and 
> eigenvectors of this matrix, most spectral clustering techniques take 
> advantage of the fact that, for data with loosely coupled clusters, the K 
> leading eigenvectors will identify the roughly piecewise constant regions in 
> the data that correspond to clusters.
> However, these techniques all suffer from drawbacks, the two most significant 
> of which are having to choose an arbitrary K a priori, and in the situation 
> of tightly coupled clusters where the piecewise constant approximation on the 
> eigenvectors no longer holds.
> The EigenCuts algorithm addresses both these issues. As a type of spectral 
> clustering algorithm it works by constructing a Markov chain representation 
> of the data and computing the eigenvectors and eigenvalues of the transition 
> matrix. Eigenflows, or flow of probability by eigenvector, have an associated 
> half life of flow decay called eigenflow. By perturbing the weights between 
> nodes, it can be observed where bottlenecks exist in the eigenflow's 
> halflife, allowing for the identification of boundaries between clusters. 
> Thus, this algorithm iterates until no more cuts between clusters need to be 
> made, eliminating the need for an a prior K, and conferring the ability to 
> separate tightly coupled clusters.
> The only disadvantage of EigenCuts is the need to recompute eigenvectors and 
> eigenvalues at each iterative step, incurring a large computational overhead. 
> This problem can be adequately addressed within the map/reduce framework and 
> on a Hadoop cluster by parallelizing the computation of each eigenvector and 
> its associated eigenvalue. Apache Hama in particular, with its 
> specializations in graph and matrix data, will be crucial in parallelizing 
> the computations of transition matrices and their corresponding eigenvalues 
> and eigenvectors at each iteration.
> Since Dr Chennubhotla is currently a member of the faculty at the University 
> of Pittsburgh, I have been in contact with him for the past few weeks, and we 
> both envision and eagerly anticipate continued collaboration over the course 
> of the summer and this project's implementation. He has thus far been 
> instrumental in highlighting the finer points of the underlying theory, and 
> coupled with my experience in and knowledge of software engineering, this is 
> a project we are both extremely excited about implementing.
> Timeline
> At the end of each sprint, there should be a concrete, functional 
> deliverable. I

Re: svn commit: r930796 - in /lucene/mahout/trunk/math: ./ src/main/java/org/apache/mahout/math/ src/main/java/org/apache/mahout/math/decomposer/hebbian/ src/main/java/org/apache/mahout/math/decompo

2010-04-04 Thread Robin Anil
SVD shouldn't really be in Math. I agree its "Math" but in principle its a
core Mahout algorithm like clustering or recommendations. I know its a very
debatable thought but for me collections and Math are just tools to aid
complex algorithms in Mahout core. Maybe we can move it under core and
adding the required logging.

Robin


On Mon, Apr 5, 2010 at 11:03 AM, Jake Mannix  wrote:

> Umm, I actually depend pretty heavily on the logging in the SVD solvers.
>  They are very long-running processes, and give off a ton of useful
> information about what the heck is going on.
>
> Reducing dependencies is great, but logging?  I think the math stuff could
> really use logging.  I haven't been able to follow all the JIRA tickets
> lately, things have been crazy, sorry.
>
>  -jake
>
> On Sun, Apr 4, 2010 at 10:21 PM,  wrote:
>
> > Author: srowen
> > Date: Mon Apr  5 05:21:27 2010
> > New Revision: 930796
> >
> > URL: http://svn.apache.org/viewvc?rev=930796&view=rev
> > Log:
> > MAHOUT-361 Hearing no objection and believing Math shouldn't have log
> > statements and seeing that they're not really used much, I commit
> >
> > Modified:
> >lucene/mahout/trunk/math/pom.xml
> >
> >
>  
> lucene/mahout/trunk/math/src/main/java/org/apache/mahout/math/JsonMatrixAdapter.java
> >
> >
>  
> lucene/mahout/trunk/math/src/main/java/org/apache/mahout/math/JsonVectorAdapter.java
> >
>  lucene/mahout/trunk/math/src/main/java/org/apache/mahout/math/Timer.java
> >
> >
>  
> lucene/mahout/trunk/math/src/main/java/org/apache/mahout/math/decomposer/hebbian/HebbianSolver.java
> >
> >
>  
> lucene/mahout/trunk/math/src/main/java/org/apache/mahout/math/decomposer/lanczos/LanczosSolver.java
> >
> >
>  
> lucene/mahout/trunk/math/src/main/java/org/apache/mahout/math/jet/random/sampling/RandomSampler.java
> >
> >
>  
> lucene/mahout/trunk/math/src/main/java/org/apache/mahout/math/jet/stat/quantile/QuantileCalc.java
> >
> >
>  
> lucene/mahout/trunk/math/src/main/java/org/apache/mahout/math/jet/stat/quantile/QuantileFinderFactory.java
> >
> >
>  
> lucene/mahout/trunk/math/src/main/java/org/apache/mahout/math/matrix/DoubleFactory2D.java
> >
> >
>  
> lucene/mahout/trunk/math/src/main/java/org/apache/mahout/math/matrix/doublealgo/Formatter.java
> >
> >
>  
> lucene/mahout/trunk/math/src/test/java/org/apache/mahout/math/decomposer/SolverTest.java
> >
> > Modified: lucene/mahout/trunk/math/pom.xml
> > URL:
> >
> http://svn.apache.org/viewvc/lucene/mahout/trunk/math/pom.xml?rev=930796&r1=930795&r2=930796&view=diff
> >
> >
> ==
> > --- lucene/mahout/trunk/math/pom.xml (original)
> > +++ lucene/mahout/trunk/math/pom.xml Mon Apr  5 05:21:27 2010
> > @@ -100,19 +100,6 @@
> > 
> >
> > 
> > -  org.slf4j
> > -  slf4j-api
> > -  1.5.8
> > -
> > -
> > -
> > -  org.slf4j
> > -  slf4j-jcl
> > -  1.5.8
> > -  test
> > -
> > -
> > -
> >   junit
> >   junit
> >   test
> >
> > Modified:
> >
> lucene/mahout/trunk/math/src/main/java/org/apache/mahout/math/JsonMatrixAdapter.java
> > URL:
> >
> http://svn.apache.org/viewvc/lucene/mahout/trunk/math/src/main/java/org/apache/mahout/math/JsonMatrixAdapter.java?rev=930796&r1=930795&r2=930796&view=diff
> >
> >
> ==
> > ---
> >
> lucene/mahout/trunk/math/src/main/java/org/apache/mahout/math/JsonMatrixAdapter.java
> > (original)
> > +++
> >
> lucene/mahout/trunk/math/src/main/java/org/apache/mahout/math/JsonMatrixAdapter.java
> > Mon Apr  5 05:21:27 2010
> > @@ -27,15 +27,12 @@ import com.google.gson.JsonPrimitive;
> >  import com.google.gson.JsonSerializationContext;
> >  import com.google.gson.JsonSerializer;
> >  import com.google.gson.reflect.TypeToken;
> > -import org.slf4j.Logger;
> > -import org.slf4j.LoggerFactory;
> >
> >  import java.lang.reflect.Type;
> >
> >  public class JsonMatrixAdapter implements JsonSerializer,
> > JsonDeserializer {
> >
> > -  private static final Logger log =
> > LoggerFactory.getLogger(JsonMatrixAdapter.class);
> >   public static final String CLASS = "class";
> >   public static final String MATRIX = "matrix";
> >
> > @@ -73,7 +70,7 @@ public class JsonMatrixAdapter implement
> > try {
> >   cl = ccl.loadClass(klass);
> > } catch (ClassNotFoundException e) {
> > -  log.warn("Error while loading class", e);
> > +  throw new JsonParseException(e);
> > }
> > return (Matrix) gson.fromJson(matrix, cl);
> >   }
> >
> > Modified:
> >
> lucene/mahout/trunk/math/src/main/java/org/apache/mahout/math/JsonVectorAdapter.java
> > URL:
> >
> http://svn.apache.org/viewvc/lucene/mahout/trunk/math/src/main/java/org/apache/mahout/math/JsonVectorAdapter.java?rev=930796&r1=930795&r2=930796&view=diff
> >
> >
> ==
> > ---
> >
> lucene/mahout/trunk/math/src/main/java/org/apache/mahout/math

[jira] Commented: (MAHOUT-332) Create adapters for MYSQL and NOSQL(hbase, cassandra) to access data for all the algorithms to use

2010-04-03 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12853182#action_12853182
 ] 

Robin Anil commented on MAHOUT-332:
---

Conversion of any arbitary data in a database to vectors would be along the 
same lines as how ARFF format is to be converted to vectors. You can find the 
code under trunk/utils. It treats boolean, enum and numeric and string 
datatypes separately. That code still may need some more tweaking up so that 
the entire ARFF spec is supported. But its a good starting point for you to 
understand how data is converted to vectors. Also look at the 
SparseVectorsFromSequenceFiles to understand how text documents in a 
SequenceFile(you need to understand this also) are converted to vectors using 
tf-idf based weighting. So in short there could be many weighting strategies. 
It will be really nice if you can make this pluggable so that users of the 
library could make custom weighting techniques for each field. 

> Create adapters for  MYSQL and NOSQL(hbase, cassandra) to access data for all 
> the algorithms to use
> ---
>
> Key: MAHOUT-332
> URL: https://issues.apache.org/jira/browse/MAHOUT-332
> Project: Mahout
>  Issue Type: New Feature
>    Reporter: Robin Anil
>
> A student with a good proposal 
> - should be free to work for Mahout in the summer and should be thrilled to 
> work in this area :)
> - should be able to program in Java and be comfortable with datastructures 
> and algorithms
> - must explore SQL and NOSQL implementations, and design a framework with 
> which data from them could be fetched and converted to mahout format or used 
> directly as a matrix transparently
> - should have a plan to make it high performance with ample caching 
> strategies or the ability to use it on a map/reduce job
> - should focus more on getting a working version than to implement all 
> functionalities. So its recommended that you divide features into milestones
> - must have clear deadlines and pace it evenly across the span of 3 months.
> If you can do something extra it counts, but make sure the plan is reasonable 
> within the specified time frame.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-332) Create adapters for MYSQL and NOSQL(hbase, cassandra) to access data for all the algorithms to use

2010-04-03 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12853174#action_12853174
 ] 

Robin Anil commented on MAHOUT-332:
---

Hi Necati, Take a look at the matrix and vector classes in mahout. And read up 
on how mahout converts text into vectors. We need a generic framework where 
data from Databases could be iterated upon as a vector and algorithms can use 
it seamlessly. The current VectorWritable could be extended to say a database 
backed vector, which should reach each field and convert it to a vector on the 
fly using a pre populated dictionary. This could be easily consumed by the 
mahout algorithms. The database backed vector should be configurable enough 
such that fields could be selected. I am sure there are frameworks which 
already does this.  Drew Farris is working on a document structure for mahout 
using avro. I am sure he will have more inputs on how these adapters should fit 
with his structure. 

> Create adapters for  MYSQL and NOSQL(hbase, cassandra) to access data for all 
> the algorithms to use
> ---
>
> Key: MAHOUT-332
> URL: https://issues.apache.org/jira/browse/MAHOUT-332
> Project: Mahout
>  Issue Type: New Feature
>    Reporter: Robin Anil
>
> A student with a good proposal 
> - should be free to work for Mahout in the summer and should be thrilled to 
> work in this area :)
> - should be able to program in Java and be comfortable with datastructures 
> and algorithms
> - must explore SQL and NOSQL implementations, and design a framework with 
> which data from them could be fetched and converted to mahout format or used 
> directly as a matrix transparently
> - should have a plan to make it high performance with ample caching 
> strategies or the ability to use it on a map/reduce job
> - should focus more on getting a working version than to implement all 
> functionalities. So its recommended that you divide features into milestones
> - must have clear deadlines and pace it evenly across the span of 3 months.
> If you can do something extra it counts, but make sure the plan is reasonable 
> within the specified time frame.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: A request for prospective GSOC students

2010-04-03 Thread Robin Anil
Thanks! I just noticed your proposal. My advice to everyone would be to be
clear on what you want to do instead of the related content and theory about
any algorithm. So really expand the design, implementation and time line
sections.

Robin

On Sat, Apr 3, 2010 at 9:18 PM, yinghua hu  wrote:

> Dear Robin and other contributors,
>
> Nice to meet you.
>
> I am a PhD student in University of Central Florida. I submitted a
> proposal to Google Summer of Code 2010 with title "Implement
> Map/Reduce Enabled Neural Networks (mahout-342)".
>
> Any suggestions and advice are very welcome. I am still allowed to do
> correction on it before April 9th.
>
> Thank you!
>
> --
> Regards,
>
> Yinghua
>
>
> On Sat, Apr 3, 2010 at 11:37 AM, Robin Anil  wrote:
> > I am having a tough time separating Mahout proposals from rest of Apache
> on
> > gsoc website. So I would request you all to reply to this thread when you
> > have submitted a proposal so that we don't miss out on reading your hard
> > worked proposal. For now I could only find Zhao Zhendong's LIBLINEAR
> > proposal. If anyone else have applied do reply back with the title of the
> > proposal.
> >
> > Robin
> >
>


A request for prospective GSOC students

2010-04-03 Thread Robin Anil
I am having a tough time separating Mahout proposals from rest of Apache on
gsoc website. So I would request you all to reply to this thread when you
have submitted a proposal so that we don't miss out on reading your hard
worked proposal. For now I could only find Zhao Zhendong's LIBLINEAR
proposal. If anyone else have applied do reply back with the title of the
proposal.

Robin


Re: My ideas for GSoC 2010

2010-03-31 Thread Robin Anil
Why dont you try it on 20 newsgroups. There are about 17-18 unique topics
and couple of overlapping ones. You can easily find issues with the
clustering code with that dataset. Once its done you can try bigger datasets
like wikipedia

Robin

On Thu, Apr 1, 2010 at 12:02 PM, Cristian Prodan
wrote:

> Hi,
>
> Can anyone please point me a good data set on which I might try SimHash
> clustering ?
> Thank you,
>
> Cristi
>
> On Tue, Mar 23, 2010 at 10:35 AM, cristi prodan
> wrote:
>
> > Hello again,
> >
> > First of all, thank you all for taking time to answer my ideas. Based on
> > your thoughts, I have been digging around, and  the project I would very
> > much like to propose is a MapReduce implementation of the simhash
> algorithm.
> > Thank you Ted for the great idea! I'm keen on starting implementing a new
> > algorithm for Mahout, which seems manageable and doable during the summer
> > period and which has a clear scope. I have noticed quite a few comments,
> > advising about not taking a project which would be to big and I intend to
> > follow them.
> >
> > After taking a deep look at the article pointed out by Ted, as well as
> > other articles and simlar ideas (Charikar's algorithm, Ankur's patch for
> > Minhash), I've decided to write some notes on the algorithm and a
> schedule
> > for implemeting the idea during the competition.  I kindly ask for your
> > feedback on this proposal.
> >
> >
> > 1. Problem and outline of simhash
> > ---
> >
> > Detecting similar files and cliassifying documents requires complex
> > heuristics and/or O(n^2) pair-wise computations [1]. The simhash
> algorithm
> > [1] relies on computing a hash function that hashes similar files to
> similar
> > values. The file similarities would then be determined by comparing the
> > pre-sorted hash key values (and reducing the complexity to O(n log n)).
> > In order to futher improve the detection mechanism, the algorithm will
> also
> > store some auxiliary data, used to compute the hash keys. This will be
> used
> > as a second filter, i.e. after the hash key comparison indicates that two
> > files are potentiallly similar.
> >
> > Properties for the similarity function and the algorithm:
> > - very similar files map to very similar or even the same hash key;
> > - distance between keys should be some measure of the difference between
> > files. This would lead to keys proportional to file sizes and this would
> > create false positives. Some auxiliary data will provide an easy and
> > efficient way of refining the similarity detection.
> > - the metric used will be a binary metric (simhash operats at byte
> level).
> > - given a similarity metric, there needs to be a threshold to determine
> how
> > close within the metric files need to be to count as similar.
> >
> > I would like my implementation of the simhash algorithm to answer two
> > questions:
> > - retrieving a set of files, similar to a given file;
> > - retrieving all pairs of similar files.
> >
> >
> > 2. simhash implementation
> > --
> > Chose 16, 8-bit tags - the bytes used for pattern matching. The idea is
> to
> > count the occurances of these tags within the file under processing.
> >
> >
> > 2.1 Computing the hash function for each file
> > -
> > for each directory
> >  for each file within directory
> > - scan through the file and look for matches for each tag;
> > - detection window is moved one bit at a time
> > - if a match if found
> >- set a skip_counter and then decrement it for the following bits;
> > - a count of matches is stored for each tag, and these are stored as
> > SUM_TABLE_ENTRIES
> > - the KEY for the current file is computed as a function of the
> > SUM_TABLE_ENTRIES (a linear combination of the sum) [#1]
> > - after this processing, we store:
> >   | file_name | file_path | file_size | key | SUM_TABLE_ENTRIES |
> >
> > The key function could also be implemented to take into account file
> > extensions.
> >
> >
> > 2.2 Finding similarities
> > 
> > for each FILE
> >   SIMILAR_FILES = files with keys within a tollerance range of KEY(FILE);
> >   for each SIMILAR_FILE
> >if file_size(SIMILAR_FILE) differes a lot from file_size(FILE)
> >  discard the query
> > else
> >  compute the distance between SIMLAR_FILE and FILE ' s
> > SUM_TABLE_ENTRIES
> >  (i.e. "the sum of the absolute values of the diffs between
> > their entries)
> >  if the computed distance is within a tollerance or is equal
> to
> > 0 then
> >   "The files are identical"
> >
> >
> >
> > 3. Distributing simhash on hadoop
> > -
> > The algorithm above is very suited for a MapReduce implementation.
> >
> > 1. Map
> > In this phase we do the computation of the hash for a file.
> > It outputs (File

Re: GSOC 2010

2010-03-31 Thread Robin Anil
Hi Tanya,
 MAHOUT-328 is just a general stub. There is no detailed project
description other than what is given there. The idea is we let you propose
to implement a clustering algorithm in Mahout. Start here
http://cwiki.apache.org/MAHOUT/gsoc.html. Browse through the Wiki. Look at
what mahout has at the moment http://cwiki.apache.org/MAHOUT/algorithms.html.
There are couple of algorithms missing  from mahout like min-hash or
hierarchical clustering or even a generic EM framework. I would suggest you
to read carefully through the discussions on the mailing list using the
archives and then zero in on the algorithm you would want to implement and
then propose to implement it.

Robin


On Wed, Mar 31, 2010 at 10:27 PM, Tanya Gupta  wrote:

> Hi
>
> I would like a detailed project description for MAHOUT-328.
>
> Thanking You
> Tanya Gupta
>


[jira] Commented: (MAHOUT-328) Implement a cool clustering algorithm on map/reduce

2010-03-30 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851742#action_12851742
 ] 

Robin Anil commented on MAHOUT-328:
---

Subscribe to the mahout dev mailing list 
http://lucene.apache.org/mahout/mailinglists.html and post your query there. 
All the would-be mentors are on that list. 

> Implement a cool clustering algorithm on map/reduce
> ---
>
> Key: MAHOUT-328
> URL: https://issues.apache.org/jira/browse/MAHOUT-328
> Project: Mahout
>  Issue Type: New Feature
>  Components: Clustering
>    Reporter: Robin Anil
>
> A student with a good proposal 
> - should be free to work for Mahout in the summer and should be thrilled to 
> work in this area :)
> - should be able to program in Java and be comfortable with datastructures 
> and algorithms
> - must be clear about the clustering algorithm, how it works, its strengths, 
> its weaknesses and possible tweaks.
> - must have a plan on making it a map/reduce implementation
> - should have a demo over standard datasets by the end of summer of code
> - must have clear deadlines and pace it evenly across the span of 3 months.
> - may have a background in these area.  Past work, thesis etc counts, so show 
> it in the proposal clearly
> If you can do something extra it counts, but make sure the plan is reasonable 
> within the specified time frame.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [jira] Created: (MAHOUT-345) [GSOC] integrate Mahout with Drupal/PHP

2010-03-22 Thread Robin Anil
This is a very interesting proposal =) I am sure no-one would have seen this
coming even a month ago. Will have to see how feasible this is.

Robin


On Mon, Mar 22, 2010 at 7:51 PM, Daniel Xiaodan Zhou (JIRA)  wrote:

> [GSOC] integrate Mahout with Drupal/PHP
> ---
>
> Key: MAHOUT-345
> URL: https://issues.apache.org/jira/browse/MAHOUT-345
> Project: Mahout
>  Issue Type: Task
>  Components: Website
>Reporter: Daniel Xiaodan Zhou
>
>
> Drupal is a very popular open source web content management system. It's
> been widely used in e-commerce sites, media sites, etc. This is a list of
> famous site using Drupal:
> http://socialcmsbuzz.com/45-drupal-sites-which-you-may-not-have-known-were-drupal-based-24092008/
>
> Integrate Mahout with Drupal would greatly increase the impact of Mahout in
> web systems: any Drupal website can easily use Mahout to make content
> recommendations or cluster contents.
>
> I'm a PhD student at University of Michigan, with a research focus on
> recommender systems. Last year I participated GSOC 2009 with Drupal.org, and
> developed a recommender system for Drupal. But that module was not as
> sophisticated as Mahout. And I think it would be nice just to integrate
> Mahout into Drupal rather than developing a separate Mahout-like module for
> Drupal.
>
> Any comments? I can provide more information if people here are interested.
> Thanks.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>


Re: Reg. Netflix Prize Apache Mahout GSoC Application

2010-03-22 Thread Robin Anil
Hi Sisir,
  I am currently on vacation. So wont be able to review your
proposal fully. But from the looks of it what I would suggest you is to
target a somewhat lower and practical proposal. Trust me converting these
algorithms to map/reduce is not as easy as it sounds and most of the time
you would spend in debugging your code. Your work history is quite
impressive but whats more important here is getting your proposal right.
Sean has written most of the recommender code of Mahout and would be best to
give you feedback as he has tried quite a number of approaches to
recommenders on map/reduce and knows very well, some of the constraints of
the framework. Feel free to explore the current Mahout recommenders code and
ask on the list if you find anything confusing. But remember you are trying
to reproduce some of the cutting edge work in Recommendations over 2 years
in a span of 10 weeks :) so stop and ponder over the feasibility. If you
still are good to go then prolly, you need to demonstrate something in terms
of code during the proposal period(which is optional).

Don't take this in the wrong way, its not meant to demotivate you. If we can
get this into mahout, I am sure noone here would be objecting to it. So your
good next step would be read, explore, think, discuss.

Regards
Robin


On Mon, Mar 22, 2010 at 4:36 PM, Sisir Koppaka wrote:

> Dear Robin & the Apache Mahout team,
> I'm Sisir Koppaka, a third-year student from IIT Kharagpur, India. I've
> contributed to open source projects like FFmpeg earlier(Repository diff
> links are here<
> http://git.ffmpeg.org/?p=ffmpeg;a=commitdiff;h=16a043535b91595bf34d7e044ef398067e7443e0
> >and
> here<
> http://git.ffmpeg.org/?p=ffmpeg;a=commitdiff;h=9dde37a150ce2e5c53e2295d09efe289cebea9cd
> >
> ), and I am very interested to work on a project for Apache Mahout this
> year(the Netflix algorithms project, to be precise - mentored by Robin).
> Kindly let me explain my background so that I can make myself relevant in
> this context.
>
> I've done research work in meta-heuristics, including proposing the
> equivalents of local search and mutation for quantum-inspired algorithms,
> in
> my paper titled "*Superior Exploration-Exploitation Balance With
> Quantum-Inspired Hadamard Walks*", that was accepted as a late-breaking
> paper at GECCO 2010. We(myself and a friend - it was an independent work),
> hope to send an expanded version of the communication to a journal in the
> near future. For this project, our language of implementation was in
> Mathematica, as we needed the combination of functional paradigms and
> available mathematically sound resources(like biased random number
> generation, simple linear programming functions etc.) as well as rapid
> prototyping ability.
>
> I have earlier interned in GE Research in their Computing and Decision
> Sciences Lab<
> http://ge.geglobalresearch.com/technologies/computing-decision-sciences/
> >last
> year, where I worked on machine learning techniques for large-scale
> databases - specifically on the Netflix Prize itself. Over a 2 month
> internship we rose from 1800 to 409th position on the Leaderboard, and had
> implemented at least one variant of each of the major algorithms. The
> contest ended at the same time as the conclusion of our internship, and the
> winning result was the combination of multiple variants of our implemented
> algorithms.
>
> Interestingly, we did try to use Hadoop and the Map-Reduce model for the
> purpose based on a talk from a person from Yahoo! who visited us during
> that
> time. However, not having access to a cluster proved to be an impedance for
> fast iterative development. We had one machine of 16 cores, so we developed
> a toolkit in C++ that could multiprocess upto 16 threads(data input
> parallelization, rather than modifying the algorithms to suit the
> Map-Reduce
> model), and implemented all our algorithms using the same toolkit.
> Specifically, SVD, kNN Movie-Movie, kNN User-User, NSVD(Bellkor and other
> variants like the Paterek SVD, and the temporal SVD++ too) were the major
> algorithms that we implemented. Some algorithms had readily available open
> source code for the Netflix Prize, like NSVD1, so we used them as well. We
> also worked on certain regression schemes that could improve prediction
> accuracy like kernel-ridge regression, and it's optimization.
>
> Towards the end, we also attempted to verify the results of the infamous
> paper that showed that IMDB-Netflix correlation could destroy privacy, and
> identify users. We would import IMDB datasets, and put them into a database
> and then correlate the IMDB entries to Netflix(we matched double the number
> of movies that the paper mentioned), and then verify the results. We also
> identified genre-wise trends and recorded them as such. Unfortuantely, the
> paper resulted in a libel case, wherein Netflix surrendered it's rights to
> hold future Prizes of this kind in return for withdrawal of charges. The
>

GSOC mentors

2010-03-20 Thread Robin Anil
Grant, Ted and others interested

Please go here to add yourself to the mentors list
http://socghop.appspot.com/gsoc/org/apply_mentor/google/gsoc2010

Robin


Re: [VOTE] Mahout as TLP

2010-03-19 Thread Robin Anil
+1 Its about time :)


[ANN] Apache Mahout 0.3 Released

2010-03-18 Thread Robin Anil
Apache Mahout  0.3 has been released and is
now available for public
download at http://www.apache.org/dyn/closer.cgi/lucene/mahout

Up-to-date maven artifacts can be found in the Apache repository at
https://repository.apache.org/content/repositories/releases/org/apache/mahout/

Apache Mahout is a subproject of Apache Lucene with the goal of
delivering scalable machine learning algorithm implementations under
the Apache license. http://www.apache.org/licenses/LICENSE-2.0

Mahout is a machine learning library meant to scale: Scale in terms of
community to support anyone interested in using machine learning.
Scale in terms of business by providing the library under a
commercially friendly, free software license. Scale in terms of
computation to the size of data we manage today.

Built on top of the powerful map/reduce paradigm of the Apache Hadoop
project, Mahout lets you solve popular machine learning problem
settings like clustering, collaborative filtering and classification
over Terabytes of data over thousands of computers.

Implemented with scalability in mind the latest release brings many
performance optimizations so that even in a single node setup the
library performs well.

The complete changelist can be found here:

http://issues.apache.org/jira/browse/MAHOUT/fixforversion/12314281

New Mahout 0.3 features include:

  * New math and collections modules based on the high performance Colt
library.
  * Faster Frequent Pattern Growth (FPGrowth) using FP-bonsai pruning
  * Parallel Dirichlet process clustering (a model-based clustering
algorithm)
  * Parallel co-occurrence based recommender
  * Parallel text document to vector conversion using LLR based ngram
generation
  * Parallel Lanczos SVD (Singular Value Decomposition) solver
  * Shell scripts for easier running of algorithms, utilities and examples
  * ...and much much more: code cleanup, many bug fixes and
performance improvements

Getting started: New to Mahout?

   * Download Mahout at http://www.apache.org/dyn/closer.cgi/lucene/mahout
   * Check out the Quick start: http://cwiki.apache.org/MAHOUT
   * Read the Mahout Wiki: http://cwiki.apache.org/MAHOUT
   * Join the community by subscribing to mahout-u...@lucene.apache.org
   * Give back: http://www.apache.org/foundation/getinvolved.html
   * Consider adding yourself to the power by Wiki
page:http://cwiki.apache.org/MAHOUT/poweredby.html

For more information on Apache Mahout, see http://lucene.apache.org/mahout


[ANNOUNCE] Apache Mahout 0.3 Released

2010-03-18 Thread Robin Anil
>
> Apache Mahout  0.3 has been released and
is now available for public

> download at http://www.apache.org/dyn/closer.cgi/lucene/mahout

>
Up-to-date maven artifacts can be found in the Apache repository at

>
https://repository.apache.org/content/repositories/releases/org/apache/mahout/

>
Apache Mahout is a subproject of Apache Lucene with the goal of

> delivering scalable machine learning algorithm implementations under

> the Apache license. http://www.apache.org/licenses/LICENSE-2.0

>
Mahout is a machine learning library meant to scale: Scale in terms of

> community to support anyone interested in using machine learning.

> Scale in terms of business by providing the library under a

> commercially friendly, free software license. Scale in terms of

> computation to the size of data we manage today.

>
Built on top of the powerful map/reduce paradigm of the Apache Hadoop

> project, Mahout lets you solve popular machine learning problem

> settings like clustering, collaborative filtering and classification

> over Terabytes of data over thousands of computers.

>
Implemented with scalability in mind the latest release brings many

> performance optimizations so that even in a single node setup the

> library performs well.

>
The complete changelist can be found here:

>
http://issues.apache.org/jira/browse/MAHOUT/fixforversion/12314281

>
New Mahout 0.3 features include:

>
  * New math and collections modules based on the high performance Colt
library.

>   * Faster Frequent Pattern Growth (FPGrowth) using FP-bonsai pruning

>   * Parallel Dirichlet process clustering (a model-based clustering
algorithm)

>   * Parallel co-occurrence based recommender

>   * Parallel text document to vector conversion using LLR based ngram
generation

>   * Parallel Lanczos SVD (Singular Value Decomposition) solver

>   * Shell scripts for easier running of algorithms, utilities and examples

>   * ...and much much more: code cleanup, many bug fixes and

> performance improvements

>
Getting started: New to Mahout?

>
   * Download Mahout at http://www.apache.org/dyn/closer.cgi/lucene/mahout

>* Check out the Quick start: http://cwiki.apache.org/MAHOUT

>* Read the Mahout Wiki: http://cwiki.apache.org/MAHOUT

>* Join the community by subscribing to mahout-u...@lucene.apache.org

>* Give back: http://www.apache.org/foundation/getinvolved.html

>* Consider adding yourself to the power by Wiki

> page:http://cwiki.apache.org/MAHOUT/poweredby.html

>
For more information on Apache Mahout, see http://lucene.apache.org/mahout


Re: Can someone please mark 0.3 release in JIRA?

2010-03-18 Thread Robin Anil
I am sending the announcement in a short while

Robin

On Thu, Mar 18, 2010 at 9:23 AM, Robin Anil  wrote:

> Just woke up. I think no one did the announcement. @Grant. ready to make
> one?
>
> Robin
>
>
>
> On Thu, Mar 18, 2010 at 6:44 AM, Grant Ingersoll wrote:
>
>> Has the release been sent out?  I don't see any announcements in the usual
>> places.
>>
>>
>> On Mar 17, 2010, at 2:19 PM, Robin Anil wrote:
>>
>> > Exported both lucene and mahout sites. Waiting for it to get updated on
>> the
>> > main server
>> >
>> > Robin
>> >
>> >
>> > On Wed, Mar 17, 2010 at 11:45 PM, Robin Anil 
>> wrote:
>> >
>> >> Im back. Looks good. I am going to run Grants script to publish the
>> >> website. We can announce shortly after that.
>> >>
>> >> Robin
>> >>
>> >>
>> >> On Wed, Mar 17, 2010 at 11:01 PM, Drew Farris > >wrote:
>> >>
>> >>> (How about this: cribbed from the 0.2 release announcement)
>> >>>
>> >>> Apache Mahout 0.3 has been released and is now available for public
>> >>> download at http://www.apache.org/dyn/closer.cgi/lucene/mahout
>> >>>
>> >>> Up-to-date maven artifacts can be found in the Apache repository at
>> >>>
>> >>>
>> https://repository.apache.org/content/repositories/releases/org/apache/mahout/
>> >>>
>> >>> Apache Mahout is a subproject of Apache Lucene with the goal of
>> >>> delivering scalable machine learning algorithm implementations under
>> >>> the Apache license. http://www.apache.org/licenses/LICENSE-2.0
>> >>>
>> >>> Mahout is a machine learning library meant to scale: Scale in terms of
>> >>> community to support anyone interested in using machine learning.
>> >>> Scale in terms of business by providing the library under a
>> >>> commercially friendly, free software license. Scale in terms of
>> >>> computation to the size of data we manage today.
>> >>>
>> >>> Built on top of the powerful map/reduce paradigm of the Apache Hadoop
>> >>> project, Mahout lets you solve popular machine learning problem
>> >>> settings like clustering, collaborative filtering and classification
>> >>> over Terabytes of data over thousands of computers.
>> >>>
>> >>> Implemented with scalability in mind the latest release brings many
>> >>> performance optimizations so that even in a single node setup the
>> >>> library performs well.
>> >>>
>> >>> The complete changelist can be found here:
>> >>>
>> >>> http://issues.apache.org/jira/browse/MAHOUT/fixforversion/12314281
>> >>>
>> >>> New Mahout 0.3 features include:
>> >>>
>> >>>  * New math and collections modules based on the high performance
>> >>> Colt library.
>> >>>  * Faster Frequent Pattern Growth (FPGrowth) using FP-bonsai pruning
>> >>>  * Parallel Dirichlet process clustering (a model-based clustering
>> >>> algorithm)
>> >>>  * Parallel co-occurrence based recommender
>> >>>  * Parallel text document to vector conversion using LLR based ngram
>> >>> generation
>> >>>  * Parallel Lanczos SVD (Singular Value Decomposition) solver
>> >>>  * Shell scripts for easier running of algorithms, utilities and
>> examples
>> >>>  * ...and much much more: code cleanup, many bug fixes and
>> >>> performance improvements
>> >>>
>> >>> Getting started: New to Mahout?
>> >>>
>> >>>   * Download Mahout at
>> >>> http://www.apache.org/dyn/closer.cgi/lucene/mahout
>> >>>   * Check out the Quick start: http://cwiki.apache.org/MAHOUT
>> >>>   * Read the Mahout Wiki: http://cwiki.apache.org/MAHOUT
>> >>>   * Join the community by subscribing to
>> mahout-u...@lucene.apache.org
>> >>>   * Give back: http://www.apache.org/foundation/getinvolved.html
>> >>>   * Consider adding yourself to the power by Wiki
>> >>> page:http://cwiki.apache.org/MAHOUT/poweredby.html
>> >>>
>> >>> For more information on Apache Mahout, see
>> >>> http://lucene.apache.org/mahout
>> >>>
>> >>> On Wed, Mar 17, 2010 at 12:54 PM, Robin Anil 
>> >>> wrote:
>> >>>> Release Done! Anyone care to write a cool release announcement? I am
>> >>> about
>> >>>> to leave for some urgent work.
>> >>>>
>> >>>> Robin
>> >>>>
>> >>>>
>> >>>> On Wed, Mar 17, 2010 at 10:19 PM, Robin Anil 
>> >>> wrote:
>> >>>>
>> >>>>> Marking 0.3 as released on JIRA
>> >>>>>
>> >>>>>
>> >>>>> On Wed, Mar 17, 2010 at 10:19 PM, Robin Anil 
>> >>> wrote:
>> >>>>>
>> >>>>>> Yep, Sorry about that. I had somehow thought my commit karma was
>> >>> limited
>> >>>>>> to the mahout folder.
>> >>>>>> I have committed the lucene site with release announcement.
>> >>>>>>
>> >>>>>> Robin
>> >>>>>>
>> >>>>>
>> >>>>>
>> >>>>
>> >>>
>> >>
>> >>
>>
>>
>


Re: Can someone please mark 0.3 release in JIRA?

2010-03-17 Thread Robin Anil
Just woke up. I think no one did the announcement. @Grant. ready to make
one?

Robin


On Thu, Mar 18, 2010 at 6:44 AM, Grant Ingersoll wrote:

> Has the release been sent out?  I don't see any announcements in the usual
> places.
>
>
> On Mar 17, 2010, at 2:19 PM, Robin Anil wrote:
>
> > Exported both lucene and mahout sites. Waiting for it to get updated on
> the
> > main server
> >
> > Robin
> >
> >
> > On Wed, Mar 17, 2010 at 11:45 PM, Robin Anil 
> wrote:
> >
> >> Im back. Looks good. I am going to run Grants script to publish the
> >> website. We can announce shortly after that.
> >>
> >> Robin
> >>
> >>
> >> On Wed, Mar 17, 2010 at 11:01 PM, Drew Farris  >wrote:
> >>
> >>> (How about this: cribbed from the 0.2 release announcement)
> >>>
> >>> Apache Mahout 0.3 has been released and is now available for public
> >>> download at http://www.apache.org/dyn/closer.cgi/lucene/mahout
> >>>
> >>> Up-to-date maven artifacts can be found in the Apache repository at
> >>>
> >>>
> https://repository.apache.org/content/repositories/releases/org/apache/mahout/
> >>>
> >>> Apache Mahout is a subproject of Apache Lucene with the goal of
> >>> delivering scalable machine learning algorithm implementations under
> >>> the Apache license. http://www.apache.org/licenses/LICENSE-2.0
> >>>
> >>> Mahout is a machine learning library meant to scale: Scale in terms of
> >>> community to support anyone interested in using machine learning.
> >>> Scale in terms of business by providing the library under a
> >>> commercially friendly, free software license. Scale in terms of
> >>> computation to the size of data we manage today.
> >>>
> >>> Built on top of the powerful map/reduce paradigm of the Apache Hadoop
> >>> project, Mahout lets you solve popular machine learning problem
> >>> settings like clustering, collaborative filtering and classification
> >>> over Terabytes of data over thousands of computers.
> >>>
> >>> Implemented with scalability in mind the latest release brings many
> >>> performance optimizations so that even in a single node setup the
> >>> library performs well.
> >>>
> >>> The complete changelist can be found here:
> >>>
> >>> http://issues.apache.org/jira/browse/MAHOUT/fixforversion/12314281
> >>>
> >>> New Mahout 0.3 features include:
> >>>
> >>>  * New math and collections modules based on the high performance
> >>> Colt library.
> >>>  * Faster Frequent Pattern Growth (FPGrowth) using FP-bonsai pruning
> >>>  * Parallel Dirichlet process clustering (a model-based clustering
> >>> algorithm)
> >>>  * Parallel co-occurrence based recommender
> >>>  * Parallel text document to vector conversion using LLR based ngram
> >>> generation
> >>>  * Parallel Lanczos SVD (Singular Value Decomposition) solver
> >>>  * Shell scripts for easier running of algorithms, utilities and
> examples
> >>>  * ...and much much more: code cleanup, many bug fixes and
> >>> performance improvements
> >>>
> >>> Getting started: New to Mahout?
> >>>
> >>>   * Download Mahout at
> >>> http://www.apache.org/dyn/closer.cgi/lucene/mahout
> >>>   * Check out the Quick start: http://cwiki.apache.org/MAHOUT
> >>>   * Read the Mahout Wiki: http://cwiki.apache.org/MAHOUT
> >>>   * Join the community by subscribing to mahout-u...@lucene.apache.org
> >>>   * Give back: http://www.apache.org/foundation/getinvolved.html
> >>>   * Consider adding yourself to the power by Wiki
> >>> page:http://cwiki.apache.org/MAHOUT/poweredby.html
> >>>
> >>> For more information on Apache Mahout, see
> >>> http://lucene.apache.org/mahout
> >>>
> >>> On Wed, Mar 17, 2010 at 12:54 PM, Robin Anil 
> >>> wrote:
> >>>> Release Done! Anyone care to write a cool release announcement? I am
> >>> about
> >>>> to leave for some urgent work.
> >>>>
> >>>> Robin
> >>>>
> >>>>
> >>>> On Wed, Mar 17, 2010 at 10:19 PM, Robin Anil 
> >>> wrote:
> >>>>
> >>>>> Marking 0.3 as released on JIRA
> >>>>>
> >>>>>
> >>>>> On Wed, Mar 17, 2010 at 10:19 PM, Robin Anil 
> >>> wrote:
> >>>>>
> >>>>>> Yep, Sorry about that. I had somehow thought my commit karma was
> >>> limited
> >>>>>> to the mahout folder.
> >>>>>> I have committed the lucene site with release announcement.
> >>>>>>
> >>>>>> Robin
> >>>>>>
> >>>>>
> >>>>>
> >>>>
> >>>
> >>
> >>
>
>


Re: Can someone please mark 0.3 release in JIRA?

2010-03-17 Thread Robin Anil
Exported both lucene and mahout sites. Waiting for it to get updated on the
main server

Robin


On Wed, Mar 17, 2010 at 11:45 PM, Robin Anil  wrote:

> Im back. Looks good. I am going to run Grants script to publish the
> website. We can announce shortly after that.
>
> Robin
>
>
> On Wed, Mar 17, 2010 at 11:01 PM, Drew Farris wrote:
>
>> (How about this: cribbed from the 0.2 release announcement)
>>
>> Apache Mahout 0.3 has been released and is now available for public
>> download at http://www.apache.org/dyn/closer.cgi/lucene/mahout
>>
>> Up-to-date maven artifacts can be found in the Apache repository at
>>
>> https://repository.apache.org/content/repositories/releases/org/apache/mahout/
>>
>> Apache Mahout is a subproject of Apache Lucene with the goal of
>> delivering scalable machine learning algorithm implementations under
>> the Apache license. http://www.apache.org/licenses/LICENSE-2.0
>>
>> Mahout is a machine learning library meant to scale: Scale in terms of
>> community to support anyone interested in using machine learning.
>> Scale in terms of business by providing the library under a
>> commercially friendly, free software license. Scale in terms of
>> computation to the size of data we manage today.
>>
>> Built on top of the powerful map/reduce paradigm of the Apache Hadoop
>> project, Mahout lets you solve popular machine learning problem
>> settings like clustering, collaborative filtering and classification
>> over Terabytes of data over thousands of computers.
>>
>> Implemented with scalability in mind the latest release brings many
>> performance optimizations so that even in a single node setup the
>> library performs well.
>>
>> The complete changelist can be found here:
>>
>> http://issues.apache.org/jira/browse/MAHOUT/fixforversion/12314281
>>
>> New Mahout 0.3 features include:
>>
>>   * New math and collections modules based on the high performance
>> Colt library.
>>   * Faster Frequent Pattern Growth (FPGrowth) using FP-bonsai pruning
>>   * Parallel Dirichlet process clustering (a model-based clustering
>> algorithm)
>>   * Parallel co-occurrence based recommender
>>   * Parallel text document to vector conversion using LLR based ngram
>> generation
>>   * Parallel Lanczos SVD (Singular Value Decomposition) solver
>>   * Shell scripts for easier running of algorithms, utilities and examples
>>   * ...and much much more: code cleanup, many bug fixes and
>> performance improvements
>>
>> Getting started: New to Mahout?
>>
>>* Download Mahout at
>> http://www.apache.org/dyn/closer.cgi/lucene/mahout
>>* Check out the Quick start: http://cwiki.apache.org/MAHOUT
>>* Read the Mahout Wiki: http://cwiki.apache.org/MAHOUT
>>* Join the community by subscribing to mahout-u...@lucene.apache.org
>>* Give back: http://www.apache.org/foundation/getinvolved.html
>>* Consider adding yourself to the power by Wiki
>> page:http://cwiki.apache.org/MAHOUT/poweredby.html
>>
>> For more information on Apache Mahout, see
>> http://lucene.apache.org/mahout
>>
>> On Wed, Mar 17, 2010 at 12:54 PM, Robin Anil 
>> wrote:
>> > Release Done! Anyone care to write a cool release announcement? I am
>> about
>> > to leave for some urgent work.
>> >
>> > Robin
>> >
>> >
>> > On Wed, Mar 17, 2010 at 10:19 PM, Robin Anil 
>> wrote:
>> >
>> >> Marking 0.3 as released on JIRA
>> >>
>> >>
>> >> On Wed, Mar 17, 2010 at 10:19 PM, Robin Anil 
>> wrote:
>> >>
>> >>> Yep, Sorry about that. I had somehow thought my commit karma was
>> limited
>> >>> to the mahout folder.
>> >>> I have committed the lucene site with release announcement.
>> >>>
>> >>> Robin
>> >>>
>> >>
>> >>
>> >
>>
>
>


Re: Can someone please mark 0.3 release in JIRA?

2010-03-17 Thread Robin Anil
Im back. Looks good. I am going to run Grants script to publish the website.
We can announce shortly after that.

Robin

On Wed, Mar 17, 2010 at 11:01 PM, Drew Farris  wrote:

> (How about this: cribbed from the 0.2 release announcement)
>
> Apache Mahout 0.3 has been released and is now available for public
> download at http://www.apache.org/dyn/closer.cgi/lucene/mahout
>
> Up-to-date maven artifacts can be found in the Apache repository at
>
> https://repository.apache.org/content/repositories/releases/org/apache/mahout/
>
> Apache Mahout is a subproject of Apache Lucene with the goal of
> delivering scalable machine learning algorithm implementations under
> the Apache license. http://www.apache.org/licenses/LICENSE-2.0
>
> Mahout is a machine learning library meant to scale: Scale in terms of
> community to support anyone interested in using machine learning.
> Scale in terms of business by providing the library under a
> commercially friendly, free software license. Scale in terms of
> computation to the size of data we manage today.
>
> Built on top of the powerful map/reduce paradigm of the Apache Hadoop
> project, Mahout lets you solve popular machine learning problem
> settings like clustering, collaborative filtering and classification
> over Terabytes of data over thousands of computers.
>
> Implemented with scalability in mind the latest release brings many
> performance optimizations so that even in a single node setup the
> library performs well.
>
> The complete changelist can be found here:
>
> http://issues.apache.org/jira/browse/MAHOUT/fixforversion/12314281
>
> New Mahout 0.3 features include:
>
>   * New math and collections modules based on the high performance
> Colt library.
>   * Faster Frequent Pattern Growth (FPGrowth) using FP-bonsai pruning
>   * Parallel Dirichlet process clustering (a model-based clustering
> algorithm)
>   * Parallel co-occurrence based recommender
>   * Parallel text document to vector conversion using LLR based ngram
> generation
>   * Parallel Lanczos SVD (Singular Value Decomposition) solver
>   * Shell scripts for easier running of algorithms, utilities and examples
>   * ...and much much more: code cleanup, many bug fixes and
> performance improvements
>
> Getting started: New to Mahout?
>
>* Download Mahout at http://www.apache.org/dyn/closer.cgi/lucene/mahout
>* Check out the Quick start: http://cwiki.apache.org/MAHOUT
>* Read the Mahout Wiki: http://cwiki.apache.org/MAHOUT
>* Join the community by subscribing to mahout-u...@lucene.apache.org
>* Give back: http://www.apache.org/foundation/getinvolved.html
>* Consider adding yourself to the power by Wiki
> page:http://cwiki.apache.org/MAHOUT/poweredby.html
>
> For more information on Apache Mahout, see http://lucene.apache.org/mahout
>
> On Wed, Mar 17, 2010 at 12:54 PM, Robin Anil  wrote:
> > Release Done! Anyone care to write a cool release announcement? I am
> about
> > to leave for some urgent work.
> >
> > Robin
> >
> >
> > On Wed, Mar 17, 2010 at 10:19 PM, Robin Anil 
> wrote:
> >
> >> Marking 0.3 as released on JIRA
> >>
> >>
> >> On Wed, Mar 17, 2010 at 10:19 PM, Robin Anil 
> wrote:
> >>
> >>> Yep, Sorry about that. I had somehow thought my commit karma was
> limited
> >>> to the mahout folder.
> >>> I have committed the lucene site with release announcement.
> >>>
> >>> Robin
> >>>
> >>
> >>
> >
>


Re: Can someone please mark 0.3 release in JIRA?

2010-03-17 Thread Robin Anil
Also need to publish these websites


On Wed, Mar 17, 2010 at 10:24 PM, Robin Anil  wrote:

> Release Done! Anyone care to write a cool release announcement? I am about
> to leave for some urgent work.
>
> Robin
>
>
>
> On Wed, Mar 17, 2010 at 10:19 PM, Robin Anil  wrote:
>
>> Marking 0.3 as released on JIRA
>>
>>
>> On Wed, Mar 17, 2010 at 10:19 PM, Robin Anil wrote:
>>
>>> Yep, Sorry about that. I had somehow thought my commit karma was limited
>>> to the mahout folder.
>>> I have committed the lucene site with release announcement.
>>>
>>> Robin
>>>
>>
>>
>


Re: Can someone please mark 0.3 release in JIRA?

2010-03-17 Thread Robin Anil
Release Done! Anyone care to write a cool release announcement? I am about
to leave for some urgent work.

Robin


On Wed, Mar 17, 2010 at 10:19 PM, Robin Anil  wrote:

> Marking 0.3 as released on JIRA
>
>
> On Wed, Mar 17, 2010 at 10:19 PM, Robin Anil  wrote:
>
>> Yep, Sorry about that. I had somehow thought my commit karma was limited
>> to the mahout folder.
>> I have committed the lucene site with release announcement.
>>
>> Robin
>>
>
>


Re: Can someone please mark 0.3 release in JIRA?

2010-03-17 Thread Robin Anil
Marking 0.3 as released on JIRA

On Wed, Mar 17, 2010 at 10:19 PM, Robin Anil  wrote:

> Yep, Sorry about that. I had somehow thought my commit karma was limited to
> the mahout folder.
> I have committed the lucene site with release announcement.
>
> Robin
>


Re: Can someone please mark 0.3 release in JIRA?

2010-03-17 Thread Robin Anil
Yep, Sorry about that. I had somehow thought my commit karma was limited to
the mahout folder.
I have committed the lucene site with release announcement.

Robin


Re: Can someone please mark 0.3 release in JIRA?

2010-03-17 Thread Robin Anil
Site updated. The release page says deploy Lucene website. Grant?

Robin


On Wed, Mar 17, 2010 at 6:42 PM, Robin Anil  wrote:

> I see the mirrors have all synced(barring the ones which are down). I will
> go and commit the site changes. Anyone willing to write the Release
> announcement for 0.3?
>
> Robin
>
>
> On Wed, Mar 17, 2010 at 3:43 AM, Grant Ingersoll wrote:
>
>> It usually takes 24 hours.  Just follow the release dirs and we'll be
>> good.  Tomorrow is a great day for a Mahout announcement!  Maybe we can
>> change the logo to be green for tomorrow.
>>
>>
>> On Mar 16, 2010, at 5:59 PM, Robin Anil wrote:
>>
>> > http://www.apache.org/mirrors/
>> >
>> > I see 0.3 folder in most of the famous mirrosr. university mirrors are
>> still
>> > not updated.
>>
>>
>>
>


Re: Can someone please mark 0.3 release in JIRA?

2010-03-17 Thread Robin Anil
I see the mirrors have all synced(barring the ones which are down). I will
go and commit the site changes. Anyone willing to write the Release
announcement for 0.3?

Robin

On Wed, Mar 17, 2010 at 3:43 AM, Grant Ingersoll wrote:

> It usually takes 24 hours.  Just follow the release dirs and we'll be good.
>  Tomorrow is a great day for a Mahout announcement!  Maybe we can change the
> logo to be green for tomorrow.
>
>
> On Mar 16, 2010, at 5:59 PM, Robin Anil wrote:
>
> > http://www.apache.org/mirrors/
> >
> > I see 0.3 folder in most of the famous mirrosr. university mirrors are
> still
> > not updated.
>
>
>


Re: Can someone please mark 0.3 release in JIRA?

2010-03-16 Thread Robin Anil
http://www.apache.org/mirrors/

I see 0.3 folder in most of the famous mirrosr. university mirrors are still
not updated.


Re: Can someone please mark 0.3 release in JIRA?

2010-03-16 Thread Robin Anil
Its done. Now. wait :). Funny enough Sean had put the release date as 18th
March. Looking back, his prediction came out true :P

On Tue, Mar 16, 2010 at 11:58 PM, Grant Ingersoll wrote:

> Try now.  I added you to the Lucene group.  You might need to log out and
> back in.
>
> On Mar 16, 2010, at 1:17 PM, Robin Anil wrote:
>
> > the files are in my home
> >
> > ~/0.3 and ~/KEYS
> >
> >
> > Robin
> >
> > On Tue, Mar 16, 2010 at 10:42 PM, Robin Anil 
> wrote:
> >
> >> umm... err. cant copy. I think its because I am not in the lucene group.
> >>
> >> Robin
> >>
> >> On Tue, Mar 16, 2010 at 10:34 PM, Grant Ingersoll  >wrote:
> >>
> >>> That's all on people.a.o under/www/lucene.apache.org/mahout  Just
> follow
> >>> the docs on http://cwiki.apache.org/MAHOUT/how-to-release.html and
> >>> everything will work.
> >>>
> >>>
> >>>
> >>>
> >>> On Mar 16, 2010, at 12:50 PM, Robin Anil wrote:
> >>>
> >>>> http://www.apache.org/dev/mirror-step-by-step.html?Step-By-Step
> >>>>
> >>>> it says you need to login to www. to copy files to dist
> >>>
> >>>
> >>>
> >>
>
>
>


Re: Can someone please mark 0.3 release in JIRA?

2010-03-16 Thread Robin Anil
the files are in my home

~/0.3 and ~/KEYS


Robin

On Tue, Mar 16, 2010 at 10:42 PM, Robin Anil  wrote:

> umm... err. cant copy. I think its because I am not in the lucene group.
>
> Robin
>
> On Tue, Mar 16, 2010 at 10:34 PM, Grant Ingersoll wrote:
>
>> That's all on people.a.o under/www/lucene.apache.org/mahout  Just follow
>> the docs on http://cwiki.apache.org/MAHOUT/how-to-release.html and
>> everything will work.
>>
>>
>>
>>
>> On Mar 16, 2010, at 12:50 PM, Robin Anil wrote:
>>
>> > http://www.apache.org/dev/mirror-step-by-step.html?Step-By-Step
>> >
>> > it says you need to login to www. to copy files to dist
>>
>>
>>
>


Re: Can someone please mark 0.3 release in JIRA?

2010-03-16 Thread Robin Anil
umm... err. cant copy. I think its because I am not in the lucene group.

Robin

On Tue, Mar 16, 2010 at 10:34 PM, Grant Ingersoll wrote:

> That's all on people.a.o under/www/lucene.apache.org/mahout  Just follow
> the docs on http://cwiki.apache.org/MAHOUT/how-to-release.html and
> everything will work.
>
>
>
>
> On Mar 16, 2010, at 12:50 PM, Robin Anil wrote:
>
> > http://www.apache.org/dev/mirror-step-by-step.html?Step-By-Step
> >
> > it says you need to login to www. to copy files to dist
>
>
>


Re: Can someone please mark 0.3 release in JIRA?

2010-03-16 Thread Robin Anil
my bad. the folder is there on people.

I have verified the signatures.

moving

KEYS
mahout-0.3-*(tar.gz|tar.bz2|zip)
mahout-0.3-*(tar.gz|tar.bz2|zip).asc
mahout-0.3-*(tar.gz|tar.bz2|zip).md5
mahout-0.3-*(tar.gz|tar.bz2|zip).sha1

to the dist folder



On Tue, Mar 16, 2010 at 10:20 PM, Robin Anil  wrote:

> http://www.apache.org/dev/mirror-step-by-step.html?Step-By-Step
>
> it says you need to login to www. to copy files to dist
>


Re: Can someone please mark 0.3 release in JIRA?

2010-03-16 Thread Robin Anil
http://www.apache.org/dev/mirror-step-by-step.html?Step-By-Step

it says you need to login to www. to copy files to dist


Re: Can someone please mark 0.3 release in JIRA?

2010-03-16 Thread Robin Anil
Seems my password for people.apache.org doesnt work for www.apache.org.
Grant any idea why? Do i need to be in PMC group id or something ?

Robin


On Tue, Mar 16, 2010 at 9:55 PM, Robin Anil  wrote:

> There are still steps left in the release. I am pushing the zips and hashes
> to the apache dist. Will have to wait 24 hours after that and some site
> level changes before marking it as released
>
> Robin
>
>
>
> On Tue, Mar 16, 2010 at 9:38 PM, Ted Dunning wrote:
>
>> Benson,
>>
>> Do you need for me to mark 0.3 as released in JIRA?
>>
>> On Tue, Mar 16, 2010 at 8:30 AM, Benson Margulies > >wrote:
>>
>> > I assume that I don't have admin karma on the JIRA project.
>> >
>>
>
>


Re: Can someone please mark 0.3 release in JIRA?

2010-03-16 Thread Robin Anil
There are still steps left in the release. I am pushing the zips and hashes
to the apache dist. Will have to wait 24 hours after that and some site
level changes before marking it as released

Robin


On Tue, Mar 16, 2010 at 9:38 PM, Ted Dunning  wrote:

> Benson,
>
> Do you need for me to mark 0.3 as released in JIRA?
>
> On Tue, Mar 16, 2010 at 8:30 AM, Benson Margulies  >wrote:
>
> > I assume that I don't have admin karma on the JIRA project.
> >
>


Re: Can someone please mark 0.3 release in JIRA?

2010-03-16 Thread Robin Anil
Show me the howto of doing this. I can do it now.


Robin


On Tue, Mar 16, 2010 at 9:13 PM, Benson Margulies wrote:

> Hmm. Not a bad idea. I think of maven as done, but that's not right.
> I'll deal with this in the evening.
>
> On Tue, Mar 16, 2010 at 11:33 AM, Robin Anil  wrote:
> > http://www.apache.org/dist/lucene/mahout/
> >
> > I dont see it here yet. Isnt that required of a release?
> >
> > Robin
> >
>


Re: Can someone please mark 0.3 release in JIRA?

2010-03-16 Thread Robin Anil
http://www.apache.org/dist/lucene/mahout/

I dont see it here yet. Isnt that required of a release?

Robin


Re: [VOTE RESULT] Mahout 0.3

2010-03-16 Thread Robin Anil
Has the tarballs been pushed to the mirrors yet? I can see it in the
repository..

Robin

On Tue, Mar 16, 2010 at 1:37 AM, Grant Ingersoll wrote:

> A big thanks to all involved in this.  It is a sure sign of a healthy and
> growing community when new people step in and step up and the load is
> distributed across many helping hands.
>
> Kudos to all,
> Grant
>
> On Mar 15, 2010, at 3:31 PM, Benson Margulies wrote:
>
> > Mahout release 0.3 passed.
> >
> > Ted Dunning, Grant and Sean Owen are PMC members.
> >
> > Here's my tally:
> >
> > +1s:
> > Sean, Grant, Ted, Benson, Jeff, Drew
> >
> > No 0's and no -1's.
>
>
>


Re: [VOTE]: release Mahout 0.3 (resend, I forgot gene...@lucene.apache.org)

2010-03-15 Thread Robin Anil
Any changes needed for the release notes ?


Re: svn commit: r923322 - /lucene/mahout/site/src/documentation/content/xdocs/index.xml

2010-03-15 Thread Robin Anil
As I read in the issues closed in this release. Generalizing Dirichlet to
n-dimensional and sparsevectors were done in this release. I think we can
call that the complete implementation.


On Tue, Mar 16, 2010 at 1:17 AM, Grant Ingersoll wrote:

>
> On Mar 15, 2010, at 12:14 PM, robina...@apache.org wrote:
>
> > Author: robinanil
> > Date: Mon Mar 15 16:14:01 2010
> > New Revision: 923322
> >
> > URL: http://svn.apache.org/viewvc?rev=923322&view=rev
> > Log:
> > Setting release announcement date as 16th March. Changes in release notes
> >
> > Modified:
> >lucene/mahout/site/src/documentation/content/xdocs/index.xml
> >
> > Modified: lucene/mahout/site/src/documentation/content/xdocs/index.xml
> > URL:
> http://svn.apache.org/viewvc/lucene/mahout/site/src/documentation/content/xdocs/index.xml?rev=923322&r1=923321&r2=923322&view=diff
> >
> ==
> > --- lucene/mahout/site/src/documentation/content/xdocs/index.xml
> (original)
> > +++ lucene/mahout/site/src/documentation/content/xdocs/index.xml Mon Mar
> 15 16:14:01 2010
> > @@ -27,16 +27,18 @@
> >   Mahout News
> >
> > + Parallel Dirichlet process clustering (model-based
> clustering algorithm)
>
> Wasn't Dirichlet in 0.2 and wasn't it already parallel or did I miss
> something new being added?


Re: [NOMINATION] Mahout PMC Chair

2010-03-15 Thread Robin Anil
Do everyone has to nominate. Wouldn't the other thread alone do? Or am I not
understanding an apache practice here?

Robin

On Tue, Mar 16, 2010 at 1:00 AM, Robin Anil  wrote:

> I am confused.
>
> On Tue, Mar 16, 2010 at 12:55 AM, Ted Dunning wrote:
>
>> I also would like to accept the votes on the other thread as if they were
>> on
>> this thread (or conversely, apply the process from this thread to the
>> other)
>>
>> On Mon, Mar 15, 2010 at 12:23 PM, Grant Ingersoll > >wrote:
>>
>> >
>> > On Mar 15, 2010, at 3:06 PM, Ted Dunning wrote:
>> >
>> > > I would like to nominate Sean Owen as Chair of the Mahout PMC.
>> > >
>> > > I suggest the following process for this election.
>> > >
>> > > We should allow other nominations for 2 days until noon March 17 PDT.
>>  If
>> > no
>> > > other nominations have been entered by then, we should consider Sean
>> > elected
>> > > by acclamation.  If other nominations have been posted to this list by
>> > then,
>> > > we should immediately hold a 3 day vote by the Mahout committers to
>> > conclude
>> > > Saturday March 20 at noon PDT.   Since this is a procedural vote, a
>> > majority
>> > > vote of committers will suffice for election and only positive votes
>> for
>> > a
>> > > properly nominated candidate will be considered.
>> >
>> > +1
>> >
>> > >
>> > > On Mon, Mar 15, 2010 at 11:20 AM, Grant Ingersoll <
>> gsing...@apache.org
>> > >wrote:
>> > >
>> > >>
>> > >> OK, seems I was wrong on this and that we do need to nominate/elect
>> > someone
>> > >> and make it part of the resolution.   Please start a separate thread
>> > with
>> > >> [NOMINATION] X as the subject line.  Please note, I am not
>> > interested in
>> > >> being the PMC Chair at this point, so I will not accept a nomination.
>> >
>> >
>> >
>>
>
>


Re: [NOMINATION] Mahout PMC Chair

2010-03-15 Thread Robin Anil
I am confused.

On Tue, Mar 16, 2010 at 12:55 AM, Ted Dunning  wrote:

> I also would like to accept the votes on the other thread as if they were
> on
> this thread (or conversely, apply the process from this thread to the
> other)
>
> On Mon, Mar 15, 2010 at 12:23 PM, Grant Ingersoll  >wrote:
>
> >
> > On Mar 15, 2010, at 3:06 PM, Ted Dunning wrote:
> >
> > > I would like to nominate Sean Owen as Chair of the Mahout PMC.
> > >
> > > I suggest the following process for this election.
> > >
> > > We should allow other nominations for 2 days until noon March 17 PDT.
>  If
> > no
> > > other nominations have been entered by then, we should consider Sean
> > elected
> > > by acclamation.  If other nominations have been posted to this list by
> > then,
> > > we should immediately hold a 3 day vote by the Mahout committers to
> > conclude
> > > Saturday March 20 at noon PDT.   Since this is a procedural vote, a
> > majority
> > > vote of committers will suffice for election and only positive votes
> for
> > a
> > > properly nominated candidate will be considered.
> >
> > +1
> >
> > >
> > > On Mon, Mar 15, 2010 at 11:20 AM, Grant Ingersoll  > >wrote:
> > >
> > >>
> > >> OK, seems I was wrong on this and that we do need to nominate/elect
> > someone
> > >> and make it part of the resolution.   Please start a separate thread
> > with
> > >> [NOMINATION] X as the subject line.  Please note, I am not
> > interested in
> > >> being the PMC Chair at this point, so I will not accept a nomination.
> >
> >
> >
>


Re: [NOMINATION] Sean Owen as Mahout PMC Chair

2010-03-15 Thread Robin Anil
I second the nomination.


On Mon, Mar 15, 2010 at 11:52 PM, Grant Ingersoll wrote:

> I hereby nominate Sean Owen to be the Chair of the proposed Mahout PMC.
>
> If we have a second and Sean accepts, then we can add him to the list of
> candidates for a vote on Friday, March 15th.
>
> -Grant


  1   2   3   4   5   6   7   8   9   >