Re: Dangling collections in front of commons

2010-04-03 Thread Ted Dunning
Sounds good to me.

On Fri, Apr 2, 2010 at 6:42 PM, Benson Margulies bimargul...@gmail.comwrote:

 Does anyone object if I send a suggestion to the commons PMC that
 mahout-collections would make more sense as commons-something-or-another? I
 don't expect to get anywhere, but I thought I'd try.



Re: Dangling collections in front of commons

2010-04-03 Thread Sean Owen
I'm neutral... maybe let it marinate longer in Mahout, prove it's used
and worthwhile and such?

I think the question will be, well, doesn't that conflict with Commons
Collections, and so, are we suggesting pushing into Collections, and
can we make an argument that it complements Collections?

On Sat, Apr 3, 2010 at 2:42 AM, Benson Margulies bimargul...@gmail.com wrote:
 Does anyone object if I send a suggestion to the commons PMC that
 mahout-collections would make more sense as commons-something-or-another? I
 don't expect to get anywhere, but I thought I'd try.



Re: [collections] and what about 'identity'?

2010-04-03 Thread Dawid Weiss
The source code to HPPC is public and accessible, so you are more then
welcome to peek/ contribute/ take whatever you want, Benson.

Dawid

On Fri, Apr 2, 2010 at 10:45 PM, Benson Margulies bimargul...@gmail.com wrote:
 Dawid,

 Now I recall why I stopped working on features of Mahout collections :-)
 HPPC.

 We'll see who gets where first.

 --benson


 On Fri, Apr 2, 2010 at 10:06 AM, Dawid Weiss dawid.we...@gmail.com wrote:

  What's the use case for needing to vary the hash function? It's one of
  those things where I assume there are incorrect ways to do it, and
  correct ways, and among the correct ways fairly clear arguments about
  which function will be better -- i.e. the object should provide the
  best function.

 Unfortunately this is not true -- just recently I've hit a use case
 where the keys stored were Long values and their distribution had a
 very low variance in the lower bits. HPPC implemented open hashing
 using 2^n arrays and hashes were modulo bitmask... this caused really,
 really long conflict chains for values that were actually very
 different. I looked at how JDK's HashMap solves this problem -- they
 do a simple rehashing scheme internally (so it's object hash and then
 remixing hash in a cascade). I've finally decided to allow external
 hash functions AND changed the _default_ hash function used for
 remixing to be murmur hash. Performance benchmarks show this yields
 virtually no degradation in execution time (the CPUs seem to spend
 most of their time waiting on cache misses anyway, so internal
 rehashing is not an issue).

 I must also apologize for a bit of inactivity with HPPC... Like I
 said, we have released it internally on our labs Web site here:

 http://labs.carrotsearch.com/hppc.html

 It doesn't mean we turn our backs on contributing HPPC to Mahout --
 the opposite, we would love to do it. But contrary to what I
 originally thought (to push HPPC to Mahout as soon as possible) I kind
 of grew reluctant because so many things are missing (equals/hashcode,
 java collections adapters) or can be improved (documentation, faster
 iterators).

 So... I'm still going to experiment with HPPC in our labs, especially
 API-wise, release one or two versions in between and then kindly ask
 you to peek at the final (?) result and consider moving the code under
 Mahout umbrella. Sounds good?

 Dawid




Re: Dangling collections in front of commons

2010-04-03 Thread Grant Ingersoll

On Apr 3, 2010, at 5:17 AM, Sean Owen wrote:

 I'm neutral... maybe let it marinate longer in Mahout, prove it's used
 and worthwhile and such?

Yeah, I'd tend to agree here.  Let's see if we get some contributions on it and 
how it plays out for us.  

 
 I think the question will be, well, doesn't that conflict with Commons
 Collections, and so, are we suggesting pushing into Collections, and
 can we make an argument that it complements Collections?

I think it does, since it focuses on primitives.

 
 On Sat, Apr 3, 2010 at 2:42 AM, Benson Margulies bimargul...@gmail.com 
 wrote:
 Does anyone object if I send a suggestion to the commons PMC that
 mahout-collections would make more sense as commons-something-or-another? I
 don't expect to get anywhere, but I thought I'd try.
 




[jira] Updated: (MAHOUT-344) Minhash based clustering

2010-04-03 Thread Cristi Prodan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cristi Prodan updated MAHOUT-344:
-

Status: Patch Available  (was: Open)

Thank you guys for all the encouragement and advices. 

I'm committing my first patch for MinHah clustering. The patch contain the 
following things:

- in core - minhash:
 * MinHashMapRed - removed the distributed hash need; Each mapper generates the 
same hash functions using the same seed (as per instructions from Ankur).
 * RandomLinearHashFunction - added another random linear hash function in 
form: h( x ) = ax + b mod p . p will be as big as possible  1000 and it 
should be prime(not done yet, but committing in this form due to some time 
restrictions) . 

- in examples - minhash directory:
 * DisplayMinHash - contains an example of running min-hash, with the options 
commented. It's basically the main function from MinHashMapRed.
 * PrepareDataset - this class offers the ability to convert the last-fm 
database suggested above in a format readable by the MinHash algorithm. It also 
shows a progress bar with the percent done :) . 
For the future I believe that all the code in the algorithm should take a more 
generalized form and use the Vectors classes used by Mahout, then the users 
could either write their own version with a Vector interface or create a tool 
that converts their ds to the vector format the code will know.
MurmurHash is used by PrepareDataset to hash the strings which denoted users 
(in the original last_fm dataset) - to integers. 
 * TestClusterQuality - gets a clustered file, generated by the minhash 
algorithm and computes the average for each cluster aggregated over all 
clusters. 
In each cluster the mean is computed by:
SUM (similarity (item_i, item_j)) / TOTAL_SIMILARITIES, for i != j . 
TOTAL_SIMILARITIES = n! / k! * (n -k)!, n = total number of items in cluster, k 
= 2.
The aggregated mean is the mean of all these values. 

As an example. Having the following input:

1   1   2   3   4   5
2   1   2   3   4   6
3   7   6   3   8   9
4   7   8   9   6   1
5   5   6   7   8   9
6   8   7   5   6

The first column are the users. For each user, on the lines we have the items 
preferred (browsed, listened to) by him. Same format in the contents of each 
cluster bellow. 

we get the following output(PARAMETERS: 20 hash functions, 4 keygroups (hash 
indices in a bucket), 2 - minimum items within cluster):

CLUSTER ID -- 2359983695385880352354530253637788 (items=2)  
=
2   1   2   3   4   6
1   1   2   3   4   5
 
CLUSTER ID -- 236643825172184878353970117486898894 (items=2)
=
4   7   8   9   6   1
3   7   6   3   8   9
 
CLUSTER ID -- 35606006580772015548743126287496777 (items=2) 
=
6   8   7   5   6
5   5   6   7   8   9
 
CLUSTER ID -- 38797144231157365543316465389702468 (items=2) 
=
6   8   7   5   6
5   5   6   7   8   9

The aggregated average over theses clusters is 0.6793650793650793.

I'm now testing on the last_fm dataset. The problem I currently encounter is 
that the size for the clustered file is kind of big, but I'm working on tuning 
the params. 


 Minhash based clustering 
 -

 Key: MAHOUT-344
 URL: https://issues.apache.org/jira/browse/MAHOUT-344
 Project: Mahout
  Issue Type: Bug
  Components: Clustering
Affects Versions: 0.3
Reporter: Ankur
Assignee: Ankur
 Attachments: MAHOUT-344-v1.patch


 Minhash clustering performs probabilistic dimension reduction of high 
 dimensional data. The essence of the technique is to hash each item using 
 multiple independent hash functions such that the probability of collision of 
 similar items is higher. Multiple such hash tables can then be constructed  
 to answer near neighbor type of queries efficiently.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAHOUT-344) Minhash based clustering

2010-04-03 Thread Cristi Prodan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cristi Prodan updated MAHOUT-344:
-

Attachment: MAHOUT-344-v2.patch

See comment above for this patch. 

 Minhash based clustering 
 -

 Key: MAHOUT-344
 URL: https://issues.apache.org/jira/browse/MAHOUT-344
 Project: Mahout
  Issue Type: Bug
  Components: Clustering
Affects Versions: 0.3
Reporter: Ankur
Assignee: Ankur
 Attachments: MAHOUT-344-v1.patch, MAHOUT-344-v2.patch


 Minhash clustering performs probabilistic dimension reduction of high 
 dimensional data. The essence of the technique is to hash each item using 
 multiple independent hash functions such that the probability of collision of 
 similar items is higher. Multiple such hash tables can then be constructed  
 to answer near neighbor type of queries efficiently.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Dangling collections in front of commons

2010-04-03 Thread Dawid Weiss
 I'm neutral... maybe let it marinate longer in Mahout, prove it's used
 and worthwhile and such?

 Yeah, I'd tend to agree here.  Let's see if we get some contributions on it 
 and how it plays out for us.

Marination is exactly my motive why I work on HPPC in separation
from Mahout... Once you let the API out in the open, it's much more
difficult/ problematic to change it.

D.


GSoC - Implementing SOM

2010-04-03 Thread hifsa kazmi
Dear Mahout Developers,

I am an undergraduate student, finishing my final year. For my final year
project, I got to work on Hadoop MapReduce and HDFS; furthermore I also had
to use clustering algorithms in Mahout on some of the datasets. One of my
project mentors proposed to implement Self Organizing Maps for that but it
has not yet been implemented in Mahout. So I thought why not I should do it.

Here I am, open for your comments and suggestions for the scope of this
project.

Thanking you all,
Hifsa Kazmi


A request for prospective GSOC students

2010-04-03 Thread Robin Anil
I am having a tough time separating Mahout proposals from rest of Apache on
gsoc website. So I would request you all to reply to this thread when you
have submitted a proposal so that we don't miss out on reading your hard
worked proposal. For now I could only find Zhao Zhendong's LIBLINEAR
proposal. If anyone else have applied do reply back with the title of the
proposal.

Robin


Re: A request for prospective GSOC students

2010-04-03 Thread yinghua hu
Dear Robin and other contributors,

Nice to meet you.

I am a PhD student in University of Central Florida. I submitted a
proposal to Google Summer of Code 2010 with title Implement
Map/Reduce Enabled Neural Networks (mahout-342).

Any suggestions and advice are very welcome. I am still allowed to do
correction on it before April 9th.

Thank you!

-- 
Regards,

Yinghua


On Sat, Apr 3, 2010 at 11:37 AM, Robin Anil robin.a...@gmail.com wrote:
 I am having a tough time separating Mahout proposals from rest of Apache on
 gsoc website. So I would request you all to reply to this thread when you
 have submitted a proposal so that we don't miss out on reading your hard
 worked proposal. For now I could only find Zhao Zhendong's LIBLINEAR
 proposal. If anyone else have applied do reply back with the title of the
 proposal.

 Robin



Re: A request for prospective GSOC students

2010-04-03 Thread Robin Anil
Thanks! I just noticed your proposal. My advice to everyone would be to be
clear on what you want to do instead of the related content and theory about
any algorithm. So really expand the design, implementation and time line
sections.

Robin

On Sat, Apr 3, 2010 at 9:18 PM, yinghua hu yinghua...@gmail.com wrote:

 Dear Robin and other contributors,

 Nice to meet you.

 I am a PhD student in University of Central Florida. I submitted a
 proposal to Google Summer of Code 2010 with title Implement
 Map/Reduce Enabled Neural Networks (mahout-342).

 Any suggestions and advice are very welcome. I am still allowed to do
 correction on it before April 9th.

 Thank you!

 --
 Regards,

 Yinghua


 On Sat, Apr 3, 2010 at 11:37 AM, Robin Anil robin.a...@gmail.com wrote:
  I am having a tough time separating Mahout proposals from rest of Apache
 on
  gsoc website. So I would request you all to reply to this thread when you
  have submitted a proposal so that we don't miss out on reading your hard
  worked proposal. For now I could only find Zhao Zhendong's LIBLINEAR
  proposal. If anyone else have applied do reply back with the title of the
  proposal.
 
  Robin
 



[GSOC] 2010 Timelines

2010-04-03 Thread Grant Ingersoll
http://socghop.appspot.com/document/show/gsoc_program/google/gsoc2010/faqs#timeline

[jira] Commented: (MAHOUT-332) Create adapters for MYSQL and NOSQL(hbase, cassandra) to access data for all the algorithms to use

2010-04-03 Thread Necati Batur (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12853167#action_12853167
 ] 

Necati Batur commented on MAHOUT-332:
-

I am a senior student at computer engineering at iztech in turkey. My areas of 
inetrests are information management, OOP(Object Oriented Programming) and 
currently bioinformatics. I have been working with a Asistan Professor(Jens 
Allmer) in molecular biology genetics department for one year.Firstly, we 
worked on a protein database 2DB and we presented the project in HIBIT09 
organization. The Project  was Database management system independence by 
amending 2DB with a database access layer (written in Java). Currently, I am 
working on another project (Kerb) as my senior project which is a general 
sqeuential task management system intend to reduce the errors and increase time 
saving in biological experiments. We will present this project in HIBIT2010 too.

I am confident to work with all new technologies.I took the data structures I , 
II courses at university so I am ok with data structures.Most importantly I am 
interested in databases.From my software engineering courses experience I know 
how to work on a project by iterative development and timelining. In order to 
add more functionalities I need a mentor to contact for this project.

 Create adapters for  MYSQL and NOSQL(hbase, cassandra) to access data for all 
 the algorithms to use
 ---

 Key: MAHOUT-332
 URL: https://issues.apache.org/jira/browse/MAHOUT-332
 Project: Mahout
  Issue Type: New Feature
Reporter: Robin Anil

 A student with a good proposal 
 - should be free to work for Mahout in the summer and should be thrilled to 
 work in this area :)
 - should be able to program in Java and be comfortable with datastructures 
 and algorithms
 - must explore SQL and NOSQL implementations, and design a framework with 
 which data from them could be fetched and converted to mahout format or used 
 directly as a matrix transparently
 - should have a plan to make it high performance with ample caching 
 strategies or the ability to use it on a map/reduce job
 - should focus more on getting a working version than to implement all 
 functionalities. So its recommended that you divide features into milestones
 - must have clear deadlines and pace it evenly across the span of 3 months.
 If you can do something extra it counts, but make sure the plan is reasonable 
 within the specified time frame.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Dangling collections in front of commons

2010-04-03 Thread Benson Margulies
Commons-collections turns out to be a very specific thing which this
is not. I have an intemediate proposal that I'll put in a separate
thread.

On Sat, Apr 3, 2010 at 7:06 AM, Dawid Weiss dawid.we...@gmail.com wrote:

  I'm neutral... maybe let it marinate longer in Mahout, prove it's used
  and worthwhile and such?
 
  Yeah, I'd tend to agree here.  Let's see if we get some contributions on it 
  and how it plays out for us.

 Marination is exactly my motive why I work on HPPC in separation
 from Mahout... Once you let the API out in the open, it's much more
 difficult/ problematic to change it.

 D.


Proposal: make collections releases independent of the rest of Mahout

2010-04-03 Thread Benson Margulies
I propose to disconnect collections from the aggregate project and put
it on its own release cycle.

This was originally someone else's idea when we started on it.

Collections is useful in its own right, and I'd like to make fixes to
it available without having the whole rest of Mahout reach a release
point.

I confess that the slf4j dependency in collections is a very strong
local motivation to me, but it also seems right in principle.

When we go TLP, we can organize this more coherently in svn, but for
now we can leave it where it is, but fix up the poms.

This strikes me as consistent with the idea of marinating with
possible intent that it would become its own thing some day.


[jira] Commented: (MAHOUT-332) Create adapters for MYSQL and NOSQL(hbase, cassandra) to access data for all the algorithms to use

2010-04-03 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12853174#action_12853174
 ] 

Robin Anil commented on MAHOUT-332:
---

Hi Necati, Take a look at the matrix and vector classes in mahout. And read up 
on how mahout converts text into vectors. We need a generic framework where 
data from Databases could be iterated upon as a vector and algorithms can use 
it seamlessly. The current VectorWritable could be extended to say a database 
backed vector, which should reach each field and convert it to a vector on the 
fly using a pre populated dictionary. This could be easily consumed by the 
mahout algorithms. The database backed vector should be configurable enough 
such that fields could be selected. I am sure there are frameworks which 
already does this.  Drew Farris is working on a document structure for mahout 
using avro. I am sure he will have more inputs on how these adapters should fit 
with his structure. 

 Create adapters for  MYSQL and NOSQL(hbase, cassandra) to access data for all 
 the algorithms to use
 ---

 Key: MAHOUT-332
 URL: https://issues.apache.org/jira/browse/MAHOUT-332
 Project: Mahout
  Issue Type: New Feature
Reporter: Robin Anil

 A student with a good proposal 
 - should be free to work for Mahout in the summer and should be thrilled to 
 work in this area :)
 - should be able to program in Java and be comfortable with datastructures 
 and algorithms
 - must explore SQL and NOSQL implementations, and design a framework with 
 which data from them could be fetched and converted to mahout format or used 
 directly as a matrix transparently
 - should have a plan to make it high performance with ample caching 
 strategies or the ability to use it on a map/reduce job
 - should focus more on getting a working version than to implement all 
 functionalities. So its recommended that you divide features into milestones
 - must have clear deadlines and pace it evenly across the span of 3 months.
 If you can do something extra it counts, but make sure the plan is reasonable 
 within the specified time frame.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Proposal: make collections releases independent of the rest of Mahout

2010-04-03 Thread Sean Owen
On Sat, Apr 3, 2010 at 6:47 PM, Benson Margulies bimargul...@gmail.com wrote:
 I confess that the slf4j dependency in collections is a very strong
 local motivation to me, but it also seems right in principle.

I just killed this BTW. (There was one dangling log statement... not
worth a dependency.)

 When we go TLP, we can organize this more coherently in svn, but for
 now we can leave it where it is, but fix up the poms.

Actually it seems like this a valid subproject of a Mahout TLP in its
own right, if that would be a useful middle-ground status.

 This strikes me as consistent with the idea of marinating with
 possible intent that it would become its own thing some day.

Yes it's already its own module, which helps manage it independently.
At the moment that means anyone can depend on it, and only it, via
Maven, which is 80% of the value.

I think it probably needs a fair bit of API rethinking and cleanup to
truly stand as a general purpose and reusable component, but that can
happen.


Re: Proposal: make collections releases independent of the rest of Mahout

2010-04-03 Thread Benson Margulies
On Sat, Apr 3, 2010 at 2:07 PM, Sean Owen sro...@gmail.com wrote:
 On Sat, Apr 3, 2010 at 6:47 PM, Benson Margulies bimargul...@gmail.com 
 wrote:
 I confess that the slf4j dependency in collections is a very strong
 local motivation to me, but it also seems right in principle.

 I just killed this BTW. (There was one dangling log statement... not
 worth a dependency.)

Yes, thank you.

My selfish short-term goal is to get a release with the log dependency
removed out before Mahout 0.4 :-).


 When we go TLP, we can organize this more coherently in svn, but for
 now we can leave it where it is, but fix up the poms.

 Actually it seems like this a valid subproject of a Mahout TLP in its
 own right, if that would be a useful middle-ground status.

I'm not trying to suggest anything different. I'm opposed to having
'separate committers', but I'm happy to have multiple releasable
components all in the Mahout TLP.


 This strikes me as consistent with the idea of marinating with
 possible intent that it would become its own thing some day.

 Yes it's already its own module, which helps manage it independently.
 At the moment that means anyone can depend on it, and only it, via
 Maven, which is 80% of the value.

 I think it probably needs a fair bit of API rethinking and cleanup to
 truly stand as a general purpose and reusable component, but that can
 happen.


No argument there.

Practical point: it would be, all joking aside, good to make a very
prompt release of this so that the rest of Mahout 0.4-SNAPSHOT could
depend on it.

If no one protests, I'll do the POM surgery in a couple of days.


[jira] Commented: (MAHOUT-332) Create adapters for MYSQL and NOSQL(hbase, cassandra) to access data for all the algorithms to use

2010-04-03 Thread Necati Batur (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12853177#action_12853177
 ] 

Necati Batur commented on MAHOUT-332:
-

Well it will not be to hard to understand the conversion of data into vectors 
if there is a source and algorithm already :)
However could you please give me the neccessary links to check out because in 
website there is excess amount of repositories that I hardly understand what in 
where.
Nonetheless,how should I write a proposal if I am asked to write?
thanks 

 Create adapters for  MYSQL and NOSQL(hbase, cassandra) to access data for all 
 the algorithms to use
 ---

 Key: MAHOUT-332
 URL: https://issues.apache.org/jira/browse/MAHOUT-332
 Project: Mahout
  Issue Type: New Feature
Reporter: Robin Anil

 A student with a good proposal 
 - should be free to work for Mahout in the summer and should be thrilled to 
 work in this area :)
 - should be able to program in Java and be comfortable with datastructures 
 and algorithms
 - must explore SQL and NOSQL implementations, and design a framework with 
 which data from them could be fetched and converted to mahout format or used 
 directly as a matrix transparently
 - should have a plan to make it high performance with ample caching 
 strategies or the ability to use it on a map/reduce job
 - should focus more on getting a working version than to implement all 
 functionalities. So its recommended that you divide features into milestones
 - must have clear deadlines and pace it evenly across the span of 3 months.
 If you can do something extra it counts, but make sure the plan is reasonable 
 within the specified time frame.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: A request for prospective GSOC students

2010-04-03 Thread yinghua hu
Robin,

This is very helpful suggestions! Design and Implementation details
are exactly what should be strengthened in this proposal. I will dig
more if time allows.

Thanks a lot!

On Sat, Apr 3, 2010 at 11:51 AM, Robin Anil robin.a...@gmail.com wrote:
 Thanks! I just noticed your proposal. My advice to everyone would be to be
 clear on what you want to do instead of the related content and theory about
 any algorithm. So really expand the design, implementation and time line
 sections.

 Robin

 On Sat, Apr 3, 2010 at 9:18 PM, yinghua hu yinghua...@gmail.com wrote:

 Dear Robin and other contributors,

 Nice to meet you.

 I am a PhD student in University of Central Florida. I submitted a
 proposal to Google Summer of Code 2010 with title Implement
 Map/Reduce Enabled Neural Networks (mahout-342).

 Any suggestions and advice are very welcome. I am still allowed to do
 correction on it before April 9th.

 Thank you!

 --
 Regards,

 Yinghua


 On Sat, Apr 3, 2010 at 11:37 AM, Robin Anil robin.a...@gmail.com wrote:
  I am having a tough time separating Mahout proposals from rest of Apache
 on
  gsoc website. So I would request you all to reply to this thread when you
  have submitted a proposal so that we don't miss out on reading your hard
  worked proposal. For now I could only find Zhao Zhendong's LIBLINEAR
  proposal. If anyone else have applied do reply back with the title of the
  proposal.
 
  Robin
 





-- 
Regards,

Yinghua


[jira] Commented: (MAHOUT-332) Create adapters for MYSQL and NOSQL(hbase, cassandra) to access data for all the algorithms to use

2010-04-03 Thread Robin Anil (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12853182#action_12853182
 ] 

Robin Anil commented on MAHOUT-332:
---

Conversion of any arbitary data in a database to vectors would be along the 
same lines as how ARFF format is to be converted to vectors. You can find the 
code under trunk/utils. It treats boolean, enum and numeric and string 
datatypes separately. That code still may need some more tweaking up so that 
the entire ARFF spec is supported. But its a good starting point for you to 
understand how data is converted to vectors. Also look at the 
SparseVectorsFromSequenceFiles to understand how text documents in a 
SequenceFile(you need to understand this also) are converted to vectors using 
tf-idf based weighting. So in short there could be many weighting strategies. 
It will be really nice if you can make this pluggable so that users of the 
library could make custom weighting techniques for each field. 

 Create adapters for  MYSQL and NOSQL(hbase, cassandra) to access data for all 
 the algorithms to use
 ---

 Key: MAHOUT-332
 URL: https://issues.apache.org/jira/browse/MAHOUT-332
 Project: Mahout
  Issue Type: New Feature
Reporter: Robin Anil

 A student with a good proposal 
 - should be free to work for Mahout in the summer and should be thrilled to 
 work in this area :)
 - should be able to program in Java and be comfortable with datastructures 
 and algorithms
 - must explore SQL and NOSQL implementations, and design a framework with 
 which data from them could be fetched and converted to mahout format or used 
 directly as a matrix transparently
 - should have a plan to make it high performance with ample caching 
 strategies or the ability to use it on a map/reduce job
 - should focus more on getting a working version than to implement all 
 functionalities. So its recommended that you divide features into milestones
 - must have clear deadlines and pace it evenly across the span of 3 months.
 If you can do something extra it counts, but make sure the plan is reasonable 
 within the specified time frame.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAHOUT-323) Classify new data using Decision Forest

2010-04-03 Thread Deneche A. Hakim (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12853185#action_12853185
 ] 

Deneche A. Hakim commented on MAHOUT-323:
-

committed a basic mapreduce version of TestForest. If you pass -mr to 
TestForest it will use Hadoop to classify the data. Each input file is 
processed by exactly one mapper. For now, you compute the confusion matrix with 
the mapreduce version...this should come in a the next commit

 Classify new data using Decision Forest
 ---

 Key: MAHOUT-323
 URL: https://issues.apache.org/jira/browse/MAHOUT-323
 Project: Mahout
  Issue Type: Improvement
  Components: Classification
Affects Versions: 0.4
Reporter: Deneche A. Hakim
Assignee: Deneche A. Hakim
 Attachments: mahout-323.patch


 When building a Decision Forest we should be able to store it somewhere and 
 use it later to classify new datasets

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: A request for prospective GSOC students

2010-04-03 Thread Lukáš Vlček
Hi,

My proposal had the following subject:
Mahout GSoC 2010 proposal: Association Mining

It was missing time schedule and further implementation details. I can work
on those missing parts but I was rather expecting some general discussion
about this topic first before I invest time in time planning and other
details. I can see that Mahout is getting a lot of proposals and I think
some of them will get reasonable interest of the community. Saying this I
think I am fine working on association mining my way without being
limited/pushed by GSoC timeline to do compromises that I do not need to do
now. However, comments from community about my proposal are still warmly
welcome.

Regard,
Lukas

On Sat, Apr 3, 2010 at 5:37 PM, Robin Anil robin.a...@gmail.com wrote:

 I am having a tough time separating Mahout proposals from rest of Apache on
 gsoc website. So I would request you all to reply to this thread when you
 have submitted a proposal so that we don't miss out on reading your hard
 worked proposal. For now I could only find Zhao Zhendong's LIBLINEAR
 proposal. If anyone else have applied do reply back with the title of the
 proposal.

 Robin



Re: Proposal: make collections releases independent of the rest of Mahout

2010-04-03 Thread Grant Ingersoll

On Apr 3, 2010, at 2:22 PM, Benson Margulies wrote:

 On Sat, Apr 3, 2010 at 2:07 PM, Sean Owen sro...@gmail.com wrote:
 
 
 
 Actually it seems like this a valid subproject of a Mahout TLP in its
 own right, if that would be a useful middle-ground status.
 
 I'm not trying to suggest anything different. I'm opposed to having
 'separate committers', but I'm happy to have multiple releasable
 components all in the Mahout TLP.

For those following the sub project saga in Lucene, let's not go down that 
road.  +1 to releasable components, though.  We can release what we want when 
we want.  It doesn't have to be the whole thing all the time.  But I'd say no 
to separate committers, etc.