develop hadoop based toolset to build categorized TF-IDF corpora to train 
document classification models
--------------------------------------------------------------------------------------------------------

                 Key: NXSEM-7
                 URL: http://jira.nuxeo.org/browse/NXSEM-7
             Project: Nuxeo Semantic R&D
          Issue Type: Task
            Reporter: Olivier Grisel
            Assignee: Olivier Grisel


The toolset should be packaged to be easily deployable on AWS using the 
Cloudera Distribution for Hadoop 2 AMI [1] and the wikipedia XML dump AWS 
Dataset [2].

The Mahout project already has some partial implementation of this (lacking the 
TF-IDF [3] part). To avoid having to load a huge dictionary in memory, we plan 
to leverage a hashed representation [4] of the term and document frequencies.

[1] http://archive.cloudera.com/docs/ec2.html
[2] 
http://aws.typepad.com/aws/2009/09/new-public-data-set-wikipedia-xml-data.html
[3] http://en.wikipedia.org/wiki/Tf-idf
[4] http://hunch.net/~jl/projects/hash_reps/index.html


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://jira.nuxeo.org/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        
_______________________________________________
ECM-tickets mailing list
[email protected]
http://lists.nuxeo.com/mailman/listinfo/ecm-tickets

Reply via email to