[ 
https://jira.nuxeo.com/browse/NXSEM-7?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=95570#comment-95570
 ] 

Olivier Grisel commented on NXSEM-7:
------------------------------------

Work is underway here: 
https://github.com/ogrisel/pignlproc/tree/master/examples/topic-corpus

The goal is to output topic usage samples as NTriples suitable for indexing by 
the SolrYard of Stanbol EntityHub component. The SolrYard has been extended to 
be able to perform MoreLikeThis queries without HTTP posting (using the default 
Embedded Solr setup). A new Stanbol engine will be developed to perform topic 
assignment (i.e. text classification) based on this indexed taxonomy.

  https://issues.apache.org/jira/browse/STANBOL-197

> develop hadoop based toolset to build categorized TF-IDF corpora to train 
> document classification models
> --------------------------------------------------------------------------------------------------------
>
>                 Key: NXSEM-7
>                 URL: https://jira.nuxeo.com/browse/NXSEM-7
>             Project: Nuxeo Semantic R&D
>          Issue Type: Task
>            Reporter: Olivier Grisel
>            Assignee: Olivier Grisel
>             Fix For: 5.4.2
>
>
> The toolset should be packaged to be easily deployable on AWS using the 
> Cloudera Distribution for Hadoop 2 AMI [1] and the wikipedia XML dump AWS 
> Dataset [2].
> The Mahout project already has some partial implementation of this (lacking 
> the TF-IDF [3] part). To avoid having to load a huge dictionary in memory, we 
> plan to leverage a hashed representation [4] of the term and document 
> frequencies.
> [1] http://archive.cloudera.com/docs/ec2.html
> [2] 
> http://aws.typepad.com/aws/2009/09/new-public-data-set-wikipedia-xml-data.html
> [3] http://en.wikipedia.org/wiki/Tf-idf
> [4] http://hunch.net/~jl/projects/hash_reps/index.html

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        
_______________________________________________
ECM-tickets mailing list
[email protected]
http://lists.nuxeo.com/mailman/listinfo/ecm-tickets

Reply via email to