[ 
https://jira.nuxeo.com/browse/NXSEM-8?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olivier Grisel resolved NXSEM-8.
--------------------------------

    Resolution: Won't Fix

This feature will be delegated to Stanbol and it's EntityHub and ContentHub 
related components.

> CoreEventListener + service to build 64bits semantic hash of documents with 
> text content (PDF, office, xhtml, ...)
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: NXSEM-8
>                 URL: https://jira.nuxeo.com/browse/NXSEM-8
>             Project: Nuxeo Semantic R&D
>          Issue Type: Task
>            Reporter: Olivier Grisel
>            Assignee: Olivier Grisel
>             Fix For: 5.4.2
>
>
> Using stacked denoising autoencoders (SDA) [1], spectral hashing (SH) [2] or 
> locality sensitive hashing (LSH) [3][4] or binary reconstructive encodings 
> (BRE) [5] build a service that is able to extract 64bits coliding hashes of 
> document such that low Hamming distances in the hash space mean highly 
> related content in the implicit human semantic space.
> The lshkit [6] project provides SH and LSH implementation. The libsgd project 
> [7] should also soon provide SDA implementation albeit with a dense 
> representation that might not scale to the several tenth of hundred of 
> dimensions of the documents TF-IDF input space. Maybe SDA and libsgd should 
> be first tested on picture semantic hashing instead.
> Before starting the implementation of this service, several algo / 
> implementations should be benched on a small tokenized / TF-IDF'ed wikipedia 
> subset to get a grasp of the performance requirements (CPU time / Memory 
> usage) of each options.
> The end user goal of having semantic hashing is to complement the fulltext 
> indexes with another  very scalable implementation of content based search 
> (using keywords queries) or by browsing the content of the nuxeo document 
> repository based on the document similiratiies instead of workspace 
> localization. Such as browsing user interface coold be build upon the JS 
> InfoViz Toolkit lib [8].
> [1] http://www.cs.toronto.edu/.../aistats_2009_robust_interdependent.pdf
> [2] http://people.csail.mit.edu/torralba/.../spectralhashing.pdf
> [3] http://www.mit.edu/~andoni/LSH/
> [4] http://www.cs.utexas.edu/~grauman/papers/iccv2009_klsh.pdf
> [5] http://www.eecs.berkeley.edu/~kulis/pubs/hashing_bre_tr.pdf
> [6] http://lshkit.sourceforge.net/
> [7] http://bitbucket.org/ogrisel/libsgd/src/ 
> [8] http://thejit.org/

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        
_______________________________________________
ECM-tickets mailing list
[email protected]
http://lists.nuxeo.com/mailman/listinfo/ecm-tickets

Reply via email to