[
https://jira.nuxeo.com/browse/NXSEM-8?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Olivier Grisel resolved NXSEM-8.
--------------------------------
Resolution: Won't Fix
This feature will be delegated to Stanbol and it's EntityHub and ContentHub
related components.
> CoreEventListener + service to build 64bits semantic hash of documents with
> text content (PDF, office, xhtml, ...)
> ------------------------------------------------------------------------------------------------------------------
>
> Key: NXSEM-8
> URL: https://jira.nuxeo.com/browse/NXSEM-8
> Project: Nuxeo Semantic R&D
> Issue Type: Task
> Reporter: Olivier Grisel
> Assignee: Olivier Grisel
> Fix For: 5.4.2
>
>
> Using stacked denoising autoencoders (SDA) [1], spectral hashing (SH) [2] or
> locality sensitive hashing (LSH) [3][4] or binary reconstructive encodings
> (BRE) [5] build a service that is able to extract 64bits coliding hashes of
> document such that low Hamming distances in the hash space mean highly
> related content in the implicit human semantic space.
> The lshkit [6] project provides SH and LSH implementation. The libsgd project
> [7] should also soon provide SDA implementation albeit with a dense
> representation that might not scale to the several tenth of hundred of
> dimensions of the documents TF-IDF input space. Maybe SDA and libsgd should
> be first tested on picture semantic hashing instead.
> Before starting the implementation of this service, several algo /
> implementations should be benched on a small tokenized / TF-IDF'ed wikipedia
> subset to get a grasp of the performance requirements (CPU time / Memory
> usage) of each options.
> The end user goal of having semantic hashing is to complement the fulltext
> indexes with another very scalable implementation of content based search
> (using keywords queries) or by browsing the content of the nuxeo document
> repository based on the document similiratiies instead of workspace
> localization. Such as browsing user interface coold be build upon the JS
> InfoViz Toolkit lib [8].
> [1] http://www.cs.toronto.edu/.../aistats_2009_robust_interdependent.pdf
> [2] http://people.csail.mit.edu/torralba/.../spectralhashing.pdf
> [3] http://www.mit.edu/~andoni/LSH/
> [4] http://www.cs.utexas.edu/~grauman/papers/iccv2009_klsh.pdf
> [5] http://www.eecs.berkeley.edu/~kulis/pubs/hashing_bre_tr.pdf
> [6] http://lshkit.sourceforge.net/
> [7] http://bitbucket.org/ogrisel/libsgd/src/
> [8] http://thejit.org/
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
_______________________________________________
ECM-tickets mailing list
[email protected]
http://lists.nuxeo.com/mailman/listinfo/ecm-tickets