Re: Contenthub structure

Ali Anil Sinaci Tue, 04 Oct 2011 02:18:36 -0700

Dear all,

We (SRDC team) have completed an initial version of the implementationof contenthub. I am going to upload a patch reflecting the changes bycreating an issue on the Jira server.


In general, we did the following:

   * Contenthub has a new store implementation, which stores the
     documents in a Solr server.
   * Our search component is now integrated into contenthub. In
     addition to the text based search on Solr documents, it tries to
     search over user supplied ontologies as well as the enhancements
     extracted from all documents.
   * Faceted search

Let me go in details.

Regarding the Solr backend of contenthub,

   * We have used the SolrServerProviderManager and
     SolrDirectoryManager from stanbol.commons.solr to initialize an
     EmbeddedSolrServer.
   * We have created our own core files (indexes) for Solr.
   * Currently, only text files can be submitted to contenthub. Whole
     text content and supplied constraints are indexed.
   * Content items can be saved and removed. Update function is not
     implemented yet.

Regarding the search component,

   * Our old implementation were consisting of five different search
     engines. The approach was to run each of them and merge the
     results of each engine. However, this leads to efficiency problems
     as the size and number of the data increase. Currently, three
     search engines run and results are merged as they arrive. We are
     trying to come up with a unified approach to overcome the
     efficiency issues. In the end, we are planning to have Solr index
     most of the search resources.
   * In addition to the search on Solr index, we search over all
     enhancements. All enhancements are stored in a single graph on
     TCManager. This graph is indexed and search through LARQ.
     EnhancementListener mainly handles this job. In the near future,
     we are planning to get rid of this listener approach by storing
     the enhancements in Solr aligned with the content items.
   * If there is any user ontology in the system, and if the user wants
     to include that ontology in the search operation, we index that
     ontology and perform a search through LARQ. Matching ontology
     resources gives new keywords to us. This approach will also be
     improved as we unify our solution.
   * In a search using multiple keywords, currently we do not take the
     relation between the keywords into account.

Regarding the faceted search,

   * Contenthub enables storage of field:[value,] pairs through Solr
     faceted search mechanism. User is allowed to save any constraint
     (field:[value,] pair) along with the content item.
   * Our Solr index makes use of dynamic fields to index any value
     carried with the content item.
   * In the first search, facets are constructed from the fields of
     resulting documents. Later on, user is allowed to make use of the
     faceted search features.

We are planning to continue with implementing new search engines througha unified approach, to increase the semantic capabilities of search. Forexample, we plan to analyze city-country person-organizationperson-birthplace relations. Apart from that, we also plan to integratelatest version of Wordnet to increase the search facilities withexternal resources.

Regarding LMF, up to now, we have not considered any collaboration.However, from now on we will try not to duplicate efforts and focus ondivergent parts of contenthub.


Kind regards,

Anil.




-------- Original Message --------
Subject:        Fwd: Re: Contenthub structure
Date:   Thu, 18 Aug 2011 15:01:01 +0300
From:   Suat Gonul <[email protected]>
To:     [email protected]





-------- Original Message --------
Subject:        Re: Contenthub structure
Date:   Thu, 2 Jun 2011 10:54:15 +0200
From:   Rupert Westenthaler <[email protected]>
Reply-To:       [email protected]
To:     [email protected]



Hi all

I will try to create a small usage Szenario here:

A user posts a query for "CMS workshops in France" to the Contenthub:

The semantic Search component of the Contenthub uses several
SeachEngines (like EnhancementEngines in the Enhancer).

1. OntologySearcher: It tries to identify Concepts mentioned in the
Search. For the example it will find the Concpet "Workshop"
2. EntitySearcher: It tries to find Entities for words used in the
Query. For the example it will find "France"
3. Faceted Search engine: It will compose a Lucene type search for
Documents with
 * a reference Workshop
 * a reference to France
 * the text "CMS"

If there would be an other Search engine that can understand internal
structure of the query one could even search for things
* with the type Workshop
* located within Paris
* the text "CMS"
and because Workshops are events one could activate Facets for
* Location
* Time
* Participants
* facets explicitly requested with the query (e.g. Tags, Creator ...)

So the Idea is to use

* Ontologies (CMS-Adapter&  Kres)
* Entityhub
* maybe neuronal networks with learned query patterns??
* other stuff??

for query preprocessing and

* full text indices over Documents
* full text indices over Facts (like the Workshop)
* SPARQL endpoints over Enhancements
* other things??

for the execution of the enhances query.

Joining results from the different sources (Documents, Facts,
Enhancements) would be challenging. However I think this feature would
not be necessary for a first version.

I would also like to consider this
[Screencast](http://www.srdc.com.tr/iks/2ndyear/DemoVideo.htm) in the
context of this Usage Scenario.

WDYT
Rupert

On Wed, Jun 1, 2011 at 10:26 AM, Olivier Grisel
<[email protected]>  wrote:

 2011/6/1 Suat Gonul<[email protected]>:

 Hi everbody,

 After discussing with Rupert yesterday, we have come up with a basic design
 for the Contenthub component.

 It will provide two main RESTful interface to:

 1) Upload (register) content and metadata (Available in current
 implementation)
 2) Search for registered content

 There would be Indexing Engines for (1) and Search Engines for (2). The
 Contenthub implementation would then implement Indexing Engines to store the
 enhancements in a triple store and Search Engines to search enhancements and
 content items in triple store.

 There is also an already started implementation for the search part in
 google code base of IKS project at [1]. It will be integrated to the
 Contenthub component.

 What do you think?


 I think the default search implementation for content should be based
 on fulltext indexing using the EntityHub's SolrYard extended with
 faceted search.

 I find fulltext search + structure facet based structured refinements
 combo much more intuitive than the traditional multi-fields form based
 search interface.

 --
 Olivier
 http://twitter.com/ogrisel  -http://github.com/ogrisel




--
| Rupert [email protected]
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: Contenthub structure

Reply via email to