Dear all,
We (SRDC team) have completed an initial version of the implementation
of contenthub. I am going to upload a patch reflecting the changes by
creating an issue on the Jira server.
In general, we did the following:
* Contenthub has a new store implementation, which stores the
documents in a Solr server.
* Our search component is now integrated into contenthub. In
addition to the text based search on Solr documents, it tries to
search over user supplied ontologies as well as the enhancements
extracted from all documents.
* Faceted search
Let me go in details.
Regarding the Solr backend of contenthub,
* We have used the SolrServerProviderManager and
SolrDirectoryManager from stanbol.commons.solr to initialize an
EmbeddedSolrServer.
* We have created our own core files (indexes) for Solr.
* Currently, only text files can be submitted to contenthub. Whole
text content and supplied constraints are indexed.
* Content items can be saved and removed. Update function is not
implemented yet.
Regarding the search component,
* Our old implementation were consisting of five different search
engines. The approach was to run each of them and merge the
results of each engine. However, this leads to efficiency problems
as the size and number of the data increase. Currently, three
search engines run and results are merged as they arrive. We are
trying to come up with a unified approach to overcome the
efficiency issues. In the end, we are planning to have Solr index
most of the search resources.
* In addition to the search on Solr index, we search over all
enhancements. All enhancements are stored in a single graph on
TCManager. This graph is indexed and search through LARQ.
EnhancementListener mainly handles this job. In the near future,
we are planning to get rid of this listener approach by storing
the enhancements in Solr aligned with the content items.
* If there is any user ontology in the system, and if the user wants
to include that ontology in the search operation, we index that
ontology and perform a search through LARQ. Matching ontology
resources gives new keywords to us. This approach will also be
improved as we unify our solution.
* In a search using multiple keywords, currently we do not take the
relation between the keywords into account.
Regarding the faceted search,
* Contenthub enables storage of field:[value,] pairs through Solr
faceted search mechanism. User is allowed to save any constraint
(field:[value,] pair) along with the content item.
* Our Solr index makes use of dynamic fields to index any value
carried with the content item.
* In the first search, facets are constructed from the fields of
resulting documents. Later on, user is allowed to make use of the
faceted search features.
We are planning to continue with implementing new search engines through
a unified approach, to increase the semantic capabilities of search. For
example, we plan to analyze city-country person-organization
person-birthplace relations. Apart from that, we also plan to integrate
latest version of Wordnet to increase the search facilities with
external resources.
Regarding LMF, up to now, we have not considered any collaboration.
However, from now on we will try not to duplicate efforts and focus on
divergent parts of contenthub.
Kind regards,
Anil.
-------- Original Message --------
Subject: Fwd: Re: Contenthub structure
Date: Thu, 18 Aug 2011 15:01:01 +0300
From: Suat Gonul <[email protected]>
To: [email protected]
-------- Original Message --------
Subject: Re: Contenthub structure
Date: Thu, 2 Jun 2011 10:54:15 +0200
From: Rupert Westenthaler <[email protected]>
Reply-To: [email protected]
To: [email protected]
Hi all
I will try to create a small usage Szenario here:
A user posts a query for "CMS workshops in France" to the Contenthub:
The semantic Search component of the Contenthub uses several
SeachEngines (like EnhancementEngines in the Enhancer).
1. OntologySearcher: It tries to identify Concepts mentioned in the
Search. For the example it will find the Concpet "Workshop"
2. EntitySearcher: It tries to find Entities for words used in the
Query. For the example it will find "France"
3. Faceted Search engine: It will compose a Lucene type search for
Documents with
* a reference Workshop
* a reference to France
* the text "CMS"
If there would be an other Search engine that can understand internal
structure of the query one could even search for things
* with the type Workshop
* located within Paris
* the text "CMS"
and because Workshops are events one could activate Facets for
* Location
* Time
* Participants
* facets explicitly requested with the query (e.g. Tags, Creator ...)
So the Idea is to use
* Ontologies (CMS-Adapter& Kres)
* Entityhub
* maybe neuronal networks with learned query patterns??
* other stuff??
for query preprocessing and
* full text indices over Documents
* full text indices over Facts (like the Workshop)
* SPARQL endpoints over Enhancements
* other things??
for the execution of the enhances query.
Joining results from the different sources (Documents, Facts,
Enhancements) would be challenging. However I think this feature would
not be necessary for a first version.
I would also like to consider this
[Screencast](http://www.srdc.com.tr/iks/2ndyear/DemoVideo.htm) in the
context of this Usage Scenario.
WDYT
Rupert
On Wed, Jun 1, 2011 at 10:26 AM, Olivier Grisel
<[email protected]> wrote:
2011/6/1 Suat Gonul<[email protected]>:
Hi everbody,
After discussing with Rupert yesterday, we have come up with a basic design
for the Contenthub component.
It will provide two main RESTful interface to:
1) Upload (register) content and metadata (Available in current
implementation)
2) Search for registered content
There would be Indexing Engines for (1) and Search Engines for (2). The
Contenthub implementation would then implement Indexing Engines to store the
enhancements in a triple store and Search Engines to search enhancements and
content items in triple store.
There is also an already started implementation for the search part in
google code base of IKS project at [1]. It will be integrated to the
Contenthub component.
What do you think?
I think the default search implementation for content should be based
on fulltext indexing using the EntityHub's SolrYard extended with
faceted search.
I find fulltext search + structure facet based structured refinements
combo much more intuitive than the traditional multi-fields form based
search interface.
--
Olivier
http://twitter.com/ogrisel -http://github.com/ogrisel
--
| Rupert [email protected]
| Bodenlehenstraße 11 ++43-699-11108907
| A-5500 Bischofshofen