Re: CV Mining (Early adopter program)

Rupert Westenthaler Thu, 01 Mar 2012 14:15:10 -0800

Hi Luca

A really interesting Scenario.

On Thu, Mar 1, 2012 at 3:44 PM, Luca Dini <[email protected]> wrote:
>
>    The provision to Stanbol of classes allowing the connection with
> Linguagrid (www.linguagrid.org) and possibly LanguageGrid
> (http://langrid.org/en/index.html).
>    The verification of the extensibility of Stanbol to languages other than
> English (The project will concern CVs written in French).
>

Ok this answers my question of the other Email. Can you maybe provide
some additional information (links) about this services. What is the
License of Language Grid. I was not able to find information related
to that.

> The basic goal is to provide them with an open
> source document management system able to deal in an intelligent way with
> non structured CV (or "resumes"), i.e. CVs which comes in Microsoft Word,
> pdf, Open Office etc.

Apache Stanbol has now two EnhancementEngines for processing non plain
text documents

* MetaxaEngine (mainly based on aperture.sourceforge.net)
* TikaEngine (Apache Tika)

Therefore the kind of documents you mentioned should be supported by Stanbol.

>
> This might represent:
>
>    experiences of the candidate
>    skills of the candidate
>    Education level
>    reference data (name, address etc.)
>    contact data
>
> Some of these data might be slightly more structured than just named
> entities, but definitely in the representation power of rdf. Some of them
> could be even more semantically enriched, by providing external information
> on companies, places, specific technologies etc.
>

It is very easy to import data that are available as RDF into stanbol
and used it for Entity Extraction and Linking. There is also support
for importing existing vCard files. Such data are converted to RDF by
using the schema.org schema.

> As a result of this personnel at the HR department would be able to
> formulate queries such as (just an exemplification):
>
>    All CV of people living in Paris older then 27 years
>    All CV of people with skills in SQL server and Java
>    All people who have worked in an high tech company since november 2011.
>

Do you plan to use the Apache Contenthub for Semantic Search, or does
the CMS you use already support such kind of searches?

>
> Challenges
>
> From a technical point of view the most interesting challenge consists in
> integrating the set of Stanbol enhancer, with the semantic web services
> provided at www.linguagrid.org. In principle it should not be a different
> integration than what has already been made with OpenCalais WS and Zemanta
> WS. However there are at least two major challenges:
>
>    Multilinguality. The extraction will consider French documents rather
> than English ones. Moreover, in a second phase (not covered by the present
> project, the whole system could be extended to Italian and French.

Stanbol already nicely supports multi lingual scenarios. The LangId
engine can be used to detect the language of a Document (internally
used Apache Tika) and stores the detected language in the metadata.
Other engines can use this language for further processing.

When dealing with French you might want to update the Configuration of
the SolrCore used to store the Controlled vocabulary with French
specific configurations such as stop words, stemmers ... This will
improve the results for the NamedEntityTaggingEngine and
KeywordLinkingEngine engine.

>    Ontological extension. While CVs typically contains quite a lot of named
> entities which are already covered by Stanbol (e.g. geographical names, time
> expressions, Company names, person names) there are entities which will need
> some ontology extension such as skills and education.
>    Structural Complexity. In a CV instances of entities are linked each
> other in a structurally complex way. For instance places are not just a flat
> list of geographical entities, but their are likely to be connected with
> periods, with job types, with companies, etc. Handling this structural
> complexity represents an important challenge.
>

This might be indeed a challenge. I would start to split up the
content in smaller pieces (e.g. sentences) and try to group Entities
extracted from such parts.
If you than build a semantic index that stores such pieces as own
documents even searches for a job type at a specific company could
work quite nicely.

Such a System would not really "understand" the structural complexity
but still should be able to present Users with good search results.

best
Rupert

-- 
| Rupert Westenthaler             [email protected]
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: CV Mining (Early adopter program)

Reply via email to