Re: CV Mining (Early adopter program)

Rupert Westenthaler Fri, 02 Mar 2012 02:36:03 -0800

On 02.03.2012, at 10:53, Luca Dini wrote:
> The reason is that licensing vary according to the service provider. As you 
> have seen we are not the only providers via linguagrid. As far as our 
> services are concerned, they are open access but not open source. In short, 
> this means:
> 1) unlimited access for research/educational purposes, with support for 
> integration ecc..
> 2) free access for "commercial purposes", with no service level guarantee.
> 3) Paying access (subscription or pay per use) if some service level 
> guarantee is needed, Prices vary of course depending on volumes, constraint, 
> time of response etc.
> 
> Concerning standbol, as IKS it is a research project we are willing to give 
> unlimited access to all standbol instances. Of course the limitation is 
> represented by the computational power of the AmazonWS instances where the 
> linguagrid and related services are hosted. In front of a massive adoption 
> and the need of activating many instances (they have a costs) we will be 
> forced to impose some kind of fee. But this is a future scenario, as 
> currently the linguagrid seems to scale rather well.


Thy for the clarification! This would be very similar to already existing 
engines such as Zemanta and genomes.org.
For Stanbol users that do not have problems with sending content to external 
services linguagrid would than be an real alternative to OpenNLP.

>> Stanbol already nicely supports multi lingual scenarios. The LangId
>> engine can be used to detect the language of a Document (internally
>> used Apache Tika) and stores the detected language in the metadata.
>> Other engines can use this language for further processing.
> That's great: probably my consideration of multilinguality as a challenge was 
> due to the fact that  that most integrated linguistic engines where dealing 
> with English. I was also wondering if the strategies for matching a given 
> named entity with e.g. dbpedia url are completely language independent.
matching Entities uses

* labels with the given language
* labels without any defined language

In addition you can configure a "default matching language". This is useful for 
datasets like DBPedia where all string values do get the language of the 
extracted dataset.

So this engines are language independent. The only thing you need to ensure is 
that labels of the target language are included in the index.

>> When dealing with French you might want to update the Configuration of
>> the SolrCore used to store the Controlled vocabulary with French
>> specific configurations such as stop words, stemmers ... This will
>> improve the results for the NamedEntityTaggingEngine and
>> KeywordLinkingEngine engine.
> I understand this for the  KeywordLinkingEngine, but not completely for the 
> NamedEntityTaggingEngine. In our view we will have to integrate a new 
> French/Italian NamedEntityTaggingEngine which will handle stop words and all 
> other language related aspects internally. But this believe might just be due 
> to the fact that our knowledge of the whole system is still limited.
> 

In principle you are right.
Stop words have no influence on the NamedEntityTaggingEngine. However steamers 
might improve linking results (depending on the language). Also things like 
folding [1], Word Delimiter Rules[2] ore more advanced things like Phonetic [3] 
matching could be realized by playing around with the Solr Configuration.

[1] 
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ICUFoldingFilterFactory
 
[2] 
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory
[3] 
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PhoneticFilterFactory

>> This might be indeed a challenge. I would start to split up the
>> content in smaller pieces (e.g. sentences) and try to group Entities
>> extracted from such parts.
>> If you than build a semantic index that stores such pieces as own
>> documents even searches for a job type at a specific company could
>> work quite nicely.
> We will follow  the approach you describe: if I understand correctly you 
> propose to make use use of atomic information (e.g. an experienceLine) as a 
> kind of document in such a way that it is possible to formulate query such as 
> "all documents of type experienceLine which contains a job X and a company Y" 
> right?

exactly. 

best
Rupert

Re: CV Mining (Early adopter program)

Reply via email to