Re: PDF Description Extraction For Linked data

Maatari Daniel Okouya Wed, 28 May 2014 17:29:18 -0700

Rafa, 

Many thanks for your elaborated answer.


It seems to me that from your elaborated answer i did not completely graps the 
concepts behind StanBol. Its primary purpose is semantically annotating the 
content of a file for the purpose of semantic search. Although one could divert 
by reusing the enhancing infrastructure to get the description generated and 
apply some Sparql rule to get the description in a format desire. It is not 
geared toward linked data out of the box. What i mean generating a description 
that you could publish as is, which is what i was looking for. As you say, the 
best match here is the description returned by the Topic annotation engine and 
maybe few things extracted by Tika. 

I mean i still need to read a bit, but this is what i get for now, from your 
explanation and my readings. 

Am I close ?

Best, 
-M-
-- 
Maatari Daniel Okouya
Sent with Airmail

On 28 May 2014 at 13:46:00, Rafa Haro ([email protected]) wrote:

Hi Maatari,  

El 27/05/14 21:05, Maatari Daniel Okouya escribió:  
> Hi ,  
>  
> Completing my previous question, I think it would be better for me to give 
> the bigger picture of what i’m trying to achieve.  
>  
>  
> I have been charge with helping in disseminating the publications content of 
> my organisation. Most of them are in PDF.  
>  
> Therefore, I need a process to produce a meaningful RDF description of our 
> content that links as much as possible to the LOD cloud and LOV (liked open 
> vocab). Hence i need to use common core vocabularies as much as i can i.e. 
> dublin, schema.org, Bibo, FOAF, etc… and reference entity from DBpedia for 
> instance.  
>  
> Searching around the web how to automatically generate these descriptions 
> which would include creator, publisher, primaryTopic, subject, thematic etc…. 
> It seems to me that Apache StanBol was the best match.  
With Stanbol you can enrich your content with your own vocabularies or  
dataset from the LOD cloud as long as you import them before as a site.  
Let's say that "out of the box" enrichment process consist on linking  
pieces of texts (like entities/concepts' names/labels) with entities  
within your datasets.  
>  
> So that’s it, in the first place I would like to automatically generate some 
> rich description about my Pdf publication. not rich tho. We are not yet 
> planing on providing semantic search. It will probably come in the future.  
I would say that what you need is not related to Entity Linking for now.  
The closer resource that you can use in Stanbol for categorizing your  
content in that way is the Topic Annotation Engine which is able to  
classify your content according to a pre-trained model using a certain  
set of categories. Those categories should correspond to concepts from a  
Stanbol site. Please, note that things like primaryTopic. subject,  
thematic... are usually not possible to be extracted without training a  
model first with already annotated content. There are, of course,  
unsupervised alternatives like Latent Semantic Analysis or Latent  
Dirilecht Allocation that can be used to extract main terms as topics  
for your content, but currently there is not support for those in Stanbol.  
>  
> however for now, i’m interested in providing some biblio graphic data, and 
> state the main topics of the publication i.e. what does it talk about 
> generally speaking  
If the PDFs have correct metadata, you can use Tika for extracting.  
Probably some one in the list can correct me but, as far as I know  
current Tika engine in Stanbol is used to extract the content for later  
enrich it, but it is not mapping extracted metadata to RDF. I'm not 100%  
sure about this but, anyway, to implement it shouldn't be complex.  
>  
> I will then deploy those description in sparql endpoint, use a frontend like 
> pubby, and do some content negotiation to redirect toward my pdf when 
> requested. This means also that my description need to have some specific url 
> that i provide them with.  
In the 0.12 branch of Stanbol, there is a component called ContentHub  
which is able to automatically store the content metadata as RDF along  
with the enhancements providing also an SPARQL endpoint. If you are  
planning to store huge volumes of data, probably then the best idea is  
to take the RDF response of the enhancer and store it in your own triple  
store.  
>  
>  
> Can any one give me some pointers? Is it possible to do that with StanBol, if 
> yes how should i go for it ? How to configure the enhancer for that ?  
>  
>  
> Many thanks,  
>  
> -M-  
>  
>  
> --  
> Maatari Daniel Okouya  
> Sent with Airmail

Re: PDF Description Extraction For Linked data

Reply via email to