Formatted Content Extraction and Title Detection

imyuka Thu, 09 Oct 2014 05:24:08 -0700

Hi all,


    Here is my problem: I have extracted plain texts from a serious of doc(x) 
documents and their titles via the "dc:title" label of metadata, but I'm not 
sure this is the right way to attain a title of a document. In many cases, a 
title inside a document could be of the largest font-size and bold-style, which 
I want to utilized to extract the very title, however, I have no idea how to 
get a formatted content and font-size/bold-style detection. please let me know 
if I miss something.
    Thank you very much!

Formatted Content Extraction and Title Detection

Reply via email to