Re:Re: Formatted Content Extraction and Title Detection

imyuka Thu, 09 Oct 2014 05:58:31 -0700

Thanks Nick, I really appreciate it. In this case, does it suppose that 
formatted context extraction can only be processed by producing corresponding 
XHTML file as output? I roughly checked up the book <Tika in Action> and found 
the instruction about transforming a document to a XHTML file with command 
line, while I have no idea about the Java coding implementation. Are there any 
instructions or tutorials I can refer to?



Thanks!








At 2014-10-09 20:46:01, "Nick Burch" <apa...@gagravarr.org> wrote:
>On Thu, 9 Oct 2014, imyuka wrote:
>> Here is my problem: I have extracted plain texts from a serious of 
>> doc(x) documents and their titles via the "dc:title" label of metadata, 
>> but I'm not sure this is the right way to attain a title of a document. 
>> In many cases, a title inside a document could be of the largest 
>> font-size and bold-style, which I want to utilized to extract the very 
>> title, however, I have no idea how to get a formatted content and 
>> font-size/bold-style detection
>
>If it's been styled as a heading, then you'll be able to get that from the 
>html contents. If in Word it's styled as normal body text, but manually 
>set to a larger font size, then there's nothing in Tika to help with that.
>
>Nick

Re:Re: Formatted Content Extraction and Title Detection

Reply via email to