Thanks Nick, I really appreciate it. In this case, does it suppose that formatted context extraction can only be processed by producing corresponding XHTML file as output? I roughly checked up the book <Tika in Action> and found the instruction about transforming a document to a XHTML file with command line, while I have no idea about the Java coding implementation. Are there any instructions or tutorials I can refer to?
Thanks! At 2014-10-09 20:46:01, "Nick Burch" <apa...@gagravarr.org> wrote: >On Thu, 9 Oct 2014, imyuka wrote: >> Here is my problem: I have extracted plain texts from a serious of >> doc(x) documents and their titles via the "dc:title" label of metadata, >> but I'm not sure this is the right way to attain a title of a document. >> In many cases, a title inside a document could be of the largest >> font-size and bold-style, which I want to utilized to extract the very >> title, however, I have no idea how to get a formatted content and >> font-size/bold-style detection > >If it's been styled as a heading, then you'll be able to get that from the >html contents. If in Word it's styled as normal body text, but manually >set to a larger font size, then there's nothing in Tika to help with that. > >Nick