Hello Nicholas, The PDF you link to has a decent title in its metadata, but if it isn't there, i would not rely on the first N characters of the content, as it is very unreliable. You can find all kinds of bad markup right at the start of PDFs.
But there is a choice, you can still use the raw filename, which is fine in most cases, and usually prettier to read than the first N characters. Another trick is to use the most common hyperlink anchor, which is most of the times very readable and descriptive. Regards, Markus Op wo 21 apr. 2021 om 18:02 schreef Nicholas DiPiazza < nicholas.dipia...@gmail.com>: > Hi Tika Users: > > Does Tika have any built-in Title extract logic? > > I am currently using a simple algorithm that: > > 1) Checks metadata for a title. Use that if there. > 2) If no title metadata, then use the body text. Extract the first line of > the body text and use that as the title. > > Let's take this PDF for example: > https://www.fdic.gov/regulations/reform/resplans/plans/icicibank-165-1612.pdf > > That results in > > - 4 - > > as a title. Not great, right? Ha! > > So then I add something like: > > 3) If the first line has < 5 alpha num characters, go to the next line > until you find a title. > > That works in this case but doesn't work for many other cases. > > What are others doing for title extraction? I would imagine there's no > perfect solution here. Just curious what ya'll are doing to troubleshoot > this stuff. >