Re: Title extract logic

Markus Jelsma Thu, 22 Apr 2021 03:44:57 -0700

Hello Nicholas,

The PDF you link to has a decent title in its metadata, but if it isn't
there, i would not rely on the first N characters of the content, as it is
very unreliable. You can find all kinds of bad markup right at the start of
PDFs.


But there is a choice, you can still use the raw filename, which is fine in
most cases, and usually prettier to read than the first N characters.
Another trick is to use the most common hyperlink anchor, which is most of
the times very readable and descriptive.

Regards,
Markus

Op wo 21 apr. 2021 om 18:02 schreef Nicholas DiPiazza <
[email protected]>:

> Hi Tika Users:
>
> Does Tika have any built-in Title extract logic?
>
> I am currently using a simple algorithm that:
>
> 1) Checks metadata for a title. Use that if there.
> 2) If no title metadata, then use the body text. Extract the first line of
> the body text and use that as the title.
>
> Let's take this PDF for example:
> https://www.fdic.gov/regulations/reform/resplans/plans/icicibank-165-1612.pdf
>
> That results in
>
> - 4 -
>
> as a title. Not great, right? Ha!
>
> So then I add something like:
>
> 3) If the first line has < 5 alpha num characters, go to the next line
> until you find a title.
>
> That works in this case but doesn't work for many other cases.
>
> What are others doing for title extraction? I would imagine there's no
> perfect solution here. Just curious what ya'll are doing to troubleshoot
> this stuff.
>

Re: Title extract logic

Reply via email to