My first question is: how is text stored in a PDF? I think there are 2 ways to store text in a PDF: a) vector PDF: the PDF contains a line telling it to print a word in a specific font on a specific location b) OCR text has been added to the image as an extra layer (I think this is called, the XMP metadata)
Is this information correct? So, if PDFBox wants to extract text from a PDF, how does it extract the data? Is it looking at the XMP metadata? Or the vector details? Any developer wanting to help me on this issue?

