Re: [Tutor] How to Scrape Text from PDFs
This isn't a response that's python-related, sorry, I'm still learning python myself, but more questions around the nature of the PDF and where I might start looking to solve the problem, were it mine. The URLs that you are intending to match - are they themselves clickable when you open the PDF in another reader? If so, then you might have better luck looking for the PDF element that provides that capability rather than trying to text-scrape to recover them. Although unlikely inside a URL, text in a PDF can be laid out on the page in a completely arbitrary manner and to properly do PDF-to-text conversion you may need to track position on the page for each glyph as well as the font mapping vector - a glyph of an 'A' for instance might not actually be mapped to the ASCII/Unicode for 'A' ... all of which can make this a complete nightmare for the unwary. So - when I last looked at generating a PDF with a live link element, this was implemented as blue underlined text (to make it look like a link) with an invisible box placed over the top which contained the PDF magic to make that do what I wanted when the user clicked on it. I would suspect that what you might want would be a Python library that can pull apart a PDF into it's structural elements and then hunt through there for the appropriate "URL box" or whatever it's called ... Hope that helps, Malcolm -- Malcolm Herbert m...@mjch.net ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] How to Scrape Text from PDFs
On 17/06/2019 06:30, Cem Vardar wrote: > some PDF files that have links for some websites and I need to extract these > links There is a module that may help: PyPDF2 Here is a post showing how to extract the text from a PDF which should include the links. https://stackoverflow.com/questions/34837707/how-to-extract-text-from-a-pdf-file There may even be more specific extraction tools if you look more closely... -- Alan G Author of the Learn to Program web site http://www.alan-g.me.uk/ http://www.amazon.com/author/alan_gauld Follow my photo-blog on Flickr at: http://www.flickr.com/photos/alangauldphotos ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] How to Scrape Text from PDFs
> On Jun 17, 2019, at 1:30 AM, Cem Vardar wrote: > > Hello, > > I have been working on assignment that was described to me as “fairly > trivial” for a couple of days now. I have some PDF files that have links for > some websites and I need to extract these links from these files by using > Python. I would be very glad if someone could point me in the direction of > some resources that would give me the essential skills specific for this task. > Unfortunately, a PDF can contain anything from almost PostScript to a bit map. But lets assume your PDFs are of the almost PostScript flavor. In that case you can simply read them as text, and then use standard Python’s standard string searching for http:// or https://. Each time you find one, stop and parse (again with string handling) the URL looking for one of the typical terminators (e.g. .com, .net, .org etc.). It might help to cheat a bit and open one of the PDFs with a standard text editor and using it, search for http:// and see what turns up. I’ll bet it will be fairly clear. Bill > Sincerely, > Cem > ___ > Tutor maillist - Tutor@python.org > To unsubscribe or change subscription options: > https://mail.python.org/mailman/listinfo/tutor ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
[Tutor] How to Scrape Text from PDFs
Hello, I have been working on assignment that was described to me as “fairly trivial” for a couple of days now. I have some PDF files that have links for some websites and I need to extract these links from these files by using Python. I would be very glad if someone could point me in the direction of some resources that would give me the essential skills specific for this task. Sincerely, Cem ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor