Re: [Tutor] How to Scrape Text from PDFs

2019-06-19 Thread Malcolm Herbert
This isn't  a response that's python-related, sorry, I'm still learning python 
myself, but more questions around the nature of the PDF and where I might start 
looking to solve the problem, were it mine.

The URLs that you are intending to match - are they themselves clickable when 
you open the PDF in another reader?  If so, then you might have better luck 
looking for the PDF element that provides that capability rather than trying to 
text-scrape to recover them.

Although unlikely inside a URL, text in a PDF can be laid out on the page in a 
completely arbitrary manner and to properly do PDF-to-text conversion you may 
need to track position on the page for each glyph as well as the font mapping 
vector - a glyph of an 'A' for instance might not actually be mapped to the 
ASCII/Unicode for 'A' ... all of which can make this a complete nightmare for 
the unwary.

So - when I last looked at generating a PDF with a live link element, this was 
implemented as blue underlined text (to make it look like a link) with an 
invisible box placed over the top which contained the PDF magic to make that do 
what I wanted when the user clicked on it.

I would suspect that what you might want would be a Python library that can 
pull apart a PDF into it's structural elements and then hunt through there for 
the appropriate "URL box" or whatever it's called ...

Hope that helps,
Malcolm

-- 
Malcolm Herbert
m...@mjch.net
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] How to Scrape Text from PDFs

2019-06-17 Thread Alan Gauld via Tutor
On 17/06/2019 06:30, Cem Vardar wrote:
> some PDF files that have links for some websites and I need to extract these 
> links 

There is a module that may help: PyPDF2

Here is a post showing how to extract the text from a PDF which should
include the links.

https://stackoverflow.com/questions/34837707/how-to-extract-text-from-a-pdf-file

There may even be more specific extraction tools if you look more closely...




-- 
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos


___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] How to Scrape Text from PDFs

2019-06-17 Thread William Ray Wing via Tutor


> On Jun 17, 2019, at 1:30 AM, Cem Vardar  wrote:
> 
> Hello,
> 
> I have been working on assignment that was described to me as “fairly 
> trivial” for a couple of days now. I have some PDF files that have links for 
> some websites and I need to extract these links from these files by using 
> Python. I would be very glad if someone could point me in the direction of 
> some resources that would give me the essential skills specific for this task.
> 

Unfortunately, a PDF can contain anything from almost PostScript to a bit map.  
But lets assume your PDFs are of the almost PostScript flavor.  In that case 
you can simply read them as text, and then use standard Python’s standard 
string searching for http:// or https://.  Each time you find one, stop and 
parse (again with string handling) the URL looking for one of the typical 
terminators (e.g. .com, .net, .org etc.).

It might help to cheat a bit and open one of the PDFs with a standard text 
editor and using it, search for http:// and see what turns up.  I’ll bet it 
will be fairly clear.

Bill

> Sincerely,
> Cem
> ___
> Tutor maillist  -  Tutor@python.org
> To unsubscribe or change subscription options:
> https://mail.python.org/mailman/listinfo/tutor

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


[Tutor] How to Scrape Text from PDFs

2019-06-17 Thread Cem Vardar
Hello,

I have been working on assignment that was described to me as “fairly trivial” 
for a couple of days now. I have some PDF files that have links for some 
websites and I need to extract these links from these files by using Python. I 
would be very glad if someone could point me in the direction of some resources 
that would give me the essential skills specific for this task.

Sincerely,
Cem
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor