Re: [Tutor] How to Scrape Text from PDFs

2019-06-17 Thread Alan Gauld via Tutor
On 17/06/2019 06:30, Cem Vardar wrote:
> some PDF files that have links for some websites and I need to extract these 
> links 

There is a module that may help: PyPDF2

Here is a post showing how to extract the text from a PDF which should
include the links.

https://stackoverflow.com/questions/34837707/how-to-extract-text-from-a-pdf-file

There may even be more specific extraction tools if you look more closely...




-- 
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos


___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Installing Python v3 on a laptop Windows 10 (SOLVED)

2019-06-17 Thread Ken Green

On 15/06/2019 22:23, Ken Green wrote:


I understood there is a preferable method
of installing Python into Windows. I pray
tell on how about to do it, gentlemen.



Thank you gentlemen for the prompt responses to
my inquiry. I believe it would be best for me to use
the ActiveState installation for my laptop.

I like Microsoft trying to make it easily to download
Python but I am not sure if it has been fully implemented
yet. Again, thanks guys.

Ken Green
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] How to Scrape Text from PDFs

2019-06-17 Thread William Ray Wing via Tutor


> On Jun 17, 2019, at 1:30 AM, Cem Vardar  wrote:
> 
> Hello,
> 
> I have been working on assignment that was described to me as “fairly 
> trivial” for a couple of days now. I have some PDF files that have links for 
> some websites and I need to extract these links from these files by using 
> Python. I would be very glad if someone could point me in the direction of 
> some resources that would give me the essential skills specific for this task.
> 

Unfortunately, a PDF can contain anything from almost PostScript to a bit map.  
But lets assume your PDFs are of the almost PostScript flavor.  In that case 
you can simply read them as text, and then use standard Python’s standard 
string searching for http:// or https://.  Each time you find one, stop and 
parse (again with string handling) the URL looking for one of the typical 
terminators (e.g. .com, .net, .org etc.).

It might help to cheat a bit and open one of the PDFs with a standard text 
editor and using it, search for http:// and see what turns up.  I’ll bet it 
will be fairly clear.

Bill

> Sincerely,
> Cem
> ___
> Tutor maillist  -  Tutor@python.org
> To unsubscribe or change subscription options:
> https://mail.python.org/mailman/listinfo/tutor

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


[Tutor] How to Scrape Text from PDFs

2019-06-17 Thread Cem Vardar
Hello,

I have been working on assignment that was described to me as “fairly trivial” 
for a couple of days now. I have some PDF files that have links for some 
websites and I need to extract these links from these files by using Python. I 
would be very glad if someone could point me in the direction of some resources 
that would give me the essential skills specific for this task.

Sincerely,
Cem
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor