Re: FW: [PHP] Accessing Files Outside the Web Root

2013-03-15 Thread Dale H. Cook
At 09:44 PM 3/14/2013, tamouse mailing lists wrote:

If you are delivering files to a (human) user via their browser, by whatever 
mechanism, that means someone can write a script to scrape them.

That script, however, would have to be running on my host system in order to 
access the script which actually delivers the file, as the latter script is 
located outside of the web root.

Dale H. Cook, Market Chief Engineer, Centennial Broadcasting, 
Roanoke/Lynchburg, VA
http://plymouthcolony.net/starcityeng/index.html  


-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Accessing Files Outside the Web Root

2013-03-15 Thread Stuart Dallas
On 15 Mar 2013, at 13:11, Dale H. Cook webmas...@plymouthcolony.net wrote:

 At 09:44 PM 3/14/2013, tamouse mailing lists wrote:
 
 If you are delivering files to a (human) user via their browser, by whatever 
 mechanism, that means someone can write a script to scrape them.
 
 That script, however, would have to be running on my host system in order to 
 access the script which actually delivers the file, as the latter script is 
 located outside of the web root.

If a browser can get at it then a spider can get at it. All this talk of the 
web root is daft. Unless your site is being specifically targeted (highly 
unlikely) then it automated systems that are downloading your content and 
offering it on other websites. The only way such a system can discover content 
is if it's linked from somewhere. Whether that link uses a script that inside 
or outside the web root is completely irrelevant.

Since copies of the content is now out there, anything you add now to protect 
your content is not going to get it back. You'll have to pursue legal avenues 
to prevent it being made available, and that's usually prohibitively expensive.

Based on your description of your users, you have the age-old dilemma of 
balancing ease of use and security. The more you try to protect the content 
from these spiders the harder you'll make it for users.

Here's what I'd do: make sure your details and your website info are plastered 
across every page of the PDF files. Make sure that where copies exist it's 
going to be obvious where the content came from. It sounds like you don't 
charge for the content (this problem wouldn't exist if you did), so you have 
nothing financial to gain from controlling these external copies, other than 
wanting it to be clear from whence it came and where to find more.

At the end of the day the question is this: would you rather control access to 
your creation (in which case charge a nominal fee for it), or would you prefer 
that it (and your name/cause) gets in to as many hands as possible. As a 
professional photographer I made the latter choice a long time ago and haven't 
looked back since.

-Stuart

-- 
Stuart Dallas
3ft9 Ltd
http://3ft9.com/
--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Accessing Files Outside the Web Root

2013-03-15 Thread Dale H. Cook
At 09:27 AM 3/15/2013, Stuart Dallas wrote:

You'll have to pursue legal avenues to prevent it being made available, and 
that's usually prohibitively expensive.

Not necessarily. Most of the host systems for the scraper sites are responsive 
to my complaints. Even if a site owner will not respond to a DMCA takedown 
notice the host system will often honor that notice, and other site owners and 
hosts will back down when notified of my royalty rates for the use of my files 
by a commercial site.

At the end of the day the question is this: would you rather control access to 
your creation (in which case charge a nominal fee for it), or would you prefer 
that it (and your name/cause) gets in to as many hands as possible.

I merely wish to try to prevent commercial sites from profiting from my work 
without my permission. I am in the process of registering the copyright for my 
files with LOC, as my attorneys have advised. That will give my attorneys 
ammunition.

Dale H. Cook, Member, NEHGS and MA Society of Mayflower Descendants;
Plymouth Co. MA Coordinator for the USGenWeb Project
Administrator of http://plymouthcolony.net 


-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: FW: [PHP] Accessing Files Outside the Web Root

2013-03-15 Thread Ashley Sheridan
On Fri, 2013-03-15 at 09:11 -0400, Dale H. Cook wrote:

 At 09:44 PM 3/14/2013, tamouse mailing lists wrote:
 
 If you are delivering files to a (human) user via their browser, by whatever 
 mechanism, that means someone can write a script to scrape them.
 
 That script, however, would have to be running on my host system in order to 
 access the script which actually delivers the file, as the latter script is 
 located outside of the web root.
 
 Dale H. Cook, Market Chief Engineer, Centennial Broadcasting, 
 Roanoke/Lynchburg, VA
 http://plymouthcolony.net/starcityeng/index.html  
 
 


Not really.

You script is web accessible right? It just opens a file and delivers it
to the browser of your visitor. It's easy to make a script that pretends
to be a browser and make the same request of your script to grab the
file.

Thanks,
Ash
http://www.ashleysheridan.co.uk




Re: FW: [PHP] Accessing Files Outside the Web Root

2013-03-14 Thread tamouse mailing lists
On Mar 13, 2013 7:06 PM, David Robley robl...@aapt.net.au wrote:

 Dale H. Cook wrote:

  At 05:04 PM 3/13/2013, Dan McCullough wrote
  :
 Web bots can ignore the robots.txt file, most scrapers would.
 
  and at 05:06 PM 3/13/2013, Marc Guay wrote:
 
 These don't sound like robots that would respect a txt file to me.
 
  Dan and Marc are correct. Although I used the terms spiders and
  pirates I believe that the correct term, as employed by Dan, is
  scrapers, and that twerm might be applied to either the robot or the
  site which displays its results. One blogger has called scrapers the
  arterial plaque of the Internet. I need to implement a solution that
  allows humans to access my files but prevents scrapers from accessing
  them. I will undoubtedly have to implement some type of
  challenge-and-response in the system (such as a captcha), but as long as
  those files are stored below the web root a scraper that has a valid URL
  can probably grab them. That is part of what the public in public_html
  implies.
 
  One of the reasons why this irks me is that the scrapers are all
  commercial sites, but they haven't offered me a piece of the action for
  the use of my files. My domain is an entirely non-commercial domain,
and I
  provide free hosting for other non-commercial genealogical works,
  primarily pages that are part of the USGenWeb Project, which is perhaps
  the largest of all non-commercial genealogical projects.
 

 readfile() is probably where you want to start, in conjunction with a
 captcha or similar

 --
 Cheers
 David Robley

 Catholic (n.) A cat with a drinking problem.


 --
 PHP General Mailing List (http://www.php.net/)
 To unsubscribe, visit: http://www.php.net/unsub.php


If the files are delivered via the web, by php or some other means, even if
located outside webroot, they'd still be scrapeable.


Re: FW: [PHP] Accessing Files Outside the Web Root

2013-03-14 Thread Dale H. Cook
At 04:06 AM 3/14/2013, tamouse mailing lists wrote:

If the files are delivered via the web, by php or some other means, even if
located outside webroot, they'd still be scrapeable.

Bots, however, being mechanical (i.e., hard wired or programmed) behave in 
different ways than humans, and that difference can be exploited in a script.

Part of the rationale in putting the files outside the root is that they have 
no URLs, eliminating one vulnerability (you can't scrape the URL of a file if 
it has no URL). Late last night I figured out why I was having trouble 
accessing those external files from my script, and now I'm working out the 
parsing details that enable one script to access multiple external files. My 
approach probably won't defeat all bad bots, but it will likely defeat most of 
them. You can't make code bulletproof, but you can wrap it in Kevlar.

Dale H. Cook, Member, NEHGS and MA Society of Mayflower Descendants;
Plymouth Co. MA Coordinator for the USGenWeb Project
Administrator of http://plymouthcolony.net 


-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Accessing Files Outside the Web Root

2013-03-14 Thread Ravi Gehlot
Hello Dale,

The spiders are not the only problem. The issue here is that anyone can
download your files from your website and then make them available
elsewhere. In order to address the problem, you should create a Members
Restricted Area where members only could download your files. You can then
make your PDF directory only visible through your Members Restricted Area.
That directory would be invisible to the web. In some Linux distros, if the
file/directory is not a member of www-data, it is not visible online. But
you can still link the files to your PHP page.

Ravi.

On Wed, Mar 13, 2013 at 4:38 PM, Dale H. Cook
radiot...@plymouthcolony.netwrote:

 Let me preface my question by noting that I am virtually a PHP novice.
 Although I am a long-time webmaster, and have used PHP for some years to
 give visitors access to information in my SQL database, this is my first
 attempt to use it for another purpose. I have browsed the mailing list
 archives and have searched online but have not yet succeeded in teaching
 myself how to do what I want to do. This need not provoke a lengthy
 discussion or involve extensive hand-holding - if someone can point to an
 appropriate code sample or online tutorial that might do the trick.

 I am the author of a number of PDF files that serve as genealogical
 reference works. My problem is that there are a number of sites which are
 posing as search engines and which display my PDF files in their entirety
 on their own sites. These pirate sites are not simply opening a window that
 displays my files as they appear on my site. They are using Google Docs to
 display copies of my files that are cached or stored elsewhere online. The
 proof of that is that I can modify one of my files and upload it to my
 site. The file, as seen on my site, immediately displays the modification.
 The same file, as displayed on the pirate sites, is unmodified and may
 remain unmodified for weeks.

 It is obvious that my files, which are stored under public_html, are being
 spidered and then stored or cached. This displeases me greatly. I want my
 files, some of which have cost an enormous amount of work over many years,
 to be available only on my site. Legitimate search engines, such as Google,
 may display a snippet, but they do not display the entire file - they link
 to my site so the visitor can get the file from me.

 A little study has indicated to me that if I store those files in a folder
 outside the web root and use PHP to provide access they will not be
 spidered. Writing a PHP script to provide access to the files in that
 folder is what I need help with. I have experimented with a number of code
 samples but have not been able to make things work. Could any of you point
 to code samples or tutorials that might help me? Remember that, aside from
 the code I have written to handle my SQL database I am a PHP novice.

 Dale H. Cook, Member, NEHGS and MA Society of Mayflower Descendants;
 Plymouth Co. MA Coordinator for the USGenWeb Project
 Administrator of http://plymouthcolony.net


 --
 PHP General Mailing List (http://www.php.net/)
 To unsubscribe, visit: http://www.php.net/unsub.php




[PHP] Accessing Files Outside the Web Root - Progress Report 1

2013-03-14 Thread Dale H. Cook
I have made some progress. It occurred to me that the problem that I had in 
accessing files outside the web root could be a pathing problem, and that was 
the case. I finally ran phpinfo() and examined _SERVER[DOCUMENT_ROOT] to see 
what the correct path should be. I then figured that for portability my script 
should build the paths for desired files beginning with a truncated 
_SERVER[DOCUMENT_ROOT] and concatenating the external folder and the 
filename, and that is working fine. I now have a script that will give the 
visitor access to a PDF file stored outside the web root and whose filename is 
hard-coded in the script

The next step is to create the mechanism that lets one of my HTML pages pass 
the desired filename to the script and have the script retrieve that file for 
the visitor. That should be simple enough since it is just string manipulation 
(once I get the hang of some additional PHP string manipulation functions).

Then I can move on to making my script bot-resistant before implementing it on 
my site.

Dale H. Cook, Member, NEHGS and MA Society of Mayflower Descendants;
Plymouth Co. MA Coordinator for the USGenWeb Project
Administrator of http://plymouthcolony.net 


-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Accessing Files Outside the Web Root

2013-03-14 Thread Dale H. Cook
At 12:40 PM 3/14/2013, Ravi Gehlot wrote:

In order to address the problem, you should create a Members Restricted Area

You need to understand my target demo to understand my approach. Much of my 
target audience for those files is elderly and not very computer savvy. Having 
to register for access would discourage some of them. I prefer to keep their 
access as simple as possible and as close as possible to the way in which I 
have always provided it. Some of those files have been available (and have been 
updated quarterly) for nearly a decade, and many visitors are used to 
downloading and perusing some of those files quarterly.

Registration poses its own problems, as some miscreants will attempt to 
register in order to usurp my resources. It probably wouldn't be as much of a 
nuisance as it is when running a phpbb (I run two of those) but I'd rather 
avoid dealing with registration.

All in all, I'd rather use a server-side approach incorporating methods to 
differentiate between human visitors and bad bots.

Dale H. Cook, Member, NEHGS and MA Society of Mayflower Descendants;
Plymouth Co. MA Coordinator for the USGenWeb Project
Administrator of http://plymouthcolony.net 


-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



[PHP] Accessing Files Outside the Web Root

2013-03-13 Thread Dale H. Cook
Let me preface my question by noting that I am virtually a PHP novice. Although 
I am a long-time webmaster, and have used PHP for some years to give visitors 
access to information in my SQL database, this is my first attempt to use it 
for another purpose. I have browsed the mailing list archives and have searched 
online but have not yet succeeded in teaching myself how to do what I want to 
do. This need not provoke a lengthy discussion or involve extensive 
hand-holding - if someone can point to an appropriate code sample or online 
tutorial that might do the trick.

I am the author of a number of PDF files that serve as genealogical reference 
works. My problem is that there are a number of sites which are posing as 
search engines and which display my PDF files in their entirety on their own 
sites. These pirate sites are not simply opening a window that displays my 
files as they appear on my site. They are using Google Docs to display copies 
of my files that are cached or stored elsewhere online. The proof of that is 
that I can modify one of my files and upload it to my site. The file, as seen 
on my site, immediately displays the modification. The same file, as displayed 
on the pirate sites, is unmodified and may remain unmodified for weeks.

It is obvious that my files, which are stored under public_html, are being 
spidered and then stored or cached. This displeases me greatly. I want my 
files, some of which have cost an enormous amount of work over many years, to 
be available only on my site. Legitimate search engines, such as Google, may 
display a snippet, but they do not display the entire file - they link to my 
site so the visitor can get the file from me.

A little study has indicated to me that if I store those files in a folder 
outside the web root and use PHP to provide access they will not be spidered. 
Writing a PHP script to provide access to the files in that folder is what I 
need help with. I have experimented with a number of code samples but have not 
been able to make things work. Could any of you point to code samples or 
tutorials that might help me? Remember that, aside from the code I have written 
to handle my SQL database I am a PHP novice.

Dale H. Cook, Member, NEHGS and MA Society of Mayflower Descendants;
Plymouth Co. MA Coordinator for the USGenWeb Project
Administrator of http://plymouthcolony.net 


-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



FW: [PHP] Accessing Files Outside the Web Root

2013-03-13 Thread Jen Rasmussen
-Original Message-
From: Dale H. Cook [mailto:radiot...@plymouthcolony.net] 
Sent: Wednesday, March 13, 2013 3:38 PM
To: php-general@lists.php.net
Subject: [PHP] Accessing Files Outside the Web Root

Let me preface my question by noting that I am virtually a PHP novice.
Although I am a long-time webmaster, and have used PHP for some years to
give visitors access to information in my SQL database, this is my first
attempt to use it for another purpose. I have browsed the mailing list
archives and have searched online but have not yet succeeded in teaching
myself how to do what I want to do. This need not provoke a lengthy
discussion or involve extensive hand-holding - if someone can point to an
appropriate code sample or online tutorial that might do the trick.

I am the author of a number of PDF files that serve as genealogical
reference works. My problem is that there are a number of sites which are
posing as search engines and which display my PDF files in their entirety on
their own sites. These pirate sites are not simply opening a window that
displays my files as they appear on my site. They are using Google Docs to
display copies of my files that are cached or stored elsewhere online. The
proof of that is that I can modify one of my files and upload it to my site.
The file, as seen on my site, immediately displays the modification. The
same file, as displayed on the pirate sites, is unmodified and may remain
unmodified for weeks.

It is obvious that my files, which are stored under public_html, are being
spidered and then stored or cached. This displeases me greatly. I want my
files, some of which have cost an enormous amount of work over many years,
to be available only on my site. Legitimate search engines, such as Google,
may display a snippet, but they do not display the entire file - they link
to my site so the visitor can get the file from me.

A little study has indicated to me that if I store those files in a folder
outside the web root and use PHP to provide access they will not be
spidered. Writing a PHP script to provide access to the files in that folder
is what I need help with. I have experimented with a number of code samples
but have not been able to make things work. Could any of you point to code
samples or tutorials that might help me? Remember that, aside from the code
I have written to handle my SQL database I am a PHP novice.

Dale H. Cook, Member, NEHGS and MA Society of Mayflower Descendants;
Plymouth Co. MA Coordinator for the USGenWeb Project Administrator of
http://plymouthcolony.net 


--
PHP General Mailing List (http://www.php.net/) To unsubscribe, visit:
http://www.php.net/unsub.php


Have you tried keeping all of your documents in one directory and blocking
that directory via a robots.txt file?

Jen





-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: FW: [PHP] Accessing Files Outside the Web Root

2013-03-13 Thread Dan McCullough
Web bots can ignore the robots.txt file, most scrapers would.
On Mar 13, 2013 4:59 PM, Jen Rasmussen j...@cetaceasound.com wrote:

 -Original Message-
 From: Dale H. Cook [mailto:radiot...@plymouthcolony.net]
 Sent: Wednesday, March 13, 2013 3:38 PM
 To: php-general@lists.php.net
 Subject: [PHP] Accessing Files Outside the Web Root

 Let me preface my question by noting that I am virtually a PHP novice.
 Although I am a long-time webmaster, and have used PHP for some years to
 give visitors access to information in my SQL database, this is my first
 attempt to use it for another purpose. I have browsed the mailing list
 archives and have searched online but have not yet succeeded in teaching
 myself how to do what I want to do. This need not provoke a lengthy
 discussion or involve extensive hand-holding - if someone can point to an
 appropriate code sample or online tutorial that might do the trick.

 I am the author of a number of PDF files that serve as genealogical
 reference works. My problem is that there are a number of sites which are
 posing as search engines and which display my PDF files in their entirety
 on
 their own sites. These pirate sites are not simply opening a window that
 displays my files as they appear on my site. They are using Google Docs to
 display copies of my files that are cached or stored elsewhere online. The
 proof of that is that I can modify one of my files and upload it to my
 site.
 The file, as seen on my site, immediately displays the modification. The
 same file, as displayed on the pirate sites, is unmodified and may remain
 unmodified for weeks.

 It is obvious that my files, which are stored under public_html, are being
 spidered and then stored or cached. This displeases me greatly. I want my
 files, some of which have cost an enormous amount of work over many years,
 to be available only on my site. Legitimate search engines, such as Google,
 may display a snippet, but they do not display the entire file - they link
 to my site so the visitor can get the file from me.

 A little study has indicated to me that if I store those files in a folder
 outside the web root and use PHP to provide access they will not be
 spidered. Writing a PHP script to provide access to the files in that
 folder
 is what I need help with. I have experimented with a number of code samples
 but have not been able to make things work. Could any of you point to code
 samples or tutorials that might help me? Remember that, aside from the code
 I have written to handle my SQL database I am a PHP novice.

 Dale H. Cook, Member, NEHGS and MA Society of Mayflower Descendants;
 Plymouth Co. MA Coordinator for the USGenWeb Project Administrator of
 http://plymouthcolony.net


 --
 PHP General Mailing List (http://www.php.net/) To unsubscribe, visit:
 http://www.php.net/unsub.php


 Have you tried keeping all of your documents in one directory and blocking
 that directory via a robots.txt file?

 Jen





 --
 PHP General Mailing List (http://www.php.net/)
 To unsubscribe, visit: http://www.php.net/unsub.php




Re: FW: [PHP] Accessing Files Outside the Web Root

2013-03-13 Thread Marc Guay
 Have you tried keeping all of your documents in one directory and blocking
 that directory via a robots.txt file?

These don't sound like robots that would respect a txt file to me.

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: FW: [PHP] Accessing Files Outside the Web Root

2013-03-13 Thread Dale H. Cook
At 04:58 PM 3/13/2013, Jen Rasmussen wrote:

Have you tried keeping all of your documents in one directory and blocking
that directory via a robots.txt file?

A spider used by a pirate site does not have to honor robots.txt, just as a 
non-Adobe PDF utility does not have to honor security settings imposed by 
Acrobat Pro. The use of robots.txt would succeed mainly in blocking major 
search engines, which are not the problem.

Dale H. Cook, Member, NEHGS and MA Society of Mayflower Descendants;
Plymouth Co. MA Coordinator for the USGenWeb Project
Administrator of http://plymouthcolony.net  


-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: FW: [PHP] Accessing Files Outside the Web Root

2013-03-13 Thread Dale H. Cook
At 05:04 PM 3/13/2013, Dan McCullough wrote
:
Web bots can ignore the robots.txt file, most scrapers would.

and at 05:06 PM 3/13/2013, Marc Guay wrote:

These don't sound like robots that would respect a txt file to me.

Dan and Marc are correct. Although I used the terms spiders and pirates I 
believe that the correct term, as employed by Dan, is scrapers, and that 
twerm might be applied to either the robot or the site which displays its 
results. One blogger has called scrapers the arterial plaque of the Internet. 
I need to implement a solution that allows humans to access my files but 
prevents scrapers from accessing them. I will undoubtedly have to implement 
some type of challenge-and-response in the system (such as a captcha), but as 
long as those files are stored below the web root a scraper that has a valid 
URL can probably grab them. That is part of what the public in public_html 
implies.

One of the reasons why this irks me is that the scrapers are all commercial 
sites, but they haven't offered me a piece of the action for the use of my 
files. My domain is an entirely non-commercial domain, and I provide free 
hosting for other non-commercial genealogical works, primarily pages that are 
part of the USGenWeb Project, which is perhaps the largest of all 
non-commercial genealogical projects.

Dale H. Cook, Member, NEHGS and MA Society of Mayflower Descendants;
Plymouth Co. MA Coordinator for the USGenWeb Project
Administrator of http://plymouthcolony.net 


-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: FW: [PHP] Accessing Files Outside the Web Root

2013-03-13 Thread David Robley
Dale H. Cook wrote:

 At 05:04 PM 3/13/2013, Dan McCullough wrote
 :
Web bots can ignore the robots.txt file, most scrapers would.
 
 and at 05:06 PM 3/13/2013, Marc Guay wrote:
 
These don't sound like robots that would respect a txt file to me.
 
 Dan and Marc are correct. Although I used the terms spiders and
 pirates I believe that the correct term, as employed by Dan, is
 scrapers, and that twerm might be applied to either the robot or the
 site which displays its results. One blogger has called scrapers the
 arterial plaque of the Internet. I need to implement a solution that
 allows humans to access my files but prevents scrapers from accessing
 them. I will undoubtedly have to implement some type of
 challenge-and-response in the system (such as a captcha), but as long as
 those files are stored below the web root a scraper that has a valid URL
 can probably grab them. That is part of what the public in public_html
 implies.
 
 One of the reasons why this irks me is that the scrapers are all
 commercial sites, but they haven't offered me a piece of the action for
 the use of my files. My domain is an entirely non-commercial domain, and I
 provide free hosting for other non-commercial genealogical works,
 primarily pages that are part of the USGenWeb Project, which is perhaps
 the largest of all non-commercial genealogical projects.
 

readfile() is probably where you want to start, in conjunction with a 
captcha or similar

-- 
Cheers
David Robley

Catholic (n.) A cat with a drinking problem.


-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php