[CODE4LIB] Restricted access fo free covers from Google :)

2008-03-18 Thread Boheemen, Peter van
 For what concerns Google's policy concerning NAT calls being treated as
spyware activity. I am now proxying Google for the JSON call, so it will
see the IP adress of our web server in stead of the Network Adress
Translator. It helps for the moment. Now I wonder if our own calls will
make Google complain as well, we will see.

One thing I noticed, when encountering the "we're sorry ... you have
spyware" message from google is the following link in this message:

http://www.google.com/support/bin/answer.py?answer=86640

very interesting !! I guess lots of you are using meta search enigine to
search Google.
This message explicitly states this is against Google's terms of service
(http://www.google.com/terms_of_service.html)

Peter


Re: [CODE4LIB] Restricted access fo free covers from Google :)

2008-03-18 Thread Jonathan Rochkind

Wait, now ALL of your clients calls are coming from one single IP?
Surely that will trigger Googles detectors, if the NAT did. Keep us
updated though.

Jonathan

Boheemen, Peter van wrote:

 For what concerns Google's policy concerning NAT calls being treated as
spyware activity. I am now proxying Google for the JSON call, so it will
see the IP adress of our web server in stead of the Network Adress
Translator. It helps for the moment. Now I wonder if our own calls will
make Google complain as well, we will see.

One thing I noticed, when encountering the "we're sorry ... you have
spyware" message from google is the following link in this message:

http://www.google.com/support/bin/answer.py?answer=86640

very interesting !! I guess lots of you are using meta search enigine to
search Google.
This message explicitly states this is against Google's terms of service
(http://www.google.com/terms_of_service.html)

Peter




--
Jonathan Rochkind
Digital Services Software Engineer
The Sheridan Libraries
Johns Hopkins University
410.516.8886
rochkind (at) jhu.edu


Re: [CODE4LIB] Restricted access fo free covers from Google :)

2008-03-18 Thread Joe Hourcle

On Tue, 18 Mar 2008, Jonathan Rochkind wrote:


Wait, now ALL of your clients calls are coming from one single IP?
Surely that will trigger Googles detectors, if the NAT did. Keep us
updated though.


I don't know what Peter's exact implementation is, but they might relax
the limits when they see an 'X-Forwarded-For' header, or something else to
suggest it's coming through a proxy.  It used to be pretty common when
writing rate limiting code to use X-Forwarded-For in place of HTTP_ADDR so
you didn't accidentally ban groups behind proxies.  (of course, I don't
know if the X-Forwarded-For value is something that's not routable (in
10/8), or the NAT IP, so it might still look like 1 IP address behind a
proxy)

Also, by using a caching proxy (if the responses are cachable), the total
number of requests going to Google might be reduced.

I would assume they'd need to have some consideration for proxies, as I
remember the days when AOL's proxy servers channeled all requests through
less than a dozen unique IP addresses.  (or at least, those were the only
ones hitting my servers)

-Joe


Re: [CODE4LIB] Restricted access fo free covers from Google :)

2008-03-18 Thread Jonathan Rochkind

Nice. X-Forwarded-For would also allow google to deliver availability
information suitable for the actual location of the end-user.  If their
software chooses to pay attention to this. Which is the objection to
server-side API requests voiced to me by a Google person. (By proxying
everything through the server, you are essentially doing what I wanted
to do in the first place but Google told me they would not allow. Ironic
if you have more luck with that then the actual client-side AJAXy
requests that Google said they required!)

Thanks for alerting us to X-forwarded-for, that's a good idea.

Jonathan

Joe Hourcle wrote:

On Tue, 18 Mar 2008, Jonathan Rochkind wrote:


Wait, now ALL of your clients calls are coming from one single IP?
Surely that will trigger Googles detectors, if the NAT did. Keep us
updated though.


I don't know what Peter's exact implementation is, but they might relax
the limits when they see an 'X-Forwarded-For' header, or something
else to
suggest it's coming through a proxy.  It used to be pretty common when
writing rate limiting code to use X-Forwarded-For in place of
HTTP_ADDR so
you didn't accidentally ban groups behind proxies.  (of course, I don't
know if the X-Forwarded-For value is something that's not routable (in
10/8), or the NAT IP, so it might still look like 1 IP address behind a
proxy)

Also, by using a caching proxy (if the responses are cachable), the total
number of requests going to Google might be reduced.

I would assume they'd need to have some consideration for proxies, as I
remember the days when AOL's proxy servers channeled all requests through
less than a dozen unique IP addresses.  (or at least, those were the only
ones hitting my servers)

-Joe



--
Jonathan Rochkind
Digital Services Software Engineer
The Sheridan Libraries
Johns Hopkins University
410.516.8886
rochkind (at) jhu.edu


Re: [CODE4LIB] Restricted access fo free covers from Google :)

2008-03-18 Thread Boheemen, Peter van
I don't think I do anything sophisticated like X-forwarder-for. I just have a 
ProxyPass directive in the apache configuration teeling it to reverse proxy a 
directory to google
 
ProxyPass /googlebooks http://books.google.com/books
 
But what if Google did something with a X-forwarded-for header? It can not see 
where the actual user is located. Behind a NAT usually 10.0.0.0 adresses are 
used. In fact it is trivial what Ip adresses are used behind the NAT. Since 
they are not exposed to the outside world it is only relevant if they are 
unique within the network behind the NAT.
 
Anyway, since we only hit google books form the server when a user asks for 
display of a full record, I hardly expect that will cause the Google triggers. 
I suspect that the few thousand PC's within the university campus hitting 
Google cause the problem, which especially Google books reacts upon. (I can 
still search Google when Google books rejects accces from my IP adress.)
I'll keep you informed.
 
Peter

 
Drs. P.J.C. van Boheemen
Hoofd Applicatieontwikkeling en beheer - Bibliotheek Wageningen UR
Head of Application Development and Management - Wageningen University and 
Research Library
tel. +31 317 48 25 17   
 http://library.wur.nl <http://library.wur.nl/> 
P Please consider the environment before printing this e-mail



Van: Code for Libraries namens Jonathan Rochkind
Verzonden: di 18-3-2008 18:48
Aan: CODE4LIB@LISTSERV.ND.EDU
Onderwerp: Re: [CODE4LIB] Restricted access fo free covers from Google :)



Nice. X-Forwarded-For would also allow google to deliver availability
information suitable for the actual location of the end-user.  If their
software chooses to pay attention to this. Which is the objection to
server-side API requests voiced to me by a Google person. (By proxying
everything through the server, you are essentially doing what I wanted
to do in the first place but Google told me they would not allow. Ironic
if you have more luck with that then the actual client-side AJAXy
requests that Google said they required!)

Thanks for alerting us to X-forwarded-for, that's a good idea.

Jonathan

Joe Hourcle wrote:
> On Tue, 18 Mar 2008, Jonathan Rochkind wrote:
>
>> Wait, now ALL of your clients calls are coming from one single IP?
>> Surely that will trigger Googles detectors, if the NAT did. Keep us
>> updated though.
>
> I don't know what Peter's exact implementation is, but they might relax
> the limits when they see an 'X-Forwarded-For' header, or something
> else to
> suggest it's coming through a proxy.  It used to be pretty common when
> writing rate limiting code to use X-Forwarded-For in place of
> HTTP_ADDR so
> you didn't accidentally ban groups behind proxies.  (of course, I don't
> know if the X-Forwarded-For value is something that's not routable (in
> 10/8), or the NAT IP, so it might still look like 1 IP address behind a
> proxy)
>
> Also, by using a caching proxy (if the responses are cachable), the total
> number of requests going to Google might be reduced.
>
> I would assume they'd need to have some consideration for proxies, as I
> remember the days when AOL's proxy servers channeled all requests through
> less than a dozen unique IP addresses.  (or at least, those were the only
> ones hitting my servers)
>
> -Joe
>

--
Jonathan Rochkind
Digital Services Software Engineer
The Sheridan Libraries
Johns Hopkins University
410.516.8886
rochkind (at) jhu.edu


Re: [CODE4LIB] Restricted access fo free covers from Google :)

2008-03-18 Thread Jonathan Rochkind

I believe that apache ProxyPass _will_ send an X-forwarded-for header
for you. But you're right that the "forwarded for" IP address will in
your case be an internal-only IP that doesn't mean anything to google,
if it's there at all. But who knows what Google's 'traffic defender'
routines do, maybe they would realize in the presence of that
x-forwarded-for not to limit you, even though the forwarded-for IP is
meaningless.

Who knows, and google probably won't say (because they don't want to
give any extra info to people maliciously trying to get around it).

Do please keep us updated on if this new solution works and prevents the
traffic-limiting defense that you were getting before. If it does, then
the question would be why, but x-forwarded-for (which I _think_
ProxyPass will send) may indeed be the answer.

Jonathan

Boheemen, Peter van wrote:

I don't think I do anything sophisticated like X-forwarder-for. I just have a 
ProxyPass directive in the apache configuration teeling it to reverse proxy a 
directory to google

ProxyPass /googlebooks http://books.google.com/books

But what if Google did something with a X-forwarded-for header? It can not see 
where the actual user is located. Behind a NAT usually 10.0.0.0 adresses are 
used. In fact it is trivial what Ip adresses are used behind the NAT. Since 
they are not exposed to the outside world it is only relevant if they are 
unique within the network behind the NAT.

Anyway, since we only hit google books form the server when a user asks for 
display of a full record, I hardly expect that will cause the Google triggers. 
I suspect that the few thousand PC's within the university campus hitting 
Google cause the problem, which especially Google books reacts upon. (I can 
still search Google when Google books rejects accces from my IP adress.)
I'll keep you informed.

Peter


Drs. P.J.C. van Boheemen
Hoofd Applicatieontwikkeling en beheer - Bibliotheek Wageningen UR
Head of Application Development and Management - Wageningen University and 
Research Library
tel. +31 317 48 25 17 
   http://library.wur.nl <http://library.wur.nl/>
P Please consider the environment before printing this e-mail



Van: Code for Libraries namens Jonathan Rochkind
Verzonden: di 18-3-2008 18:48
Aan: CODE4LIB@LISTSERV.ND.EDU
Onderwerp: Re: [CODE4LIB] Restricted access fo free covers from Google :)



Nice. X-Forwarded-For would also allow google to deliver availability
information suitable for the actual location of the end-user.  If their
software chooses to pay attention to this. Which is the objection to
server-side API requests voiced to me by a Google person. (By proxying
everything through the server, you are essentially doing what I wanted
to do in the first place but Google told me they would not allow. Ironic
if you have more luck with that then the actual client-side AJAXy
requests that Google said they required!)

Thanks for alerting us to X-forwarded-for, that's a good idea.

Jonathan

Joe Hourcle wrote:


On Tue, 18 Mar 2008, Jonathan Rochkind wrote:



Wait, now ALL of your clients calls are coming from one single IP?
Surely that will trigger Googles detectors, if the NAT did. Keep us
updated though.


I don't know what Peter's exact implementation is, but they might relax
the limits when they see an 'X-Forwarded-For' header, or something
else to
suggest it's coming through a proxy.  It used to be pretty common when
writing rate limiting code to use X-Forwarded-For in place of
HTTP_ADDR so
you didn't accidentally ban groups behind proxies.  (of course, I don't
know if the X-Forwarded-For value is something that's not routable (in
10/8), or the NAT IP, so it might still look like 1 IP address behind a
proxy)

Also, by using a caching proxy (if the responses are cachable), the total
number of requests going to Google might be reduced.

I would assume they'd need to have some consideration for proxies, as I
remember the days when AOL's proxy servers channeled all requests through
less than a dozen unique IP addresses.  (or at least, those were the only
ones hitting my servers)

-Joe




--
Jonathan Rochkind
Digital Services Software Engineer
The Sheridan Libraries
Johns Hopkins University
410.516.8886
rochkind (at) jhu.edu




--
Jonathan Rochkind
Digital Services Software Engineer
The Sheridan Libraries
Johns Hopkins University
410.516.8886
rochkind (at) jhu.edu


Re: [CODE4LIB] Restricted access fo free covers from Google :)

2008-03-18 Thread Kent Fitch
I'd be very surprised if Google _automatically_ took any notice of
anything in an HTTP header to relax protection against what they
consider harvesting of data because all HTTP headers can be set to
anything:  that is, if I wanted to suck Google dry of bib data, I
could simply pretend to be forwarding requests for "real" clients
behind a NAT barrier.

But they may well investigate such cases and configure their traffic
monitoring software for known legitimate proxies.

Kent Fitch

On Wed, Mar 19, 2008 at 3:29 AM, Joe Hourcle
<[EMAIL PROTECTED]> wrote:
> On Tue, 18 Mar 2008, Jonathan Rochkind wrote:
>
>  > Wait, now ALL of your clients calls are coming from one single IP?
>  > Surely that will trigger Googles detectors, if the NAT did. Keep us
>  > updated though.
>
>  I don't know what Peter's exact implementation is, but they might relax
>  the limits when they see an 'X-Forwarded-For' header, or something else to
>  suggest it's coming through a proxy.  It used to be pretty common when
>  writing rate limiting code to use X-Forwarded-For in place of HTTP_ADDR so
>  you didn't accidentally ban groups behind proxies.  (of course, I don't
>  know if the X-Forwarded-For value is something that's not routable (in
>  10/8), or the NAT IP, so it might still look like 1 IP address behind a
>  proxy)
>
>  Also, by using a caching proxy (if the responses are cachable), the total
>  number of requests going to Google might be reduced.
>
>  I would assume they'd need to have some consideration for proxies, as I
>  remember the days when AOL's proxy servers channeled all requests through
>  less than a dozen unique IP addresses.  (or at least, those were the only
>  ones hitting my servers)
>
>  -Joe
>


Re: [CODE4LIB] Restricted access fo free covers from Google :)

2008-04-08 Thread Boheemen, Peter van
Hi Jonathan,

It is indeed working with the proxypass directive in Apache.
Now Google sees the Ip adres of the server and apparently, this does not create 
too much trafic. However, when it is busy the user might see the we're sorry 
page when they click on the link the API creates and travel through the NAT 
gateway. At least they can then enter the captcha and continue.
Today I have upscaled the service for books that do not have an ISBN. Google 
also accepts LCC and OCLC numbers as an id for books, but both numbers are not 
present in our catalog. All of our books are in Worldcat, so there must be a 
link. I have asked OCLC PICA (the dutch branch of OCLC) to provide me with a 
service that will return an OCLC number when I present it our national catalog 
number. They were very cooporative 
(http://webquery.blogspot.com/2008/03/hooray-for-oclc-pica-customer-response.html)
 and build this service (first they answered with a plain text return, but they 
altered this into a true XML service on my request) So now I call this service 
for the OCLC number and use this to invoke Google books API when and ISBN is 
missing from the catalog record. It works fine. I only find that Google's 
service is very slow. When I watch the response with firebug I see that the 
Google API takes about 10 - 20 times as much time (130 -250 msec) as the local 
parts of the page !
 and twice as much as an Amazon book cover lookup. However, this is when it all 
goes well. Around mid day response slows down to 1,6 seconds and at some 
moments to over 30 seconds. Five minutes later response can be back to normal. 
I checked google.books.com at such a moment and it does not respond at all. I 
guess they have heavily underpowered Google books. Have you noticed this as 
well ?

Google got in touch with me about the problem and asked me where they could see 
the service. That won't help, since they will not pass our NAT gateway. 
However, I will contact them, also about the poor response. The way we have 
implemented it now our full record presentation performance is heavily 
influenced by the Google books response times.
I haven't had time to get back to them, because I have been busy organizing the 
yearly European Library Automation Group (ELAG) meeting which we will host next 
week. http://library.wur.nl/elag2008

I'll CC this message to the list, it ay be of use to others and I wonder how 
others experience the Google Books performance

Peter

Drs. P.J.C. van Boheemen
Hoofd Applicatieontwikkeling en beheer - Bibliotheek Wageningen UR
Head of Application Development and Management - Wageningen University and 
Research Library
tel. +31 317 48 25 17   
 http://library.wur.nl <http://library.wur.nl/>
P Please consider the environment before printing this e-mail



Van: Jonathan Rochkind [mailto:[EMAIL PROTECTED]
Verzonden: di 8-4-2008 18:33
Aan: Boheemen, Peter van
Onderwerp: Re: [CODE4LIB] Restricted access fo free covers from Google :)



Hi Pete, I'd be interested in an update on this. Is your ProxyPass with
Apache to access Google Books search API still working well for you, and
not running into Google traffic limiters?You haven't actually
communicated with Google on this to set up something special or
anything, have you?

I'm interested in trying a similar thing here.

Jonathan

Boheemen, Peter van wrote:
> I don't think I do anything sophisticated like X-forwarder-for. I just have a 
> ProxyPass directive in the apache configuration teeling it to reverse proxy a 
> directory to google
>
> ProxyPass /googlebooks http://books.google.com/books
>
> But what if Google did something with a X-forwarded-for header? It can not 
> see where the actual user is located. Behind a NAT usually 10.0.0.0 adresses 
> are used. In fact it is trivial what Ip adresses are used behind the NAT. 
> Since they are not exposed to the outside world it is only relevant if they 
> are unique within the network behind the NAT.
>
> Anyway, since we only hit google books form the server when a user asks for 
> display of a full record, I hardly expect that will cause the Google 
> triggers. I suspect that the few thousand PC's within the university campus 
> hitting Google cause the problem, which especially Google books reacts upon. 
> (I can still search Google when Google books rejects accces from my IP 
> adress.)
> I'll keep you informed.
>
> Peter
>
>
> Drs. P.J.C. van Boheemen
> Hoofd Applicatieontwikkeling en beheer - Bibliotheek Wageningen UR
> Head of Application Development and Management - Wageningen University and 
> Research Library
> tel. +31 317 48 25 17 
>http://library.wur.nl <http://library.wur.n