Re: [Boston.pm] Tech Meeting Followup

2004-08-06 Thread Ted Zlatanov
On Fri, 6 Aug 2004, [EMAIL PROTECTED] wrote:

 Have you ever noticed a google resultset entry that didn't have a
 cache link? I don't know if it is something that a publisher can set
 programatically or if it is a business arrangement.

Pages are cached by default.  To get removed you have to request it.

http://www.google.com/help/features.html#cached

 Advertising based news sites will probably be even less appreciative
 of mirroring and caching as more and more of them turn into
 registration based sites.

You misunderstand.  If registration is required, a crawler will fail
anyway, and I don't mean anything but crawler-like behavior with depth
1.  I'm talking about common sense caching, not how can we defeat
this site so all their content is mirrored.  You could even have a
redirect link that sends you to the cached version iff the original
site is unresponsive, so normal users never know what happened.

Ted
___
Boston-pm mailing list
[EMAIL PROTECTED]
http://mail.pm.org/mailman/listinfo/boston-pm


Re: [Boston.pm] Tech Meeting Followup

2004-08-06 Thread Chris Devers
On Fri, 6 Aug 2004, Andrew M. Langmead wrote:
On Aug 5, 2004, at 10:01 PM, Ted Zlatanov wrote:
I meant it in the sense of the Google cache, where you have an
alternative in case the main one goes down, but the main link is
prominent and obviously the one to follow.
Have you ever noticed a google resultset entry that didn't have a 
cache link? I don't know if it is something that a publisher can set 
programatically or if it is a business arrangement.
That, or something simpler like...
meta http-equiv=Pragma content=no-cache
The fact that it's a 'http-equiv' call suggests to me that if Apache (or 
whatever) was adjusted to emit a 'Pragma: no-cache' header with all
responses then you'd get the same result at the server level.

My assumption has always been that Google just honors this directive.
Advertising based news sites will probably be even less appreciative 
of mirroring and caching as more and more of them turn into 
registration based sites.
How about if the page content is cached by Slashdot, but the images -- 
and in particular, the advertising graphics -- are passed through to the 
original site? That way, Slashdot takes the bandwidth hit and the 
original site doesn't miss out on the advertising impressions.

Of course, implementing this might be a pain. You'd probably want to 
cache the main graphics -- any photos with the article, any page 
furniture  logos, etc -- while passing the ad graphics back to the 
original publisher.

To do this, every site would have to be a special case -- NYT ads come 
from www.nytimes.com [and so filtering them from other links may be 
tricky for the caching site to capture], while Boston.com ads come from 
rmedia.boston.com [and so would be easy for caching sites to capture -- 
but you'd have to figure out this easy special case for every news 
site you'd want to link to...].

Also, for anyone with a browser set to reject images coming from an 
alternate server (I think Mozilla has had this for a while), this would 
break the whole scheme.

So maybe that wouldn't work.
But at least it tries to address the publisher's needs...
--
Chris Devers
___
Boston-pm mailing list
[EMAIL PROTECTED]
http://mail.pm.org/mailman/listinfo/boston-pm


Re: [Boston.pm] Tech Meeting Followup

2004-08-06 Thread Dan Sugalski
At 5:34 PM -0400 8/5/04, Aaron Sherman wrote:
On Thu, Aug 05, 2004 at 02:38:50PM -0400, Dan Sugalski wrote:
 This is a very, *very* good thing. It had a link to my blog on it --
 a blog which lives on my server, which is in a closet in my
 daughter's bedroom, behind a DSL line with a 256kbit upstream rate.
Sorry about that, I've been visiting your blog for a long while, and
always thought it was part of a commercial service. Nicely done ;-)
Always give the illusion of competence, that's what I say! :)
  If it hit the main page my server'd have burst into flames, which
 would've been no fun at all. (Slashdot really ought to check with
 linkees before posting stories but alas they don't)
Appologies for over-stepping, I just thought that someone who cared
about the project and its goals should submit the story before it showed
up as Perl's Parrot Loses Performance Benchmark, Gets Pied, and I
admit I rushed it a bit for that reason (I'd word the submission a bit
better next time around as well).
Ah, I'm not mad. I just got that moment of terror when I popped over 
to the screen session running apachetop and saw slashdot start 
showing up in the referrers. That was... not fun.
--
Dan

--it's like this---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk
___
Boston-pm mailing list
[EMAIL PROTECTED]
http://mail.pm.org/mailman/listinfo/boston-pm


Re: [Boston.pm] Wiki

2004-08-06 Thread John Saylor
hi

( 04.08.05 19:31 -0400 ) Uri Guttman:
 at least we should have meeting info and directions, talk subjects,
 a who's who page of members, job stuff?, boston perl things (what??),
 etc.

maybe some of those things go on the main web site- don't need to
duplicate lists. wiki is better for putting things up to share [and
modify], little one-of projects, [buzzword alert] collaboration.

why not just run it up the flagpole and see who salutes!

-- 
\js oblique strategy: the inconsistency principle
___
Boston-pm mailing list
[EMAIL PROTECTED]
http://mail.pm.org/mailman/listinfo/boston-pm


Re: [Boston.pm] Tech Meeting Followup

2004-08-06 Thread Ted Zlatanov
On Fri, 6 Aug 2004, [EMAIL PROTECTED] wrote:

On Aug 6, 2004, at 6:14 AM, Ted Zlatanov wrote:

 You misunderstand.  If registration is required, a crawler will fail
 anyway,
 
 Unless the crawler is itself registered.  If I wrote a crawler, I'd
 keep a database of usernames and passwords for this purpose.

That's not a typical web crawler, and obviously not what I meant.
Such databases already exist (e.g. bugmenot) but using them to rip a
page is definitely abusive.  Think Google, not rip-off.

Ted
___
Boston-pm mailing list
[EMAIL PROTECTED]
http://mail.pm.org/mailman/listinfo/boston-pm


Re: [Boston.pm] Tech Meeting Followup

2004-08-06 Thread Uri Guttman
 TZ == Ted Zlatanov [EMAIL PROTECTED] writes:

   You misunderstand.  If registration is required, a crawler will fail
   anyway,
   
   Unless the crawler is itself registered.  If I wrote a crawler, I'd
   keep a database of usernames and passwords for this purpose.

  TZ That's not a typical web crawler, and obviously not what I meant.
  TZ Such databases already exist (e.g. bugmenot) but using them to rip a
  TZ page is definitely abusive.  Think Google, not rip-off.

i wrote a crawler for a client that did just that (even had paid
registration for the wall street journal). it was specifically crawling
newspapers and publications only so it had to register for some. it was
not meant for archiving or public (commercial only) use. hard to say
whether it violated any policies but that was their problem, not
mine. and i don't think they ever went live. 

uri

-- 
Uri Guttman  --  [EMAIL PROTECTED]   http://www.stemsystems.com
--Perl Consulting, Stem Development, Systems Architecture, Design and Coding-
Search or Offer Perl Jobs    http://jobs.perl.org
___
Boston-pm mailing list
[EMAIL PROTECTED]
http://mail.pm.org/mailman/listinfo/boston-pm


Re: [Boston.pm] Tech Meeting Followup

2004-08-06 Thread Ron Newman
 That's not a typical web crawler, and obviously not what I meant.
Such databases already exist (e.g. bugmenot) but using them to rip a
page is definitely abusive.

Not abusive at all.  It's a public service.

 Think Google, not rip-off.

Go to news.google.com and you will see many results that say things like

Kansas City Star (subscription)

So the Google crawler does indeed subscribe to some registration-required sites
and crawl them.

___
Boston-pm mailing list
[EMAIL PROTECTED]
http://mail.pm.org/mailman/listinfo/boston-pm


[Boston.pm] possibly off-topic: html, metadata, and a perl script???

2004-08-06 Thread Greg London
OK, so this is off topic in context, but the solution may be a perl script, so
it may be on-topic in solution...

I have some short stories, excerpts, poems that I want to put under a
CreativeCommons (CC) license, CC-BY-NC. When you select a CC license, they
give you an excerpt of metadata to put into the html code of the content so
that search engines will know its a funky license and allow people to search
for CC-NC works, whatever.

the metadata is here:
http://creativecommons.org/license/work-html-popup?lang=enlicense_code=by-nc

the html content is here:
http://www.greglondon.com/hunger/givemestorm.html

I usually use OpenOffice to create my html. So, where do I put this metadata?
Is there something in OO that I can put it in so it will put it in the right
place?

Do I need a perl script that takes my html and inserts the metadata into it?
Or can I do it in OO?

I know basically zip about HTML.

Any help would be appreciated...

Thanks,
Greg

___
Boston-pm mailing list
[EMAIL PROTECTED]
http://mail.pm.org/mailman/listinfo/boston-pm


Re: [Boston.pm] Wiki

2004-08-06 Thread Gyepi SAM
On Fri, Aug 06, 2004 at 10:43:20AM -0400, Uri Guttman wrote:
   JS why not just run it up the flagpole and see who salutes!
 
 i won't touch that with a 10 foot flagpole!

A 3 foot flagczech might be easier to manage, I'd say.

-Gyepi




___
Boston-pm mailing list
[EMAIL PROTECTED]
http://mail.pm.org/mailman/listinfo/boston-pm


Re: [Boston.pm] Tech Meeting Followup

2004-08-06 Thread Ted Zlatanov
On Fri, 6 Aug 2004, [EMAIL PROTECTED] wrote:

 That's not a typical web crawler, and obviously not what I meant.
Such databases already exist (e.g. bugmenot) but using them to rip a
page is definitely abusive.
 
 Not abusive at all.  It's a public service.

It's abusive to the content provider who pays the network connectivity
bills and expects ad revenue, regardless of how you or anyone else
feels.  Note the context is on a major site's ripping of a page so
visitors never see the original site, NOT general web visitors.  I'm
not interested in discussing the latter's attitude towards web
registrations because that's completely irrelevant to Slashdot
caching.

 Think Google, not rip-off.
 
 Go to news.google.com and you will see many results that say things like
 
 Kansas City Star (subscription)
 
 So the Google crawler does indeed subscribe to some
 registration-required sites and crawl them.

I'm not sure how that matters.  We're talking about Google's HTTP
caching of ANY page, not their news items; furthermore the focus is on
*intent* and not on *mechanism*.  Google's intent with cached HTTP
crawling is clearly not to rip off advertisers.

Ted
___
Boston-pm mailing list
[EMAIL PROTECTED]
http://mail.pm.org/mailman/listinfo/boston-pm


Re: [Boston.pm] possibly off-topic: html, metadata, and a perl script???

2004-08-06 Thread John Saylor
hi

( 04.08.06 11:11 -0400 ) Greg London:
 I usually use OpenOffice to create my html.

hmm ...

 Do I need a perl script that takes my html and inserts the metadata
 into it?  Or can I do it in OO?

just use a text editor [vim, notepad, the emacs operating system,
whatever] and cut and paste.

 I know basically zip about HTML.

well, this might be a good opportunity to learn something. because it's
so ubiquitous, i'm sure your efforts in learning some parts of it won't
be wasted.

-- 
\js oblique strategy: straight into his lap)
___
Boston-pm mailing list
[EMAIL PROTECTED]
http://mail.pm.org/mailman/listinfo/boston-pm