Re: [Boston.pm] Tech Meeting Followup
On Fri, 6 Aug 2004, [EMAIL PROTECTED] wrote: Have you ever noticed a google resultset entry that didn't have a cache link? I don't know if it is something that a publisher can set programatically or if it is a business arrangement. Pages are cached by default. To get removed you have to request it. http://www.google.com/help/features.html#cached Advertising based news sites will probably be even less appreciative of mirroring and caching as more and more of them turn into registration based sites. You misunderstand. If registration is required, a crawler will fail anyway, and I don't mean anything but crawler-like behavior with depth 1. I'm talking about common sense caching, not how can we defeat this site so all their content is mirrored. You could even have a redirect link that sends you to the cached version iff the original site is unresponsive, so normal users never know what happened. Ted ___ Boston-pm mailing list [EMAIL PROTECTED] http://mail.pm.org/mailman/listinfo/boston-pm
Re: [Boston.pm] Tech Meeting Followup
On Fri, 6 Aug 2004, Andrew M. Langmead wrote: On Aug 5, 2004, at 10:01 PM, Ted Zlatanov wrote: I meant it in the sense of the Google cache, where you have an alternative in case the main one goes down, but the main link is prominent and obviously the one to follow. Have you ever noticed a google resultset entry that didn't have a cache link? I don't know if it is something that a publisher can set programatically or if it is a business arrangement. That, or something simpler like... meta http-equiv=Pragma content=no-cache The fact that it's a 'http-equiv' call suggests to me that if Apache (or whatever) was adjusted to emit a 'Pragma: no-cache' header with all responses then you'd get the same result at the server level. My assumption has always been that Google just honors this directive. Advertising based news sites will probably be even less appreciative of mirroring and caching as more and more of them turn into registration based sites. How about if the page content is cached by Slashdot, but the images -- and in particular, the advertising graphics -- are passed through to the original site? That way, Slashdot takes the bandwidth hit and the original site doesn't miss out on the advertising impressions. Of course, implementing this might be a pain. You'd probably want to cache the main graphics -- any photos with the article, any page furniture logos, etc -- while passing the ad graphics back to the original publisher. To do this, every site would have to be a special case -- NYT ads come from www.nytimes.com [and so filtering them from other links may be tricky for the caching site to capture], while Boston.com ads come from rmedia.boston.com [and so would be easy for caching sites to capture -- but you'd have to figure out this easy special case for every news site you'd want to link to...]. Also, for anyone with a browser set to reject images coming from an alternate server (I think Mozilla has had this for a while), this would break the whole scheme. So maybe that wouldn't work. But at least it tries to address the publisher's needs... -- Chris Devers ___ Boston-pm mailing list [EMAIL PROTECTED] http://mail.pm.org/mailman/listinfo/boston-pm
Re: [Boston.pm] Tech Meeting Followup
At 5:34 PM -0400 8/5/04, Aaron Sherman wrote: On Thu, Aug 05, 2004 at 02:38:50PM -0400, Dan Sugalski wrote: This is a very, *very* good thing. It had a link to my blog on it -- a blog which lives on my server, which is in a closet in my daughter's bedroom, behind a DSL line with a 256kbit upstream rate. Sorry about that, I've been visiting your blog for a long while, and always thought it was part of a commercial service. Nicely done ;-) Always give the illusion of competence, that's what I say! :) If it hit the main page my server'd have burst into flames, which would've been no fun at all. (Slashdot really ought to check with linkees before posting stories but alas they don't) Appologies for over-stepping, I just thought that someone who cared about the project and its goals should submit the story before it showed up as Perl's Parrot Loses Performance Benchmark, Gets Pied, and I admit I rushed it a bit for that reason (I'd word the submission a bit better next time around as well). Ah, I'm not mad. I just got that moment of terror when I popped over to the screen session running apachetop and saw slashdot start showing up in the referrers. That was... not fun. -- Dan --it's like this--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk ___ Boston-pm mailing list [EMAIL PROTECTED] http://mail.pm.org/mailman/listinfo/boston-pm
Re: [Boston.pm] Wiki
hi ( 04.08.05 19:31 -0400 ) Uri Guttman: at least we should have meeting info and directions, talk subjects, a who's who page of members, job stuff?, boston perl things (what??), etc. maybe some of those things go on the main web site- don't need to duplicate lists. wiki is better for putting things up to share [and modify], little one-of projects, [buzzword alert] collaboration. why not just run it up the flagpole and see who salutes! -- \js oblique strategy: the inconsistency principle ___ Boston-pm mailing list [EMAIL PROTECTED] http://mail.pm.org/mailman/listinfo/boston-pm
Re: [Boston.pm] Tech Meeting Followup
On Fri, 6 Aug 2004, [EMAIL PROTECTED] wrote: On Aug 6, 2004, at 6:14 AM, Ted Zlatanov wrote: You misunderstand. If registration is required, a crawler will fail anyway, Unless the crawler is itself registered. If I wrote a crawler, I'd keep a database of usernames and passwords for this purpose. That's not a typical web crawler, and obviously not what I meant. Such databases already exist (e.g. bugmenot) but using them to rip a page is definitely abusive. Think Google, not rip-off. Ted ___ Boston-pm mailing list [EMAIL PROTECTED] http://mail.pm.org/mailman/listinfo/boston-pm
Re: [Boston.pm] Tech Meeting Followup
TZ == Ted Zlatanov [EMAIL PROTECTED] writes: You misunderstand. If registration is required, a crawler will fail anyway, Unless the crawler is itself registered. If I wrote a crawler, I'd keep a database of usernames and passwords for this purpose. TZ That's not a typical web crawler, and obviously not what I meant. TZ Such databases already exist (e.g. bugmenot) but using them to rip a TZ page is definitely abusive. Think Google, not rip-off. i wrote a crawler for a client that did just that (even had paid registration for the wall street journal). it was specifically crawling newspapers and publications only so it had to register for some. it was not meant for archiving or public (commercial only) use. hard to say whether it violated any policies but that was their problem, not mine. and i don't think they ever went live. uri -- Uri Guttman -- [EMAIL PROTECTED] http://www.stemsystems.com --Perl Consulting, Stem Development, Systems Architecture, Design and Coding- Search or Offer Perl Jobs http://jobs.perl.org ___ Boston-pm mailing list [EMAIL PROTECTED] http://mail.pm.org/mailman/listinfo/boston-pm
Re: [Boston.pm] Tech Meeting Followup
That's not a typical web crawler, and obviously not what I meant. Such databases already exist (e.g. bugmenot) but using them to rip a page is definitely abusive. Not abusive at all. It's a public service. Think Google, not rip-off. Go to news.google.com and you will see many results that say things like Kansas City Star (subscription) So the Google crawler does indeed subscribe to some registration-required sites and crawl them. ___ Boston-pm mailing list [EMAIL PROTECTED] http://mail.pm.org/mailman/listinfo/boston-pm
[Boston.pm] possibly off-topic: html, metadata, and a perl script???
OK, so this is off topic in context, but the solution may be a perl script, so it may be on-topic in solution... I have some short stories, excerpts, poems that I want to put under a CreativeCommons (CC) license, CC-BY-NC. When you select a CC license, they give you an excerpt of metadata to put into the html code of the content so that search engines will know its a funky license and allow people to search for CC-NC works, whatever. the metadata is here: http://creativecommons.org/license/work-html-popup?lang=enlicense_code=by-nc the html content is here: http://www.greglondon.com/hunger/givemestorm.html I usually use OpenOffice to create my html. So, where do I put this metadata? Is there something in OO that I can put it in so it will put it in the right place? Do I need a perl script that takes my html and inserts the metadata into it? Or can I do it in OO? I know basically zip about HTML. Any help would be appreciated... Thanks, Greg ___ Boston-pm mailing list [EMAIL PROTECTED] http://mail.pm.org/mailman/listinfo/boston-pm
Re: [Boston.pm] Wiki
On Fri, Aug 06, 2004 at 10:43:20AM -0400, Uri Guttman wrote: JS why not just run it up the flagpole and see who salutes! i won't touch that with a 10 foot flagpole! A 3 foot flagczech might be easier to manage, I'd say. -Gyepi ___ Boston-pm mailing list [EMAIL PROTECTED] http://mail.pm.org/mailman/listinfo/boston-pm
Re: [Boston.pm] Tech Meeting Followup
On Fri, 6 Aug 2004, [EMAIL PROTECTED] wrote: That's not a typical web crawler, and obviously not what I meant. Such databases already exist (e.g. bugmenot) but using them to rip a page is definitely abusive. Not abusive at all. It's a public service. It's abusive to the content provider who pays the network connectivity bills and expects ad revenue, regardless of how you or anyone else feels. Note the context is on a major site's ripping of a page so visitors never see the original site, NOT general web visitors. I'm not interested in discussing the latter's attitude towards web registrations because that's completely irrelevant to Slashdot caching. Think Google, not rip-off. Go to news.google.com and you will see many results that say things like Kansas City Star (subscription) So the Google crawler does indeed subscribe to some registration-required sites and crawl them. I'm not sure how that matters. We're talking about Google's HTTP caching of ANY page, not their news items; furthermore the focus is on *intent* and not on *mechanism*. Google's intent with cached HTTP crawling is clearly not to rip off advertisers. Ted ___ Boston-pm mailing list [EMAIL PROTECTED] http://mail.pm.org/mailman/listinfo/boston-pm
Re: [Boston.pm] possibly off-topic: html, metadata, and a perl script???
hi ( 04.08.06 11:11 -0400 ) Greg London: I usually use OpenOffice to create my html. hmm ... Do I need a perl script that takes my html and inserts the metadata into it? Or can I do it in OO? just use a text editor [vim, notepad, the emacs operating system, whatever] and cut and paste. I know basically zip about HTML. well, this might be a good opportunity to learn something. because it's so ubiquitous, i'm sure your efforts in learning some parts of it won't be wasted. -- \js oblique strategy: straight into his lap) ___ Boston-pm mailing list [EMAIL PROTECTED] http://mail.pm.org/mailman/listinfo/boston-pm