Re: Help on Plucker-izing a newsbot

David A. Desrosiers Thu, 23 Jan 2003 05:54:37 -0800

> I have developed a newsbot that collects, categorizes and ranks news
> stories:


        Have you seen NewsBlaster[1]? I've been using that for quite awhile
to scrape my news for me, and I pluck it every 4 hours or so, to keep it
fresh.

> One feature I have added recently is to have memigo grab the
> printer-friendly version of each article.  That version should be better
> suited to PDAs (thereby allowing me to fulfill my original goal), correct?

        It depends. Sites like Register.com, News.com, etc. both happen to
have a PDA-sized version of their site, as well as a syndicated version in
RDF/XML format, which you can probably grab and parse easily as well.

> I now serve the top-ranked articles in their "lite" versions (if
> available), here: http://memigo.com/now

        Are you storing this content locally on your server? I'd check with
your local laws to make sure this is legal. Reproduction of a company's
content without their consent (especially if you're taking out banner ads
and such) may not be legal. It certainly isn't legal to do that in the U.S.

> * It seems to me that some sites switch to their full version if the
> referrer of their "printer-friendly" version is not the full version.
> Workarounds?

        Pass the referer when you fetch the content. I do this all the time
with my perl spider, and it works well, but be careful you don't abuse the
site's rules (i.e. do NOT ignore robots.txt, etc.)

> * New York Times.  Need I say more?  how can I (never mind whether I
> should) get around their restrictions?

        Which restrictions are they using that you can't compensate for?
Cookies? Referers? UserAgent? Just pass those all back to the server when
you fetch it, and you should be all set. Cookies are easy to support, so
that's easy to work around, and the rest are fairly easily done as well,
though I'm not sure how Python handles these things.

> However, I don't know how to discover those (as AvantGo hides the Channel
> URLs, as I found out just recently).

        But I do ;)

        Seriously though, I've been collecting these URLs for quite some
time. Check out the wiki I created to capture them all at:

        http://openurls.plkr.org/

        Please try to help us sort and categories the existing URLs found
there. Add the ones you feel are missing or correct the ones you feel are
wrong.

> * Memigo can customize pages to each user with a simple GET (not yet; more
> on this below).  What do you think are meaningful customizations for
> Plucker clients?

        Please use POST, not GET. GET is spoofable, and sniffable.

> * Pre-built Plucker DBs.  I am not sure about this one.  Memigo is 100%
> Python (yeah! I kept that last one for the end) and could integrate nicely
> with the Plucker code, when I get around to figuring it out :-)  However,
> I don't know if there's enough added value for this, much less if it's
> legally/ethically advisable.

        Depending on where you are, it is probably not legal. Just fetch the
content live when the user requests it, don't store it on your server in a
batched mode.

> This is my pet project and I'd really appreciate any thoughtful feedback.

        Sounds like a fun project. Keep us on the list updated when you add
new things. I'm sure many users would be interested in the features you add.



d.


[1] http://www.cs.columbia.edu/nlp/newsblaster/frame_content.html


_______________________________________________
plucker-dev mailing list
[EMAIL PROTECTED]
http://lists.rubberchicken.org/mailman/listinfo/plucker-dev

Re: Help on Plucker-izing a newsbot

Reply via email to