Re: Plucking meta news sites??

Greg Copeland Wed, 22 Sep 2004 07:15:41 -0700

Thanks for the replies.  I'll look some into the solutions offered. 
Having said that, I think the answers focused too much on /.
specifically, rather than a generic solution to the issue in general.

On Tue, 2004-09-21 at 19:26, David A. Desrosiers wrote:
> > Let's say I want to pluck slashdot.  If I set max depth to 2, while 
> > making it stay on host, I won't get the article, rather, only the 
> > article talking about the article.  That's about worthless.
> 
>       Have you tried using http://slashdot.org/palm/ ?
> 

I use this already.  But you're focusing too specifically on slashdot
rather than a generic solution.

> > If I remove the stayonhost restriction, it quickly spiders far too much 
> > stuff, mostly, away from the meta site, which makes the pdb grow far too 
> > large and simply wastes time spidering it.
> 
>       Try using staybelow="http://slashdot.org/palm/";, but realize 
> you'll miss the top header image on the main page of that site. If you 
> want the main image, try using stayonhost with a lower maxdepth. If that 
> doesn't work, try stayondomain.
> 

How will staybelow grab the article in question?  I don't think this
satisfies the problem of meta news sites very well.  Perhaps I'm not
using it correctly?  Perhaps I misunderstand what it does?

> > --maxdepth 5
> 
>       A depth of 5 is extremely excessive. 3 is the most I've ever seen 
> anyone require for a site like Slashdot.
> 

The two options (one of which does not currently exist, and has been
removed from the quoted context) was strictly illustrative.

> > Is there any way to do what I'm wanting to do without modifying plucker?
> 
>       Sure, dozens of ways. Each site requires custom treatment, on a 
> site-by-site basis. Just be careful with a site like Slashdot. If you 
> spider it too much, or too often, they'll ban your IP from being able to 
> reach the site again.
> 

Hmm.  Okay, please let me know a couple of ways.  Thus far, I'm thinking
that I've not seen a valid option, except doing it the way I'm doing it,
which is less than satisfactory.  This is because, the way I'm doing it,
it simply grabs too much off-site junk.

Let me explain the concept again, because it appears that I didn't
explain it very well.  I would like to spider meta-news sites. 
Meta-news sites are slightly different from normal news sites, in that,
they are a story about a story.  If I use options, such as, staybelow or
stayonhost, then it will not spider the actual article, merely the
meta-article.  Right?  From my testing here, it does not appear to be an
acceptable solution.

So, it seemed to me, what I'm really trying to do is, support two
different spider depths.  Which was the concept I attempted to explain
in my original message.  Beyond that, it seems as an excellent feature
to have, for a variety of reasons.  That feature is, allowing for two
distinct depths to be specified.  One depth for the homeurl and another
depth for off-host urls.  For meta-news sites, it seems like this would
be a good generic solution without requiring custom spiders/parsers for
various meta sites.

>From what I'm hearing, it sounds like a custom spider/parser and/or
modification to plucker is the only viable solution?

Thanks,

-- 
Greg Copeland <[EMAIL PROTECTED]>

_______________________________________________
plucker-list mailing list
[EMAIL PROTECTED]
http://lists.rubberchicken.org/mailman/listinfo/plucker-list

Re: Plucking meta news sites??

Reply via email to