Re: best storage method for change detection

Travis Leleu Tue, 27 Jan 2015 18:41:23 -0800

Are you planning on just storing the diffs, or full versions?  If you're
just storing the diffs, I'd use something flexible and queryable.  I like
JSON, but flat files are sometimes hard to get the section you need.
Therefore, I use mongo -- schemaless, easy to setup.  I think it's a great
data storage layer despite its other flaws.

Elastic isn't great as a primary data store.  I usually couple it (via
streams or other connectors) with a primary store (usually MySQL, sometimes
Mongo), and set up a "river" (i think that's what Elastic calls it) from
the primary to ES.  I query structured records on the primary, and search
on the ES instance.

If all you're trying to do is detect if a page has changed (rather than
computing the diff), and space is at a premium, you could just hash the
HTML (or parts of the html -- I recommend identifying the areas you what
want to follow changes on, and hashing that).

Finally, if you are trying to point out the sections where the page
changed, I'd use a prebuilt python diff library rather than rolling your
own.  I don't have any advice on which one to use.

On Tue, Jan 27, 2015 at 4:59 PM, JS <[email protected]> wrote:

> Hi,
>
> I would like to crawl a particular set of websites every hour to detect
> content changes, but i'm not sure what storage method would be best for my
> use case.  I could potentially store crawl results in json or csv files,
> use mongodb, or some other solution like elasticsearch (if it supports
> historical records).  But I'm not sure which pathway is the best option.
> Is anyone currently storing and keeping a historical record of crawled
> content?  If so, what strategy are you using?
>
> --
> You received this message because you are subscribed to the Google Groups
> "scrapy-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/scrapy-users.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: best storage method for change detection

Reply via email to