Are you planning on just storing the diffs, or full versions? If you're just storing the diffs, I'd use something flexible and queryable. I like JSON, but flat files are sometimes hard to get the section you need. Therefore, I use mongo -- schemaless, easy to setup. I think it's a great data storage layer despite its other flaws.
Elastic isn't great as a primary data store. I usually couple it (via streams or other connectors) with a primary store (usually MySQL, sometimes Mongo), and set up a "river" (i think that's what Elastic calls it) from the primary to ES. I query structured records on the primary, and search on the ES instance. If all you're trying to do is detect if a page has changed (rather than computing the diff), and space is at a premium, you could just hash the HTML (or parts of the html -- I recommend identifying the areas you what want to follow changes on, and hashing that). Finally, if you are trying to point out the sections where the page changed, I'd use a prebuilt python diff library rather than rolling your own. I don't have any advice on which one to use. On Tue, Jan 27, 2015 at 4:59 PM, JS <[email protected]> wrote: > Hi, > > I would like to crawl a particular set of websites every hour to detect > content changes, but i'm not sure what storage method would be best for my > use case. I could potentially store crawl results in json or csv files, > use mongodb, or some other solution like elasticsearch (if it supports > historical records). But I'm not sure which pathway is the best option. > Is anyone currently storing and keeping a historical record of crawled > content? If so, what strategy are you using? > > -- > You received this message because you are subscribed to the Google Groups > "scrapy-users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at http://groups.google.com/group/scrapy-users. > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
