I am getting ready to work on my first scrapy project. I've done the dmoz tutorial and looked at the docs. My project is to obtain the data, do a large number of search and replaces, any other needed clean up, and save it to Postgres. I was wondering what 'best practices' are for putting together the pipeline?
If I am understanding things correctly, most pipelines write the results of the scrape into plain text files for processing, regardless of the specific tools to be used in that processing, and then bulk upload the finished product to the database. However, I am seriously considering going straight to the database with DjangoItem, so that I can calculate the urls and then incorporate that data in my search and replace. But I suspect trying to do all this text processing in the db is a bad idea, but I don't know that for sure. Maybe it makes no difference? Another option might be to Tee the scrape into both the db and text files. This way I can still use the db to calculate the urls, even if that is all I do with those results. Then I could process the text files and INSERT/UPDATE the final result back into Postgres, overwriting the original raw scrape content. But then I wondered about keeping track of all the changes. Has anyone used Git in a situation like this? Thanks for sharing. -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
