Hi Malik, I speak only for myself, but I always thought the emphasis on csv item exporting was to eliminate barriers to entry to get a scrape up and running. If you can export to csv, you can open to Excel, and anybody reading the scrapy tutorial can do that.
I'm not 100% clear on your objectives, but I generally do string manipulation in my scrapers, using the db to handle deduplication. Then I write processing scripts from there to normalize, fuzzy dedupe, etc. It sounds a little like you're overthinking it -- I'd recommend just letting it fly, grab the data you want in the scraper, and save it as a DjangoItem. You can and will, rescrape later. On Thu, Aug 20, 2015 at 8:47 PM, Malik Rumi <[email protected]> wrote: > I am getting ready to work on my first scrapy project. I've done the dmoz > tutorial and looked at the docs. My project is to obtain the data, do a > large number of search and replaces, any other needed clean up, and save it > to Postgres. I was wondering what 'best practices' are for putting together > the pipeline? > > > If I am understanding things correctly, most pipelines write the results > of the scrape into plain text files for processing, regardless of the > specific tools to be used in that processing, and then bulk upload the > finished product to the database. > > > However, I am seriously considering going straight to the database with > DjangoItem, so that I can calculate the urls and then incorporate that data > in my search and replace. But I suspect trying to do all this text > processing in the db is a bad idea, but I don't know that for sure. Maybe > it makes no difference? > > > Another option might be to Tee the scrape into both the db and text files. > This way I can still use the db to calculate the urls, even if that is all > I do with those results. Then I could process the text files and > INSERT/UPDATE the final result back into Postgres, overwriting the original > raw scrape content. But then I wondered about keeping track of all the > changes. Has anyone used Git in a situation like this? > > > Thanks for sharing. > > -- > You received this message because you are subscribed to the Google Groups > "scrapy-users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at http://groups.google.com/group/scrapy-users. > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
