Well, I have been accused of overthinking things before, so you might be onto something. All right, I will let fly.
On Thursday, August 20, 2015 at 11:09:23 PM UTC-5, Travis Leleu wrote: > > Hi Malik, > > I speak only for myself, but I always thought the emphasis on csv item > exporting was to eliminate barriers to entry to get a scrape up and > running. If you can export to csv, you can open to Excel, and anybody > reading the scrapy tutorial can do that. > > I'm not 100% clear on your objectives, but I generally do string > manipulation in my scrapers, using the db to handle deduplication. Then I > write processing scripts from there to normalize, fuzzy dedupe, etc. > > It sounds a little like you're overthinking it -- I'd recommend just > letting it fly, grab the data you want in the scraper, and save it as a > DjangoItem. You can and will, rescrape later. > > > On Thu, Aug 20, 2015 at 8:47 PM, Malik Rumi <[email protected] > <javascript:>> wrote: > >> I am getting ready to work on my first scrapy project. I've done the dmoz >> tutorial and looked at the docs. My project is to obtain the data, do a >> large number of search and replaces, any other needed clean up, and save it >> to Postgres. I was wondering what 'best practices' are for putting together >> the pipeline? >> >> >> If I am understanding things correctly, most pipelines write the results >> of the scrape into plain text files for processing, regardless of the >> specific tools to be used in that processing, and then bulk upload the >> finished product to the database. >> >> >> However, I am seriously considering going straight to the database with >> DjangoItem, so that I can calculate the urls and then incorporate that data >> in my search and replace. But I suspect trying to do all this text >> processing in the db is a bad idea, but I don't know that for sure. Maybe >> it makes no difference? >> >> >> Another option might be to Tee the scrape into both the db and text >> files. This way I can still use the db to calculate the urls, even if that >> is all I do with those results. Then I could process the text files and >> INSERT/UPDATE the final result back into Postgres, overwriting the original >> raw scrape content. But then I wondered about keeping track of all the >> changes. Has anyone used Git in a situation like this? >> >> >> Thanks for sharing. >> >> -- >> You received this message because you are subscribed to the Google Groups >> "scrapy-users" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To post to this group, send email to [email protected] >> <javascript:>. >> Visit this group at http://groups.google.com/group/scrapy-users. >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
