Order of Post Scrape Processing

Malik Rumi Thu, 20 Aug 2015 20:48:07 -0700


I am getting ready to work on my first scrapy project. I've done the dmoz 
tutorial and looked at the docs. My project is to obtain the data, do a 
large number of search and replaces, any other needed clean up, and save it 
to Postgres. I was wondering what 'best practices' are for putting together 
the pipeline?



If I am understanding things correctly, most pipelines write the results of 
the scrape into plain text files for processing, regardless of the specific 
tools to be used in that processing, and then bulk upload the finished 
product to the database.


However, I am seriously considering going straight to the database with 
DjangoItem, so that I can calculate the urls and then incorporate that data 
in my search and replace. But I suspect trying to do all this text 
processing in the db is a bad idea, but I don't know that for sure. Maybe 
it makes no difference?


Another option might be to Tee the scrape into both the db and text files. 
This way I can still use the db to calculate the urls, even if that is all 
I do with those results. Then I could process the text files and 
INSERT/UPDATE the final result back into Postgres, overwriting the original 
raw scrape content. But then I wondered about keeping track of all the 
changes. Has anyone used Git in a situation like this?


Thanks for sharing.

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Order of Post Scrape Processing

Reply via email to