Re: Order of Post Scrape Processing

Malik Rumi Fri, 21 Aug 2015 15:04:00 -0700

Well, I have been accused of overthinking things before, so you might be 
onto something. All right, I will let fly.


On Thursday, August 20, 2015 at 11:09:23 PM UTC-5, Travis Leleu wrote:
>
> Hi Malik,
>
> I speak only for myself, but I always thought the emphasis on csv item 
> exporting was to eliminate barriers to entry to get a scrape up and 
> running.  If you can export to csv, you can open to Excel, and anybody 
> reading the scrapy tutorial can do that.
>
> I'm not 100% clear on your objectives, but I generally do string 
> manipulation in my scrapers, using the db to handle deduplication.  Then I 
> write processing scripts from there to normalize, fuzzy dedupe, etc.
>
> It sounds a little like you're overthinking it -- I'd recommend just 
> letting it fly, grab the data you want in the scraper, and save it as a 
> DjangoItem.  You can and will, rescrape later.
>
>
> On Thu, Aug 20, 2015 at 8:47 PM, Malik Rumi <[email protected] 
> <javascript:>> wrote:
>
>> I am getting ready to work on my first scrapy project. I've done the dmoz 
>> tutorial and looked at the docs. My project is to obtain the data, do a 
>> large number of search and replaces, any other needed clean up, and save it 
>> to Postgres. I was wondering what 'best practices' are for putting together 
>> the pipeline? 
>>
>>
>> If I am understanding things correctly, most pipelines write the results 
>> of the scrape into plain text files for processing, regardless of the 
>> specific tools to be used in that processing, and then bulk upload the 
>> finished product to the database.
>>
>>
>> However, I am seriously considering going straight to the database with 
>> DjangoItem, so that I can calculate the urls and then incorporate that data 
>> in my search and replace. But I suspect trying to do all this text 
>> processing in the db is a bad idea, but I don't know that for sure. Maybe 
>> it makes no difference?
>>
>>
>> Another option might be to Tee the scrape into both the db and text 
>> files. This way I can still use the db to calculate the urls, even if that 
>> is all I do with those results. Then I could process the text files and 
>> INSERT/UPDATE the final result back into Postgres, overwriting the original 
>> raw scrape content. But then I wondered about keeping track of all the 
>> changes. Has anyone used Git in a situation like this?
>>
>>
>> Thanks for sharing.
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "scrapy-users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at http://groups.google.com/group/scrapy-users.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Order of Post Scrape Processing

Reply via email to