Hi Malik,

I speak only for myself, but I always thought the emphasis on csv item
exporting was to eliminate barriers to entry to get a scrape up and
running.  If you can export to csv, you can open to Excel, and anybody
reading the scrapy tutorial can do that.

I'm not 100% clear on your objectives, but I generally do string
manipulation in my scrapers, using the db to handle deduplication.  Then I
write processing scripts from there to normalize, fuzzy dedupe, etc.

It sounds a little like you're overthinking it -- I'd recommend just
letting it fly, grab the data you want in the scraper, and save it as a
DjangoItem.  You can and will, rescrape later.


On Thu, Aug 20, 2015 at 8:47 PM, Malik Rumi <[email protected]> wrote:

> I am getting ready to work on my first scrapy project. I've done the dmoz
> tutorial and looked at the docs. My project is to obtain the data, do a
> large number of search and replaces, any other needed clean up, and save it
> to Postgres. I was wondering what 'best practices' are for putting together
> the pipeline?
>
>
> If I am understanding things correctly, most pipelines write the results
> of the scrape into plain text files for processing, regardless of the
> specific tools to be used in that processing, and then bulk upload the
> finished product to the database.
>
>
> However, I am seriously considering going straight to the database with
> DjangoItem, so that I can calculate the urls and then incorporate that data
> in my search and replace. But I suspect trying to do all this text
> processing in the db is a bad idea, but I don't know that for sure. Maybe
> it makes no difference?
>
>
> Another option might be to Tee the scrape into both the db and text files.
> This way I can still use the db to calculate the urls, even if that is all
> I do with those results. Then I could process the text files and
> INSERT/UPDATE the final result back into Postgres, overwriting the original
> raw scrape content. But then I wondered about keeping track of all the
> changes. Has anyone used Git in a situation like this?
>
>
> Thanks for sharing.
>
> --
> You received this message because you are subscribed to the Google Groups
> "scrapy-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/scrapy-users.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Reply via email to