I work for an automotive parts manufacturer. Part of my job is writing
spiders to crawl our suppliers sites. It is true that each site will need a
custom scrapper. Theoretically is is possible to write a general purpose
crawler. But it is not practical. One would need to do a lot of native
language processing on each document and even then it is likely that you
would end up with garbage more often than not as English is a difficult
language to comprehend from the computer's point of view. And, to do this
you really need to download each page and run the nlp (natural language
processing) off line.

Since blogs usually have an RSS feed which uses a pretty standard format,
writing a general blog crawler is a bit easier... However, if you try to
read the content from the blog pages, you'll have the same issue as with
product information. Since no two blogs will be exactly alike.

My approach for our suppliers was to write generalized code for the common
aspects among of each project. this reduces the amount of code needed for
each additional project and gives a basic API to work with for each new
project. Even then, we end up doing regular maintenance on several projects
because those suppliers constantly change their site layouts.

To put it plainly, there is no silver bullet that will extract the desired
data from all sites. If you need good results, you'll need a custom crawler
for each site.

Hope this helps. If i can be of help let me know.


On Tue, Mar 11, 2014 at 12:54 PM, Joseph Piscal <[email protected]>wrote:

> Greetings Scrapy Enthusiasts,
>
> I have been searching for a reliable and experienced developer to develop
> a web crawler/data scraper for some time. I work for a marketing company
> and have a list of 200+ consumer product URLS (blogs, e-commerce stores,
> home shopping networks, & big sites like amazon/pintrest) that I would be
> interested in scraping. Information I desire is product image, product
> price, product description, link to purchase, etc. I would like the
> information presented to me in a web RSS feed type format (attached) for
> ease of sifting through products quickly. After the base program is built I
> would also be interested in making it more intelligent. Utilizing a keyword
> or weighting system to filter out products I don't want to see.
>
> I spoke to one developer who claimed scraping blogs would be easy since
> they mostly run on Wordpress, or at least provide an RSS feed so monitoring
> would be simple. He also mentioned that it would be difficult to do the
> e-commerce sites because almost all of them utilize a different platform.
> So we would pretty much require custom code for every single site. Not sure
> how true this is but it made sense.
>
> The goal is for me to see new data daily. After a URL has been scraped the
> first time, I would assume all the products (data) would be stored in a
> data base. So the next time that URL is scraped it only collects the NEW
> products (data) for my viewing pleasure.
>
> I am looking for any and all advice or suggestions on this project. If you
> are interested please feel free to respond to this post or contact me
> directly:
>
> [email protected]
>
> Thank you in advance!
>
>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "scrapy-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/scrapy-users.
> For more options, visit https://groups.google.com/d/optout.
>



-- 
If you ask me if it can be done. The answer is YES, it can always be done.
The correct questions however are... What will it cost, and how long will
it take?

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Reply via email to