Development box with Scrapy(d)s, ES, MySQL, Redis and Spark.

Dimitris Kouzis - Loukas Sat, 02 Apr 2016 10:02:10 -0700

Hello,

I've just realised that some of you might find my Vagrant box quite useful. 
I used it for my book and works on Win, Mac and Linux. It contains (see 
diagram below) a few scrapyd servers, Elasticsearch, Redis, MySQL and 
Apache Spark. Its size is ~2.6GB and once you download it, it starts within 
seconds. Note that this box shouldn't be used in production since it 
doesn't follow several best practises in order to provide significant 
functionality while remaining easy to use and able to run on a typical 
laptop (VM requires 2GBs of RAM).

Here's how to use it:

1. Download and add it (assumes you have Virtualbox/Vagrant):

$ wget http://scrapybook.com/scrapybook.box
$ vagrant box add scrapybook scrapybook.box

2. Get rid of book-specific stuff:

$ mv Vagrantfile.dockerhost.boxed Vagrantfile.dockerhost
$ ls -A1 | egrep -v "(Vagrant|insecure_key)" | xargs rm -r

That's it! You should have only those 3 files at the end of this process:

$ ls
Vagrantfile Vagrantfile.dockerhost insecure_key

3. You can start the system by doing:

$ vagrant up --no-parallel

It takes just a few seconds. This is the system this command sets up (click
for larger):

<https://lh3.googleusercontent.com/-JE-esOSQnsU/Vv_qDiWW9eI/AAAAAAAABsg/NnA_vICFn_w0GpS4cT5nQD696GRdlXLjg/s1600/generic-system.png>

It has 8 independent servers with different software on them. It's much
more realistic than the typical dev environment that has everything in a
single box. You can connect to the dev machine:

$ vagrant ssh
$ cd book

Whatever you do in the book directory is reflected on your host's directory
and vice versa. This means you can use your native IDE to edit your files,
even if you are on Windows and still run/test your code from Vagrant's
Linux box. You might find that running Scrapy code in Ubuntu 14.04 is
smoother and faster that developing on your host environment despite the
fact it's running in a VM. Anyway, from the dev machine you have all the
tools you need to use MySQL, Redis and ElasticSearch:

$ mysql -h mysql -uroot -ppass
$ redis-cli -h redis
$ curl http://es:9200

You can also access them from your host machine e.g. you can open
http://localhost:9200 on your web browser. You can also install whatever
you like with the usual sudo apt-get etc. You can access the Spark server
directly. As an example of what you can do with it, open another terminal
(command prompt) and connect to the Spark server:

$ vagrant ssh spark

Then you can type-in a minimal Spark streaming application that monitors
the special directory /root/items, that I've set-up to be written by an ftp
service which runs on the same server:

$ pyspark
>>> from pyspark.streaming import StreamingContext
>>> ssc = StreamingContext(sc, 5)
>>> ssc.textFileStream("file:///root/items").pprint()
>>> ssc.start()

This is an easy way to connect your dev and scrapyd machines without using
infrastructure that requires tons of CPU and RAM. In production it's
trivial to replace this functionality by using e.g. S3 or Kafka. If you go
back to your dev terminal, you can use it with a trivial Scrapy application:

$ scrapy startproject tutorial
$ cd tutorial
$ scrapy genspider example example.com
$ echo ' return {"foo": "bar"}' >> tutorial/spiders/example.py
$ scrapy crawl example -o ftp://anonymous@spark/foobar.$RANDOM.jl

When you run the scrapy crawl command you see a {"foo":"bar"} item printed
on Spark's side.

I so hope you find this Vagrant box useful.

Cheers,
Dimitris

--
You received this message because you are subscribed to the Google Groups
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Development box with Scrapy(d)s, ES, MySQL, Redis and Spark.

Reply via email to