Hi, Scrapy community!
Relatively new to Scrapy, I found to be confused by its behavior. As stated in the doc ( http://doc.scrapy.org/en/1.0/faq.html#does-scrapy-crawl-in-breadth-first-or-depth-first-order ), Scrapy is meant to crawl in depth-first order by default. But, as far as I understand, this is not how it actually behaves. For a given scrapping project, I need that Scrapy crawls URLs in a depth-first order. I observed that this does not occur as expected. To be sure, or at least illustrate what behavior I observed, I created a small scrapping project here: https://github.com/vincent-ferotin/scraping-github This project crawls GitHub and some given projects trees, and registers orders in which requests and responses are proceeded. For details, please refer to its README (directly readable at project's URL above). I illustrate results in images, for both requests and responses orders, with two configurations (default said to be "depth-first" order, and other for "breadth-first"). For "depth-first" order, requests orders are: https://raw.githubusercontent.com/vincent-ferotin/scraping-github/master/tree/github-tree-requests-depth_priority_0.png and responses ones are: https://raw.githubusercontent.com/vincent-ferotin/scraping-github/master/tree/github-tree-responses-depth_priority_0.png For "breadth-first" order, requests orders are: https://raw.githubusercontent.com/vincent-ferotin/scraping-github/master/tree/github-tree-requests-depth_priority_1.png and responses ones are: https://raw.githubusercontent.com/vincent-ferotin/scraping-github/master/tree/github-tree-responses-depth_priority_1.png In any case, what I understand (please correct me if needed) is that crawling is done through a *breadth-first* order, in any case. What changes is that for "breadth-first" order, order respects left-to-right order specified by graph to crawl, whereas for "depth-first", left-to-right is not respected (I do not understand it also). Please let verify it by yourself, running code's project. (Orders are pretty printed through logging at end of crawling.) So, my very first question, would be: am I right? (or: where do I misunderstand?) If so, should documentation regarding "depth-first" VS "breadth-first" order be rewritten? And, is there a way to obtain a true depth-first order crawling? Thanks, -- Vincent -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
