I am working on a fairly large research project so I am in the process
of trying to retrieve the most recent 200 tweets for 400,000 users. It
didn't seem like a problem because individual queries took about 1
second to return. Among 5 machines then, this should take about 22.2
hours assuming each request takes 1 second.

After 24 hours, I have retrieved only 25,000 users. Of course, I
realize there is variance in my 1 user/second estimate, but this seems
quite slow, retrieving between 10 and 80 users per minute, I was
expecting to be blocked by rate limiting each hour, but I am nowhere
even close to hitting the 20,000/hr whitelist limit.

Might it be better to parallelize this process using map/reduce to
make several requests simultaneously? Or does the Twitter API HTTP
block the other requests while waiting for the first to complete?

Reply via email to