[twitter-dev] Re: how are people collecting spritzer/gardenhose?
I'm just using a realtime json parser in Ruby written as a native C extension (http://github.com/brianmario/yajl-ruby/tree/master) It's really simple to use and well documented. I'm just storing everything in a Postgres database, and then using other scripts to query it. Note: using gardenhose at least you get a LOT of data fast. In just a few days, I have a 4GB+ database now or so On Jun 11, 5:10 pm, "M. Edward (Ed) Borasky" wrote: > Right now, I'm collecting spritzer data with a simple shell script > "curl | bzip2 -c > .bz2". A cron > job checks every minute and restarts the script if it crashes. The > rest is simple ETL. :)
[twitter-dev] Re: how are people collecting spritzer/gardenhose?
Right now, I'm collecting spritzer data with a simple shell script "curl | bzip2 -c > .bz2". A cron job checks every minute and restarts the script if it crashes. The rest is simple ETL. :)
[twitter-dev] Re: how are people collecting spritzer/gardenhose?
On Tue, May 26, 2009 at 10:38 AM, pplante wrote: > You are essentially doing the same thing via some bash scripts and > flatfiles. How are you parsing and indexing the data once its > collected? python simplejson, custom tokenizer & other text analysis, then lots of tokyo cabinet/tyrant. -- Brendan O'Connor - http://anyall.org
[twitter-dev] Re: how are people collecting spritzer/gardenhose?
I am using python to implement a process which listens to the stream and places all incoming data onto a message queue service. A few other worker processes in the background work off the queue and store the data. The message queue is not fault tollerant at this time, however with a simple switch to an enterprise based MQ service that could be achieved. You are essentially doing the same thing via some bash scripts and flatfiles. How are you parsing and indexing the data once its collected? On May 25, 5:02 pm, "Brendan O'Connor" wrote: > spritzer is great! well done folks. > I'm wondering how other people are collecting the data. I'm saving the > json-per-line raw output to a flatfile, just using a restarting curl, then > processing later. > > Something as simple as this seems to work for me: > > while true; do > date; echo "starting curl" > curl -s -u user:passhttp://stream.twitter.com/spritzer.json>> > tweets.$(date --iso) > sleep 1 > done |& tee curl.log > > ... and also, to force file rotation once in a while: > > while true; do > date; echo "forcing curl restart" > killall curl > sleep $((60*60*5)) > done |& tee kill.log > > anyone else? > > -Brendan