[twitter-dev] Re: illegal unicode character \uffff

2009-12-04 Thread braver
Mark, great to see you here!  Now I trust the platform is in the right
hands.  :)

Cheers,
Alexy


[twitter-dev] Re: illegal unicode character \uffff

2009-12-03 Thread braver
On Dec 1, 10:49 pm, John Kalucki jkalu...@gmail.com wrote:
 Perhaps someone from Platform could weigh in on this?

In [vulgar] Russian, I'd say it seems Platform retracted its tongue
into a [bodily cavity].  :)  Platform, hey! :)

Cheers,
Alexy


[twitter-dev] Re: illegal unicode character \uffff

2009-12-01 Thread braver
Gardenhose apparently returns illegal Unicode, as confirmed by
PostgreSQL and Perl's Encode, a very trusted, high-mileage code.  We
surely can trap illegal Unicode errors but need to know whether you're
aware of it, the rationale, and plan of action, if any. -- Alexy

On Nov 21, 5:10 pm, braver delivera...@gmail.com wrote:
 I've tried loading the gardenhose via Perl's JSON, and it fails on
 quite a few Asian ones with \u in them, e.g. the tweet if
 5277460813:

 {text:RT @RealLamarOdom \uIf you haven't heard it, go 
 towww.richsoilclothing.comand look under \updates\. Tell me what you
 think. It's hot!,...}

 Is it the artifact of downloading, or Twitter serves illegal UTF8?
 Here's an example of what Perl says about it, for another tweet:

 *** json ENCODING error: malformed or illegal unicode character in
 string [ Artest l], cannot convert to JSON at /home/alexyk/twitter/
 loader/jwilter.pl line 30,  line 44817003.

  {in_reply_to_screen_name:null,text:RT @TheLakersNation
 \uArtest looked great. Lamar dominated the boards. Kobe is Kobe.
 And most importantly, the Lakers take the WIN!,source:a href=
 \http://mobileways.de/gravity\; rel=\nofollow\Gravity/
 a,in_reply_to_user_id:null,in_reply_to_status_id:null,truncated:fal 
 se,geo:null,created_at:Mon
 Nov 02 05:55:49 + 2009,user:
 {profile_background_tile:false,profile_sidebar_border_color:BDDCAD,f 
 ollowing:null,statuses_count:
 243,followers_count:33,profile_image_url:http://a3.twimg.com/
 profile_images/406146987/Real_Force_normal.jpg,friends_count:
 93,description:My Love:Kobe Bryant,Los Angeles
 Lakers,NBA,Twitter,Music,Movie.I Love This Game.Determination:Let's
 again!,location:CN,geo_enabled:false,profile_background_color:9AE 
 4E8,screen_name:Real_Force,favourites_count:
 4,verified:false,notifications:null,profile_text_color:33,time 
 _zone:Beijing,protected:false,url:http://
 hi.baidu.com/real_force/,created_at:Wed Sep 09 12:41:22 +
 2009,profile_link_color:0084B4,name:Zhang
 Yuhao,profile_background_image_url:http://a1.twimg.com/
 profile_background_images/36003404/
 photo_manipulation_photo_art_the_mansion.jpg,id:
 72842359,utc_offset:
 28800,profile_sidebar_fill_color:DDFFCC},favorited:false,id:
 5357163705}

 PostgreSQL shows similar annoyance on its text field in UTF8.  Pls
 clarify what do you do to unicode here!
 Cheers,
 Alexy


[twitter-dev] Re: illegal unicode character \uffff

2009-12-01 Thread braver
John -- thanks for clarification!  Certainly it's the data in
Twitter's database as a whole, not just the Streaming API.  One
question is whether you should accept illegal Unicode?  Probably it's
a safer thing to do to avoid scaring the clients, but maybe you'd want
to apply some filter before sticking it into the database?  I.e., is
it reasonable to have a policy of accepting or storing only legal
Unicode?  I know some folks use Twitter for machine/sensor data, but
perhaps it's not intended?  I can envision Twitter allowing non-
Unicode data if marked as such, perhaps on a closed stream, for
machines talking to each other, -- but not humans.

Cheers,
Alexy


[twitter-dev] Re: historical trends

2009-11-05 Thread braver

Well, trends shown on Twitter itself have self-reinforcement effect:
once a trends breaks into the Top 10, it's snowball after that.
Thus, it's not sufficient to just study tweets when identifying
trends.  Breaking into the Top 10 is a major event.

Thus I suggest Twitter carefully records when it changes the Top 10
display and provides it via an API!  This is a separate, computational
processing which affects almost every Twitter user's behavior, and is
thus important to preserve and study.

Cheers,
Alexy


[twitter-dev] Re: The Gardenhose Cooperative

2009-07-22 Thread braver

I don't see anything vulnerable in a reasonably done verification --
e.g., I'll ask you to grep a word in a day you have and tell me the
count.  I'll google you, and preferably see you here or on twitter.
Heck, Twitter, I'll pay you guys a $1/day for backup fetch!
Preferably then to the starting point of the hoses.

Cheers,
Alexy


[twitter-dev] updating follow/shadow/birddog list of users

2009-07-08 Thread braver

Uf you have thousands of users, do you really have to cook up a
following file with comma-separated say 100,000 user IDs?  Should it
all be on one line?  Now what happens if we want to drop some and add
some IDs -- do we have to restart and re-upload all that list again?
I see when the curl -d @following ... starts up, it does that.
Restarting with huge lists sounds like data loss...

Cheers,
Alexy


[twitter-dev] Illegal byte sequence 0x00 in UTF8

2009-07-08 Thread braver

I'm loading twits into PostgreSQL, and get a few hundreds of errors
for illegal sequence 0x00 in UTF8, e.g. (each leading . is 10,000
gardenhose twits):

.org.postgresql.util.PSQLException: ERROR: invalid byte sequence for
encoding UTF8: 0x00 [loving the weather here in sunny birmingham uk
at the moment but its hard to sleep in when imfeeling lazy lol]
com.tfitter.db.DBError: CANNOT PUT TWIT 2283513311
ROLLBACK uid=21490127 tid=2283513311
org.postgresql.util.PSQLException: ERROR: invalid byte sequence
for encoding UTF8: 0x00 F?9H^f'??%???p?{^]
com.tfitter.db.DBError: CANNOT PUT TWIT 2283842814
ROLLBACK uid=30029372 tid=2283842814
...org.postgresql.util.PSQLException: ERROR: invalid byte sequence for
encoding UTF8: 0x00 [...@andycrofford  まだ脱ぐな。そろそろこのこと考えるのは最後にされると、5エ譛ォ遶ッ
縺ョ譁ケ縺ッ蜃コ鬘後&繧後腟蕭⒢㎢⒢⒢]

Anybody knows how to get rid of those 0x00s cleanly in Scala/Java?
Cheers,
Alexy


[twitter-dev] catching up with gardenhose

2009-07-07 Thread braver

We've lost gardenhose data 6/28-7/7, if anybody could share it we'd
appreciate it very much!  I'm @khrabrov, authorized for it.

Cheers,
Alexy


[twitter-dev] length limits for all fields

2009-06-18 Thread braver

In designing an SQL schema for statuses as returned by Streaming API,
we need to know the length limits for all strings.  Is there a single
table with such lengths, and/or can you guys please specify them here?

Cheers,
Alexy


[twitter-dev] all conversations

2009-06-14 Thread braver

What percentage of all tweets are replies to others, i.e. contain
@nick?  We do research on dialogue and I'd like to get as many
conversations as possible.  So far the only reliable way I see to do
it is crawl.  Even with the /gardenhose I'm not sure that I'm
capturing enough from each conversation.  Perhaps /follow can be
harnessed for it, but then I'd have to determine the people of
interest who converse enough.  Perhaps you honorable Twitter Spirits
can add a replies stream?

Cheers,
Alexy