[twitter-dev] Re: illegal unicode character \uffff
Mark, great to see you here! Now I trust the platform is in the right hands. :) Cheers, Alexy
[twitter-dev] Re: illegal unicode character \uffff
On Dec 1, 10:49 pm, John Kalucki jkalu...@gmail.com wrote: Perhaps someone from Platform could weigh in on this? In [vulgar] Russian, I'd say it seems Platform retracted its tongue into a [bodily cavity]. :) Platform, hey! :) Cheers, Alexy
[twitter-dev] Re: illegal unicode character \uffff
Gardenhose apparently returns illegal Unicode, as confirmed by PostgreSQL and Perl's Encode, a very trusted, high-mileage code. We surely can trap illegal Unicode errors but need to know whether you're aware of it, the rationale, and plan of action, if any. -- Alexy On Nov 21, 5:10 pm, braver delivera...@gmail.com wrote: I've tried loading the gardenhose via Perl's JSON, and it fails on quite a few Asian ones with \u in them, e.g. the tweet if 5277460813: {text:RT @RealLamarOdom \uIf you haven't heard it, go towww.richsoilclothing.comand look under \updates\. Tell me what you think. It's hot!,...} Is it the artifact of downloading, or Twitter serves illegal UTF8? Here's an example of what Perl says about it, for another tweet: *** json ENCODING error: malformed or illegal unicode character in string [ Artest l], cannot convert to JSON at /home/alexyk/twitter/ loader/jwilter.pl line 30, line 44817003. {in_reply_to_screen_name:null,text:RT @TheLakersNation \uArtest looked great. Lamar dominated the boards. Kobe is Kobe. And most importantly, the Lakers take the WIN!,source:a href= \http://mobileways.de/gravity\; rel=\nofollow\Gravity/ a,in_reply_to_user_id:null,in_reply_to_status_id:null,truncated:fal se,geo:null,created_at:Mon Nov 02 05:55:49 + 2009,user: {profile_background_tile:false,profile_sidebar_border_color:BDDCAD,f ollowing:null,statuses_count: 243,followers_count:33,profile_image_url:http://a3.twimg.com/ profile_images/406146987/Real_Force_normal.jpg,friends_count: 93,description:My Love:Kobe Bryant,Los Angeles Lakers,NBA,Twitter,Music,Movie.I Love This Game.Determination:Let's again!,location:CN,geo_enabled:false,profile_background_color:9AE 4E8,screen_name:Real_Force,favourites_count: 4,verified:false,notifications:null,profile_text_color:33,time _zone:Beijing,protected:false,url:http:// hi.baidu.com/real_force/,created_at:Wed Sep 09 12:41:22 + 2009,profile_link_color:0084B4,name:Zhang Yuhao,profile_background_image_url:http://a1.twimg.com/ profile_background_images/36003404/ photo_manipulation_photo_art_the_mansion.jpg,id: 72842359,utc_offset: 28800,profile_sidebar_fill_color:DDFFCC},favorited:false,id: 5357163705} PostgreSQL shows similar annoyance on its text field in UTF8. Pls clarify what do you do to unicode here! Cheers, Alexy
[twitter-dev] Re: illegal unicode character \uffff
John -- thanks for clarification! Certainly it's the data in Twitter's database as a whole, not just the Streaming API. One question is whether you should accept illegal Unicode? Probably it's a safer thing to do to avoid scaring the clients, but maybe you'd want to apply some filter before sticking it into the database? I.e., is it reasonable to have a policy of accepting or storing only legal Unicode? I know some folks use Twitter for machine/sensor data, but perhaps it's not intended? I can envision Twitter allowing non- Unicode data if marked as such, perhaps on a closed stream, for machines talking to each other, -- but not humans. Cheers, Alexy
[twitter-dev] Re: historical trends
Well, trends shown on Twitter itself have self-reinforcement effect: once a trends breaks into the Top 10, it's snowball after that. Thus, it's not sufficient to just study tweets when identifying trends. Breaking into the Top 10 is a major event. Thus I suggest Twitter carefully records when it changes the Top 10 display and provides it via an API! This is a separate, computational processing which affects almost every Twitter user's behavior, and is thus important to preserve and study. Cheers, Alexy
[twitter-dev] Re: The Gardenhose Cooperative
I don't see anything vulnerable in a reasonably done verification -- e.g., I'll ask you to grep a word in a day you have and tell me the count. I'll google you, and preferably see you here or on twitter. Heck, Twitter, I'll pay you guys a $1/day for backup fetch! Preferably then to the starting point of the hoses. Cheers, Alexy
[twitter-dev] updating follow/shadow/birddog list of users
Uf you have thousands of users, do you really have to cook up a following file with comma-separated say 100,000 user IDs? Should it all be on one line? Now what happens if we want to drop some and add some IDs -- do we have to restart and re-upload all that list again? I see when the curl -d @following ... starts up, it does that. Restarting with huge lists sounds like data loss... Cheers, Alexy
[twitter-dev] Illegal byte sequence 0x00 in UTF8
I'm loading twits into PostgreSQL, and get a few hundreds of errors for illegal sequence 0x00 in UTF8, e.g. (each leading . is 10,000 gardenhose twits): .org.postgresql.util.PSQLException: ERROR: invalid byte sequence for encoding UTF8: 0x00 [loving the weather here in sunny birmingham uk at the moment but its hard to sleep in when imfeeling lazy lol] com.tfitter.db.DBError: CANNOT PUT TWIT 2283513311 ROLLBACK uid=21490127 tid=2283513311 org.postgresql.util.PSQLException: ERROR: invalid byte sequence for encoding UTF8: 0x00 F?9H^f'??%???p?{^] com.tfitter.db.DBError: CANNOT PUT TWIT 2283842814 ROLLBACK uid=30029372 tid=2283842814 ...org.postgresql.util.PSQLException: ERROR: invalid byte sequence for encoding UTF8: 0x00 [...@andycrofford まだ脱ぐな。そろそろこのこと考えるのは最後にされると、5エ譛ォ遶ッ 縺ョ譁ケ縺ッ蜃コ鬘後&繧後腟蕭⒢㎢⒢⒢] Anybody knows how to get rid of those 0x00s cleanly in Scala/Java? Cheers, Alexy
[twitter-dev] catching up with gardenhose
We've lost gardenhose data 6/28-7/7, if anybody could share it we'd appreciate it very much! I'm @khrabrov, authorized for it. Cheers, Alexy
[twitter-dev] length limits for all fields
In designing an SQL schema for statuses as returned by Streaming API, we need to know the length limits for all strings. Is there a single table with such lengths, and/or can you guys please specify them here? Cheers, Alexy
[twitter-dev] all conversations
What percentage of all tweets are replies to others, i.e. contain @nick? We do research on dialogue and I'd like to get as many conversations as possible. So far the only reliable way I see to do it is crawl. Even with the /gardenhose I'm not sure that I'm capturing enough from each conversation. Perhaps /follow can be harnessed for it, but then I'd have to determine the people of interest who converse enough. Perhaps you honorable Twitter Spirits can add a replies stream? Cheers, Alexy