GitHub user chenlica closed a discussion: Data Sets (from old wiki)
Content from https://github.com/apache/texera/wiki/Data-Sets (may be dangling) ===== Authors: Chen Li ### Medline Abstracts * Uploader: Zuozhi Wang and Chen Li * Data size: 100K docs - 47MB (zipped), 1M docs - 531MB (zipped) * Number of records: 100K docs, 1M docs * Download URL: https://drive.google.com/drive/u/0/folders/0Bxp0qxtbSGxYd0s3NXZPQTJtUkE * Sample records. Each line is one separate record in JSON format. A sample record is as following: ``` { "pmid":"19866847", "affiliation":"Surgeon, U. S. A.", "article_title":"ON THE APPEARANCE ......", "authors":"W Reed", "journal_issue":"2-5 Sep 1, 1897", "journal_title":"The Journal of experimental medicine", "keywords":"", "mesh_headings":"", "abstract":"1. The claim of L. Pfeiffer that ........", "zipf_score":0.019866847 } ``` ### Twitter Data * Uploader: Zuozhi Wang and Jianfeng Jia * Data size: 10K tweets: 4MB (zipped), 30MB (unzipped), 200K tweets: 100MB (zipped), 700MB (unzipped) * Number of records: 10K tweets, 200K tweets * Download URL: https://drive.google.com/drive/folders/0Bxp0qxtbSGxYVWlENVVOTzA3QUE?usp=sharing * Sample records: Each line is one tweet and its information in JSON format. A sample record is as following: ``` { "created_at": "Tue Nov 17 21:33:18 +0000 2015", "id": 666730857898508288, "id_str": "666730857898508288", "text": "Get a major thrill out of ditching something\/someone way before they even get the opportunity to pull some shady activity.", "source": "\u003ca href=\"http:\/\/twitter.com\/download\/iphone\" rel=\"nofollow\"\u003eTwitter for iPhone\u003c\/a\u003e", "truncated": false, "in_reply_to_status_id": null, "in_reply_to_status_id_str": null, "in_reply_to_user_id": null, "in_reply_to_user_id_str": null, "in_reply_to_screen_name": null, "user": { "id": 329833893, "id_str": "329833893", "name": "Jade Castillo", "screen_name": "RealJadeMarie", "location": "Los Angeles, CA", "url": "http:\/\/Instagram.com\/realjademarie", "description": null, "protected": false, "verified": false, "followers_count": 1423, "friends_count": 557, "listed_count": 11, "favourites_count": 35345, "statuses_count": 40776, "created_at": "Tue Jul 05 17:56:10 +0000 2011", "utc_offset": -21600, "time_zone": "Central Time (US & Canada)", "geo_enabled": true, "lang": "en", "contributors_enabled": false, "is_translator": false, "profile_background_color": "FF6699", "profile_background_image_url": "http:\/\/abs.twimg.com\/images\/themes\/theme11\/bg.gif", "profile_background_image_url_https": "https:\/\/abs.twimg.com\/images\/themes\/theme11\/bg.gif", "profile_background_tile": true, "profile_link_color": "B40B43", "profile_sidebar_border_color": "CC3366", "profile_sidebar_fill_color": "E5507E", "profile_text_color": "362720", "profile_use_background_image": true, "profile_image_url": "http:\/\/pbs.twimg.com\/profile_images\/665437037613457408\/8CCCd9iG_normal.jpg", "profile_image_url_https": "https:\/\/pbs.twimg.com\/profile_images\/665437037613457408\/8CCCd9iG_normal.jpg", "profile_banner_url": "https:\/\/pbs.twimg.com\/profile_banners\/329833893\/1445677148", "default_profile": false, "default_profile_image": false, "following": null, "follow_request_sent": null, "notifications": null }, "geo": null, "coordinates": null, "place": { "id": "fbd6d2f5a4e4a15e", "url": "https:\/\/api.twitter.com\/1.1\/geo\/id\/fbd6d2f5a4e4a15e.json", "place_type": "admin", "name": "California", "full_name": "California, USA", "country_code": "US", "country": "United States", "bounding_box": { "type": "Polygon", "coordinates": [ [ [-124.482003, 32.528832], [-124.482003, 42.009519], [-114.131212, 42.009519], [-114.131212, 32.528832] ] ] }, "attributes": {} }, "contributors": null, "is_quote_status": false, "retweet_count": 0, "favorite_count": 0, "entities": { "hashtags": [], "urls": [], "user_mentions": [], "symbols": [] }, "favorited": false, "retweeted": false, "filter_level": "low", "lang": "en", "timestamp_ms": "1447795998440" } ``` ### Proposal Data (in Chinese) * Uploader: Qinhua Huang * Data size: 150KB * Num of records: N/A * Download URL: https://drive.google.com/drive/folders/0B-xzsxV4BxGeQTdNQkRRek4xQ00?usp=sharing * Sample records: Records are not strictly structured. Each record roughly contains 5 sections: project title, project source, project period, project leaders, project description. A sample record is as following: ``` (六)河南黄淮海平原中低产地区农业投资问题研究 项目来源:中国农科院农业经济研究所 起止年限:1988-1990年 主持人:卑圣模、王蕴娴 该项目系中国农科院农经所“七五”重点科研项目“黄淮海平原中低产地区综合治理开发与农业投资问题研究”的子课题...... ``` GitHub link: https://github.com/apache/texera/discussions/3956 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
