Hi, as you know, Freebase publishes already RDF views of their 'topics' [1]. However, if you want a consistent chunk of it, crawling or using the Freebase APIs isn't the most comfortable way. Fortunately, they also share their data dumps [2], here: [3]. But, it's not RDF (very similar though).
A trivial conversion of the Freebase data dump into RDF is possible, no attempt at extracting any schema/vocabulary (although it would be interesting, if at all possible, any idea?). Code is here: https://github.com/castagna/freebase2rdf Example of log for a run: Mar 7 09:47:57 ip-10-54-167-166 build: Converting Freebase data dump into RDF... Mar 7 09:48:00 ip-10-54-167-166 build: 09:48:00 INFO Freebase2RDF :: Add: 100,000 lines (Batch: 37,921 / Avg: 37,921) Mar 7 09:48:02 ip-10-54-167-166 build: 09:48:02 INFO Freebase2RDF :: Add: 200,000 lines (Batch: 41,220 / Avg: 39,502) Mar 7 09:48:04 ip-10-54-167-166 build: 09:48:04 INFO Freebase2RDF :: Add: 300,000 lines (Batch: 46,339 / Avg: 41,545) [...] Mar 7 13:10:39 ip-10-54-167-166 build: 13:10:39 INFO Freebase2RDF :: Add: 618,400,000 lines (Batch: 46,728 / Avg: 50,846) [...] During the conversion you'll see some warnings (it's normal with large datasets, nobody is perfect ;-)): [...] Mar 7 11:27:42 ip-10-54-167-166 build: 11:27:42 WARN Freebase2RDF :: Line 313904881 has only 2 tokens: /m/09r_f91#011/location/postal_code/postal_code#011#011 [...] Mar 7 13:08:55 ip-10-54-167-166 build: 13:08:55 WARN Freebase2RDF :: Line 613767705 has one or more empty tokens: /m/026hkm5#011/common/topic/alias#011/lang/en#011 Mar 7 13:10:06 ip-10-54-167-166 build: 13:10:06 WARN Freebase2RDF :: Line 616940966 has only 2 tokens: /m/0cp4rt9#011/location/postal_code/postal_code#011#011 [...] Those lines are skipped. Total triple count: 618,465,279. There is plenty of room and opportunities for improvements: forks and pull requests are welcome! ;-) A MapReduce version of it should be possible, since bzip2 should be splittable... but I've not tried it yet. Anyone interested? Cheers, Paolo PS: I used this dataset to test tdbloader2 and tdbloader3 loading speeds on EC2 m1.xlarge instances (i.e. 15 GB memory), separate email will follow. [1] http://rdf.freebase.com/ [2] http://wiki.freebase.com/wiki/Data_dumps [3] http://download.freebase.com/datadumps/
