Hi,
as you know, Freebase publishes already RDF views of their 'topics' [1]. 
However, if you want a consistent chunk of it, crawling or using the Freebase 
APIs isn't the most comfortable way.
Fortunately, they also share their data dumps [2], here: [3]. But, it's not RDF 
(very similar though).

A trivial conversion of the Freebase data dump into RDF is possible, no attempt 
at extracting any schema/vocabulary (although it would be interesting, if at 
all possible, any idea?).

Code is here:
https://github.com/castagna/freebase2rdf

Example of log for a run:
Mar  7 09:47:57 ip-10-54-167-166 build: Converting Freebase data dump into 
RDF...
Mar  7 09:48:00 ip-10-54-167-166 build: 09:48:00 INFO  Freebase2RDF             
 :: Add: 100,000 lines (Batch: 37,921 / Avg: 37,921)
Mar  7 09:48:02 ip-10-54-167-166 build: 09:48:02 INFO  Freebase2RDF             
 :: Add: 200,000 lines (Batch: 41,220 / Avg: 39,502)
Mar  7 09:48:04 ip-10-54-167-166 build: 09:48:04 INFO  Freebase2RDF             
 :: Add: 300,000 lines (Batch: 46,339 / Avg: 41,545)
[...]
Mar  7 13:10:39 ip-10-54-167-166 build: 13:10:39 INFO  Freebase2RDF             
 :: Add: 618,400,000 lines (Batch: 46,728 / Avg: 50,846)
[...]

During the conversion you'll see some warnings (it's normal with large 
datasets, nobody is perfect ;-)):
[...]
Mar  7 11:27:42 ip-10-54-167-166 build: 11:27:42 WARN  Freebase2RDF             
 :: Line 313904881 has only 2 tokens: 
/m/09r_f91#011/location/postal_code/postal_code#011#011
[...]
Mar  7 13:08:55 ip-10-54-167-166 build: 13:08:55 WARN  Freebase2RDF             
 :: Line 613767705 has one or more empty tokens: 
/m/026hkm5#011/common/topic/alias#011/lang/en#011
Mar  7 13:10:06 ip-10-54-167-166 build: 13:10:06 WARN  Freebase2RDF             
 :: Line 616940966 has only 2 tokens: 
/m/0cp4rt9#011/location/postal_code/postal_code#011#011
[...]

Those lines are skipped.

Total triple count: 618,465,279.

There is plenty of room and opportunities for improvements: forks and pull 
requests are welcome! ;-)
A MapReduce version of it should be possible, since bzip2 should be 
splittable... but I've not tried it yet. Anyone interested?

Cheers,
Paolo

PS:
I used this dataset to test tdbloader2 and tdbloader3 loading speeds on EC2 
m1.xlarge instances (i.e. 15 GB memory), separate email will follow.


 [1] http://rdf.freebase.com/
 [2] http://wiki.freebase.com/wiki/Data_dumps
 [3] http://download.freebase.com/datadumps/

Reply via email to