Hi Rupert, Thanks a lot for the reply.
In the mean time, I tried to add following line in namespaceprefix.mappings and does seem to change a bit. <space> http://rdf.freebase.com/ns/ After above change, I kicked-off the run again and now I am seeing following lines in the run at the end. I also see some data in tdb directory. Is it doing the right thing or should I re-start the process per your suggestion? Thanks in advance, Rajan ..................................................Log..................................... Batch: 57,045 slots/s / Avg: 89,972 slots/s) 07:23:27,329 [Thread-3] INFO jenatdb.RdfResourceImporter - Index SPO->POS: 356,000,000 slots (Batch: 43,782 slots/s / Avg: 89,945 slots/s) 07:23:27,329 [Thread-3] INFO jenatdb.RdfResourceImporter - Elapsed: 83,935.45 seconds [2015/05/23 07:23:27 EDT] 07:23:28,485 [Thread-3] INFO jenatdb.RdfResourceImporter - Index SPO->POS: 356,100,000 slots (Batch: 86,505 slots/s / Avg: 89,944 slots/s) 07:23:31,334 [Thread-3] INFO jenatdb.RdfResourceImporter - Index SPO->POS: 356,200,000 slots (Batch: 35,100 slots/s / Avg: 89,904 slots/s) 07:23:32,966 [Thread-3] INFO jenatdb.RdfResourceImporter - Index SPO->POS: 356,300,000 slots (Batch: 61,274 slots/s / Avg: 89,893 slots/s) 07:23:34,348 [Thread-3] INFO jenatdb.RdfResourceImporter - Index SPO->POS: 356,400,000 slots (Batch: 72,358 slots/s / Avg: 89,887 slots/s) 07:23:36,850 [Thread-3] INFO jenatdb.R ................................................................................................................ ---------------- files in tdb directory ..................... -rw-r--r-- 1 8388608 May 22 07:49 GOSP.dat -rw-r--r-- 1 8388608 May 22 07:49 GOSP.idn -rw-r--r-- 1 8388608 May 22 07:49 GPOS.dat -rw-r--r-- 1 8388608 May 22 07:49 GPOS.idn -rw-r--r-- 1 8388608 May 22 07:49 GSPO.dat -rw-r--r-- 1 8388608 May 22 07:49 GSPO.idn -rw-r--r-- 1 8388608 May 22 07:49 OSP.dat -rw-r--r-- 1 8388608 May 22 07:49 OSP.idn -rw-r--r-- 1 8388608 May 22 07:49 OSPG.dat -rw-r--r-- 1 8388608 May 22 07:49 OSPG.idn -rw-r--r-- 1 16399728640 May 23 07:24 POS.dat -rw-r--r-- 1 117440512 May 23 07:24 POS.idn -rw-r--r-- 1 8388608 May 22 07:49 POSG.dat -rw-r--r-- 1 8388608 May 22 07:49 POSG.idn -rw-r--r-- 25954353152 May 23 07:24 SPO.dat -rw-r--r-- 192937984 May 23 07:17 SPO.idn -rw-r--r-- 1 8388608 May 22 07:49 SPOG.dat -rw-r--r-- 1 8388608 May 22 07:49 SPOG.idn -rw-r--r-- 8808038400 May 23 07:15 node2id.dat -rw-r--r-- 1 50331648 May 23 06:50 node2id.idn -rw-r--r-- 1 18672429231 May 23 06:17 nodes.dat -rw-r--r-- 1 8388608 May 22 07:49 prefix2id.dat -rw-r--r-- 1 8388608 May 22 07:49 prefix2id.idn -rw-r--r-- 1 8388608 May 22 07:49 prefixIdx.dat -rw-r--r-- 1 8388608 May 22 07:49 prefixIdx.idn -rw-r--r-- 1 0 May 22 07:49 prefixes.dat On Sat, May 23, 2015 at 12:12 AM, Rupert Westenthaler < rupert.westentha...@gmail.com> wrote: > Hi > > you need to enable 'id-namespace' in the iditerator.properties file > and set the value to 'http://rdf.freebase.com/ns/' (the same value as > defined by http://prefix.cc/fb) > > This will ensure that the indexing tool is looking for the correct > Entity URIs (e.g. 'http://rdf.freebase.com/ns/m.0kpv11' for '10888430 > m.0kpv11' the first line in the incoming_links.txt file) > > best > Rupert > > > On Fri, May 22, 2015 at 3:43 PM, Rajan Shah <raja...@gmail.com> wrote: > > Hi Rupert, > > > > Thanks for the quick turnover, I really appreciate your prompt response. > > > > Please find included at the end. > > > > Thanks in advance, > > Rajan > > > > *a. iditerator.properties* > > > > #NOTES: > > # Lines in this file start with spaces in cases the score is lower than > one > > # million. because of that we need to trim leading spaces > > trimLine > > # after trimming the lines the > > # -> first position is always an empty string > > # -> score should be at the first position > > score-pos=1 > > # -> second position should be the local name of the entity > > id-pos=2 > > #the file needs to be in the source (default="/indexing/resource") > folder! > > source=incoming_links.txt > > encodeIds=false > > charset=UTF-8 > > # set the separator to ' ' > > separator= > > # and URLdecode the IDs > > decodeIds=false > > > > # freebase uses namespace prefixes for IDs, because of that we do not > need > > # the id-namespace parameter. NOTE that the 'ns' prefix need to be set to > > # http://www. > > #id-namespace=http://freebase.com/ > > ns-prefix-state=false > > > > *b. incoming_links.txt* > > > > Some lines are as follows: > > > > 10888430 m.0kpv11 > > 3741261 m.019h > > 2667858 m.0775xx5 > > 2667804 m.0775xvm > > 1875352 m.01xryvm > > 1739262 m.05zppz > > 1369590 m.01xrzlb > > 1336481 m.0g4g > > 1202333 m.04l > > 1093642 m.01xryw5 > > 1079153 m.09gn > > 1070544 m.0kpv17 > > 1066210 m.09c7w0 > > 925879 m.01x32j1 > > 922312 m.0jst35z > > 921239 m.08x8 > > 864526 m.02nsjl9 > > 832558 m.01xlj26 > > 769191 m.02lx2r > > 736892 m.04m8 > > > > On Fri, May 22, 2015 at 9:29 AM, Rupert Westenthaler < > > rupert.westentha...@gmail.com> wrote: > > > >> Hi Rajan, > >> > >> > *05:11:41,851 [Indexing: Finished Entity Logger Deamon] INFO > >> > impl.IndexerImpl - Indexed 0 items in 2059467sec (Infinityms/item): > >> > >> 'You have not indexed a single entity. So something in your indexing > >> configuration is wrong. Most likely you are not correctly building the > >> URIs of the entities from the incoming_links.txt file. Can you provide > >> me an example line of the 'incoming_links.txt' file and the contents > >> of the 'iditerator.properties' file. Those specify how Entity URIs are > >> built. > >> > >> Short answers to the other questions > >> > >> > >> On Fri, May 22, 2015 at 2:10 PM, Rajan Shah <raja...@gmail.com> wrote: > >> > it ran for almost 3 days and generated index. > >> > >> Thats good. It means you do have now the Freebase dump in your Jena > >> TDB triple store. You will not need to repeat this (until you want to > >> use a newer dump. On the next call to the indexing tool it will > >> immediately start with the indexing step. > >> > >> > >> > > >> > Couple questions come to mind: > >> > > >> > a. Is there any particular log/error file the process generates > besides > >> > printing out on stdout/stderr? > >> > >> The indexer writes a zip archive with the IDs of all the indexed > >> entities. Its in the indexing/destination folder. > >> > >> > b. Is it a must-have to have stanbol full launcher running all the > time > >> > while indexing is going on? > >> > >> No Stanbol instance is needed by the indexing process. > >> > >> > c. Is it possible that, if the machine is not connected to internet > for > >> > couple minutes could cause some issues? > >> > >> No Internet connectivity is needed during indexing. Only if you want > >> to use the namespace prefix mappings of prefix.cc you need to have > >> internet connectivity when starting the indexing tool. > >> > >> best > >> Rupert > >> > >> > > >> > I would really appreciate, if you can shed some light on "what could > be > >> > wrong" or "potential approach to nail down this issue"? If you need, > I am > >> > happy to share any additional logs/properties. > >> > > >> > With best regards, > >> > Rajan > >> > > >> > *1. Configuration changes* > >> > > >> > a. set ns-prefix-state=false* > >> > [within /indexing/config/iditerator.properties]* > >> > b. add empty space mapping to http://rdf.freebase.com/ns/* > >> > [within namespaceprefix.mappings]* > >> > c. enable bunch of properties within mappings.txt such as following > >> > > >> > fb:music.artist.genre > >> > fb:music.artist.label > >> > fb:music.artist.album > >> > > >> > *2. Contents of indexing/dist directory* > >> > > >> > -rw-r--r-- 108899 May 22 05:11 freebase.solrindex.zip > >> > -rw-r--r-- 3457 May 22 05:11 > >> > org.apache.stanbol.data.site.freebase-1.0.0.jar > >> > > >> > *3. Contents of /tmp/freebase/indexing/resources/imported directory* > >> > > >> > -rw-r--r-- 1 31026810858 May 20 07:32 freebase.nt.gz > >> > > >> > *4. Contents of /tmp/freebase/indexing/resources directory* > >> > > >> > -rw-r--r-- 1 1206745360 May 19 09:38 incoming_links.txt > >> > > >> > *5. The indexer log* > >> > > >> > *04:31:57,236 [Thread-3] INFO jenatdb.RdfResourceImporter - Add: > >> > 570,850,000 triples (Batch: 2,604 / Avg: 3,621)* > >> > *04:32:00,727 [Thread-3] INFO jenatdb.RdfResourceImporter - Filtered: > >> > 2429800000 triples (80.97554853864854%)* > >> > *04:32:01,157 [Thread-3] INFO jenatdb.RdfResourceImporter - -- Finish > >> > triples data phase* > >> > *04:32:01,157 [Thread-3] INFO jenatdb.RdfResourceImporter - ** Data: > >> > 570,859,352 triples loaded in 157,619.39 seconds [Rate: 3,621.76 per > >> > second]* > >> > *04:32:01,157 [Thread-3] INFO jenatdb.RdfResourceImporter - -- Start > >> > triples index phase* > >> > *04:32:01,157 [Thread-3] INFO jenatdb.RdfResourceImporter - -- Finish > >> > triples index phase* > >> > *04:32:01,157 [Thread-3] INFO jenatdb.RdfResourceImporter - -- Finish > >> > triples load* > >> > *04:32:01,157 [Thread-3] INFO jenatdb.RdfResourceImporter - ** > >> Completed: > >> > 570,859,352 triples loaded in 157,619.39 seconds [Rate: 3,621.76 per > >> > second]* > >> > 04:32:56,880 [Thread-3] INFO source.ResourceLoader - ... moving > >> > imported file freebase.nt.gz to imported/freebase.nt.gz > >> > 04:32:56,883 [Thread-3] INFO source.ResourceLoader - - completed > in > >> > 157675 seconds > >> > 04:32:56,883 [Thread-3] INFO source.ResourceLoader - > loading > >> > '/private/tmp/freebase/indexing/resources/rdfdata/fixit.sh' ... > >> > 04:32:56,944 [Thread-3] WARN jenatdb.RdfResourceImporter - ignore > File > >> {} > >> > because of unknown extension > >> > 04:32:56,958 [Thread-3] INFO source.ResourceLoader - - completed > in 0 > >> > seconds > >> > 04:32:56,958 [Thread-3] INFO source.ResourceLoader - ... 2 files > >> imported > >> > in 157675 seconds > >> > 04:32:56,958 [Thread-3] INFO source.ResourceLoader - Loding 0 File > ... > >> > 04:32:56,958 [Thread-3] INFO source.ResourceLoader - ... 0 files > >> imported > >> > in 0 seconds > >> > 04:32:56,971 [main] INFO impl.IndexerImpl - ... delete existing > >> > IndexedEntityId file > >> > /private/tmp/freebase/indexing/destination/indexed-entities-ids.zip > >> > 04:32:56,982 [main] INFO impl.IndexerImpl - Initialisation completed > >> > 04:32:56,982 [main] INFO impl.IndexerImpl - ... initialisation > >> completed > >> > 04:32:56,982 [main] INFO impl.IndexerImpl - start indexing ... > >> > 04:32:56,982 [main] INFO impl.IndexerImpl - Indexing started ... > >> > > >> > > >> > > >> > 04:45:48,075 [pool-1-thread-1] WARN impl.NamespacePrefixProviderImpl > - > >> > Invalid Namespace Mapping: prefix 'nsogi' valid , namespace ' > >> > http://prefix.cc/nsogi:' invalid -> mapping ignored! > >> > 04:45:48,076 [pool-1-thread-1] WARN impl.NamespacePrefixProviderImpl > - > >> > Invalid Namespace Mapping: prefix 'category' valid , namespace ' > >> > http://dbpedia.org/resource/Category:' invalid -> mapping ignored! > >> > 04:45:48,077 [pool-1-thread-1] WARN impl.NamespacePrefixProviderImpl > - > >> > Invalid Namespace Mapping: prefix 'chebi' valid , namespace ' > >> > http://bio2rdf.org/chebi:' invalid -> mapping ignored! > >> > 04:45:48,077 [pool-1-thread-1] WARN impl.NamespacePrefixProviderImpl > - > >> > Invalid Namespace Mapping: prefix 'hgnc' valid , namespace ' > >> > http://bio2rdf.org/hgnc:' invalid -> mapping ignored! > >> > 04:45:48,077 [pool-1-thread-1] WARN impl.NamespacePrefixProviderImpl > - > >> > Invalid Namespace Mapping: prefix 'dbptmpl' valid , namespace ' > >> > http://dbpedia.org/resource/Template:' invalid -> mapping ignored! > >> > 04:45:48,077 [pool-1-thread-1] WARN impl.NamespacePrefixProviderImpl > - > >> > Invalid Namespace Mapping: prefix 'dbc' valid , namespace ' > >> > http://dbpedia.org/resource/Category:' invalid -> mapping ignored! > >> > 04:45:48,078 [pool-1-thread-1] WARN impl.NamespacePrefixProviderImpl > - > >> > Invalid Namespace Mapping: prefix 'pubmed' valid , namespace ' > >> > http://bio2rdf.org/pubmed_vocabulary:' invalid -> mapping ignored! > >> > 04:45:48,078 [pool-1-thread-1] WARN impl.NamespacePrefixProviderImpl > - > >> > Invalid Namespace Mapping: prefix 'dbt' valid , namespace ' > >> > http://dbpedia.org/resource/Template:' invalid -> mapping ignored! > >> > 04:45:48,078 [pool-1-thread-1] WARN impl.NamespacePrefixProviderImpl > - > >> > Invalid Namespace Mapping: prefix 'dbrc' valid , namespace ' > >> > http://dbpedia.org/resource/Category:' invalid -> mapping ignored! > >> > 04:45:48,078 [pool-1-thread-1] WARN impl.NamespacePrefixProviderImpl > - > >> > Invalid Namespace Mapping: prefix 'call' valid , namespace ' > >> > http://webofcode.org/wfn/call:' invalid -> mapping ignored! > >> > 04:45:48,078 [pool-1-thread-1] WARN impl.NamespacePrefixProviderImpl > - > >> > Invalid Namespace Mapping: prefix 'dbcat' valid , namespace ' > >> > http://dbpedia.org/resource/Category:' invalid -> mapping ignored! > >> > 04:45:48,084 [pool-1-thread-1] WARN impl.NamespacePrefixProviderImpl > - > >> > Invalid Namespace Mapping: prefix 'affymetrix' valid , namespace ' > >> > http://bio2rdf.org/affymetrix_vocabulary:' invalid -> mapping > ignored! > >> > 04:45:48,084 [pool-1-thread-1] WARN impl.NamespacePrefixProviderImpl > - > >> > Invalid Namespace Mapping: prefix 'bgcat' valid , namespace ' > >> > http://bg.dbpedia.org/resource/Категория:' invalid -> mapping > ignored! > >> > 04:45:48,084 [pool-1-thread-1] WARN impl.NamespacePrefixProviderImpl > - > >> > Invalid Namespace Mapping: prefix 'condition' valid , namespace ' > >> > http://www.kinjal.com/condition:' invalid -> mapping ignored! > >> > 05:11:41,836 [Indexing: Entity Source Reader Deamon] INFO > >> impl.IndexerImpl > >> > - Indexing: Entity Source Reader Deamon completed (sequence=0) ... > >> > 05:11:41,838 [Indexing: Entity Source Reader Deamon] INFO > >> impl.IndexerImpl > >> > - > current sequence : 0 > >> > 05:11:41,838 [Indexing: Entity Source Reader Deamon] INFO > >> impl.IndexerImpl > >> > - > new sequence: 1 > >> > 05:11:41,838 [Indexing: Entity Source Reader Deamon] INFO > >> impl.IndexerImpl > >> > - Send end-of-queue to Deamons with Sequence 1 > >> > 05:11:41,839 [Indexing: Entity Processor Deamon] INFO > impl.IndexerImpl - > >> > Indexing: Entity Processor Deamon completed (sequence=1) ... > >> > 05:11:41,839 [Indexing: Entity Processor Deamon] INFO > impl.IndexerImpl - > >> > > current sequence : 1 > >> > 05:11:41,839 [Indexing: Entity Processor Deamon] INFO > impl.IndexerImpl - > >> > > new sequence: 2 > >> > 05:11:41,839 [Indexing: Entity Processor Deamon] INFO > impl.IndexerImpl - > >> > Send end-of-queue to Deamons with Sequence 2 > >> > 05:11:41,839 [Indexing: Entity Perstisting Deamon] INFO > >> impl.IndexerImpl - > >> > Indexing: Entity Perstisting Deamon completed (sequence=2) ... > >> > 05:11:41,839 [Indexing: Entity Perstisting Deamon] INFO > >> impl.IndexerImpl - > >> > > current sequence : 2 > >> > 05:11:41,839 [Indexing: Entity Perstisting Deamon] INFO > >> impl.IndexerImpl - > >> > > new sequence: 3 > >> > 05:11:41,839 [Indexing: Entity Perstisting Deamon] INFO > >> impl.IndexerImpl - > >> > Send end-of-queue to Deamons with Sequence 3 > >> > *05:11:41,851 [Indexing: Finished Entity Logger Deamon] INFO > >> > impl.IndexerImpl - Indexed 0 items in 2059467sec (Infinityms/item): > >> > processing: -1.000ms/item | queue: -1.000ms* > >> > 05:11:41,851 [Indexing: Finished Entity Logger Deamon] INFO > >> > impl.IndexerImpl - - source : -1.000ms/item > >> > 05:11:41,851 [Indexing: Finished Entity Logger Deamon] INFO > >> > impl.IndexerImpl - - processing: -1.000ms/item > >> > 05:11:41,851 [Indexing: Finished Entity Logger Deamon] INFO > >> > impl.IndexerImpl - - store : -1.000ms/item > >> > 05:11:41,906 [Indexing: Finished Entity Logger Deamon] INFO > >> > impl.IndexerImpl - Indexing: Finished Entity Logger Deamon completed > >> > (sequence=3) ... > >> > 05:11:41,906 [Indexing: Finished Entity Logger Deamon] INFO > >> > impl.IndexerImpl - > current sequence : 3 > >> > 05:11:41,906 [Indexing: Finished Entity Logger Deamon] INFO > >> > impl.IndexerImpl - > new sequence: 4 > >> > 05:11:41,906 [Indexing: Finished Entity Logger Deamon] INFO > >> > impl.IndexerImpl - Send end-of-queue to Deamons with Sequence 4 > >> > 05:11:41,910 [Indexer: Entity Error Logging Daemon] INFO > >> impl.IndexerImpl > >> > - Indexer: Entity Error Logging Daemon completed (sequence=4) ... > >> > 05:11:41,910 [Indexer: Entity Error Logging Daemon] INFO > >> impl.IndexerImpl > >> > - > current sequence : 4 > >> > 05:11:41,910 [main] INFO impl.IndexerImpl - ... indexing completed > >> > 05:11:41,910 [main] INFO impl.IndexerImpl - start post-processing ... > >> > 05:11:41,910 [main] INFO impl.IndexerImpl - PostProcessing started > ... > >> > 05:11:41,910 [main] INFO impl.IndexerImpl - ... post-processing > >> finished > >> > ... > >> > 05:11:41,911 [main] INFO impl.IndexerImpl - start finalisation.... > >> > > >> > > >> > > >> > On Wed, May 20, 2015 at 8:19 AM, Rupert Westenthaler < > >> > rupert.westentha...@gmail.com> wrote: > >> > > >> >> On Tue, May 19, 2015 at 7:04 PM, Rajan Shah <raja...@gmail.com> > wrote: > >> >> > Hi Rupert and Antonio, > >> >> > > >> >> > Thanks a lot for the reply. > >> >> > > >> >> > I start to follow Rupert's suggestion, however it failed again at > >> >> > > >> >> > 10:56:34,152 [Thread-3] ERROR jena.riot - [line: 8722294, col: 88] > >> >> illegal > >> >> > escape sequence value: $ (0x24) -- Is there anyway it can be > resolved > >> for > >> >> > the entire file? > >> >> > > >> >> > >> >> The indexing tool uses Apache Jena. An those are Jena parsing errors. > >> >> So the Jena Mailing lists would be the better place to look for > >> >> answers. > >> >> This specific issue looks like an invalid URI that is not fixed by > the > >> >> fixit script. > >> >> > >> >> > >> >> > I requested an access to latest BaseKB bucket, as it doesn't seem > to > >> be > >> >> > open. > >> >> > > >> >> > s3cmd ls s3://basekb-now/2015-04-15-18-54/ > >> >> > --add-header="x-amz-request-payer: requester" > >> >> > ERROR: Access to bucket 'basekb-now' was denied > >> >> > > >> >> > > >> >> > *Couple additional questions:* > >> >> > > >> >> > *1. indexing enhancements:* > >> >> > What settings/properties one can tweak to gain most out of the > >> indexing. > >> >> > > >> >> > >> >> In general you do only want information as needed for your > application > >> >> case in the index. > >> >> For EntityLinking only labels and type are required. > >> >> Additional properties will only be used for dereferencing Entities. > So > >> >> this will depend on your application needs (your dereferencing > >> >> configuration). > >> >> > >> >> In general I try to exclude as much information as possible form the > >> >> index to keep the size of the Solr Index as small as possible. > >> >> > >> >> > a. for ex. domain specific such as Pharmaceutical, Law etc... > within > >> >> > freebase > >> >> > b. potential optimizations to speed up the overall indexing > >> >> > >> >> Most of the time will be needed to load the Freebase dump into Jena > >> >> TDB. Even with an SSD equipped Server this will take several days. > >> >> Assigning more RAM will speed up this process as Jena TDB can cache > >> >> more things in RAM. > >> >> > >> >> Usually it is a good Idea to cancel the indexing process after the > >> >> importing of the RDF data has finished (and the indexing of the > >> >> Entities has started). This is because after indexing all the RAM > will > >> >> be used by Jena TDB for caching stuff that is no longer needed in the > >> >> read-only operations during indexing. So a fresh start can speed up > >> >> the indexing part of the process. > >> >> > >> >> Also have a look at the Freebase Indexing Tool Readme > >> >> > >> >> > > >> >> > *2. demo:* > >> >> > I see that, in recent github commit(s) the eHealth and other demos > >> have > >> >> > been commented out. How can I get demo source code and other > >> components > >> >> for > >> >> > these demos. I prefer to build it myself to see the power of > stanbol. > >> >> > > >> >> > >> >> The eHealth demo is still in the 0.12 branch [1]. This is fully > >> >> compatible to the trunk version. > >> >> > >> >> > *3. custom vocabulary:* > >> >> > Suppose, I have custom vocabulary in CSV format. Is there a > preferred > >> way > >> >> > to upload it to Stanbol and have it recognize my entities? > >> >> > >> >> Google Refine[2] with the RDF extension [3]. You can also try to use > >> >> the (newer) Open Refine [4] with the RDF Refine 0.9.0 Alpha version > >> >> but AFAIK this combination is not so stable and might not work at > all. > >> >> > >> >> * Google Refine allows you to import your CSV file. > >> >> * Clean it up (if necessary) > >> >> * The RDF extension allows you to map your CSV data to RDF > >> >> * based on this mapping you can save your data as RDF > >> >> * after that you can import the RDF data to Apache Stanbol > >> >> > >> >> hope this helps > >> >> best > >> >> Rupert > >> >> > >> >> > > >> >> > Thanks in advance, > >> >> > Rajan > >> >> > > >> >> > >> >> > >> >> > >> >> [1] > >> >> > >> > http://svn.apache.org/repos/asf/stanbol/branches/release-0.12/demos/ehealth/ > >> >> [2] https://code.google.com/p/google-refine/ > >> >> [3] http://refine.deri.ie/ > >> >> [4] http://openrefine.org/ > >> >> > >> >> > On Tue, May 19, 2015 at 3:01 AM, Rupert Westenthaler < > >> >> > rupert.westentha...@gmail.com> wrote: > >> >> > > >> >> >> Hi Rajan, > >> >> >> > >> >> >> I think this is because you named you file > >> >> >> "freebase-rdf-latest-fixed.gz". Jena assumes RDF/XML if the RDF > >> format > >> >> >> is not provided by the file extension. Renaming the file to > >> >> >> "freebase-rdf-latest-fixed.nt.gz" should fix this issue. > >> >> >> > >> >> >> The suggestion of Antonio to use BaseKB is also a valid option. > >> >> >> > >> >> >> best > >> >> >> Rupert > >> >> >> > >> >> >> On Tue, May 19, 2015 at 8:32 AM, Antonio David Perez Morales > >> >> >> <ape...@zaizi.com> wrote: > >> >> >> > Hi Rajan > >> >> >> > > >> >> >> > Freebase dump contains some things that does not fit very well > with > >> >> the > >> >> >> > indexer. > >> >> >> > I advise you to use the dump provided by BaseKB ( > http://basekb.com > >> ) > >> >> >> which > >> >> >> > is a curated Freebase dump. > >> >> >> > I did not have any problem indexing it using that dump. > >> >> >> > > >> >> >> > Regards > >> >> >> > > >> >> >> > On Mon, May 18, 2015 at 8:48 PM, Rajan Shah <raja...@gmail.com> > >> >> wrote: > >> >> >> > > >> >> >> >> Hi, > >> >> >> >> > >> >> >> >> I am working on indexing Freebase data within EntityHub and > >> observed > >> >> >> >> following issue: > >> >> >> >> > >> >> >> >> 01:06:01,547 [Thread-3] ERROR jena.riot - [line: 1, col: 7 ] > >> Element > >> >> or > >> >> >> >> attribute do not match QName production: > >> QName::=(NCName':')?NCName. > >> >> >> >> > >> >> >> >> I would appreciate any help pertaining to this issue. > >> >> >> >> > >> >> >> >> Thanks, > >> >> >> >> Rajan > >> >> >> >> > >> >> >> >> *Steps followed:* > >> >> >> >> > >> >> >> >> *1. Initialization: * > >> >> >> >> java -jar > >> >> >> org.apache.stanbol.entityhub.indexing.freebase-1.0.0-SNAPSHOT.jar > >> >> >> >> init > >> >> >> >> > >> >> >> >> *2. Download the data:* > >> >> >> >> Download data and copy it to > >> >> >> https://developers.google.com/freebase/data > >> >> >> >> > >> >> >> >> *3. Performed execution of fbrankings-uri.sh* > >> >> >> >> It generated incoming_links.txt under resources directory as > >> follows > >> >> >> >> > >> >> >> >> 10888430 m.0kpv11 > >> >> >> >> 3741261 m.019h > >> >> >> >> 2667858 m.0775xx5 > >> >> >> >> 2667804 m.0775xvm > >> >> >> >> 1875352 m.01xryvm > >> >> >> >> 1739262 m.05zppz > >> >> >> >> 1369590 m.01xrzlb > >> >> >> >> > >> >> >> >> *4. Performed execution of fixit script* > >> >> >> >> > >> >> >> >> gunzip -c ${FB_DUMP} | fixit | gzip > ${FB_DUMP_fixed} > >> >> >> >> > >> >> >> >> *5. Rename the fixed file to freebase.rdf.gz and copy it * > >> >> >> >> to indexing/resources/rdfdata > >> >> >> >> > >> >> >> >> *6. config/iditer.properties file has following setting* > >> >> >> >> #id-namespace=http://freebase.com/ > >> >> >> >> ns-prefix-state=false > >> >> >> >> > >> >> >> >> *7. Performed run of following command:* > >> >> >> >> java -jar -Xmx32g > >> >> >> >> > org.apache.stanbol.entityhub.indexing.freebase-1.0.0-SNAPSHOT.jar > >> >> index > >> >> >> >> > >> >> >> >> The error dump on stdout is as follows: > >> >> >> >> > >> >> >> >> 01:37:32,884 [Thread-0] INFO > >> solryard.SolrYardIndexingDestination - > >> >> >> ... > >> >> >> >> copy Solr Configuration form > >> >> >> /private/tmp/freebase/indexing/config/freebase > >> >> >> >> to > >> >> /private/tmp/freebase/indexing/destination/indexes/default/freebase > >> >> >> >> 01:37:32,895 [Thread-3] INFO jenatdb.RdfResourceImporter - > - > >> >> bulk > >> >> >> >> loading File freebase.rdf.gz using Format Lang:RDF/XML > >> >> >> >> 01:37:32,896 [Thread-3] INFO jenatdb.RdfResourceImporter - -- > >> Start > >> >> >> >> triples data phase > >> >> >> >> 01:37:32,896 [Thread-3] INFO jenatdb.RdfResourceImporter - ** > >> Load > >> >> >> empty > >> >> >> >> triples table > >> >> >> >> *01:37:32,948 [Thread-3] ERROR jena.riot - [line: 1, col: 7 ] > >> >> Element or > >> >> >> >> attribute do not match QName production: > >> QName::=(NCName':')?NCName.* > >> >> >> >> 01:37:32,948 [Thread-3] INFO jenatdb.RdfResourceImporter - -- > >> Finish > >> >> >> >> triples data phase > >> >> >> >> 01:37:32,948 [Thread-3] INFO jenatdb.RdfResourceImporter - -- > >> Finish > >> >> >> >> triples load > >> >> >> >> 01:37:32,960 [Thread-3] INFO source.ResourceLoader - Ignore > Error > >> >> for > >> >> >> File > >> >> >> >> > /private/tmp/freebase/indexing/resources/rdfdata/freebase.rdf.gz > >> and > >> >> >> >> continue > >> >> >> >> > >> >> >> >> Additional Reference Point: > >> >> >> >> > >> >> >> >> *Original Freebase dump size:* 31025015397 May 14 18:10 > >> >> >> >> freebase-rdf-latest.gz > >> >> >> >> *Fixed Freebase dump size:* 31026818367 May 15 12:45 > >> >> >> >> freebase-rdf-latest-fixed.gz > >> >> >> >> *Incoming Links size: *1206745360 May 17 00:42 > incoming_links.txt > >> >> >> >> > >> >> >> > > >> >> >> > -- > >> >> >> > > >> >> >> > ------------------------------ > >> >> >> > This message should be regarded as confidential. If you have > >> received > >> >> >> this > >> >> >> > email in error please notify the sender and destroy it > immediately. > >> >> >> > Statements of intent shall only become binding when confirmed in > >> hard > >> >> >> copy > >> >> >> > by an authorised signatory. > >> >> >> > > >> >> >> > Zaizi Ltd is registered in England and Wales with the > registration > >> >> number > >> >> >> > 6440931. The Registered Office is Brook House, 229 Shepherds > Bush > >> >> Road, > >> >> >> > London W6 7AN. > >> >> >> > >> >> >> > >> >> >> > >> >> >> -- > >> >> >> | Rupert Westenthaler rupert.westentha...@gmail.com > >> >> >> | Bodenlehenstraße 11 > ++43-699-11108907 > >> >> >> | A-5500 Bischofshofen > >> >> >> | REDLINK.CO > >> >> >> > >> >> > >> > .......................................................................... > >> >> >> | http://redlink.co/ > >> >> >> > >> >> > >> >> > >> >> > >> >> -- > >> >> | Rupert Westenthaler rupert.westentha...@gmail.com > >> >> | Bodenlehenstraße 11 ++43-699-11108907 > >> >> | A-5500 Bischofshofen > >> >> | REDLINK.CO > >> >> > >> > .......................................................................... > >> >> | http://redlink.co/ > >> >> > >> > >> > >> > >> -- > >> | Rupert Westenthaler rupert.westentha...@gmail.com > >> | Bodenlehenstraße 11 ++43-699-11108907 > >> | A-5500 Bischofshofen > >> | REDLINK.CO > >> > .......................................................................... > >> | http://redlink.co/ > >> > > > > -- > | Rupert Westenthaler rupert.westentha...@gmail.com > | Bodenlehenstraße 11 ++43-699-11108907 > | A-5500 Bischofshofen > | REDLINK.CO > .......................................................................... > | http://redlink.co/ >