Hi, Accidentally, I wiped out logs for a clean start. At the same time, I am planning to run on a higher end AWS instance as well, so will keep you posted.
Thanks again for your continuous help. With best regards, Rajan Sent from my iPhone > On May 26, 2015, at 8:47 AM, Rupert Westenthaler > <rupert.westentha...@gmail.com> wrote: > > HI > >> On Tue, May 26, 2015 at 2:13 PM, <raja...@gmail.com> wrote: >> Hi Rupert, >> >> After last failure, I am only using language=en and it still fails. > > Can you provide the some lines of logging before the OOM. I would like > to be sure that it really happens during the Solr optimization phase. > >> Thanks for the timely answer. Just to double confirm, if I re-started the >> index command this am again with higher -Xmx option is it too late to run >> finalise correct? > > If the OOM exception really happened during the Solr optimization calling > > java -jar -Xmx{higher-value}g > org.apache.stanbol.entityhub.indexing.freebase-1.0.0-SNAPSHOT.jar > finalise > > will use the data of the previous indexing call and just repeat the > finalization steps > > best > Rupert > > >> With best regards, >> Rajan >> >> Sent from my iPhone >> >>> On May 26, 2015, at 7:47 AM, Rupert Westenthaler >>> <rupert.westentha...@gmail.com> wrote: >>> >>> Hi Rajan >>> >>>> On Mon, May 25, 2015 at 6:15 AM, Rajan Shah <raja...@gmail.com> wrote: >>>> Hi Rupert, >>>> >>>> Thanks for the reply. >>>> >>>> As per your suggestion, I made necessary changes however it failed with >>>> "OutOfMemory" errors. At present, I am running with -Xmx48g however at this >>>> point it's a trial and error approach with several days effort being >>>> wasted. >>> >>> I guess you are getting the OutOfMemory while optimizing the Solr >>> Index (right?). The README [1] explicitly notes that a high amount of >>> memory is needed by exactly this step of the indexing process. >>> >>> If the indexing fails at this step you can call the indexing tool with >>> the `finalise` command (instead if `indexing`) (seeSTANBOL-1047 [2] >>> for details). This will prevent the indexing to be repeated and only >>> execute the finalization steps (optimizing the Solr Index and creating >>> the freebase.solrindex.zip file). >>> >>> >>>> I am just throwing out an idea, but wanted to see >>>> >>>> a. Is it possible to publish set of constraints and required parameters. >>>> i.e. with minimal set of entities within mappings.txt, one need to set >>>> these parameters? >>> >>> I do not understand this question. Do you want to filter entities >>> based on their information? If so you might want to have a look at the >>> `org.apache.stanbol.entityhub.indexing.core.processor.FieldValueFilter`. >>> The generic RDF indexing tool as an example on how to use this >>> processor to filter entities based on their rdf:type values. >>> >>> See also the "Entity Filters" section of [3] >>> >>>> >>>> b. Is it possible to split the file based on subject? generate smaller >>>> index for each subject and merge afterwards? >>> >>> Yes. You can split up the dump (by subject). Import those parts in >>> different Indexing Tool instances (meaning different Jena TDB >>> instances). Importing 4*500million triples to Jena TDB is supposed to >>> be much faster as 1*2Billion. >>> >>> If you still want to have all data in a single Entityhub Site you need >>> to script the indexing process. >>> >>> * call indexing for the first part >>> * after this finishes link the {part1}/indexing/destination/indexes >>> folder to {part2..n}/indexing/destination/indexes >>> * call indexing for the 2..n parts. >>> >>> As the indexing tool only adds additional information to the Solr >>> Index you will get the union over all parts at the end of the process. >>> All parts need to use the full incoming_links.txt file because >>> otherwise the rankings would not be correct. >>> >>> The "Indexing Datasets separately" section of [3] describes a similar >>> trick of creating an union index over multiple datasets. >>> >>> >>> best >>> Rupert >>> >>>> c. Work with BaseKB guys to also make it available at nominal charge? >>>> >>>> d. Maybe apply some Map/Reduce - extension of idea b >>>> >>>> With best regards, >>>> Rajan >>> >>> >>> >>> [1] >>> http://svn.apache.org/repos/asf/stanbol/trunk/entityhub/indexing/freebase/README.md >>> [2] https://issues.apache.org/jira/browse/STANBOL-1047 >>> [3] >>> http://svn.apache.org/repos/asf/stanbol/branches/release-0.12/demos/ehealth/README.md >>> >>>> >>>> >>>> >>>> On Fri, May 22, 2015 at 9:29 AM, Rupert Westenthaler < >>>> rupert.westentha...@gmail.com> wrote: >>>> >>>>> Hi Rajan, >>>>> >>>>>> *05:11:41,851 [Indexing: Finished Entity Logger Deamon] INFO >>>>>> impl.IndexerImpl - Indexed 0 items in 2059467sec (Infinityms/item): >>>>> >>>>> 'You have not indexed a single entity. So something in your indexing >>>>> configuration is wrong. Most likely you are not correctly building the >>>>> URIs of the entities from the incoming_links.txt file. Can you provide >>>>> me an example line of the 'incoming_links.txt' file and the contents >>>>> of the 'iditerator.properties' file. Those specify how Entity URIs are >>>>> built. >>>>> >>>>> Short answers to the other questions >>>>> >>>>> >>>>>> On Fri, May 22, 2015 at 2:10 PM, Rajan Shah <raja...@gmail.com> wrote: >>>>>> it ran for almost 3 days and generated index. >>>>> >>>>> Thats good. It means you do have now the Freebase dump in your Jena >>>>> TDB triple store. You will not need to repeat this (until you want to >>>>> use a newer dump. On the next call to the indexing tool it will >>>>> immediately start with the indexing step. >>>>> >>>>> >>>>>> >>>>>> Couple questions come to mind: >>>>>> >>>>>> a. Is there any particular log/error file the process generates besides >>>>>> printing out on stdout/stderr? >>>>> >>>>> The indexer writes a zip archive with the IDs of all the indexed >>>>> entities. Its in the indexing/destination folder. >>>>> >>>>>> b. Is it a must-have to have stanbol full launcher running all the time >>>>>> while indexing is going on? >>>>> >>>>> No Stanbol instance is needed by the indexing process. >>>>> >>>>>> c. Is it possible that, if the machine is not connected to internet for >>>>>> couple minutes could cause some issues? >>>>> >>>>> No Internet connectivity is needed during indexing. Only if you want >>>>> to use the namespace prefix mappings of prefix.cc you need to have >>>>> internet connectivity when starting the indexing tool. >>>>> >>>>> best >>>>> Rupert >>>>> >>>>>> >>>>>> I would really appreciate, if you can shed some light on "what could be >>>>>> wrong" or "potential approach to nail down this issue"? If you need, I am >>>>>> happy to share any additional logs/properties. >>>>>> >>>>>> With best regards, >>>>>> Rajan >>>>>> >>>>>> *1. Configuration changes* >>>>>> >>>>>> a. set ns-prefix-state=false* >>>>>> [within /indexing/config/iditerator.properties]* >>>>>> b. add empty space mapping to http://rdf.freebase.com/ns/* >>>>>> [within namespaceprefix.mappings]* >>>>>> c. enable bunch of properties within mappings.txt such as following >>>>>> >>>>>> fb:music.artist.genre >>>>>> fb:music.artist.label >>>>>> fb:music.artist.album >>>>>> >>>>>> *2. Contents of indexing/dist directory* >>>>>> >>>>>> -rw-r--r-- 108899 May 22 05:11 freebase.solrindex.zip >>>>>> -rw-r--r-- 3457 May 22 05:11 >>>>>> org.apache.stanbol.data.site.freebase-1.0.0.jar >>>>>> >>>>>> *3. Contents of /tmp/freebase/indexing/resources/imported directory* >>>>>> >>>>>> -rw-r--r-- 1 31026810858 May 20 07:32 freebase.nt.gz >>>>>> >>>>>> *4. Contents of /tmp/freebase/indexing/resources directory* >>>>>> >>>>>> -rw-r--r-- 1 1206745360 May 19 09:38 incoming_links.txt >>>>>> >>>>>> *5. The indexer log* >>>>>> >>>>>> *04:31:57,236 [Thread-3] INFO jenatdb.RdfResourceImporter - Add: >>>>>> 570,850,000 triples (Batch: 2,604 / Avg: 3,621)* >>>>>> *04:32:00,727 [Thread-3] INFO jenatdb.RdfResourceImporter - Filtered: >>>>>> 2429800000 triples (80.97554853864854%)* >>>>>> *04:32:01,157 [Thread-3] INFO jenatdb.RdfResourceImporter - -- Finish >>>>>> triples data phase* >>>>>> *04:32:01,157 [Thread-3] INFO jenatdb.RdfResourceImporter - ** Data: >>>>>> 570,859,352 triples loaded in 157,619.39 seconds [Rate: 3,621.76 per >>>>>> second]* >>>>>> *04:32:01,157 [Thread-3] INFO jenatdb.RdfResourceImporter - -- Start >>>>>> triples index phase* >>>>>> *04:32:01,157 [Thread-3] INFO jenatdb.RdfResourceImporter - -- Finish >>>>>> triples index phase* >>>>>> *04:32:01,157 [Thread-3] INFO jenatdb.RdfResourceImporter - -- Finish >>>>>> triples load* >>>>>> *04:32:01,157 [Thread-3] INFO jenatdb.RdfResourceImporter - ** >>>>> Completed: >>>>>> 570,859,352 triples loaded in 157,619.39 seconds [Rate: 3,621.76 per >>>>>> second]* >>>>>> 04:32:56,880 [Thread-3] INFO source.ResourceLoader - ... moving >>>>>> imported file freebase.nt.gz to imported/freebase.nt.gz >>>>>> 04:32:56,883 [Thread-3] INFO source.ResourceLoader - - completed in >>>>>> 157675 seconds >>>>>> 04:32:56,883 [Thread-3] INFO source.ResourceLoader - > loading >>>>>> '/private/tmp/freebase/indexing/resources/rdfdata/fixit.sh' ... >>>>>> 04:32:56,944 [Thread-3] WARN jenatdb.RdfResourceImporter - ignore File >>>>> {} >>>>>> because of unknown extension >>>>>> 04:32:56,958 [Thread-3] INFO source.ResourceLoader - - completed in 0 >>>>>> seconds >>>>>> 04:32:56,958 [Thread-3] INFO source.ResourceLoader - ... 2 files >>>>> imported >>>>>> in 157675 seconds >>>>>> 04:32:56,958 [Thread-3] INFO source.ResourceLoader - Loding 0 File ... >>>>>> 04:32:56,958 [Thread-3] INFO source.ResourceLoader - ... 0 files >>>>> imported >>>>>> in 0 seconds >>>>>> 04:32:56,971 [main] INFO impl.IndexerImpl - ... delete existing >>>>>> IndexedEntityId file >>>>>> /private/tmp/freebase/indexing/destination/indexed-entities-ids.zip >>>>>> 04:32:56,982 [main] INFO impl.IndexerImpl - Initialisation completed >>>>>> 04:32:56,982 [main] INFO impl.IndexerImpl - ... initialisation >>>>> completed >>>>>> 04:32:56,982 [main] INFO impl.IndexerImpl - start indexing ... >>>>>> 04:32:56,982 [main] INFO impl.IndexerImpl - Indexing started ... >>>>>> >>>>>> >>>>>> >>>>>> 04:45:48,075 [pool-1-thread-1] WARN impl.NamespacePrefixProviderImpl - >>>>>> Invalid Namespace Mapping: prefix 'nsogi' valid , namespace ' >>>>>> http://prefix.cc/nsogi:' invalid -> mapping ignored! >>>>>> 04:45:48,076 [pool-1-thread-1] WARN impl.NamespacePrefixProviderImpl - >>>>>> Invalid Namespace Mapping: prefix 'category' valid , namespace ' >>>>>> http://dbpedia.org/resource/Category:' invalid -> mapping ignored! >>>>>> 04:45:48,077 [pool-1-thread-1] WARN impl.NamespacePrefixProviderImpl - >>>>>> Invalid Namespace Mapping: prefix 'chebi' valid , namespace ' >>>>>> http://bio2rdf.org/chebi:' invalid -> mapping ignored! >>>>>> 04:45:48,077 [pool-1-thread-1] WARN impl.NamespacePrefixProviderImpl - >>>>>> Invalid Namespace Mapping: prefix 'hgnc' valid , namespace ' >>>>>> http://bio2rdf.org/hgnc:' invalid -> mapping ignored! >>>>>> 04:45:48,077 [pool-1-thread-1] WARN impl.NamespacePrefixProviderImpl - >>>>>> Invalid Namespace Mapping: prefix 'dbptmpl' valid , namespace ' >>>>>> http://dbpedia.org/resource/Template:' invalid -> mapping ignored! >>>>>> 04:45:48,077 [pool-1-thread-1] WARN impl.NamespacePrefixProviderImpl - >>>>>> Invalid Namespace Mapping: prefix 'dbc' valid , namespace ' >>>>>> http://dbpedia.org/resource/Category:' invalid -> mapping ignored! >>>>>> 04:45:48,078 [pool-1-thread-1] WARN impl.NamespacePrefixProviderImpl - >>>>>> Invalid Namespace Mapping: prefix 'pubmed' valid , namespace ' >>>>>> http://bio2rdf.org/pubmed_vocabulary:' invalid -> mapping ignored! >>>>>> 04:45:48,078 [pool-1-thread-1] WARN impl.NamespacePrefixProviderImpl - >>>>>> Invalid Namespace Mapping: prefix 'dbt' valid , namespace ' >>>>>> http://dbpedia.org/resource/Template:' invalid -> mapping ignored! >>>>>> 04:45:48,078 [pool-1-thread-1] WARN impl.NamespacePrefixProviderImpl - >>>>>> Invalid Namespace Mapping: prefix 'dbrc' valid , namespace ' >>>>>> http://dbpedia.org/resource/Category:' invalid -> mapping ignored! >>>>>> 04:45:48,078 [pool-1-thread-1] WARN impl.NamespacePrefixProviderImpl - >>>>>> Invalid Namespace Mapping: prefix 'call' valid , namespace ' >>>>>> http://webofcode.org/wfn/call:' invalid -> mapping ignored! >>>>>> 04:45:48,078 [pool-1-thread-1] WARN impl.NamespacePrefixProviderImpl - >>>>>> Invalid Namespace Mapping: prefix 'dbcat' valid , namespace ' >>>>>> http://dbpedia.org/resource/Category:' invalid -> mapping ignored! >>>>>> 04:45:48,084 [pool-1-thread-1] WARN impl.NamespacePrefixProviderImpl - >>>>>> Invalid Namespace Mapping: prefix 'affymetrix' valid , namespace ' >>>>>> http://bio2rdf.org/affymetrix_vocabulary:' invalid -> mapping ignored! >>>>>> 04:45:48,084 [pool-1-thread-1] WARN impl.NamespacePrefixProviderImpl - >>>>>> Invalid Namespace Mapping: prefix 'bgcat' valid , namespace ' >>>>>> http://bg.dbpedia.org/resource/Категория:' invalid -> mapping ignored! >>>>>> 04:45:48,084 [pool-1-thread-1] WARN impl.NamespacePrefixProviderImpl - >>>>>> Invalid Namespace Mapping: prefix 'condition' valid , namespace ' >>>>>> http://www.kinjal.com/condition:' invalid -> mapping ignored! >>>>>> 05:11:41,836 [Indexing: Entity Source Reader Deamon] INFO >>>>> impl.IndexerImpl >>>>>> - Indexing: Entity Source Reader Deamon completed (sequence=0) ... >>>>>> 05:11:41,838 [Indexing: Entity Source Reader Deamon] INFO >>>>> impl.IndexerImpl >>>>>> - > current sequence : 0 >>>>>> 05:11:41,838 [Indexing: Entity Source Reader Deamon] INFO >>>>> impl.IndexerImpl >>>>>> - > new sequence: 1 >>>>>> 05:11:41,838 [Indexing: Entity Source Reader Deamon] INFO >>>>> impl.IndexerImpl >>>>>> - Send end-of-queue to Deamons with Sequence 1 >>>>>> 05:11:41,839 [Indexing: Entity Processor Deamon] INFO impl.IndexerImpl - >>>>>> Indexing: Entity Processor Deamon completed (sequence=1) ... >>>>>> 05:11:41,839 [Indexing: Entity Processor Deamon] INFO impl.IndexerImpl - >>>>>>> current sequence : 1 >>>>>> 05:11:41,839 [Indexing: Entity Processor Deamon] INFO impl.IndexerImpl - >>>>>>> new sequence: 2 >>>>>> 05:11:41,839 [Indexing: Entity Processor Deamon] INFO impl.IndexerImpl - >>>>>> Send end-of-queue to Deamons with Sequence 2 >>>>>> 05:11:41,839 [Indexing: Entity Perstisting Deamon] INFO >>>>> impl.IndexerImpl - >>>>>> Indexing: Entity Perstisting Deamon completed (sequence=2) ... >>>>>> 05:11:41,839 [Indexing: Entity Perstisting Deamon] INFO >>>>> impl.IndexerImpl - >>>>>>> current sequence : 2 >>>>>> 05:11:41,839 [Indexing: Entity Perstisting Deamon] INFO >>>>> impl.IndexerImpl - >>>>>>> new sequence: 3 >>>>>> 05:11:41,839 [Indexing: Entity Perstisting Deamon] INFO >>>>> impl.IndexerImpl - >>>>>> Send end-of-queue to Deamons with Sequence 3 >>>>>> *05:11:41,851 [Indexing: Finished Entity Logger Deamon] INFO >>>>>> impl.IndexerImpl - Indexed 0 items in 2059467sec (Infinityms/item): >>>>>> processing: -1.000ms/item | queue: -1.000ms* >>>>>> 05:11:41,851 [Indexing: Finished Entity Logger Deamon] INFO >>>>>> impl.IndexerImpl - - source : -1.000ms/item >>>>>> 05:11:41,851 [Indexing: Finished Entity Logger Deamon] INFO >>>>>> impl.IndexerImpl - - processing: -1.000ms/item >>>>>> 05:11:41,851 [Indexing: Finished Entity Logger Deamon] INFO >>>>>> impl.IndexerImpl - - store : -1.000ms/item >>>>>> 05:11:41,906 [Indexing: Finished Entity Logger Deamon] INFO >>>>>> impl.IndexerImpl - Indexing: Finished Entity Logger Deamon completed >>>>>> (sequence=3) ... >>>>>> 05:11:41,906 [Indexing: Finished Entity Logger Deamon] INFO >>>>>> impl.IndexerImpl - > current sequence : 3 >>>>>> 05:11:41,906 [Indexing: Finished Entity Logger Deamon] INFO >>>>>> impl.IndexerImpl - > new sequence: 4 >>>>>> 05:11:41,906 [Indexing: Finished Entity Logger Deamon] INFO >>>>>> impl.IndexerImpl - Send end-of-queue to Deamons with Sequence 4 >>>>>> 05:11:41,910 [Indexer: Entity Error Logging Daemon] INFO >>>>> impl.IndexerImpl >>>>>> - Indexer: Entity Error Logging Daemon completed (sequence=4) ... >>>>>> 05:11:41,910 [Indexer: Entity Error Logging Daemon] INFO >>>>> impl.IndexerImpl >>>>>> - > current sequence : 4 >>>>>> 05:11:41,910 [main] INFO impl.IndexerImpl - ... indexing completed >>>>>> 05:11:41,910 [main] INFO impl.IndexerImpl - start post-processing ... >>>>>> 05:11:41,910 [main] INFO impl.IndexerImpl - PostProcessing started ... >>>>>> 05:11:41,910 [main] INFO impl.IndexerImpl - ... post-processing >>>>> finished >>>>>> ... >>>>>> 05:11:41,911 [main] INFO impl.IndexerImpl - start finalisation.... >>>>>> >>>>>> >>>>>> >>>>>> On Wed, May 20, 2015 at 8:19 AM, Rupert Westenthaler < >>>>>> rupert.westentha...@gmail.com> wrote: >>>>>> >>>>>>>> On Tue, May 19, 2015 at 7:04 PM, Rajan Shah <raja...@gmail.com> wrote: >>>>>>>> Hi Rupert and Antonio, >>>>>>>> >>>>>>>> Thanks a lot for the reply. >>>>>>>> >>>>>>>> I start to follow Rupert's suggestion, however it failed again at >>>>>>>> >>>>>>>> 10:56:34,152 [Thread-3] ERROR jena.riot - [line: 8722294, col: 88] >>>>>>> illegal >>>>>>>> escape sequence value: $ (0x24) -- Is there anyway it can be resolved >>>>> for >>>>>>>> the entire file? >>>>>>> >>>>>>> The indexing tool uses Apache Jena. An those are Jena parsing errors. >>>>>>> So the Jena Mailing lists would be the better place to look for >>>>>>> answers. >>>>>>> This specific issue looks like an invalid URI that is not fixed by the >>>>>>> fixit script. >>>>>>> >>>>>>> >>>>>>>> I requested an access to latest BaseKB bucket, as it doesn't seem to >>>>> be >>>>>>>> open. >>>>>>>> >>>>>>>> s3cmd ls s3://basekb-now/2015-04-15-18-54/ >>>>>>>> --add-header="x-amz-request-payer: requester" >>>>>>>> ERROR: Access to bucket 'basekb-now' was denied >>>>>>>> >>>>>>>> >>>>>>>> *Couple additional questions:* >>>>>>>> >>>>>>>> *1. indexing enhancements:* >>>>>>>> What settings/properties one can tweak to gain most out of the >>>>> indexing. >>>>>>> >>>>>>> In general you do only want information as needed for your application >>>>>>> case in the index. >>>>>>> For EntityLinking only labels and type are required. >>>>>>> Additional properties will only be used for dereferencing Entities. So >>>>>>> this will depend on your application needs (your dereferencing >>>>>>> configuration). >>>>>>> >>>>>>> In general I try to exclude as much information as possible form the >>>>>>> index to keep the size of the Solr Index as small as possible. >>>>>>> >>>>>>>> a. for ex. domain specific such as Pharmaceutical, Law etc... within >>>>>>>> freebase >>>>>>>> b. potential optimizations to speed up the overall indexing >>>>>>> >>>>>>> Most of the time will be needed to load the Freebase dump into Jena >>>>>>> TDB. Even with an SSD equipped Server this will take several days. >>>>>>> Assigning more RAM will speed up this process as Jena TDB can cache >>>>>>> more things in RAM. >>>>>>> >>>>>>> Usually it is a good Idea to cancel the indexing process after the >>>>>>> importing of the RDF data has finished (and the indexing of the >>>>>>> Entities has started). This is because after indexing all the RAM will >>>>>>> be used by Jena TDB for caching stuff that is no longer needed in the >>>>>>> read-only operations during indexing. So a fresh start can speed up >>>>>>> the indexing part of the process. >>>>>>> >>>>>>> Also have a look at the Freebase Indexing Tool Readme >>>>>>> >>>>>>>> >>>>>>>> *2. demo:* >>>>>>>> I see that, in recent github commit(s) the eHealth and other demos >>>>> have >>>>>>>> been commented out. How can I get demo source code and other >>>>> components >>>>>>> for >>>>>>>> these demos. I prefer to build it myself to see the power of stanbol. >>>>>>> >>>>>>> The eHealth demo is still in the 0.12 branch [1]. This is fully >>>>>>> compatible to the trunk version. >>>>>>> >>>>>>>> *3. custom vocabulary:* >>>>>>>> Suppose, I have custom vocabulary in CSV format. Is there a preferred >>>>> way >>>>>>>> to upload it to Stanbol and have it recognize my entities? >>>>>>> >>>>>>> Google Refine[2] with the RDF extension [3]. You can also try to use >>>>>>> the (newer) Open Refine [4] with the RDF Refine 0.9.0 Alpha version >>>>>>> but AFAIK this combination is not so stable and might not work at all. >>>>>>> >>>>>>> * Google Refine allows you to import your CSV file. >>>>>>> * Clean it up (if necessary) >>>>>>> * The RDF extension allows you to map your CSV data to RDF >>>>>>> * based on this mapping you can save your data as RDF >>>>>>> * after that you can import the RDF data to Apache Stanbol >>>>>>> >>>>>>> hope this helps >>>>>>> best >>>>>>> Rupert >>>>>>> >>>>>>>> >>>>>>>> Thanks in advance, >>>>>>>> Rajan >>>>>>> >>>>>>> >>>>>>> >>>>>>> [1] >>>>> http://svn.apache.org/repos/asf/stanbol/branches/release-0.12/demos/ehealth/ >>>>>>> [2] https://code.google.com/p/google-refine/ >>>>>>> [3] http://refine.deri.ie/ >>>>>>> [4] http://openrefine.org/ >>>>>>> >>>>>>>> On Tue, May 19, 2015 at 3:01 AM, Rupert Westenthaler < >>>>>>>> rupert.westentha...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Hi Rajan, >>>>>>>>> >>>>>>>>> I think this is because you named you file >>>>>>>>> "freebase-rdf-latest-fixed.gz". Jena assumes RDF/XML if the RDF >>>>> format >>>>>>>>> is not provided by the file extension. Renaming the file to >>>>>>>>> "freebase-rdf-latest-fixed.nt.gz" should fix this issue. >>>>>>>>> >>>>>>>>> The suggestion of Antonio to use BaseKB is also a valid option. >>>>>>>>> >>>>>>>>> best >>>>>>>>> Rupert >>>>>>>>> >>>>>>>>> On Tue, May 19, 2015 at 8:32 AM, Antonio David Perez Morales >>>>>>>>> <ape...@zaizi.com> wrote: >>>>>>>>>> Hi Rajan >>>>>>>>>> >>>>>>>>>> Freebase dump contains some things that does not fit very well with >>>>>>> the >>>>>>>>>> indexer. >>>>>>>>>> I advise you to use the dump provided by BaseKB (http://basekb.com >>>>> ) >>>>>>>>> which >>>>>>>>>> is a curated Freebase dump. >>>>>>>>>> I did not have any problem indexing it using that dump. >>>>>>>>>> >>>>>>>>>> Regards >>>>>>>>>> >>>>>>>>>> On Mon, May 18, 2015 at 8:48 PM, Rajan Shah <raja...@gmail.com> >>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> I am working on indexing Freebase data within EntityHub and >>>>> observed >>>>>>>>>>> following issue: >>>>>>>>>>> >>>>>>>>>>> 01:06:01,547 [Thread-3] ERROR jena.riot - [line: 1, col: 7 ] >>>>> Element >>>>>>> or >>>>>>>>>>> attribute do not match QName production: >>>>> QName::=(NCName':')?NCName. >>>>>>>>>>> >>>>>>>>>>> I would appreciate any help pertaining to this issue. >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Rajan >>>>>>>>>>> >>>>>>>>>>> *Steps followed:* >>>>>>>>>>> >>>>>>>>>>> *1. Initialization: * >>>>>>>>>>> java -jar >>>>>>>>> org.apache.stanbol.entityhub.indexing.freebase-1.0.0-SNAPSHOT.jar >>>>>>>>>>> init >>>>>>>>>>> >>>>>>>>>>> *2. Download the data:* >>>>>>>>>>> Download data and copy it to >>>>>>>>> https://developers.google.com/freebase/data >>>>>>>>>>> >>>>>>>>>>> *3. Performed execution of fbrankings-uri.sh* >>>>>>>>>>> It generated incoming_links.txt under resources directory as >>>>> follows >>>>>>>>>>> >>>>>>>>>>> 10888430 m.0kpv11 >>>>>>>>>>> 3741261 m.019h >>>>>>>>>>> 2667858 m.0775xx5 >>>>>>>>>>> 2667804 m.0775xvm >>>>>>>>>>> 1875352 m.01xryvm >>>>>>>>>>> 1739262 m.05zppz >>>>>>>>>>> 1369590 m.01xrzlb >>>>>>>>>>> >>>>>>>>>>> *4. Performed execution of fixit script* >>>>>>>>>>> >>>>>>>>>>> gunzip -c ${FB_DUMP} | fixit | gzip > ${FB_DUMP_fixed} >>>>>>>>>>> >>>>>>>>>>> *5. Rename the fixed file to freebase.rdf.gz and copy it * >>>>>>>>>>> to indexing/resources/rdfdata >>>>>>>>>>> >>>>>>>>>>> *6. config/iditer.properties file has following setting* >>>>>>>>>>> #id-namespace=http://freebase.com/ >>>>>>>>>>> ns-prefix-state=false >>>>>>>>>>> >>>>>>>>>>> *7. Performed run of following command:* >>>>>>>>>>> java -jar -Xmx32g >>>>>>>>>>> org.apache.stanbol.entityhub.indexing.freebase-1.0.0-SNAPSHOT.jar >>>>>>> index >>>>>>>>>>> >>>>>>>>>>> The error dump on stdout is as follows: >>>>>>>>>>> >>>>>>>>>>> 01:37:32,884 [Thread-0] INFO >>>>> solryard.SolrYardIndexingDestination - >>>>>>>>> ... >>>>>>>>>>> copy Solr Configuration form >>>>>>>>> /private/tmp/freebase/indexing/config/freebase >>>>>>>>>>> to >>>>>>> /private/tmp/freebase/indexing/destination/indexes/default/freebase >>>>>>>>>>> 01:37:32,895 [Thread-3] INFO jenatdb.RdfResourceImporter - - >>>>>>> bulk >>>>>>>>>>> loading File freebase.rdf.gz using Format Lang:RDF/XML >>>>>>>>>>> 01:37:32,896 [Thread-3] INFO jenatdb.RdfResourceImporter - -- >>>>> Start >>>>>>>>>>> triples data phase >>>>>>>>>>> 01:37:32,896 [Thread-3] INFO jenatdb.RdfResourceImporter - ** >>>>> Load >>>>>>>>> empty >>>>>>>>>>> triples table >>>>>>>>>>> *01:37:32,948 [Thread-3] ERROR jena.riot - [line: 1, col: 7 ] >>>>>>> Element or >>>>>>>>>>> attribute do not match QName production: >>>>> QName::=(NCName':')?NCName.* >>>>>>>>>>> 01:37:32,948 [Thread-3] INFO jenatdb.RdfResourceImporter - -- >>>>> Finish >>>>>>>>>>> triples data phase >>>>>>>>>>> 01:37:32,948 [Thread-3] INFO jenatdb.RdfResourceImporter - -- >>>>> Finish >>>>>>>>>>> triples load >>>>>>>>>>> 01:37:32,960 [Thread-3] INFO source.ResourceLoader - Ignore Error >>>>>>> for >>>>>>>>> File >>>>>>>>>>> /private/tmp/freebase/indexing/resources/rdfdata/freebase.rdf.gz >>>>> and >>>>>>>>>>> continue >>>>>>>>>>> >>>>>>>>>>> Additional Reference Point: >>>>>>>>>>> >>>>>>>>>>> *Original Freebase dump size:* 31025015397 May 14 18:10 >>>>>>>>>>> freebase-rdf-latest.gz >>>>>>>>>>> *Fixed Freebase dump size:* 31026818367 May 15 12:45 >>>>>>>>>>> freebase-rdf-latest-fixed.gz >>>>>>>>>>> *Incoming Links size: *1206745360 May 17 00:42 incoming_links.txt >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> >>>>>>>>>> ------------------------------ >>>>>>>>>> This message should be regarded as confidential. If you have >>>>> received >>>>>>>>> this >>>>>>>>>> email in error please notify the sender and destroy it immediately. >>>>>>>>>> Statements of intent shall only become binding when confirmed in >>>>> hard >>>>>>>>> copy >>>>>>>>>> by an authorised signatory. >>>>>>>>>> >>>>>>>>>> Zaizi Ltd is registered in England and Wales with the registration >>>>>>> number >>>>>>>>>> 6440931. The Registered Office is Brook House, 229 Shepherds Bush >>>>>>> Road, >>>>>>>>>> London W6 7AN. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> | Rupert Westenthaler rupert.westentha...@gmail.com >>>>>>>>> | Bodenlehenstraße 11 ++43-699-11108907 >>>>>>>>> | A-5500 Bischofshofen >>>>>>>>> | REDLINK.CO >>>>> .......................................................................... >>>>>>>>> | http://redlink.co/ >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> | Rupert Westenthaler rupert.westentha...@gmail.com >>>>>>> | Bodenlehenstraße 11 ++43-699-11108907 >>>>>>> | A-5500 Bischofshofen >>>>>>> | REDLINK.CO >>>>> .......................................................................... >>>>>>> | http://redlink.co/ >>>>> >>>>> >>>>> >>>>> -- >>>>> | Rupert Westenthaler rupert.westentha...@gmail.com >>>>> | Bodenlehenstraße 11 ++43-699-11108907 >>>>> | A-5500 Bischofshofen >>>>> | REDLINK.CO >>>>> .......................................................................... >>>>> | http://redlink.co/ >>> >>> >>> >>> -- >>> | Rupert Westenthaler rupert.westentha...@gmail.com >>> | Bodenlehenstraße 11 ++43-699-11108907 >>> | A-5500 Bischofshofen >>> | REDLINK.CO >>> .......................................................................... >>> | http://redlink.co/ > > > > -- > | Rupert Westenthaler rupert.westentha...@gmail.com > | Bodenlehenstraße 11 ++43-699-11108907 > | A-5500 Bischofshofen > | REDLINK.CO > .......................................................................... > | http://redlink.co/