Hi, Thanks again for your help.
Finally, I have freebase index and I can use it. I really appreciate your continuous help. With best regards, Rajan On Thu, May 28, 2015 at 5:42 AM, Rupert Westenthaler < rupert.westentha...@gmail.com> wrote: > Hi, > > Please have a look at the stanbol log file (./stanbol/log/error.log). > The schema.xml of the freebase indexing tool uses Solr Analyzers that > are not included by all Stanbol Launchers. If you are missing some > things you will see according exceptions in the log. > > Installation will extract the index from the archive and copy it to > the ./stanbol/indexes. So depending on the size the installation may > take some time. > > best > Rupert > > > On Wed, May 27, 2015 at 3:01 PM, Rajan Shah <raja...@gmail.com> wrote: > > Hi Rupert, > > > > Finally, I got the freebase index after 2 days run. For english language > > only, the size is roughly 28G. > > > > Surprisingly, after I installed it via OSGI console it created Referenced > > Site and Solr Yard. However, it's not visible within entityhub sites. I > did > > configure following parameters within SolrYard > > > > a. "Allow Initialization" - checked > > b. Index configuration: freebase.solrindex.zip > > > > I also re-started couple times but no luck. > > > > Does it require any additional special configuration? i.e. do I need to > > have higher -Xmx parameter setting or something else > > > > With best regards, > > Rajan > > > > On Tue, May 26, 2015 at 9:06 AM, <raja...@gmail.com> wrote: > > > >> Hi, > >> > >> Accidentally, I wiped out logs for a clean start. At the same time, I am > >> planning to run on a higher end AWS instance as well, so will keep you > >> posted. > >> > >> Thanks again for your continuous help. > >> > >> With best regards, > >> Rajan > >> > >> Sent from my iPhone > >> > >> > On May 26, 2015, at 8:47 AM, Rupert Westenthaler < > >> rupert.westentha...@gmail.com> wrote: > >> > > >> > HI > >> > > >> >> On Tue, May 26, 2015 at 2:13 PM, <raja...@gmail.com> wrote: > >> >> Hi Rupert, > >> >> > >> >> After last failure, I am only using language=en and it still fails. > >> > > >> > Can you provide the some lines of logging before the OOM. I would like > >> > to be sure that it really happens during the Solr optimization phase. > >> > > >> >> Thanks for the timely answer. Just to double confirm, if I re-started > >> the index command this am again with higher -Xmx option is it too late > to > >> run finalise correct? > >> > > >> > If the OOM exception really happened during the Solr optimization > calling > >> > > >> > java -jar -Xmx{higher-value}g > >> > org.apache.stanbol.entityhub.indexing.freebase-1.0.0-SNAPSHOT.jar > >> > finalise > >> > > >> > will use the data of the previous indexing call and just repeat the > >> > finalization steps > >> > > >> > best > >> > Rupert > >> > > >> > > >> >> With best regards, > >> >> Rajan > >> >> > >> >> Sent from my iPhone > >> >> > >> >>> On May 26, 2015, at 7:47 AM, Rupert Westenthaler < > >> rupert.westentha...@gmail.com> wrote: > >> >>> > >> >>> Hi Rajan > >> >>> > >> >>>> On Mon, May 25, 2015 at 6:15 AM, Rajan Shah <raja...@gmail.com> > >> wrote: > >> >>>> Hi Rupert, > >> >>>> > >> >>>> Thanks for the reply. > >> >>>> > >> >>>> As per your suggestion, I made necessary changes however it failed > >> with > >> >>>> "OutOfMemory" errors. At present, I am running with -Xmx48g however > >> at this > >> >>>> point it's a trial and error approach with several days effort > being > >> >>>> wasted. > >> >>> > >> >>> I guess you are getting the OutOfMemory while optimizing the Solr > >> >>> Index (right?). The README [1] explicitly notes that a high amount > of > >> >>> memory is needed by exactly this step of the indexing process. > >> >>> > >> >>> If the indexing fails at this step you can call the indexing tool > with > >> >>> the `finalise` command (instead if `indexing`) (seeSTANBOL-1047 [2] > >> >>> for details). This will prevent the indexing to be repeated and only > >> >>> execute the finalization steps (optimizing the Solr Index and > creating > >> >>> the freebase.solrindex.zip file). > >> >>> > >> >>> > >> >>>> I am just throwing out an idea, but wanted to see > >> >>>> > >> >>>> a. Is it possible to publish set of constraints and required > >> parameters. > >> >>>> i.e. with minimal set of entities within mappings.txt, one need to > set > >> >>>> these parameters? > >> >>> > >> >>> I do not understand this question. Do you want to filter entities > >> >>> based on their information? If so you might want to have a look at > the > >> >>> > >> `org.apache.stanbol.entityhub.indexing.core.processor.FieldValueFilter`. > >> >>> The generic RDF indexing tool as an example on how to use this > >> >>> processor to filter entities based on their rdf:type values. > >> >>> > >> >>> See also the "Entity Filters" section of [3] > >> >>> > >> >>>> > >> >>>> b. Is it possible to split the file based on subject? generate > smaller > >> >>>> index for each subject and merge afterwards? > >> >>> > >> >>> Yes. You can split up the dump (by subject). Import those parts in > >> >>> different Indexing Tool instances (meaning different Jena TDB > >> >>> instances). Importing 4*500million triples to Jena TDB is supposed > to > >> >>> be much faster as 1*2Billion. > >> >>> > >> >>> If you still want to have all data in a single Entityhub Site you > need > >> >>> to script the indexing process. > >> >>> > >> >>> * call indexing for the first part > >> >>> * after this finishes link the {part1}/indexing/destination/indexes > >> >>> folder to {part2..n}/indexing/destination/indexes > >> >>> * call indexing for the 2..n parts. > >> >>> > >> >>> As the indexing tool only adds additional information to the Solr > >> >>> Index you will get the union over all parts at the end of the > process. > >> >>> All parts need to use the full incoming_links.txt file because > >> >>> otherwise the rankings would not be correct. > >> >>> > >> >>> The "Indexing Datasets separately" section of [3] describes a > similar > >> >>> trick of creating an union index over multiple datasets. > >> >>> > >> >>> > >> >>> best > >> >>> Rupert > >> >>> > >> >>>> c. Work with BaseKB guys to also make it available at nominal > charge? > >> >>>> > >> >>>> d. Maybe apply some Map/Reduce - extension of idea b > >> >>>> > >> >>>> With best regards, > >> >>>> Rajan > >> >>> > >> >>> > >> >>> > >> >>> [1] > >> > http://svn.apache.org/repos/asf/stanbol/trunk/entityhub/indexing/freebase/README.md > >> >>> [2] https://issues.apache.org/jira/browse/STANBOL-1047 > >> >>> [3] > >> > http://svn.apache.org/repos/asf/stanbol/branches/release-0.12/demos/ehealth/README.md > >> >>> > >> >>>> > >> >>>> > >> >>>> > >> >>>> On Fri, May 22, 2015 at 9:29 AM, Rupert Westenthaler < > >> >>>> rupert.westentha...@gmail.com> wrote: > >> >>>> > >> >>>>> Hi Rajan, > >> >>>>> > >> >>>>>> *05:11:41,851 [Indexing: Finished Entity Logger Deamon] INFO > >> >>>>>> impl.IndexerImpl - Indexed 0 items in 2059467sec > (Infinityms/item): > >> >>>>> > >> >>>>> 'You have not indexed a single entity. So something in your > indexing > >> >>>>> configuration is wrong. Most likely you are not correctly building > >> the > >> >>>>> URIs of the entities from the incoming_links.txt file. Can you > >> provide > >> >>>>> me an example line of the 'incoming_links.txt' file and the > contents > >> >>>>> of the 'iditerator.properties' file. Those specify how Entity URIs > >> are > >> >>>>> built. > >> >>>>> > >> >>>>> Short answers to the other questions > >> >>>>> > >> >>>>> > >> >>>>>> On Fri, May 22, 2015 at 2:10 PM, Rajan Shah <raja...@gmail.com> > >> wrote: > >> >>>>>> it ran for almost 3 days and generated index. > >> >>>>> > >> >>>>> Thats good. It means you do have now the Freebase dump in your > Jena > >> >>>>> TDB triple store. You will not need to repeat this (until you > want to > >> >>>>> use a newer dump. On the next call to the indexing tool it will > >> >>>>> immediately start with the indexing step. > >> >>>>> > >> >>>>> > >> >>>>>> > >> >>>>>> Couple questions come to mind: > >> >>>>>> > >> >>>>>> a. Is there any particular log/error file the process generates > >> besides > >> >>>>>> printing out on stdout/stderr? > >> >>>>> > >> >>>>> The indexer writes a zip archive with the IDs of all the indexed > >> >>>>> entities. Its in the indexing/destination folder. > >> >>>>> > >> >>>>>> b. Is it a must-have to have stanbol full launcher running all > the > >> time > >> >>>>>> while indexing is going on? > >> >>>>> > >> >>>>> No Stanbol instance is needed by the indexing process. > >> >>>>> > >> >>>>>> c. Is it possible that, if the machine is not connected to > internet > >> for > >> >>>>>> couple minutes could cause some issues? > >> >>>>> > >> >>>>> No Internet connectivity is needed during indexing. Only if you > want > >> >>>>> to use the namespace prefix mappings of prefix.cc you need to have > >> >>>>> internet connectivity when starting the indexing tool. > >> >>>>> > >> >>>>> best > >> >>>>> Rupert > >> >>>>> > >> >>>>>> > >> >>>>>> I would really appreciate, if you can shed some light on "what > >> could be > >> >>>>>> wrong" or "potential approach to nail down this issue"? If you > >> need, I am > >> >>>>>> happy to share any additional logs/properties. > >> >>>>>> > >> >>>>>> With best regards, > >> >>>>>> Rajan > >> >>>>>> > >> >>>>>> *1. Configuration changes* > >> >>>>>> > >> >>>>>> a. set ns-prefix-state=false* > >> >>>>>> [within /indexing/config/iditerator.properties]* > >> >>>>>> b. add empty space mapping to http://rdf.freebase.com/ns/* > >> >>>>>> [within namespaceprefix.mappings]* > >> >>>>>> c. enable bunch of properties within mappings.txt such as > following > >> >>>>>> > >> >>>>>> fb:music.artist.genre > >> >>>>>> fb:music.artist.label > >> >>>>>> fb:music.artist.album > >> >>>>>> > >> >>>>>> *2. Contents of indexing/dist directory* > >> >>>>>> > >> >>>>>> -rw-r--r-- 108899 May 22 05:11 freebase.solrindex.zip > >> >>>>>> -rw-r--r-- 3457 May 22 05:11 > >> >>>>>> org.apache.stanbol.data.site.freebase-1.0.0.jar > >> >>>>>> > >> >>>>>> *3. Contents of /tmp/freebase/indexing/resources/imported > directory* > >> >>>>>> > >> >>>>>> -rw-r--r-- 1 31026810858 May 20 07:32 freebase.nt.gz > >> >>>>>> > >> >>>>>> *4. Contents of /tmp/freebase/indexing/resources directory* > >> >>>>>> > >> >>>>>> -rw-r--r-- 1 1206745360 May 19 09:38 incoming_links.txt > >> >>>>>> > >> >>>>>> *5. The indexer log* > >> >>>>>> > >> >>>>>> *04:31:57,236 [Thread-3] INFO jenatdb.RdfResourceImporter - Add: > >> >>>>>> 570,850,000 triples (Batch: 2,604 / Avg: 3,621)* > >> >>>>>> *04:32:00,727 [Thread-3] INFO jenatdb.RdfResourceImporter - > >> Filtered: > >> >>>>>> 2429800000 triples (80.97554853864854%)* > >> >>>>>> *04:32:01,157 [Thread-3] INFO jenatdb.RdfResourceImporter - -- > >> Finish > >> >>>>>> triples data phase* > >> >>>>>> *04:32:01,157 [Thread-3] INFO jenatdb.RdfResourceImporter - ** > >> Data: > >> >>>>>> 570,859,352 triples loaded in 157,619.39 seconds [Rate: 3,621.76 > per > >> >>>>>> second]* > >> >>>>>> *04:32:01,157 [Thread-3] INFO jenatdb.RdfResourceImporter - -- > >> Start > >> >>>>>> triples index phase* > >> >>>>>> *04:32:01,157 [Thread-3] INFO jenatdb.RdfResourceImporter - -- > >> Finish > >> >>>>>> triples index phase* > >> >>>>>> *04:32:01,157 [Thread-3] INFO jenatdb.RdfResourceImporter - -- > >> Finish > >> >>>>>> triples load* > >> >>>>>> *04:32:01,157 [Thread-3] INFO jenatdb.RdfResourceImporter - ** > >> >>>>> Completed: > >> >>>>>> 570,859,352 triples loaded in 157,619.39 seconds [Rate: 3,621.76 > per > >> >>>>>> second]* > >> >>>>>> 04:32:56,880 [Thread-3] INFO source.ResourceLoader - ... > moving > >> >>>>>> imported file freebase.nt.gz to imported/freebase.nt.gz > >> >>>>>> 04:32:56,883 [Thread-3] INFO source.ResourceLoader - - > >> completed in > >> >>>>>> 157675 seconds > >> >>>>>> 04:32:56,883 [Thread-3] INFO source.ResourceLoader - > loading > >> >>>>>> '/private/tmp/freebase/indexing/resources/rdfdata/fixit.sh' ... > >> >>>>>> 04:32:56,944 [Thread-3] WARN jenatdb.RdfResourceImporter - > ignore > >> File > >> >>>>> {} > >> >>>>>> because of unknown extension > >> >>>>>> 04:32:56,958 [Thread-3] INFO source.ResourceLoader - - > >> completed in 0 > >> >>>>>> seconds > >> >>>>>> 04:32:56,958 [Thread-3] INFO source.ResourceLoader - ... 2 > files > >> >>>>> imported > >> >>>>>> in 157675 seconds > >> >>>>>> 04:32:56,958 [Thread-3] INFO source.ResourceLoader - Loding 0 > File > >> ... > >> >>>>>> 04:32:56,958 [Thread-3] INFO source.ResourceLoader - ... 0 > files > >> >>>>> imported > >> >>>>>> in 0 seconds > >> >>>>>> 04:32:56,971 [main] INFO impl.IndexerImpl - ... delete existing > >> >>>>>> IndexedEntityId file > >> >>>>>> > /private/tmp/freebase/indexing/destination/indexed-entities-ids.zip > >> >>>>>> 04:32:56,982 [main] INFO impl.IndexerImpl - Initialisation > >> completed > >> >>>>>> 04:32:56,982 [main] INFO impl.IndexerImpl - ... initialisation > >> >>>>> completed > >> >>>>>> 04:32:56,982 [main] INFO impl.IndexerImpl - start indexing ... > >> >>>>>> 04:32:56,982 [main] INFO impl.IndexerImpl - Indexing started ... > >> >>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>>> 04:45:48,075 [pool-1-thread-1] WARN > >> impl.NamespacePrefixProviderImpl - > >> >>>>>> Invalid Namespace Mapping: prefix 'nsogi' valid , namespace ' > >> >>>>>> http://prefix.cc/nsogi:' invalid -> mapping ignored! > >> >>>>>> 04:45:48,076 [pool-1-thread-1] WARN > >> impl.NamespacePrefixProviderImpl - > >> >>>>>> Invalid Namespace Mapping: prefix 'category' valid , namespace ' > >> >>>>>> http://dbpedia.org/resource/Category:' invalid -> mapping > ignored! > >> >>>>>> 04:45:48,077 [pool-1-thread-1] WARN > >> impl.NamespacePrefixProviderImpl - > >> >>>>>> Invalid Namespace Mapping: prefix 'chebi' valid , namespace ' > >> >>>>>> http://bio2rdf.org/chebi:' invalid -> mapping ignored! > >> >>>>>> 04:45:48,077 [pool-1-thread-1] WARN > >> impl.NamespacePrefixProviderImpl - > >> >>>>>> Invalid Namespace Mapping: prefix 'hgnc' valid , namespace ' > >> >>>>>> http://bio2rdf.org/hgnc:' invalid -> mapping ignored! > >> >>>>>> 04:45:48,077 [pool-1-thread-1] WARN > >> impl.NamespacePrefixProviderImpl - > >> >>>>>> Invalid Namespace Mapping: prefix 'dbptmpl' valid , namespace ' > >> >>>>>> http://dbpedia.org/resource/Template:' invalid -> mapping > ignored! > >> >>>>>> 04:45:48,077 [pool-1-thread-1] WARN > >> impl.NamespacePrefixProviderImpl - > >> >>>>>> Invalid Namespace Mapping: prefix 'dbc' valid , namespace ' > >> >>>>>> http://dbpedia.org/resource/Category:' invalid -> mapping > ignored! > >> >>>>>> 04:45:48,078 [pool-1-thread-1] WARN > >> impl.NamespacePrefixProviderImpl - > >> >>>>>> Invalid Namespace Mapping: prefix 'pubmed' valid , namespace ' > >> >>>>>> http://bio2rdf.org/pubmed_vocabulary:' invalid -> mapping > ignored! > >> >>>>>> 04:45:48,078 [pool-1-thread-1] WARN > >> impl.NamespacePrefixProviderImpl - > >> >>>>>> Invalid Namespace Mapping: prefix 'dbt' valid , namespace ' > >> >>>>>> http://dbpedia.org/resource/Template:' invalid -> mapping > ignored! > >> >>>>>> 04:45:48,078 [pool-1-thread-1] WARN > >> impl.NamespacePrefixProviderImpl - > >> >>>>>> Invalid Namespace Mapping: prefix 'dbrc' valid , namespace ' > >> >>>>>> http://dbpedia.org/resource/Category:' invalid -> mapping > ignored! > >> >>>>>> 04:45:48,078 [pool-1-thread-1] WARN > >> impl.NamespacePrefixProviderImpl - > >> >>>>>> Invalid Namespace Mapping: prefix 'call' valid , namespace ' > >> >>>>>> http://webofcode.org/wfn/call:' invalid -> mapping ignored! > >> >>>>>> 04:45:48,078 [pool-1-thread-1] WARN > >> impl.NamespacePrefixProviderImpl - > >> >>>>>> Invalid Namespace Mapping: prefix 'dbcat' valid , namespace ' > >> >>>>>> http://dbpedia.org/resource/Category:' invalid -> mapping > ignored! > >> >>>>>> 04:45:48,084 [pool-1-thread-1] WARN > >> impl.NamespacePrefixProviderImpl - > >> >>>>>> Invalid Namespace Mapping: prefix 'affymetrix' valid , namespace > ' > >> >>>>>> http://bio2rdf.org/affymetrix_vocabulary:' invalid -> mapping > >> ignored! > >> >>>>>> 04:45:48,084 [pool-1-thread-1] WARN > >> impl.NamespacePrefixProviderImpl - > >> >>>>>> Invalid Namespace Mapping: prefix 'bgcat' valid , namespace ' > >> >>>>>> http://bg.dbpedia.org/resource/Категория:' invalid -> mapping > >> ignored! > >> >>>>>> 04:45:48,084 [pool-1-thread-1] WARN > >> impl.NamespacePrefixProviderImpl - > >> >>>>>> Invalid Namespace Mapping: prefix 'condition' valid , namespace ' > >> >>>>>> http://www.kinjal.com/condition:' invalid -> mapping ignored! > >> >>>>>> 05:11:41,836 [Indexing: Entity Source Reader Deamon] INFO > >> >>>>> impl.IndexerImpl > >> >>>>>> - Indexing: Entity Source Reader Deamon completed (sequence=0) > ... > >> >>>>>> 05:11:41,838 [Indexing: Entity Source Reader Deamon] INFO > >> >>>>> impl.IndexerImpl > >> >>>>>> - > current sequence : 0 > >> >>>>>> 05:11:41,838 [Indexing: Entity Source Reader Deamon] INFO > >> >>>>> impl.IndexerImpl > >> >>>>>> - > new sequence: 1 > >> >>>>>> 05:11:41,838 [Indexing: Entity Source Reader Deamon] INFO > >> >>>>> impl.IndexerImpl > >> >>>>>> - Send end-of-queue to Deamons with Sequence 1 > >> >>>>>> 05:11:41,839 [Indexing: Entity Processor Deamon] INFO > >> impl.IndexerImpl - > >> >>>>>> Indexing: Entity Processor Deamon completed (sequence=1) ... > >> >>>>>> 05:11:41,839 [Indexing: Entity Processor Deamon] INFO > >> impl.IndexerImpl - > >> >>>>>>> current sequence : 1 > >> >>>>>> 05:11:41,839 [Indexing: Entity Processor Deamon] INFO > >> impl.IndexerImpl - > >> >>>>>>> new sequence: 2 > >> >>>>>> 05:11:41,839 [Indexing: Entity Processor Deamon] INFO > >> impl.IndexerImpl - > >> >>>>>> Send end-of-queue to Deamons with Sequence 2 > >> >>>>>> 05:11:41,839 [Indexing: Entity Perstisting Deamon] INFO > >> >>>>> impl.IndexerImpl - > >> >>>>>> Indexing: Entity Perstisting Deamon completed (sequence=2) ... > >> >>>>>> 05:11:41,839 [Indexing: Entity Perstisting Deamon] INFO > >> >>>>> impl.IndexerImpl - > >> >>>>>>> current sequence : 2 > >> >>>>>> 05:11:41,839 [Indexing: Entity Perstisting Deamon] INFO > >> >>>>> impl.IndexerImpl - > >> >>>>>>> new sequence: 3 > >> >>>>>> 05:11:41,839 [Indexing: Entity Perstisting Deamon] INFO > >> >>>>> impl.IndexerImpl - > >> >>>>>> Send end-of-queue to Deamons with Sequence 3 > >> >>>>>> *05:11:41,851 [Indexing: Finished Entity Logger Deamon] INFO > >> >>>>>> impl.IndexerImpl - Indexed 0 items in 2059467sec > (Infinityms/item): > >> >>>>>> processing: -1.000ms/item | queue: -1.000ms* > >> >>>>>> 05:11:41,851 [Indexing: Finished Entity Logger Deamon] INFO > >> >>>>>> impl.IndexerImpl - - source : -1.000ms/item > >> >>>>>> 05:11:41,851 [Indexing: Finished Entity Logger Deamon] INFO > >> >>>>>> impl.IndexerImpl - - processing: -1.000ms/item > >> >>>>>> 05:11:41,851 [Indexing: Finished Entity Logger Deamon] INFO > >> >>>>>> impl.IndexerImpl - - store : -1.000ms/item > >> >>>>>> 05:11:41,906 [Indexing: Finished Entity Logger Deamon] INFO > >> >>>>>> impl.IndexerImpl - Indexing: Finished Entity Logger Deamon > completed > >> >>>>>> (sequence=3) ... > >> >>>>>> 05:11:41,906 [Indexing: Finished Entity Logger Deamon] INFO > >> >>>>>> impl.IndexerImpl - > current sequence : 3 > >> >>>>>> 05:11:41,906 [Indexing: Finished Entity Logger Deamon] INFO > >> >>>>>> impl.IndexerImpl - > new sequence: 4 > >> >>>>>> 05:11:41,906 [Indexing: Finished Entity Logger Deamon] INFO > >> >>>>>> impl.IndexerImpl - Send end-of-queue to Deamons with Sequence 4 > >> >>>>>> 05:11:41,910 [Indexer: Entity Error Logging Daemon] INFO > >> >>>>> impl.IndexerImpl > >> >>>>>> - Indexer: Entity Error Logging Daemon completed (sequence=4) ... > >> >>>>>> 05:11:41,910 [Indexer: Entity Error Logging Daemon] INFO > >> >>>>> impl.IndexerImpl > >> >>>>>> - > current sequence : 4 > >> >>>>>> 05:11:41,910 [main] INFO impl.IndexerImpl - ... indexing > >> completed > >> >>>>>> 05:11:41,910 [main] INFO impl.IndexerImpl - start > post-processing > >> ... > >> >>>>>> 05:11:41,910 [main] INFO impl.IndexerImpl - PostProcessing > started > >> ... > >> >>>>>> 05:11:41,910 [main] INFO impl.IndexerImpl - ... > post-processing > >> >>>>> finished > >> >>>>>> ... > >> >>>>>> 05:11:41,911 [main] INFO impl.IndexerImpl - start > finalisation.... > >> >>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>>> On Wed, May 20, 2015 at 8:19 AM, Rupert Westenthaler < > >> >>>>>> rupert.westentha...@gmail.com> wrote: > >> >>>>>> > >> >>>>>>>> On Tue, May 19, 2015 at 7:04 PM, Rajan Shah <raja...@gmail.com > > > >> wrote: > >> >>>>>>>> Hi Rupert and Antonio, > >> >>>>>>>> > >> >>>>>>>> Thanks a lot for the reply. > >> >>>>>>>> > >> >>>>>>>> I start to follow Rupert's suggestion, however it failed again > at > >> >>>>>>>> > >> >>>>>>>> 10:56:34,152 [Thread-3] ERROR jena.riot - [line: 8722294, col: > 88] > >> >>>>>>> illegal > >> >>>>>>>> escape sequence value: $ (0x24) -- Is there anyway it can be > >> resolved > >> >>>>> for > >> >>>>>>>> the entire file? > >> >>>>>>> > >> >>>>>>> The indexing tool uses Apache Jena. An those are Jena parsing > >> errors. > >> >>>>>>> So the Jena Mailing lists would be the better place to look for > >> >>>>>>> answers. > >> >>>>>>> This specific issue looks like an invalid URI that is not fixed > by > >> the > >> >>>>>>> fixit script. > >> >>>>>>> > >> >>>>>>> > >> >>>>>>>> I requested an access to latest BaseKB bucket, as it doesn't > seem > >> to > >> >>>>> be > >> >>>>>>>> open. > >> >>>>>>>> > >> >>>>>>>> s3cmd ls s3://basekb-now/2015-04-15-18-54/ > >> >>>>>>>> --add-header="x-amz-request-payer: requester" > >> >>>>>>>> ERROR: Access to bucket 'basekb-now' was denied > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> *Couple additional questions:* > >> >>>>>>>> > >> >>>>>>>> *1. indexing enhancements:* > >> >>>>>>>> What settings/properties one can tweak to gain most out of the > >> >>>>> indexing. > >> >>>>>>> > >> >>>>>>> In general you do only want information as needed for your > >> application > >> >>>>>>> case in the index. > >> >>>>>>> For EntityLinking only labels and type are required. > >> >>>>>>> Additional properties will only be used for dereferencing > >> Entities. So > >> >>>>>>> this will depend on your application needs (your dereferencing > >> >>>>>>> configuration). > >> >>>>>>> > >> >>>>>>> In general I try to exclude as much information as possible form > >> the > >> >>>>>>> index to keep the size of the Solr Index as small as possible. > >> >>>>>>> > >> >>>>>>>> a. for ex. domain specific such as Pharmaceutical, Law etc... > >> within > >> >>>>>>>> freebase > >> >>>>>>>> b. potential optimizations to speed up the overall indexing > >> >>>>>>> > >> >>>>>>> Most of the time will be needed to load the Freebase dump into > Jena > >> >>>>>>> TDB. Even with an SSD equipped Server this will take several > days. > >> >>>>>>> Assigning more RAM will speed up this process as Jena TDB can > cache > >> >>>>>>> more things in RAM. > >> >>>>>>> > >> >>>>>>> Usually it is a good Idea to cancel the indexing process after > the > >> >>>>>>> importing of the RDF data has finished (and the indexing of the > >> >>>>>>> Entities has started). This is because after indexing all the > RAM > >> will > >> >>>>>>> be used by Jena TDB for caching stuff that is no longer needed > in > >> the > >> >>>>>>> read-only operations during indexing. So a fresh start can > speed up > >> >>>>>>> the indexing part of the process. > >> >>>>>>> > >> >>>>>>> Also have a look at the Freebase Indexing Tool Readme > >> >>>>>>> > >> >>>>>>>> > >> >>>>>>>> *2. demo:* > >> >>>>>>>> I see that, in recent github commit(s) the eHealth and other > demos > >> >>>>> have > >> >>>>>>>> been commented out. How can I get demo source code and other > >> >>>>> components > >> >>>>>>> for > >> >>>>>>>> these demos. I prefer to build it myself to see the power of > >> stanbol. > >> >>>>>>> > >> >>>>>>> The eHealth demo is still in the 0.12 branch [1]. This is fully > >> >>>>>>> compatible to the trunk version. > >> >>>>>>> > >> >>>>>>>> *3. custom vocabulary:* > >> >>>>>>>> Suppose, I have custom vocabulary in CSV format. Is there a > >> preferred > >> >>>>> way > >> >>>>>>>> to upload it to Stanbol and have it recognize my entities? > >> >>>>>>> > >> >>>>>>> Google Refine[2] with the RDF extension [3]. You can also try to > >> use > >> >>>>>>> the (newer) Open Refine [4] with the RDF Refine 0.9.0 Alpha > version > >> >>>>>>> but AFAIK this combination is not so stable and might not work > at > >> all. > >> >>>>>>> > >> >>>>>>> * Google Refine allows you to import your CSV file. > >> >>>>>>> * Clean it up (if necessary) > >> >>>>>>> * The RDF extension allows you to map your CSV data to RDF > >> >>>>>>> * based on this mapping you can save your data as RDF > >> >>>>>>> * after that you can import the RDF data to Apache Stanbol > >> >>>>>>> > >> >>>>>>> hope this helps > >> >>>>>>> best > >> >>>>>>> Rupert > >> >>>>>>> > >> >>>>>>>> > >> >>>>>>>> Thanks in advance, > >> >>>>>>>> Rajan > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> [1] > >> >>>>> > >> > http://svn.apache.org/repos/asf/stanbol/branches/release-0.12/demos/ehealth/ > >> >>>>>>> [2] https://code.google.com/p/google-refine/ > >> >>>>>>> [3] http://refine.deri.ie/ > >> >>>>>>> [4] http://openrefine.org/ > >> >>>>>>> > >> >>>>>>>> On Tue, May 19, 2015 at 3:01 AM, Rupert Westenthaler < > >> >>>>>>>> rupert.westentha...@gmail.com> wrote: > >> >>>>>>>> > >> >>>>>>>>> Hi Rajan, > >> >>>>>>>>> > >> >>>>>>>>> I think this is because you named you file > >> >>>>>>>>> "freebase-rdf-latest-fixed.gz". Jena assumes RDF/XML if the > RDF > >> >>>>> format > >> >>>>>>>>> is not provided by the file extension. Renaming the file to > >> >>>>>>>>> "freebase-rdf-latest-fixed.nt.gz" should fix this issue. > >> >>>>>>>>> > >> >>>>>>>>> The suggestion of Antonio to use BaseKB is also a valid > option. > >> >>>>>>>>> > >> >>>>>>>>> best > >> >>>>>>>>> Rupert > >> >>>>>>>>> > >> >>>>>>>>> On Tue, May 19, 2015 at 8:32 AM, Antonio David Perez Morales > >> >>>>>>>>> <ape...@zaizi.com> wrote: > >> >>>>>>>>>> Hi Rajan > >> >>>>>>>>>> > >> >>>>>>>>>> Freebase dump contains some things that does not fit very > well > >> with > >> >>>>>>> the > >> >>>>>>>>>> indexer. > >> >>>>>>>>>> I advise you to use the dump provided by BaseKB ( > >> http://basekb.com > >> >>>>> ) > >> >>>>>>>>> which > >> >>>>>>>>>> is a curated Freebase dump. > >> >>>>>>>>>> I did not have any problem indexing it using that dump. > >> >>>>>>>>>> > >> >>>>>>>>>> Regards > >> >>>>>>>>>> > >> >>>>>>>>>> On Mon, May 18, 2015 at 8:48 PM, Rajan Shah < > raja...@gmail.com> > >> >>>>>>> wrote: > >> >>>>>>>>>> > >> >>>>>>>>>>> Hi, > >> >>>>>>>>>>> > >> >>>>>>>>>>> I am working on indexing Freebase data within EntityHub and > >> >>>>> observed > >> >>>>>>>>>>> following issue: > >> >>>>>>>>>>> > >> >>>>>>>>>>> 01:06:01,547 [Thread-3] ERROR jena.riot - [line: 1, col: 7 ] > >> >>>>> Element > >> >>>>>>> or > >> >>>>>>>>>>> attribute do not match QName production: > >> >>>>> QName::=(NCName':')?NCName. > >> >>>>>>>>>>> > >> >>>>>>>>>>> I would appreciate any help pertaining to this issue. > >> >>>>>>>>>>> > >> >>>>>>>>>>> Thanks, > >> >>>>>>>>>>> Rajan > >> >>>>>>>>>>> > >> >>>>>>>>>>> *Steps followed:* > >> >>>>>>>>>>> > >> >>>>>>>>>>> *1. Initialization: * > >> >>>>>>>>>>> java -jar > >> >>>>>>>>> > org.apache.stanbol.entityhub.indexing.freebase-1.0.0-SNAPSHOT.jar > >> >>>>>>>>>>> init > >> >>>>>>>>>>> > >> >>>>>>>>>>> *2. Download the data:* > >> >>>>>>>>>>> Download data and copy it to > >> >>>>>>>>> https://developers.google.com/freebase/data > >> >>>>>>>>>>> > >> >>>>>>>>>>> *3. Performed execution of fbrankings-uri.sh* > >> >>>>>>>>>>> It generated incoming_links.txt under resources directory as > >> >>>>> follows > >> >>>>>>>>>>> > >> >>>>>>>>>>> 10888430 m.0kpv11 > >> >>>>>>>>>>> 3741261 m.019h > >> >>>>>>>>>>> 2667858 m.0775xx5 > >> >>>>>>>>>>> 2667804 m.0775xvm > >> >>>>>>>>>>> 1875352 m.01xryvm > >> >>>>>>>>>>> 1739262 m.05zppz > >> >>>>>>>>>>> 1369590 m.01xrzlb > >> >>>>>>>>>>> > >> >>>>>>>>>>> *4. Performed execution of fixit script* > >> >>>>>>>>>>> > >> >>>>>>>>>>> gunzip -c ${FB_DUMP} | fixit | gzip > ${FB_DUMP_fixed} > >> >>>>>>>>>>> > >> >>>>>>>>>>> *5. Rename the fixed file to freebase.rdf.gz and copy it * > >> >>>>>>>>>>> to indexing/resources/rdfdata > >> >>>>>>>>>>> > >> >>>>>>>>>>> *6. config/iditer.properties file has following setting* > >> >>>>>>>>>>> #id-namespace=http://freebase.com/ > >> >>>>>>>>>>> ns-prefix-state=false > >> >>>>>>>>>>> > >> >>>>>>>>>>> *7. Performed run of following command:* > >> >>>>>>>>>>> java -jar -Xmx32g > >> >>>>>>>>>>> > >> org.apache.stanbol.entityhub.indexing.freebase-1.0.0-SNAPSHOT.jar > >> >>>>>>> index > >> >>>>>>>>>>> > >> >>>>>>>>>>> The error dump on stdout is as follows: > >> >>>>>>>>>>> > >> >>>>>>>>>>> 01:37:32,884 [Thread-0] INFO > >> >>>>> solryard.SolrYardIndexingDestination - > >> >>>>>>>>> ... > >> >>>>>>>>>>> copy Solr Configuration form > >> >>>>>>>>> /private/tmp/freebase/indexing/config/freebase > >> >>>>>>>>>>> to > >> >>>>>>> > /private/tmp/freebase/indexing/destination/indexes/default/freebase > >> >>>>>>>>>>> 01:37:32,895 [Thread-3] INFO jenatdb.RdfResourceImporter - > >> - > >> >>>>>>> bulk > >> >>>>>>>>>>> loading File freebase.rdf.gz using Format Lang:RDF/XML > >> >>>>>>>>>>> 01:37:32,896 [Thread-3] INFO jenatdb.RdfResourceImporter - > -- > >> >>>>> Start > >> >>>>>>>>>>> triples data phase > >> >>>>>>>>>>> 01:37:32,896 [Thread-3] INFO jenatdb.RdfResourceImporter - > ** > >> >>>>> Load > >> >>>>>>>>> empty > >> >>>>>>>>>>> triples table > >> >>>>>>>>>>> *01:37:32,948 [Thread-3] ERROR jena.riot - [line: 1, col: 7 > ] > >> >>>>>>> Element or > >> >>>>>>>>>>> attribute do not match QName production: > >> >>>>> QName::=(NCName':')?NCName.* > >> >>>>>>>>>>> 01:37:32,948 [Thread-3] INFO jenatdb.RdfResourceImporter - > -- > >> >>>>> Finish > >> >>>>>>>>>>> triples data phase > >> >>>>>>>>>>> 01:37:32,948 [Thread-3] INFO jenatdb.RdfResourceImporter - > -- > >> >>>>> Finish > >> >>>>>>>>>>> triples load > >> >>>>>>>>>>> 01:37:32,960 [Thread-3] INFO source.ResourceLoader - Ignore > >> Error > >> >>>>>>> for > >> >>>>>>>>> File > >> >>>>>>>>>>> > >> /private/tmp/freebase/indexing/resources/rdfdata/freebase.rdf.gz > >> >>>>> and > >> >>>>>>>>>>> continue > >> >>>>>>>>>>> > >> >>>>>>>>>>> Additional Reference Point: > >> >>>>>>>>>>> > >> >>>>>>>>>>> *Original Freebase dump size:* 31025015397 May 14 18:10 > >> >>>>>>>>>>> freebase-rdf-latest.gz > >> >>>>>>>>>>> *Fixed Freebase dump size:* 31026818367 May 15 12:45 > >> >>>>>>>>>>> freebase-rdf-latest-fixed.gz > >> >>>>>>>>>>> *Incoming Links size: *1206745360 May 17 00:42 > >> incoming_links.txt > >> >>>>>>>>>> > >> >>>>>>>>>> -- > >> >>>>>>>>>> > >> >>>>>>>>>> ------------------------------ > >> >>>>>>>>>> This message should be regarded as confidential. If you have > >> >>>>> received > >> >>>>>>>>> this > >> >>>>>>>>>> email in error please notify the sender and destroy it > >> immediately. > >> >>>>>>>>>> Statements of intent shall only become binding when > confirmed in > >> >>>>> hard > >> >>>>>>>>> copy > >> >>>>>>>>>> by an authorised signatory. > >> >>>>>>>>>> > >> >>>>>>>>>> Zaizi Ltd is registered in England and Wales with the > >> registration > >> >>>>>>> number > >> >>>>>>>>>> 6440931. The Registered Office is Brook House, 229 Shepherds > >> Bush > >> >>>>>>> Road, > >> >>>>>>>>>> London W6 7AN. > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> -- > >> >>>>>>>>> | Rupert Westenthaler > rupert.westentha...@gmail.com > >> >>>>>>>>> | Bodenlehenstraße 11 > >> ++43-699-11108907 > >> >>>>>>>>> | A-5500 Bischofshofen > >> >>>>>>>>> | REDLINK.CO > >> >>>>> > >> > .......................................................................... > >> >>>>>>>>> | http://redlink.co/ > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> -- > >> >>>>>>> | Rupert Westenthaler rupert.westentha...@gmail.com > >> >>>>>>> | Bodenlehenstraße 11 > >> ++43-699-11108907 > >> >>>>>>> | A-5500 Bischofshofen > >> >>>>>>> | REDLINK.CO > >> >>>>> > >> > .......................................................................... > >> >>>>>>> | http://redlink.co/ > >> >>>>> > >> >>>>> > >> >>>>> > >> >>>>> -- > >> >>>>> | Rupert Westenthaler rupert.westentha...@gmail.com > >> >>>>> | Bodenlehenstraße 11 > ++43-699-11108907 > >> >>>>> | A-5500 Bischofshofen > >> >>>>> | REDLINK.CO > >> >>>>> > >> > .......................................................................... > >> >>>>> | http://redlink.co/ > >> >>> > >> >>> > >> >>> > >> >>> -- > >> >>> | Rupert Westenthaler rupert.westentha...@gmail.com > >> >>> | Bodenlehenstraße 11 ++43-699-11108907 > >> >>> | A-5500 Bischofshofen > >> >>> | REDLINK.CO > >> > .......................................................................... > >> >>> | http://redlink.co/ > >> > > >> > > >> > > >> > -- > >> > | Rupert Westenthaler rupert.westentha...@gmail.com > >> > | Bodenlehenstraße 11 ++43-699-11108907 > >> > | A-5500 Bischofshofen > >> > | REDLINK.CO > >> > .......................................................................... > >> > | http://redlink.co/ > >> > > > > -- > | Rupert Westenthaler rupert.westentha...@gmail.com > | Bodenlehenstraße 11 ++43-699-11108907 > | A-5500 Bischofshofen > | REDLINK.CO > .......................................................................... > | http://redlink.co/ >