Re: Entityhub indexing for Freebase data

rajansh Tue, 26 May 2015 06:08:48 -0700

Hi,

Accidentally, I wiped out logs for a clean start. At the same time, I am 
planning to run on a higher end AWS instance as well, so will keep you posted.


Thanks again for your continuous help.

With best regards,
Rajan

Sent from my iPhone

> On May 26, 2015, at 8:47 AM, Rupert Westenthaler 
> <rupert.westentha...@gmail.com> wrote:
> 
> HI
> 
>> On Tue, May 26, 2015 at 2:13 PM,  <raja...@gmail.com> wrote:
>> Hi Rupert,
>> 
>> After last failure, I am only using language=en and it still fails.
> 
> Can you provide the some lines of logging before the OOM. I would like
> to be sure that it really happens during the Solr optimization phase.
> 
>> Thanks for the timely answer. Just to double confirm, if I re-started the 
>> index command this am again with higher -Xmx option is it too late to run 
>> finalise correct?
> 
> If the OOM exception really happened during the Solr optimization calling
> 
>   java -jar -Xmx{higher-value}g
> org.apache.stanbol.entityhub.indexing.freebase-1.0.0-SNAPSHOT.jar
> finalise
> 
> will use the data of the previous indexing call and just repeat the
> finalization steps
> 
> best
> Rupert
> 
> 
>> With best regards,
>> Rajan
>> 
>> Sent from my iPhone
>> 
>>> On May 26, 2015, at 7:47 AM, Rupert Westenthaler 
>>> <rupert.westentha...@gmail.com> wrote:
>>> 
>>> Hi Rajan
>>> 
>>>> On Mon, May 25, 2015 at 6:15 AM, Rajan Shah <raja...@gmail.com> wrote:
>>>> Hi Rupert,
>>>> 
>>>> Thanks for the reply.
>>>> 
>>>> As per your suggestion, I made necessary changes however it failed with
>>>> "OutOfMemory" errors. At present, I am running with -Xmx48g however at this
>>>> point it's a trial and error approach with several days effort being
>>>> wasted.
>>> 
>>> I guess you are getting the OutOfMemory while optimizing the Solr
>>> Index (right?). The README [1] explicitly notes that a high amount of
>>> memory is needed by exactly this step of the indexing process.
>>> 
>>> If the indexing fails at this step you can call the indexing tool with
>>> the `finalise` command (instead if `indexing`) (seeSTANBOL-1047 [2]
>>> for details). This will prevent the indexing to be repeated and only
>>> execute the finalization steps (optimizing the Solr Index and creating
>>> the freebase.solrindex.zip file).
>>> 
>>> 
>>>> I am just throwing out an idea, but wanted to see
>>>> 
>>>> a. Is it possible to publish set of constraints and required parameters.
>>>> i.e. with minimal set of entities within mappings.txt, one need to set
>>>> these parameters?
>>> 
>>> I do not understand this question. Do you want to filter entities
>>> based on their information? If so you might want to have a look at the
>>> `org.apache.stanbol.entityhub.indexing.core.processor.FieldValueFilter`.
>>> The generic RDF indexing tool as an example on how to use this
>>> processor to filter entities based on their rdf:type values.
>>> 
>>> See also the "Entity Filters" section of [3]
>>> 
>>>> 
>>>> b. Is it possible to split the file based on subject? generate smaller
>>>> index for each subject and merge afterwards?
>>> 
>>> Yes. You can split up the dump (by subject). Import those parts in
>>> different Indexing Tool instances (meaning different Jena TDB
>>> instances). Importing 4*500million triples to Jena TDB is supposed to
>>> be much faster as 1*2Billion.
>>> 
>>> If you still want to have all data in a single Entityhub Site you need
>>> to script the indexing process.
>>> 
>>> * call indexing for the first part
>>> * after this finishes link the {part1}/indexing/destination/indexes
>>> folder to {part2..n}/indexing/destination/indexes
>>> * call indexing for the 2..n parts.
>>> 
>>> As the indexing tool only adds additional information to the Solr
>>> Index you will get the union over all parts at the end of the process.
>>> All parts need to use the full incoming_links.txt file because
>>> otherwise the rankings would not be correct.
>>> 
>>> The "Indexing Datasets separately" section of [3] describes a similar
>>> trick of creating an union index over multiple datasets.
>>> 
>>> 
>>> best
>>> Rupert
>>> 
>>>> c. Work with BaseKB guys to also make it available at nominal charge?
>>>> 
>>>> d. Maybe apply some Map/Reduce - extension of idea b
>>>> 
>>>> With best regards,
>>>> Rajan
>>> 
>>> 
>>> 
>>> [1] 
>>> http://svn.apache.org/repos/asf/stanbol/trunk/entityhub/indexing/freebase/README.md
>>> [2] https://issues.apache.org/jira/browse/STANBOL-1047
>>> [3] 
>>> http://svn.apache.org/repos/asf/stanbol/branches/release-0.12/demos/ehealth/README.md
>>> 
>>>> 
>>>> 
>>>> 
>>>> On Fri, May 22, 2015 at 9:29 AM, Rupert Westenthaler <
>>>> rupert.westentha...@gmail.com> wrote:
>>>> 
>>>>> Hi Rajan,
>>>>> 
>>>>>> *05:11:41,851 [Indexing: Finished Entity Logger Deamon] INFO
>>>>>> impl.IndexerImpl - Indexed 0 items in 2059467sec (Infinityms/item):
>>>>> 
>>>>> 'You have not indexed a single entity. So something in your indexing
>>>>> configuration is wrong. Most likely you are not correctly building the
>>>>> URIs of the entities from the incoming_links.txt file. Can you provide
>>>>> me an example line of the 'incoming_links.txt' file and the contents
>>>>> of the 'iditerator.properties' file. Those specify how Entity URIs are
>>>>> built.
>>>>> 
>>>>> Short answers to the other questions
>>>>> 
>>>>> 
>>>>>> On Fri, May 22, 2015 at 2:10 PM, Rajan Shah <raja...@gmail.com> wrote:
>>>>>> it ran for almost 3 days and generated index.
>>>>> 
>>>>> Thats good. It means you do have now the Freebase dump in your Jena
>>>>> TDB triple store. You will not need to repeat this (until you want to
>>>>> use a newer dump. On the next call to the indexing tool it will
>>>>> immediately start with the indexing step.
>>>>> 
>>>>> 
>>>>>> 
>>>>>> Couple questions come to mind:
>>>>>> 
>>>>>> a. Is there any particular log/error file the process generates besides
>>>>>> printing out on stdout/stderr?
>>>>> 
>>>>> The indexer writes a zip archive with the IDs of all the indexed
>>>>> entities. Its in the indexing/destination folder.
>>>>> 
>>>>>> b. Is it a must-have to have stanbol full launcher running all the time
>>>>>> while indexing is going on?
>>>>> 
>>>>> No Stanbol instance is needed by the indexing process.
>>>>> 
>>>>>> c. Is it possible that, if the machine is not connected to internet for
>>>>>> couple minutes could cause some issues?
>>>>> 
>>>>> No Internet connectivity is needed during indexing. Only if you want
>>>>> to use the namespace prefix mappings of prefix.cc you need to have
>>>>> internet connectivity when starting the indexing tool.
>>>>> 
>>>>> best
>>>>> Rupert
>>>>> 
>>>>>> 
>>>>>> I would really appreciate, if you can shed some light on "what could be
>>>>>> wrong" or "potential approach to nail down this issue"? If you need, I am
>>>>>> happy to share any additional logs/properties.
>>>>>> 
>>>>>> With best regards,
>>>>>> Rajan
>>>>>> 
>>>>>> *1. Configuration changes*
>>>>>> 
>>>>>> a. set ns-prefix-state=false*
>>>>>> [within /indexing/config/iditerator.properties]*
>>>>>> b. add empty space mapping to   http://rdf.freebase.com/ns/*
>>>>>> [within namespaceprefix.mappings]*
>>>>>> c. enable bunch of properties within mappings.txt such as following
>>>>>> 
>>>>>> fb:music.artist.genre
>>>>>> fb:music.artist.label
>>>>>> fb:music.artist.album
>>>>>> 
>>>>>> *2. Contents of indexing/dist directory*
>>>>>> 
>>>>>> -rw-r--r--  108899 May 22 05:11 freebase.solrindex.zip
>>>>>> -rw-r--r--  3457 May 22 05:11
>>>>>> org.apache.stanbol.data.site.freebase-1.0.0.jar
>>>>>> 
>>>>>> *3. Contents of /tmp/freebase/indexing/resources/imported directory*
>>>>>> 
>>>>>> -rw-r--r--  1 31026810858 May 20 07:32 freebase.nt.gz
>>>>>> 
>>>>>> *4. Contents of /tmp/freebase/indexing/resources directory*
>>>>>> 
>>>>>> -rw-r--r--   1 1206745360 May 19 09:38 incoming_links.txt
>>>>>> 
>>>>>> *5. The indexer log*
>>>>>> 
>>>>>> *04:31:57,236 [Thread-3] INFO  jenatdb.RdfResourceImporter - Add:
>>>>>> 570,850,000 triples (Batch: 2,604 / Avg: 3,621)*
>>>>>> *04:32:00,727 [Thread-3] INFO  jenatdb.RdfResourceImporter - Filtered:
>>>>>> 2429800000 triples (80.97554853864854%)*
>>>>>> *04:32:01,157 [Thread-3] INFO  jenatdb.RdfResourceImporter - -- Finish
>>>>>> triples data phase*
>>>>>> *04:32:01,157 [Thread-3] INFO  jenatdb.RdfResourceImporter - ** Data:
>>>>>> 570,859,352 triples loaded in 157,619.39 seconds [Rate: 3,621.76 per
>>>>>> second]*
>>>>>> *04:32:01,157 [Thread-3] INFO  jenatdb.RdfResourceImporter - -- Start
>>>>>> triples index phase*
>>>>>> *04:32:01,157 [Thread-3] INFO  jenatdb.RdfResourceImporter - -- Finish
>>>>>> triples index phase*
>>>>>> *04:32:01,157 [Thread-3] INFO  jenatdb.RdfResourceImporter - -- Finish
>>>>>> triples load*
>>>>>> *04:32:01,157 [Thread-3] INFO  jenatdb.RdfResourceImporter - **
>>>>> Completed:
>>>>>> 570,859,352 triples loaded in 157,619.39 seconds [Rate: 3,621.76 per
>>>>>> second]*
>>>>>> 04:32:56,880 [Thread-3] INFO  source.ResourceLoader -    ... moving
>>>>>> imported file freebase.nt.gz to imported/freebase.nt.gz
>>>>>> 04:32:56,883 [Thread-3] INFO  source.ResourceLoader -    - completed in
>>>>>> 157675 seconds
>>>>>> 04:32:56,883 [Thread-3] INFO  source.ResourceLoader -  > loading
>>>>>> '/private/tmp/freebase/indexing/resources/rdfdata/fixit.sh' ...
>>>>>> 04:32:56,944 [Thread-3] WARN  jenatdb.RdfResourceImporter - ignore File
>>>>> {}
>>>>>> because of unknown extension
>>>>>> 04:32:56,958 [Thread-3] INFO  source.ResourceLoader -    - completed in 0
>>>>>> seconds
>>>>>> 04:32:56,958 [Thread-3] INFO  source.ResourceLoader -  ... 2 files
>>>>> imported
>>>>>> in 157675 seconds
>>>>>> 04:32:56,958 [Thread-3] INFO  source.ResourceLoader - Loding 0 File ...
>>>>>> 04:32:56,958 [Thread-3] INFO  source.ResourceLoader -  ... 0 files
>>>>> imported
>>>>>> in 0 seconds
>>>>>> 04:32:56,971 [main] INFO  impl.IndexerImpl -  ... delete existing
>>>>>> IndexedEntityId file
>>>>>> /private/tmp/freebase/indexing/destination/indexed-entities-ids.zip
>>>>>> 04:32:56,982 [main] INFO  impl.IndexerImpl - Initialisation completed
>>>>>> 04:32:56,982 [main] INFO  impl.IndexerImpl -   ... initialisation
>>>>> completed
>>>>>> 04:32:56,982 [main] INFO  impl.IndexerImpl - start indexing ...
>>>>>> 04:32:56,982 [main] INFO  impl.IndexerImpl - Indexing started ...
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 04:45:48,075 [pool-1-thread-1] WARN  impl.NamespacePrefixProviderImpl -
>>>>>> Invalid Namespace Mapping: prefix 'nsogi' valid , namespace '
>>>>>> http://prefix.cc/nsogi:' invalid -> mapping ignored!
>>>>>> 04:45:48,076 [pool-1-thread-1] WARN  impl.NamespacePrefixProviderImpl -
>>>>>> Invalid Namespace Mapping: prefix 'category' valid , namespace '
>>>>>> http://dbpedia.org/resource/Category:' invalid -> mapping ignored!
>>>>>> 04:45:48,077 [pool-1-thread-1] WARN  impl.NamespacePrefixProviderImpl -
>>>>>> Invalid Namespace Mapping: prefix 'chebi' valid , namespace '
>>>>>> http://bio2rdf.org/chebi:' invalid -> mapping ignored!
>>>>>> 04:45:48,077 [pool-1-thread-1] WARN  impl.NamespacePrefixProviderImpl -
>>>>>> Invalid Namespace Mapping: prefix 'hgnc' valid , namespace '
>>>>>> http://bio2rdf.org/hgnc:' invalid -> mapping ignored!
>>>>>> 04:45:48,077 [pool-1-thread-1] WARN  impl.NamespacePrefixProviderImpl -
>>>>>> Invalid Namespace Mapping: prefix 'dbptmpl' valid , namespace '
>>>>>> http://dbpedia.org/resource/Template:' invalid -> mapping ignored!
>>>>>> 04:45:48,077 [pool-1-thread-1] WARN  impl.NamespacePrefixProviderImpl -
>>>>>> Invalid Namespace Mapping: prefix 'dbc' valid , namespace '
>>>>>> http://dbpedia.org/resource/Category:' invalid -> mapping ignored!
>>>>>> 04:45:48,078 [pool-1-thread-1] WARN  impl.NamespacePrefixProviderImpl -
>>>>>> Invalid Namespace Mapping: prefix 'pubmed' valid , namespace '
>>>>>> http://bio2rdf.org/pubmed_vocabulary:' invalid -> mapping ignored!
>>>>>> 04:45:48,078 [pool-1-thread-1] WARN  impl.NamespacePrefixProviderImpl -
>>>>>> Invalid Namespace Mapping: prefix 'dbt' valid , namespace '
>>>>>> http://dbpedia.org/resource/Template:' invalid -> mapping ignored!
>>>>>> 04:45:48,078 [pool-1-thread-1] WARN  impl.NamespacePrefixProviderImpl -
>>>>>> Invalid Namespace Mapping: prefix 'dbrc' valid , namespace '
>>>>>> http://dbpedia.org/resource/Category:' invalid -> mapping ignored!
>>>>>> 04:45:48,078 [pool-1-thread-1] WARN  impl.NamespacePrefixProviderImpl -
>>>>>> Invalid Namespace Mapping: prefix 'call' valid , namespace '
>>>>>> http://webofcode.org/wfn/call:' invalid -> mapping ignored!
>>>>>> 04:45:48,078 [pool-1-thread-1] WARN  impl.NamespacePrefixProviderImpl -
>>>>>> Invalid Namespace Mapping: prefix 'dbcat' valid , namespace '
>>>>>> http://dbpedia.org/resource/Category:' invalid -> mapping ignored!
>>>>>> 04:45:48,084 [pool-1-thread-1] WARN  impl.NamespacePrefixProviderImpl -
>>>>>> Invalid Namespace Mapping: prefix 'affymetrix' valid , namespace '
>>>>>> http://bio2rdf.org/affymetrix_vocabulary:' invalid -> mapping ignored!
>>>>>> 04:45:48,084 [pool-1-thread-1] WARN  impl.NamespacePrefixProviderImpl -
>>>>>> Invalid Namespace Mapping: prefix 'bgcat' valid , namespace '
>>>>>> http://bg.dbpedia.org/resource/Категория:' invalid -> mapping ignored!
>>>>>> 04:45:48,084 [pool-1-thread-1] WARN  impl.NamespacePrefixProviderImpl -
>>>>>> Invalid Namespace Mapping: prefix 'condition' valid , namespace '
>>>>>> http://www.kinjal.com/condition:' invalid -> mapping ignored!
>>>>>> 05:11:41,836 [Indexing: Entity Source Reader Deamon] INFO
>>>>> impl.IndexerImpl
>>>>>> - Indexing: Entity Source Reader Deamon completed (sequence=0) ...
>>>>>> 05:11:41,838 [Indexing: Entity Source Reader Deamon] INFO
>>>>> impl.IndexerImpl
>>>>>> -  > current sequence : 0
>>>>>> 05:11:41,838 [Indexing: Entity Source Reader Deamon] INFO
>>>>> impl.IndexerImpl
>>>>>> -  > new sequence: 1
>>>>>> 05:11:41,838 [Indexing: Entity Source Reader Deamon] INFO
>>>>> impl.IndexerImpl
>>>>>> - Send end-of-queue to Deamons with Sequence 1
>>>>>> 05:11:41,839 [Indexing: Entity Processor Deamon] INFO  impl.IndexerImpl -
>>>>>> Indexing: Entity Processor Deamon completed (sequence=1) ...
>>>>>> 05:11:41,839 [Indexing: Entity Processor Deamon] INFO  impl.IndexerImpl -
>>>>>>> current sequence : 1
>>>>>> 05:11:41,839 [Indexing: Entity Processor Deamon] INFO  impl.IndexerImpl -
>>>>>>> new sequence: 2
>>>>>> 05:11:41,839 [Indexing: Entity Processor Deamon] INFO  impl.IndexerImpl -
>>>>>> Send end-of-queue to Deamons with Sequence 2
>>>>>> 05:11:41,839 [Indexing: Entity Perstisting Deamon] INFO
>>>>> impl.IndexerImpl -
>>>>>> Indexing: Entity Perstisting Deamon completed (sequence=2) ...
>>>>>> 05:11:41,839 [Indexing: Entity Perstisting Deamon] INFO
>>>>> impl.IndexerImpl -
>>>>>>> current sequence : 2
>>>>>> 05:11:41,839 [Indexing: Entity Perstisting Deamon] INFO
>>>>> impl.IndexerImpl -
>>>>>>> new sequence: 3
>>>>>> 05:11:41,839 [Indexing: Entity Perstisting Deamon] INFO
>>>>> impl.IndexerImpl -
>>>>>> Send end-of-queue to Deamons with Sequence 3
>>>>>> *05:11:41,851 [Indexing: Finished Entity Logger Deamon] INFO
>>>>>> impl.IndexerImpl - Indexed 0 items in 2059467sec (Infinityms/item):
>>>>>> processing:  -1.000ms/item | queue:  -1.000ms*
>>>>>> 05:11:41,851 [Indexing: Finished Entity Logger Deamon] INFO
>>>>>> impl.IndexerImpl -   - source   :  -1.000ms/item
>>>>>> 05:11:41,851 [Indexing: Finished Entity Logger Deamon] INFO
>>>>>> impl.IndexerImpl -   - processing:  -1.000ms/item
>>>>>> 05:11:41,851 [Indexing: Finished Entity Logger Deamon] INFO
>>>>>> impl.IndexerImpl -   - store     :  -1.000ms/item
>>>>>> 05:11:41,906 [Indexing: Finished Entity Logger Deamon] INFO
>>>>>> impl.IndexerImpl - Indexing: Finished Entity Logger Deamon completed
>>>>>> (sequence=3) ...
>>>>>> 05:11:41,906 [Indexing: Finished Entity Logger Deamon] INFO
>>>>>> impl.IndexerImpl -  > current sequence : 3
>>>>>> 05:11:41,906 [Indexing: Finished Entity Logger Deamon] INFO
>>>>>> impl.IndexerImpl -  > new sequence: 4
>>>>>> 05:11:41,906 [Indexing: Finished Entity Logger Deamon] INFO
>>>>>> impl.IndexerImpl - Send end-of-queue to Deamons with Sequence 4
>>>>>> 05:11:41,910 [Indexer: Entity Error Logging Daemon] INFO
>>>>> impl.IndexerImpl
>>>>>> - Indexer: Entity Error Logging Daemon completed (sequence=4) ...
>>>>>> 05:11:41,910 [Indexer: Entity Error Logging Daemon] INFO
>>>>> impl.IndexerImpl
>>>>>> -  > current sequence : 4
>>>>>> 05:11:41,910 [main] INFO  impl.IndexerImpl -   ... indexing completed
>>>>>> 05:11:41,910 [main] INFO  impl.IndexerImpl - start post-processing ...
>>>>>> 05:11:41,910 [main] INFO  impl.IndexerImpl - PostProcessing started ...
>>>>>> 05:11:41,910 [main] INFO  impl.IndexerImpl -   ... post-processing
>>>>> finished
>>>>>> ...
>>>>>> 05:11:41,911 [main] INFO  impl.IndexerImpl - start finalisation....
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Wed, May 20, 2015 at 8:19 AM, Rupert Westenthaler <
>>>>>> rupert.westentha...@gmail.com> wrote:
>>>>>> 
>>>>>>>> On Tue, May 19, 2015 at 7:04 PM, Rajan Shah <raja...@gmail.com> wrote:
>>>>>>>> Hi Rupert and Antonio,
>>>>>>>> 
>>>>>>>> Thanks a lot for the reply.
>>>>>>>> 
>>>>>>>> I start to follow Rupert's suggestion, however it failed again at
>>>>>>>> 
>>>>>>>> 10:56:34,152 [Thread-3] ERROR jena.riot - [line: 8722294, col: 88]
>>>>>>> illegal
>>>>>>>> escape sequence value: $ (0x24) -- Is there anyway it can be resolved
>>>>> for
>>>>>>>> the entire file?
>>>>>>> 
>>>>>>> The indexing tool uses Apache Jena. An those are Jena parsing errors.
>>>>>>> So the Jena Mailing lists would be the better place to look for
>>>>>>> answers.
>>>>>>> This specific issue looks like an invalid URI that is not fixed by the
>>>>>>> fixit script.
>>>>>>> 
>>>>>>> 
>>>>>>>> I requested an access to latest BaseKB bucket, as it doesn't seem to
>>>>> be
>>>>>>>> open.
>>>>>>>> 
>>>>>>>> s3cmd ls s3://basekb-now/2015-04-15-18-54/
>>>>>>>> --add-header="x-amz-request-payer: requester"
>>>>>>>> ERROR: Access to bucket 'basekb-now' was denied
>>>>>>>> 
>>>>>>>> 
>>>>>>>> *Couple additional questions:*
>>>>>>>> 
>>>>>>>> *1. indexing enhancements:*
>>>>>>>> What settings/properties one can tweak to gain most out of the
>>>>> indexing.
>>>>>>> 
>>>>>>> In general you do only want information as needed for your application
>>>>>>> case in the index.
>>>>>>> For EntityLinking only labels and type are required.
>>>>>>> Additional properties will only be used for dereferencing Entities. So
>>>>>>> this will depend on your application needs (your dereferencing
>>>>>>> configuration).
>>>>>>> 
>>>>>>> In general I try to exclude as much information as possible form the
>>>>>>> index to keep the size of the Solr Index as small as possible.
>>>>>>> 
>>>>>>>> a. for ex. domain specific such as Pharmaceutical, Law etc... within
>>>>>>>> freebase
>>>>>>>> b. potential optimizations to speed up the overall indexing
>>>>>>> 
>>>>>>> Most of the time will be needed to load the Freebase dump into Jena
>>>>>>> TDB. Even with an SSD equipped Server this will take several days.
>>>>>>> Assigning more RAM will speed up this process as Jena TDB can cache
>>>>>>> more things in RAM.
>>>>>>> 
>>>>>>> Usually it is a good Idea to cancel the indexing process after the
>>>>>>> importing of the RDF data has finished (and the indexing of the
>>>>>>> Entities has started). This is because after indexing all the RAM will
>>>>>>> be used by Jena TDB for caching stuff that is no longer needed in the
>>>>>>> read-only operations during indexing. So a fresh start can speed up
>>>>>>> the indexing part of the process.
>>>>>>> 
>>>>>>> Also have a look at the Freebase Indexing Tool Readme
>>>>>>> 
>>>>>>>> 
>>>>>>>> *2. demo:*
>>>>>>>> I see that, in recent github commit(s) the eHealth and other demos
>>>>> have
>>>>>>>> been commented out. How can I get demo source code and other
>>>>> components
>>>>>>> for
>>>>>>>> these demos. I prefer to build it myself to see the power of stanbol.
>>>>>>> 
>>>>>>> The eHealth demo is still in the 0.12 branch [1]. This is fully
>>>>>>> compatible to the trunk version.
>>>>>>> 
>>>>>>>> *3. custom vocabulary:*
>>>>>>>> Suppose, I have custom vocabulary in CSV format. Is there a preferred
>>>>> way
>>>>>>>> to upload it to Stanbol and have it recognize my entities?
>>>>>>> 
>>>>>>> Google Refine[2] with the RDF extension [3]. You can also try to use
>>>>>>> the (newer) Open Refine [4] with the RDF Refine 0.9.0 Alpha version
>>>>>>> but AFAIK this combination is not so stable and might not work at all.
>>>>>>> 
>>>>>>> * Google Refine allows you to import your CSV file.
>>>>>>> * Clean it up (if necessary)
>>>>>>> * The RDF extension allows you to map your CSV data to RDF
>>>>>>> * based on this mapping you can save your data as RDF
>>>>>>> * after that you can import the RDF data to Apache Stanbol
>>>>>>> 
>>>>>>> hope this helps
>>>>>>> best
>>>>>>> Rupert
>>>>>>> 
>>>>>>>> 
>>>>>>>> Thanks in advance,
>>>>>>>> Rajan
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> [1]
>>>>> http://svn.apache.org/repos/asf/stanbol/branches/release-0.12/demos/ehealth/
>>>>>>> [2] https://code.google.com/p/google-refine/
>>>>>>> [3] http://refine.deri.ie/
>>>>>>> [4] http://openrefine.org/
>>>>>>> 
>>>>>>>> On Tue, May 19, 2015 at 3:01 AM, Rupert Westenthaler <
>>>>>>>> rupert.westentha...@gmail.com> wrote:
>>>>>>>> 
>>>>>>>>> Hi Rajan,
>>>>>>>>> 
>>>>>>>>> I think this is because you named you file
>>>>>>>>> "freebase-rdf-latest-fixed.gz". Jena assumes RDF/XML if the RDF
>>>>> format
>>>>>>>>> is not provided by the file extension. Renaming the file to
>>>>>>>>> "freebase-rdf-latest-fixed.nt.gz" should fix this issue.
>>>>>>>>> 
>>>>>>>>> The suggestion of Antonio to use BaseKB is also a valid option.
>>>>>>>>> 
>>>>>>>>> best
>>>>>>>>> Rupert
>>>>>>>>> 
>>>>>>>>> On Tue, May 19, 2015 at 8:32 AM, Antonio David Perez Morales
>>>>>>>>> <ape...@zaizi.com> wrote:
>>>>>>>>>> Hi Rajan
>>>>>>>>>> 
>>>>>>>>>> Freebase dump contains some things that does not fit very well with
>>>>>>> the
>>>>>>>>>> indexer.
>>>>>>>>>> I advise you to use the dump provided by BaseKB (http://basekb.com
>>>>> )
>>>>>>>>> which
>>>>>>>>>> is a curated Freebase dump.
>>>>>>>>>> I did not have any problem indexing it using that dump.
>>>>>>>>>> 
>>>>>>>>>> Regards
>>>>>>>>>> 
>>>>>>>>>> On Mon, May 18, 2015 at 8:48 PM, Rajan Shah <raja...@gmail.com>
>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hi,
>>>>>>>>>>> 
>>>>>>>>>>> I am working on indexing Freebase data within EntityHub and
>>>>> observed
>>>>>>>>>>> following issue:
>>>>>>>>>>> 
>>>>>>>>>>> 01:06:01,547 [Thread-3] ERROR jena.riot - [line: 1, col: 7 ]
>>>>> Element
>>>>>>> or
>>>>>>>>>>> attribute do not match QName production:
>>>>> QName::=(NCName':')?NCName.
>>>>>>>>>>> 
>>>>>>>>>>> I would appreciate any help pertaining to this issue.
>>>>>>>>>>> 
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Rajan
>>>>>>>>>>> 
>>>>>>>>>>> *Steps followed:*
>>>>>>>>>>> 
>>>>>>>>>>> *1. Initialization: *
>>>>>>>>>>> java -jar
>>>>>>>>> org.apache.stanbol.entityhub.indexing.freebase-1.0.0-SNAPSHOT.jar
>>>>>>>>>>> init
>>>>>>>>>>> 
>>>>>>>>>>> *2. Download the data:*
>>>>>>>>>>> Download data and copy it to
>>>>>>>>> https://developers.google.com/freebase/data
>>>>>>>>>>> 
>>>>>>>>>>> *3. Performed execution of fbrankings-uri.sh*
>>>>>>>>>>> It generated incoming_links.txt under resources directory as
>>>>> follows
>>>>>>>>>>> 
>>>>>>>>>>> 10888430 m.0kpv11
>>>>>>>>>>> 3741261 m.019h
>>>>>>>>>>> 2667858 m.0775xx5
>>>>>>>>>>> 2667804 m.0775xvm
>>>>>>>>>>> 1875352 m.01xryvm
>>>>>>>>>>> 1739262 m.05zppz
>>>>>>>>>>> 1369590 m.01xrzlb
>>>>>>>>>>> 
>>>>>>>>>>> *4. Performed execution of fixit script*
>>>>>>>>>>> 
>>>>>>>>>>> gunzip -c ${FB_DUMP} | fixit | gzip > ${FB_DUMP_fixed}
>>>>>>>>>>> 
>>>>>>>>>>> *5. Rename the fixed file to freebase.rdf.gz and copy it *
>>>>>>>>>>> to indexing/resources/rdfdata
>>>>>>>>>>> 
>>>>>>>>>>> *6. config/iditer.properties file has following setting*
>>>>>>>>>>> #id-namespace=http://freebase.com/
>>>>>>>>>>> ns-prefix-state=false
>>>>>>>>>>> 
>>>>>>>>>>> *7. Performed run of following command:*
>>>>>>>>>>> java -jar -Xmx32g
>>>>>>>>>>> org.apache.stanbol.entityhub.indexing.freebase-1.0.0-SNAPSHOT.jar
>>>>>>> index
>>>>>>>>>>> 
>>>>>>>>>>> The error dump on stdout is as follows:
>>>>>>>>>>> 
>>>>>>>>>>> 01:37:32,884 [Thread-0] INFO
>>>>> solryard.SolrYardIndexingDestination -
>>>>>>>>> ...
>>>>>>>>>>> copy Solr Configuration form
>>>>>>>>> /private/tmp/freebase/indexing/config/freebase
>>>>>>>>>>> to
>>>>>>> /private/tmp/freebase/indexing/destination/indexes/default/freebase
>>>>>>>>>>> 01:37:32,895 [Thread-3] INFO  jenatdb.RdfResourceImporter -     -
>>>>>>> bulk
>>>>>>>>>>> loading File freebase.rdf.gz using Format Lang:RDF/XML
>>>>>>>>>>> 01:37:32,896 [Thread-3] INFO  jenatdb.RdfResourceImporter - --
>>>>> Start
>>>>>>>>>>> triples data phase
>>>>>>>>>>> 01:37:32,896 [Thread-3] INFO  jenatdb.RdfResourceImporter - **
>>>>> Load
>>>>>>>>> empty
>>>>>>>>>>> triples table
>>>>>>>>>>> *01:37:32,948 [Thread-3] ERROR jena.riot - [line: 1, col: 7 ]
>>>>>>> Element or
>>>>>>>>>>> attribute do not match QName production:
>>>>> QName::=(NCName':')?NCName.*
>>>>>>>>>>> 01:37:32,948 [Thread-3] INFO  jenatdb.RdfResourceImporter - --
>>>>> Finish
>>>>>>>>>>> triples data phase
>>>>>>>>>>> 01:37:32,948 [Thread-3] INFO  jenatdb.RdfResourceImporter - --
>>>>> Finish
>>>>>>>>>>> triples load
>>>>>>>>>>> 01:37:32,960 [Thread-3] INFO  source.ResourceLoader - Ignore Error
>>>>>>> for
>>>>>>>>> File
>>>>>>>>>>> /private/tmp/freebase/indexing/resources/rdfdata/freebase.rdf.gz
>>>>> and
>>>>>>>>>>> continue
>>>>>>>>>>> 
>>>>>>>>>>> Additional Reference Point:
>>>>>>>>>>> 
>>>>>>>>>>> *Original Freebase dump size:*  31025015397 May 14 18:10
>>>>>>>>>>> freebase-rdf-latest.gz
>>>>>>>>>>> *Fixed Freebase dump size:* 31026818367 May 15 12:45
>>>>>>>>>>> freebase-rdf-latest-fixed.gz
>>>>>>>>>>> *Incoming Links size: *1206745360 May 17 00:42 incoming_links.txt
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> 
>>>>>>>>>> ------------------------------
>>>>>>>>>> This message should be regarded as confidential. If you have
>>>>> received
>>>>>>>>> this
>>>>>>>>>> email in error please notify the sender and destroy it immediately.
>>>>>>>>>> Statements of intent shall only become binding when confirmed in
>>>>> hard
>>>>>>>>> copy
>>>>>>>>>> by an authorised signatory.
>>>>>>>>>> 
>>>>>>>>>> Zaizi Ltd is registered in England and Wales with the registration
>>>>>>> number
>>>>>>>>>> 6440931. The Registered Office is Brook House, 229 Shepherds Bush
>>>>>>> Road,
>>>>>>>>>> London W6 7AN.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> | Rupert Westenthaler             rupert.westentha...@gmail.com
>>>>>>>>> | Bodenlehenstraße 11                              ++43-699-11108907
>>>>>>>>> | A-5500 Bischofshofen
>>>>>>>>> | REDLINK.CO
>>>>> ..........................................................................
>>>>>>>>> | http://redlink.co/
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> | Rupert Westenthaler             rupert.westentha...@gmail.com
>>>>>>> | Bodenlehenstraße 11                              ++43-699-11108907
>>>>>>> | A-5500 Bischofshofen
>>>>>>> | REDLINK.CO
>>>>> ..........................................................................
>>>>>>> | http://redlink.co/
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> | Rupert Westenthaler             rupert.westentha...@gmail.com
>>>>> | Bodenlehenstraße 11                              ++43-699-11108907
>>>>> | A-5500 Bischofshofen
>>>>> | REDLINK.CO
>>>>> ..........................................................................
>>>>> | http://redlink.co/
>>> 
>>> 
>>> 
>>> --
>>> | Rupert Westenthaler             rupert.westentha...@gmail.com
>>> | Bodenlehenstraße 11                              ++43-699-11108907
>>> | A-5500 Bischofshofen
>>> | REDLINK.CO 
>>> ..........................................................................
>>> | http://redlink.co/
> 
> 
> 
> -- 
> | Rupert Westenthaler             rupert.westentha...@gmail.com
> | Bodenlehenstraße 11                              ++43-699-11108907
> | A-5500 Bischofshofen
> | REDLINK.CO 
> ..........................................................................
> | http://redlink.co/

Re: Entityhub indexing for Freebase data

Reply via email to